The High Scalability blog posted a link to Facebook’s new posts search system and the Facebook Notes written about it by a member of the engineering team.
One of the sections mentioned the Wormhole publish/subscribe system that they developed to push data across multiple data centres in near real time. At a very basic level, they appear to have developed a three tier push system that can be queried if the system fails in part to retrieve updates.
A statistic on its effects is:
Compared to the previous system, Wormhole reduced CPU utilization on UDBs by 40% and I/O utilization by 60%, and reduced latency from a day down to a few seconds
(A UDB is a User Database used to connect the data to other services and be replicated.) That might suggest that the original system was a polling one and now they have moved into the data streams model. Given the size of Facebook, I doubt it was a small task!
The use of pub/sub in this way is becoming a familiar pattern (think DataSift and HighScalability’s post about their architecture) and I think Gnip does some thing similar. RabbitMQ talks about it in their tutorial about topics and logging.
What changes is doing in near real-time and being able to recover from crashes and failures.
No Comments