Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day
Tuesday, March 22, 2011 at 9:55AM
HighScalability Team in Example, analytics, facebook

Facebook did it again. They've built another system capable of doing something useful with ginormous streams of realtime data. Last time we saw Facebook release their New Real-Time Messaging System: HBase To Store 135+ Billion Messages A Month. This time it's a realtime analytics system handling over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds

Alex Himel, Engineering Manager at Facebook, explains what they've built (video) and the scale required:

Social plugins have become an important and growing source of traffic for millions of websites over the past year. We released a new version of Insights for Websites last week to give site owners better analytics on how people interact with their content and to help them optimize their websites in real time. To accomplish this, we had to engineer a system that could process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds. 

Alex does an excellent job with the presentation. Highly recommended. But let's take a little deeper look at what's going on...

The need for such a high powered analytics system is driven by Facebook's brilliant plan for world wide web domination via the viral propagation of social plugins, all tying the non-Facebook web back into Facebook and the Facebook web back into the non-Facebook web. Basically anything that people can do is captured and fed back through Facebook and anything done on Facebook can be displayed on your website, building closer relations between the two.

Facebook's Social Plugins are Roman Empire Management 101. You don't have to conquer everyone to build an empire. You just have control everyone with the threat they could be conquered while making them realize, oh by the way, there's lots of money to be made being friendly with Rome. This strategy worked for quite a while as I recall.

You've no doubt seen Social Plugins on websites out the wild. A social plugin lets you see what your friends have liked, commented on or shared on sites across the web. The idea is putting social plugins on a site makes the content more engaging. Your friends can see what you are liking and in turn websites can see what everyone is liking. Content that is engaging gives you more clicks, more likes, and more comments. For a business or brand, or even an individual, the more engaging the content is, the more people see it, the more it pops up in news feeds, the more it drives traffic to a site.

The formerly lone-wolf web, where content hunters stalked web sites silently and singly, has been turned into a charming little village, where everyone knows your name. That's the power of social.

Posts here on HighScalability, for example, now have Like buttons. TechCrunch has famously moved to using Facebook's commenting system. Immediately the debate centered on the quality of the comment system itself, but that was hardly the point, the point was to plunge TechCrunch more deeply into Facebook's ecosystem of 500+ million users. Other plugins include: Recommendations, Activity Feed, Login, Registration, Facepile, and Live Stream.

All that data doesn't mean much unless you can make sense of it and also prove to content providers that social plugins actually do make their sites more engaging. That's where Facebook's Insights System comes in. It's an analytics system giving you access to all that juicy data being collected. It offers stats like Like button analytics, Comments box analytics, Popular pages, Demographics, and Organic sharing. 

Imagine millions of websites and billions of pages and millions of people continually streaming data in via these social plugins. How do you make sense of all that data in real-time? It's a challenging problem.

Value Proposition

With an Insights System content producers can see what people like, which will enable content producers to generate more of what people like, which raises the content quality of the web, which gives users a better Facebook experience.

System Goals

Challenges

 Implemented a Bunch of Different Prototypes

The Winner: HBase + Scribe + Ptail + Puma

When we look at the messaging system and this analytics system, we notice what the two systems have in common: large numbers, HBase, real-time. The challenge of dealing with huge write loads in a reliable and timely fashion is a common substrate to these problems. Facebook is focussing on the HBase, Hadoop, HDFS ecosystem and counting on the operational quirks to be ironed out later. Others choose Cassandra because they love it's scalability, multi-datacenter functionality, and ease of operational use, but it doesn't fit as cleanly into the overall analytics stack.

What does this mean for you? Even if you aren't Facebook this architecture is simple enough and composed of enough off the shelf tools that it could work for much small projects too. 

Related Articles

Article originally appeared on (http://highscalability.com/).
See website for complete article licensing information.