« When to migrate your database? | Main | Seven Signs You May Need a NoSQL Database »
Friday
Feb192010

Twitter’s Plan to Analyze 100 Billion Tweets

If Twitter is the “nervous system of the web” as some people think, then what is the brain that makes sense of all those signals (tweets) from the nervous system? That brain is the Twitter Analytics System and Kevin Weil, as Analytics Lead at Twitter, is the homunculus within in charge of figuring out what those over 100 billion tweets (approximately the number of neurons in the human brain) mean.

Twitter has only 10% of the expected 100 billion tweets now, but a good brain always plans ahead. Kevin gave a talk, Hadoop and Protocol Buffers at Twitter, at the Hadoop Meetup, explaining how Twitter plans to use all that data to an answer key business questions.

What type of questions is Twitter interested in answering? Questions that help them better understand Twitter. Questions like:

  1. How many requests do we serve in a day?
  2. What is the average latency?
  3. How many searches happen in day?
  4. How many unique queries, how many unique users, what is their geographic distribution?
  5. What can we tell about as user from their tweets?
  6. Who retweets more?
  7. How does usage differ for mobile users?
  8. What went wrong at the same time?
  9. Which features get users hooked?
  10. What is a user’s reputation?
  11. How deep do retweets go?
  12. Which new features worked?

And many many more. The questions help them understand Twitter, their analytics system helps them get the answers faster.

Hadoop and Pig are Used for Analysis

Any question you can think of requires analyzing big data for answers. 100 billion is a lot of tweets. That’s why Twitter uses Hadoop and Pig as their analysis platform. Hadoop provides: key-value storage on a distributed file system, horizontal scalability, fault tolerance, and map-reduce for computation. Pig is a query a mechanism that makes it possible to write complex queries on top of Hadoop.

Saying you are using Hadoop is really just the beginning of the story. The rest of the story is what is the best way to use Hadoop? For example, how do you store data in Hadoop?

This may seem an odd question, but the answer has big consequences. In a relational database you don’t store the data, the database stores you, er, it stores the data for you. APIs move that data around in a row format.

Not so with Hadoop. Hadoop’s key-value model means it’s up to you how data is stored. Your choice has a lot to do with performance, how much data can be stored, and how agile you can be in reacting to future changes.

Each tweet has 12 fields, 3 of which have sub structure, and the fields can and will change over time as new features are added. What is the best way to store this data?

Data is Stored in Protocol Buffers to Keep it Efficient and Flexible

Twitter considered CSV, XML, JSON, and Protocol Buffers as possible storage formats. Protocol Buffer is a way of encoding structured data in an efficient yet extensible format. Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats. BSON (binary JSON) was not evaluated, but would probably not work because it doesn’t have an IDL (interface definition language.) Avro is one potential option that they’ll look into in the future.

An evaluation matrix was created which declared Protocol Buffers the winner. Protocol Buffers won because it allows data to be split across different nodes; it is reusable for data other than just tweets (logs, file storage, RPC, etc); it parses efficiently; fields can be added, changed, and deleted without having to change deployed code; the encoding is small; it supports hierarchical relationships. All the other options failed one or more of these criteria.

IDL Used for Codegen

Surprisingly efficiency, flexibility and other sacred geek metrics were not the only reason Twitter liked Protocol Buffers. What is often considered a weakness, Protocol Buffer’s use of an IDL to describe data structures, is actually considered a big win by Twitter. Having to define data structure IDL is often seen as a useless waste of time. But from the IDL they generate, as part of the build process, all Hadoop related code: Protocol Buffer InoutFOrmats, OutputFormats, Writables, Pig LoadFuncs, Pig StoreFuncs, and more.

All the code that once was written by hand for each new data structure is now simply auto generated from the IDL. This saves ton of effort and the code is much less buggy. IDL actually saves time.

At one point model driven auto generation was a common tactic on many projects. Then fashion moved to hand generating everything. Codegen it seems wasn't agile enough. Once you hand generate everything you start really worrying about the verbosity of your language, which moved everyone to more dynamic languages, and ironically DSLs were still often listed as an advantage of languages like Ruby. Another consequence of hand coding was the framework of the weekitis. Frameworks help blunt the damage caused by thinking everything must be written from scratch.

It’s good to see code generation coming into fashion again. There’s a lot of power in using a declarative specification and then writing highly efficient, system specific code generators. Data structures are the low hanging fruit, but it’s also possible to automate larger more complex processes.

Overall it was a very interesting and useful talk. I like seeing the careful evaluation of different options based on knowing what you want and why.  It's refreshing to see how these smart choices can synergize and make a better and more stable system.

Related Articles

  1. Hadoop and Protocol Buffers at Twitter
  2. A Peek Under Twitter's Hood - Twitter’s open source page goes live.
  3. Hadoop
  4. ProtocolBuffers
  5. Hadoop Bay Area User Group - Feb 17th at Yahoo! - RECAP
  6. Twitter says "Today, we are seeing 50 million tweets per day—that's an average of 600 tweets per second."

Reader Comments (13)

That’s why Twitter uses Hadoop and Pig as their analysis platform
has a broken url

February 19, 2010 | Unregistered CommenterAnon

Thanks. It looks like all the URLs were escaped for some reason. Should be fixed now.

February 19, 2010 | Registered CommenterHighScalability Team

99 percent of all those tweets is junk. how on earth will they analyse anything out of that ?

Theoretically, every tweet is a human thought or a sense of communication. And analyzing this mass of data would mean identifying how human brain functions at various aspects - such as given time of day, day of week, geographic location, culture, trends and so on--- so understanding this human behavior constitutes identifying the brains of online mass. (restricting to the twitters only).

At the end its all Statistics - a proportion of the web which shall contribute to this phenomenal data.

Hope we learn more about how human chain of thoughts function and wish this contributes to the Neuroscientists as a research tool.

My $0.02

~Vj

February 19, 2010 | Unregistered Commentermoronkreacionz

Strange, twitter tweet ID overflowed just a couple months ago, how they can possibly have 100 billion of tweets?

February 20, 2010 | Unregistered Commenterantirez

i would substitute "nervous system", which impies these signals/tweets our outside the brian, to "synaptic firings" which state that tweets are more aptly the signals *inside* the brain. it is the analysis of the signal firings or re-tweets that we find patterns, and in those patterns meaning.

more at the Synaptic Web, here:

http://synapticweb.pbworks.com/

@khrisloux

February 20, 2010 | Unregistered Commenter@khrisloux

I'll be that Vivisimo's Velocity search engine could help them a lot.

February 20, 2010 | Unregistered CommenterSaracen

Interesting post, Todd.

A few copy errors in your post:

"5. What can we tall about as user from their tweets?" | Probably should be: What can we tell about a user...

"8. What when wrong at the same time?" | What went...

"11.How deep to retweets go?" | How deep do...

Is not about what can be analised but what business purpose it serves.

The URIs within tweets offer anyway much more value on profiling users.

February 22, 2010 | Unregistered Commenter@angelmaldonado

"Protocol Buffers won because it allows data to be split across different nodes; it is reusable for data other than just tweets (logs, file storage, RPC, etc); it parses efficiently; fields can be added, changed, and deleted without having to change deployed code; the encoding is small; it supports hierarchical relationships. All the other options failed one or more of these criteria".

Just replace Protocol Buffer with Plain Text and enjoy the same :) Seriously, I did not get what a the rationals behind PB choice but eventually it is a matter of taste. Unless you are building something like HBase a top of hdfs, your internal storage format does not matter when you run lengthy batch jobs.

February 22, 2010 | Unregistered CommenterVlad Rodionov

"Today, we are seeing 50 million tweets per day—that's an average of 600 tweets per second."

What!?

600 Tweets per second!?

...

Wow... I'm more than incredibly underwhelmed... I mean I kind of imagined Twitter had all these scalability problems to solve because they had thousands or maybe even hundred thousands of tweets per second.

Gosh, if they wrote the Twitter system in C they could easily handle "ALL" that traffic with a single server.

Either way I still think Twitter has been a great idea, and they deserve all their credit, they have done a great job with the whole package and reputation of Twitter. But why not hire a C programmer to rewrite their stack, and reduce their costs hundred fold?

February 23, 2010 | Unregistered CommenterRobert Gould

@Robert Gould: I too was surprised by that low number. Though the problem as I understand is not with handling the incoming tweets, but sending them out to all subscribers. Keep in mind that some users have several thousands (or millions?) of followers.

Rewriting in C doesn't help much with scalability. It would make the system a little faster, but not more than an order of magnitude, at the expense of maintainability, stability and development speed.

February 24, 2010 | Unregistered CommenterMartin Vilcans

I think after a huge amount of analysis, and after a huge amount of data has been taken from the big Twitter database, we'll find out that we're not nearly as logical as we think and in fact very predictable.

Patterns will emerge that conclude that like all other creatures on this earth we are effected by gravity, seasons, night and day and x type of person are more likely to tweet x etc etc. I think this experiement will certainly confirm the phrase that humans are creatures of habit.

July 29, 2010 | Unregistered CommenterRob Playford

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>