High Scalability -

30 Comments |

Permalink |

Strategy,

Twitter

Tuesday

Nov292011

DataSift Architecture: Realtime Datamining at 120,000 Tweets Per Second

Tuesday, November 29, 2011 at 9:12AM

I remember the excitement of when Twitter first opened up their firehose. As an early adopter of the Twitter API I could easily imagine some of the cool things you could do with all that data. I also remember the disappointment of learning that in the land of BigData, data has a price, and that price would be too high for little fish like me. It was like learning for the first time there would be no BigData Santa Clause.

For a while though I had the pleasure of pondering just how I would handle all that data. It's a fascinating problem. You have to be able to reliably consume it, normalize it, merge it with other data, apply functions on it, store it, query it, distribute it, and oh yah, monetize it. Most of that in realish-time. And if you are trying to create a platform for allowing the entire Internet do to the same thing to the firehose, the challenge is exponentially harder.

DataSift is in the exciting position of creating just such a firehose eating, data chomping machine. You see, DataSift has bought multi-year re-syndication rights from Twitter, which grants them access to the full Twitter firehose with the ability resell subsets of it to other parties, which could be anyone, but the primary target is of course businesses. Gnip is the only other company to have these rights.

DataSift was created out of Nick Halstead's, Founder and CTO of DataSift, experience with TweetMeme, a popular real-time Twitter news aggregator, which at one time handled 1.1 billion page views per day. TweetMeme is famous for inventing the social signaling mechanism, better known as the retweet, with their retweet button, an idea that came out of an even earlier startup called fav.or.it (favorite). Imagine if you will a time before like buttons were plastered all over the virtual place.

So processing the TweetMeme at scale is nothing new for the folks at DataSift, what has been the challenge is turning that experience into an Internet-scale platform so that everyone else can do the same thing. That has been a multi-year odyssey.

DataSift is position themselves as a realtime datamining platform. The platform angle here is really the key take home message. They are pursuing a true platform strategy for processing real-time streams. TweetMeme while successful, could not be a billion dollar company, but a BigData platform could grow that large, so that’s the direction they are headed. A money quote by Nick highlights the logic in neon: "There's no money in buttons, there's money in data."

Part of the strategy behind a platform play is to become the incumbent player by building a giant technological moat around your core value proposition. When others come a knockin they can't cross over your moat because of your towering technological barrier to entry. That's what DataSift is trying to do. The drawbridge on the moat is favored access to Twitter's firehose, but the real power is in the Google quality real-time data processing platform infrastructure that they are trying to create.

DataSift's real innovation is in creating an Internet scale filtering system that can quickly evaluate very large filters (think Lady Gaga follower size) combined with the virtuous economics of virtualization, where the more customers you have the more money you make because they are sharing resources.

How are they making all this magic happen? Let's see...

6 Comments |

Permalink |

BigData,

Example,

Twitter

Monday

Mar142011

Twitter by the Numbers - 460,000 New Accounts and 140 Million Tweets Per Day

Monday, March 14, 2011 at 12:56PM

Twitter has taken some hits lately, but they are still pumping out the tweets and adding massive numbers of new users. Here are some numbers they just published, hot off the analytics engine:

Twitter’s Plan to Analyze 100 Billion Tweets

Friday, February 19, 2010 at 7:41AM

If Twitter is the “nervous system of the web” as some people think, then what is the brain that makes sense of all those signals (tweets) from the nervous system? That brain is the Twitter Analytics System and Kevin Weil, as Analytics Lead at Twitter, is the homunculus within in charge of figuring out what those over 100 billion tweets (approximately the number of neurons in the human brain) mean.

Twitter has only 10% of the expected 100 billion tweets now, but a good brain always plans ahead. Kevin gave a talk, Hadoop and Protocol Buffers at Twitter, at the Hadoop Meetup, explaining how Twitter plans to use all that data to an answer key business questions.

What type of questions is Twitter interested in answering? Questions that help them better understand Twitter. Questions like:

13 Comments |

Permalink |

Twitter API Traffic is 10x Twitter’s Site.

Example,

Hadoop,

Twitter

Saturday

Jun272009

Scaling Twitter: Making Twitter 10000 Percent Faster

Saturday, June 27, 2009 at 11:46PM

Update 6: Some interesting changes from Twitter's Evan Weaver: everything in RAM now, database is a backup; peaks at 300 tweets/second; every tweet followed by average 126 people; vector cache of tweet IDs; row cache; fragment cache; page cache; keep separate caches; GC makes Ruby optimization resistant so went with Scala; Thrift and HTTP are used internally; 100s internal requests for every external request; rewrote MQ but kept interface the same; 3 queues are used to load balance requests; extensive A/B testing for backwards capability; switched to C memcached client for speed; optimize critical path; faster to get the cached results from the network memory than recompute them locally.
Update 5: Twitter on Scala. A Conversation with Steve Jenson, Alex Payne, and Robey Pointer by Bill Venners. A fascinating discussion of why Twitter moved to the Java JVM for their server infrastructure (long lived processes) and why they moved to Scala to program against it (high level language, static typing, functional). Ruby is used on the front-end but wasn't performant or reliable enough for the back-end.
Update 4: Improving Running Components at Twitter by Evan Weaver. Tells how Twitter changed their infrastructure to go from handling 3 requests to 139 requests a second. They moved to a messaging model, asynchronous process, 3 levels of cache, and moved their middleware to a mixture C and Scala/JVM.
Update 3: Upgrading Twitter without service disruptions by Gojko Adzic. Lots of good updates on the new Twitter architecture.
Update 2: a commenter in Twitter Fails Macworld Keynote Test said this entry needs to be updated. LOL. My uneducated guess is it's not a language or architecture problem, but more a problem of not being able to add hardware fast enough into their data center. The predictability of this problem is debatable, but once you have it, it's hard to fix.
Update: Twitter releases Starling - light-weight persistent queue server that speaks the MemCache protocol. It was built to drive Twitter's backend, and is in production across Twitter's cluster.

Twitter started as a side project and blew up fast, going from 0 to millions of page views within a few terrifying months. Early design decisions that worked well in the small melted under the crush of new users chirping tweets to all their friends. Web darling Ruby on Rails was fingered early for the scaling problems, but Blaine Cook, Twitter's lead architect, held Ruby blameless:

For us, it’s really about scaling horizontally - to that end, Rails and Ruby haven’t been stumbling blocks, compared to any other language or framework. The performance boosts associated with a “faster” language would give us a 10-20% improvement, but thanks to architectural changes that Ruby and Rails happily accommodated, Twitter is 10000% faster than it was in January.

If Ruby on Rails wasn't to blame, how did Twitter learn to scale ever higher and higher?

Update: added slides Small Talk on Getting Big. Scaling a Rails App & all that Jazz

Site: http://twitter.com

Information Sources

Scaling Twitter Video by Blaine Cook.

Scaling Twitter Slides

Good News blog post by Rick Denatale

Scaling Twitter blog post Patrick Joyce.

A Small Talk on Getting Big. Scaling a Rails App & all that Jazz - really cute dog picks

The Platform

Ruby on Rails

Erlang

MySQL

Mongrel - hybrid Ruby/C HTTP server designed to be small, fast, and secure

Munin

Nagios

Google Analytics

AWStats - real-time logfile analyzer to get advanced statistics

Memcached

The Stats

Over 350,000 users. The actual numbers are as always, very super super top secret.

600 requests per second.

Average 200-300 connections per second. Spiking to 800 connections per second.

MySQL handled 2,400 requests per second.

180 Rails instances. Uses Mongrel as the "web" server.

1 MySQL Server (one big 8 core box) and 1 slave. Slave is read only for statistics and reporting.

30+ processes for handling odd jobs.

8 Sun X4100s.

Process a request in 200 milliseconds in Rails.

Average time spent in the database is 50-100 milliseconds.

Over 16 GB of memcached.

The Architecture

Ran into very public scaling problems. The little bird of failure popped up a lot for a while.

Originally they had no monitoring, no graphs, no statistics, which makes it hard to pinpoint and solve problems. Added Munin and Nagios. There were difficulties using tools on Solaris. Had Google analytics but the pages weren't loading so it wasn't that helpful :-)

Use caching with memcached a lot.
- For example, if getting a count is slow, you can memoize the count into memcache in a millisecond.
- Getting your friends status is complicated. There are security and other issues. So rather than doing a query, a friend's status is updated in cache instead. It never touches the database. This gives a predictable response time frame (upper bound 20 msecs).
- ActiveRecord objects are huge so that's why they aren't cached. So they want to store critical attributes in a hash and lazy load the other attributes on access.
- 90% of requests are API requests. So don't do any page/fragment caching on the front-end. The pages are so time sensitive it doesn't do any good. But they cache API requests.

Messaging
- Use message a lot. Producers produce messages, which are queued, and then are distributed to consumers. Twitter's main functionality is to act as a messaging bridge between different formats (SMS, web, IM, etc).
- Send message to invalidate friend's cache in the background instead of doing all individually, synchronously.
- Started with DRb, which stands for distributed Ruby. A library that allows you to send and receive messages from remote Ruby objects via TCP/IP. But it was a little flaky and single point of failure.
- Moved to Rinda, which a shared queue that uses a tuplespace model, along the lines of Linda. But the queues are persistent and the messages are lost on failure.
- Tried Erlang. Problem: How do you get a broken server running at Sunday Monday with 20,000 users waiting? The developer didn't know. Not a lot of documentation. So it violates the use what you know rule.
- Moved to Starling, a distributed queue written in Ruby.
- Distributed queues were made to survive system crashes by writing them to disk. Other big websites take this simple approach as well.

SMS is handled using an API supplied by third party gateway's. It's very expensive.

Deployment
- They do a review and push out new mongrel servers. No graceful way yet.
- An internal server error is given to the user if their mongrel server is replaced.
- All servers are killed at once. A rolling blackout isn't used because the message queue state is in the mongrels and a rolling approach would cause all the queues in the remaining mongrels to fill up.

Abuse
- A lot of down time because people crawl the site and add everyone as friends. 9000 friends in 24 hours. It would take down the site.
- Build tools to detect these problems so you can pinpoint when and where they are happening.
- Be ruthless. Delete them as users.

Partitioning
- Plan to partition in the future. Currently they don't. These changes have been enough so far.
- The partition scheme will be based on time, not users, because most requests are very temporally local.
- Partitioning will be difficult because of automatic memoization. They can't guarantee read-only operations will really be read-only. May write to a read-only slave, which is really bad.

Twitter's API Traffic is 10x Twitter’s Site
- Their API is the most important thing Twitter has done.
- Keeping the service simple allowed developers to build on top of their infrastructure and come up with ideas that are way better than Twitter could come up with. For example, Twitterrific, which is a beautiful way to use Twitter that a small team with different priorities could create.

Monit is used to kill process if they get too big.

Lessons Learned

Talk to the community. Don't hide and try to solve all problems yourself. Many brilliant people are willing to help if you ask.

Treat your scaling plan like a business plan. Assemble a board of advisers to help you.

Build it yourself. Twitter spent a lot of time trying other people's solutions that just almost seemed to work, but not quite. It's better to build some things yourself so you at least have some control and you can build in the features you need.

Build in user limits. People will try to bust your system. Put in reasonable limits and detection mechanisms to protect your system from being killed.

Don't make the database the central bottleneck of doom. Not everything needs to require a gigantic join. Cache data. Think of other creative ways to get the same result. A good example is talked about in Twitter, Rails, Hammers, and 11,000 Nails per Second.

Make your application easily partitionable from the start. Then you always have a way to scale your system.

Realize your site is slow. Immediately add reporting to track problems.

Optimize the database.
- Index everything. Rails won't do this for you.
- Use explain to how your queries are running. Indexes may not be being as you expect.
- Denormalize a lot. Single handedly saved them. For example, they store all a user IDs friend IDs together, which prevented a lot of costly joins.
- Avoid complex joins.
- Avoid scanning large sets of data.

Cache the hell out of everything. Individual active records are not cached, yet. The queries are fast enough for now.

Test everything.
- You want to know when you deploy an application that it will render correctly.
- They have a full test suite now. So when the caching broke they were able to find the problem before going live.

Long running processes should be abstracted to daemons.

Use exception notifier and exception logger to get immediate notification of problems so you can address the right away.

Don't do stupid things.
- Scale changes what can be stupid.
- Trying to load 3000 friends at once into memory can bring a server down, but when there were only 4 friends it works great.

Most performance comes not from the language, but from application design.

Turn your website into an open service by creating an API. Their API is a huge reason for Twitter's success. It allows user's to create an ever expanding and ecosystem around Twitter that is difficult to compete with. You can never do all the work your user's can do and you probably won't be as creative. So open you application up and make it easy for others to integrate your application with theirs.

For a discussion of partitioning take a look at Amazon Architecture, An Unorthodox Approach to Database Design : The Coming of the Shard, Flickr Architecture

The Mailinator Architecture has good strategies for abuse protection.

GoogleTalk Architecture addresses some interesting issues when scaling social networking sites.

Todd Hoff |

76 Comments |

Permalink |

RoR,

Twitter

Monday

Apr202009

Some things about Memcached from a Twitter software developer

Monday, April 20, 2009 at 5:14AM

Memcached is generally treated as a black box. But what if you really need to know what's in there? Not for runtime purposes, but for optimization and capacity planning?

Read more on Evan Weaver, a software developer working for Twitter (a contributor for Rails core and Mongrel).

mg1313 |

1 Comment |

Permalink |

ruby on rails

Tuesday

Apr142009

Designing a Scalable Twitter

Tuesday, April 14, 2009 at 6:14AM

There were many talks recently about twitter scalability and their specific choice of language such as Scala to address their existing Ruby based scalability. In this post i tried to provide a more methodical approach for handling twitter scalability challenges that is centered around the right choice of architecture patterns rather then the language itself. The architecture pattern are given in a generic fashion that is not specific to twitter itself and can serve anyone who is looking to build a scalable real time web application in the near future.

Hacker News Thread on Message Queue Options

natis |

At Some Point the Cost of Servers Outweighs the Cost of Programmers

Sunday, April 5, 2009 at 7:12AM

This is the intriguing quote by Bill Venners in an interview with Twitter's Alex Payne on Twitter's heretical switch from a pure Ruby stack to a Ruby on Rails stack on the front-end and JVM/Scala on the back-end:

So performance was also one of the problems with JRuby, which I [Bill Venners] think helps explain better why they'd [Twitter] prefer Scala over Ruby or JRuby for some things. I have often heard Rubyists say that although Ruby is slower than Java, for many things it is plenty fast enough, and they are right. The logic goes further, saying that servers are cheap, and programmers expensive, so it makes sense to tradeoff some runtime performance for programmer productivity. And I think that's very often true too, but not always. If you have enough traffic, at some point the cost of servers outweighs the cost of programmers. I'm not sure whether Twitter is past that point, but they get a lot of traffic. And frankly this isn't an intrinsic tradeoff. Other dynamic languages are faster than Ruby, and Scala is too. And people can be quite productive in these other languages too, including Scala.

I feel Alex's Max Payne. You might wonder why the geekosphere cares so passionately which technology stack Twitter uses? Well, it's Twitter and it's Ruby on Rails. That's like the Lindsay Lohan and Samantha Ronson of tech buzz. It creates it's own self-sustaining posting reaction. Boom! It took some giant cajones to switch from a well defended platform like Ruby on Rails to an obscure language like Scala. Few people would have been brave enough to pull the trigger on that decision. Twitter didn't take this large leap out of ignorance or incompetence. Twitter's Steve Jenson said they spent several weeks going over our options, running extensive load tests, and presented our findings to the team at each stage. We did our due diligence. They did the work and came to conclusions valid for their situation. They have to follow their own bliss. They aren't telling you to use Scala. They aren't telling you not to use Ruby. Have at it. But they have chosen the path less traveled and seem happy with the direction they are heading. If you aren't happy with their decision then that's a you problem, not a them problem. This points out for me the evolving nature of the two tier web: a client tier and a back-end tier accessed only via an API. That back-end tier can be implemented in anything at all. Twitter likes the JVM for it's undeniable performance, reliability, and scalability. They may like Scala because it has a lot of cool features, is concise, has static typing, performs well, and is a pleasure to develop in. When people jumped with a happy little grin on their face to Ruby, they loved a lot of things about Ruby that helped them make that decision to switch. Maybe Scala is cool, effective, and fun to use too? So instead of worrying how the homeboys have ever so indirectly dissed Ruby, maybe it's worth taking a look at Scala to see if it's actually any good? As a start try Bill Venners on the rise of Scala. It's a great overview of Scala and why it's a worthy next evolution. Then maybe take a peek at First Steps to Scala. And if you want to take a look at some real-life code try Kestrel - tiny queue system based on starling. Knowing only the vaguest bits about Scala before my own little exploration, I come out impressed and interested. I like a lot of things Bill had to say: Scala means scalable language as in it scales to different sized tasks and domains; Scala's static typing is valuable, especially in a team project; Scala has the feel of a scripting language, but can also work as a systems language; Scala is an artful blend of a fully Object and fully Functional language; Scala supports imperative programming, but with a functional bent; Scala can express more in types than Java so you get more rigorous type checking; Scala is concise so can say more with less; Scala treats libraries like creating internal DSLs; Scala helps programmer productivity like Ruby and Python. Sounds like Scala is worth a deeper look to me. Can't we all just get along? Well, that's not how this post started at least. It started with Bill's refreshingly anti-establishment statement: If you have enough traffic, at some point the cost of servers outweighs the cost of programmers. This is an idea we don't hear much of anymore in this era of abundance. Servers cost real cash though, especially for bootstrappers and the truly humongous. So if you can cut your cash requirements by putting in a little programmer elbow grease, that makes a lot of sense to me. Performance still matters.

Twitter on Scala by Bill Venners

My Reasoned Response about Scala at Twitter by Obie Fernandez.

Twitter Blaming Ruby for their Mistakes by Tony Arcieri

Scala's home page

The Secret Behind Twitter's Growth by Kate Greene

Why Scala for Web 2.0? by Alex Payne

Mending The Bitter Absence of Reasoned Technical Discussion by Alex Payne

Todd Hoff |

8 Comments |

Permalink |

RoR,

Twitter,

scala

Wednesday

Mar182009

QCon London 2009: Upgrading Twitter without service disruptions

Wednesday, March 18, 2009 at 7:44AM

Evan Weaver from Twitter presented a talk on Twitter software upgrades, titled Improving running components as part of the Systems that never stop track at QCon London 2009 conference last Friday. The talk focused on several upgrades performed since last May, while Twitter was experiencing serious performance problems.

gojko |

1 Comment |

Permalink |

jvm,

rails,

scala

Monday

May192008

Twitter as a scalability case study

Monday, May 19, 2008 at 2:02AM

A lot has been said already about Twitter's scalability issues. Many have given Twitter as an anti-pattern of how not to deal with scalability and have suggested different solutions for scaling it. As Twitter is famously a Ruby-on-Rails deployment, this case has also been used as a weapon in the language/platform wars between the RoR and Java camps, and to a lesser degree, also with the LAMP (PHP) camp