Entries in Example (248)

Tuesday
May272008

eBay Architecture

Update 2: EBay's Randy Shoup spills the secrets of how to service hundreds of millions of users and over two billion page views a day in Scalability Best Practices: Lessons from eBay on InfoQ. The practices: Partition by Function, Split Horizontally, Avoid Distributed Transactions, Decouple Functions Asynchronously, Move Processing To Asynchronous Flows, Virtualize At All Levels, Cache Appropriately. Update: eBay Serves 5 Billion API Calls Each Month. Aren't we seeing more and more traffic driven by mashups composed on top of open APIs? APIs are no longer a bolt on, they are your application. Architecturally that argues for implementing your own application around the same APIs developers and users employ. Who hasn't wondered how eBay does their business? As one of the largest most loaded websites in the world, it can't be easy. And the subtitle of the presentation hints at how creating such a monster system requires true engineering: Striking a balance between site stability, feature velocity, performance, and cost. You may not be able to emulate how eBay scales their system, but the issues and possible solutions are worth learning from. Site: http://ebay.com

Information Sources

  • The eBay Architecture - Striking a balance between site stability, feature velocity, performance, and cost.
  • Podcast: eBay’s Transactions on a Massive Scale
  • Dan Pritchett on Architecture at eBay interview by InfoQ

    Platform

  • Java
  • Oracle
  • WebSphere, servlets
  • Horizontal Scaling
  • Sharding
  • Mix of Windows and Unix

    What's Inside?

    This information was adapted from Johannes Ernst's Blog

    The Stats

  • On an average day, it runs through 26 billion SQL queries and keeps tabs on 100 million items available for purchase.
  • 212 million registered users, 1 billion photos
  • 1 billion page views a day, 105 million listings, 2 petabytes of data, 3 billion API calls a month
  • Something like a factor of 35 in page views, e-mails sent, bandwidth from June 1999 to Q3/2006.
  • 99.94% availability, measured as "all parts of site functional to everybody" vs. at least one part of a site not functional to some users somewhere
  • The database is virtualized and spans 600 production instances residing in more than 100 server clusters.
  • 15,000 application servers, all J2EE. About 100 groups of functionality aka "apps". Notion of a "pool": "all the machines that deal with selling"...

    The Architecture

  • Everything is planned with the question "what if load increases by 10x". Scaling only horizontal, not vertical: many parallel boxes.
  • Architectures is strictly divided into layers: data tier, application tier, search, operations,
  • Leverages MSXML framework for presentation layer (even in Java)
  • Oracle databases, WebSphere Java (still 1.3.1)
  • Split databases by primary access path, modulo on a key.
  • Every database has at least 3 on-line databases. Distributed over 8 data centers
  • Some database copies run 15 min behind, 4 hours behind
  • Databases are segmented by function: user, item account, feedback, transaction, over 70 in all.
  • No stored procedures are used. There are some very simple triggers.
  • Move cpu-intensive work moved out of the database layer to applications applications layer: referential integrity, joins, sorting done in the application layer! Reasoning: app servers are cheap, databases are the bottleneck.
  • No client-side transactions. no distributed transactions
  • J2EE: use servlets, JDBC, connection pools (with rewrite). Not much else.
  • No state information in application tier. Transient state maintained in cookie or scratch database.
  • App servers do not talk to each other -- strict layering of architecture
  • Search, in 2002: 9 hours to update the index running on largest Sun box available -- not keeping up.
  • Average item on site changes its search data 5 times before it is sold (e.g. price), so real-time search results are extremely important.
  • "Voyager": real-time feeder infrastructure built by eBay.. Uses reliable multicast from primary database to search nodes, in-memory search index, horizontal segmentation, N slices, load-balances over M instances, cache queries.

    Lessons Learned

  • Scale Out, Not Up – Horizontal scaling at every tier. – Functional decomposition.
  • Prefer Asynchronous Integration – Minimize availability coupling. – Improve scaling options.
  • Virtualize Components – Reduce physical dependencies. – Improve deployment flexibility.
  • Design for Failure – Automated failure detection and notification. – “Limp mode” operation of business features.
  • Move work out of the database into the applications because the database is the bottleneck. Ebay does this in the extreme. We see it in other architecture using caching and the file system, but eBay even does a lot of traditional database operations in applications (like joins).
  • Use what you like and toss what you don't need. Ebay didn't feel compelled to use full blown J2EE stack. They liked Java and Servlets so that's all they used. You don't have to buy into any framework completely. Just use what works for you.
  • Don't be afraid to build solutions that meet and evolve with your needs. Every off the shelf solution will fail you at some point. You have to go the rest of the way on your own.
  • Operational controls become a larger and larger part of scalability as you grow. How do you upgrade, configure, and monitor thousands of machines will running a live system?
  • Architectures evolve. You need to be able to change, refine, and develop your new system while keeping your existing site running. That's the primary challenge of any growing website.
  • It's a mistake to worry too much about scalability from the start. Don't suffer from paralysis by analysis and worrying about traffic that may never come.
  • It's also a mistake not to worry about scalability at all. You need to develop an organization capable of dealing with architecture evolution. Understand you are never done. Your system will always evolve and change. Build those expectations and capabilities into your business from the start. Don't let people and organizations be why your site fails. Many people will think the system should be perfect from the start. It doesn't work that way. A good system is developed overtime in response to real issues and concerns. Expect change and adapt to change.

    Click to read more ...

  • Wednesday
    May142008

    New Facebook Chat Feature Scales to 70 Million Users Using Erlang

    UpdateErlang at Facebook by Eugene Letuchy. How Facebook uses Erlang to implement Chat, AIM Presence, and Chat Jabber support. 

    I've done some XMPP development so when I read Facebook was making a Jabber chat client I was really curious how they would make it work. While core XMPP is straightforward, a number of protocol extensions like discovery, forms, chat states, pubsub, multi user chat, and privacy lists really up the implementation complexity. Some real engineering challenges were involved to make this puppy scale and perform. It's not clear what extensions they've implemented, but a blog entry by Facebook's Eugene Letuchy hits some of the architectural challenges they faced and how they overcame them.

    A web based Jabber client poses a few problems because XMPP, like most IM protocols, is an asynchronous event driven system that pretty much assumes you have a full time open connection. After logging in the server sends a client roster information and presence information. Your client has to be present to receive the information. If your client wants to discover the capabilities of another client then a request is sent over the wire and some time later the response comes back. An ID is used to map the reply to the request. All responses are intermingled. IM messages can come in at any time. Subscription requests can come in at any time.

    Facebook has the client open a persistent connection to the IM server and uses long polling to send requests and continually get data from the server. Long polling is a mixture of client pull and server push. It works by having the client make a request to the server. The client connection blocks until the server has data to return. When it does data is returned, the client processes it, and then is in position to make another request of the server and get any more data that has queued up in the mean time. Obviously there are all sorts of latency, overhead, and resource issues with this approach. The previous link discusses them in more detail and for performance information take a look at Performance Testing of Data Delivery Techniques for AJAX Applications by Engin Bozdag, Ali Mesbah and Arie van Deursen.

    From a client perspective I think this approach is workable, but obviously not ideal. Your client's IMs, presence changes, subscription requests, and chat states etc are all blocked on the polling loop, which wouldn't have a predictable latency. Predictable latency can be as important as raw performance.

    The real scaling challenge is on the server side. With 70 million people how do you keep all those persistent connections open? Well, when you read another $100 million was invested in Facebook for hardware you know why. That's one hella lot of connections. And consider all the data those IM servers must store up in between polling intervals. Looking at the memory consumption for their servers would be like watching someone breath. Breath in- streams of data come in and must be stored waiting for the polling loop. Breath out- the polling loops hit and all the data is written to the client and released from the server. A ceaseless cycle. In a stream based system data comes in and is pushed immediately out the connection. Only socket queue is used and that's usually quite sufficient. Now add network bandwidth for all the XMPP and TCP protocol overhead and CPU to process it all and you are talking some serious scalability issues.

    So, how do you handle all those concurrent connections? They chose Erlang. When you first hear Erlang and Jabber you think ejabberd, an open source Erlang based XMPP server. But since the blog doesn't mention ejabberd it seems they haven't used it .

    Why Erlang? First, the famous Yaws vs Apache shootout where "Apache dies at about 4,000 parallel sessions. Yaws is still functioning at over 80,000 parallel connections." Erlang is naturally good at solving high concurrency problems. Yet following the rule that no benchmark can go unchallenged, Erik Onnen calls this the Worst Measurement Ever and has some good reasoning behind it.

    In any case, Erlang does nicely match the problem space. Erlang's approach to a concurrency problem is to throw a very light weight Erlang process at each state machine you want to be concurrent. Code-wise that's more natural than thread pools, async IO, or thread per connection systems. Until Linux 2.6 it wasn't even possible to schedule large numbers of threads on a single machine. And you are still devoting a lot of unnecessary stack space to each thread. Erlang will make excellent use of machine resources to handle all those connections. Something anyone with a VPS knows is hard to do with Apache. Apache sucks up memory with joyous VPS killing abandon.

    The blog says C++ is used to log IM messages. Erlang is famously excellent for its concurrency prowess and equally famous for being poor at IO, so I imagine C++ was needed for efficiency.

    One of the downsides of multi-language development is reusing code across languages. Facebook created Thrift to tie together the Babeling Tower of all their different implementation languages. Thrift is a software framework for scalable cross-language services development. It combines a powerful software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, and Ruby. Another approach might be to cross language barriers using REST based services.

    A problem Facebook probably doesn't have to worry about scaling is the XMPP roster (contact list). Handling that many user accounts would challenge most XMPP server vendors, but Facebook has that part already solved. They could concentrate on scaling the protocol across a bunch of shiny new servers without getting bogged down in database issues. Wouldn't that be nice :-) They can just load balance users across servers and scalability is solved horizontally, simply by adding more servers. Nice work.

    Friday
    May022008

    Friends for Sale Architecture - A 300 Million Page View/Month Facebook RoR App

    Update: Jake in Does Django really scale better than Rails? thinks apps like FFS shouldn't need so much hardware to scale.

    In a short three months Friends for Sale (think Hot-or-Not with a market economy) grew to become a top 10 Facebook application handling 200 gorgeous requests per second and a stunning 300 million page views a month. They did all this using Ruby on Rails, two part time developers, a cluster of a dozen machines, and a fairly standard architecture. How did Friends for Sale scale to sell all those beautiful people? And how much do you think your friends are worth on the open market? 

    Site: http://www.facebook.com/apps/application.php?id=7019261521

    Information Sources

  • Siqi Chen and Alexander Le, co-creators of Friends for Sale, answering my standard questionairre.
  • Virality on Facebook

    The Platform

  • Ruby on Rails
  • CentOS 5 (64 bit)
  • Capistrano - update and restart application servers.
  • Memcached
  • MySQL
  • Nginx
  • Starling - distributed queue server
  • Softlayer - hosting service
  • Pingdom - for website monitoring
  • LVM - logical volume manager
  • Dr. Nics Magic Multi-Connections Gem - split database reads and writes to servers

    The Stats

  • 10th most popular application on Facebook.
  • Nearly 600,000 active users.
  • Half a million unique visitors a day and growing fast.
  • 300 million page views a month.
  • 300% monthly growth rate, but that is plateauing.
  • 2.1 million unique visitors in the past month
  • 200 requests per second.
  • 5TB of bandwidth per month.
  • 2 part time (now full time), and 1 remote DBA contractor.

  • 4 DB servers, 6 application servers, 1 staging server, and 1 front end server.
    - 6, 4 core 8 GB application servers.
    - Each application server runs 16 mongrels for a total of 96 mongrels. -
    - 4 GB memcache instance on each application server
    - 2 32GB 4 core servers with 4x 15K SCSI RAID 10 disks in a master-slave setup

    Getting to Know You

  • What is your system is for?

    Our system is designed for our Facebook application, Friends for Sale.
    It's basically Hot-or-Not with a market economy. At the time of this
    writing it's the 10th most popular application on Facebook.

    Their Facebook description reads: Buy and sell your friends as pets! You can make your pets poke, send gifts, or just show off for you.
    Make money as a shrewd pets investor or as a hot commodity! Friends for Sale is the bees knees!


  • Why did you decide to build this system?

    We designed this as more of an experiment to see if we understood virality concepts and metrics on Facebook. I guess we do. =)

  • What particular design/architecture/implementation challenges do your system have?

    As a Facebook application, every request is dynamic so no page caching is possible. Also, it is a very interactive, write heavy application so scaling the database was a challenge.

  • What did you do to meet these challenges?

    We memcached extensively early on - every page reload results in 0 SQL calls. We use Rail's fragment caching with custom expiration logic mostly.

  • How big is your system?

    We had more than half a million unique visitors yesterday and growing fast. We're on track to do more than 300 million page views this month.

  • What is your in/out bandwidth usage?

    We used around 3 terabytes of bandwidth last month. This month should be at least 5TB or so. This number is just for a few icons and XHTML/CSS.

  • How many documents, do you server? How many images? How much data?

    We don't really have unique documents ... we do have around 10 million user profiles though.

    The only images we store are a few static image icons.

  • How fast are you growing?

    We went from around 3M page views per day a month ago to more than 10M page views a day. A month before that we were doing 1M page views per day. So that's around a 300% monthly growth rate but that is plateauing. On a request per second basis, we get around 200 requests per second.

  • What is your ratio of free to paying users?

    It's all free.

  • What is your user churn?

    It's around 1% per day, with a growth rate of 3% or so per day in terms of installed users.

  • How many accounts have been active in the past month?

    We had roughtly 2.1 million unique visitors in the past month according to Google.

  • What is the architecture of your system?

    It's a relatively standard Rails cluster. We have a dedicated front end proxy balancer / static web server running nginx, which proxies directly to 6, 4 core 8 GB application servers. Each application server runs 16 mongrels for a total of 96 mongrels. The front end load balancer proxies directly to the mongrel ports. In addition, we run a 4 GB memcache instance on each application server, along with a local starling distributed queue server and misc background processes.

    We use god to monitor our processes.

    On the DB layer, we have 2 32GB 4 core servers with 4x 15K SCSI RAID 10 disks in a master-slave setup. We use Dr Nic's magic multi-connection's gem in production split reads and writes to each
    box.

    We are adding more slaves right now so we can distribute the read load better and have better redundancy and backup policies. We also get help from Percona (the mysqlperformanceblog guys) for remote DBA work.

    We're hosted on Softlayer - they're a fantastic host. The only problem was that their hardware load balancing server doesn't really work very well ... we had lots of problems with hanging connections and latency. Switching a dedicated box running just nginx fixed everything.

  • How is your system architected to scale?

    It really isn't. On the application layer we are shared-nothing so it's pretty trivial. On the database side we're still with a monolithic master and we're trying to push off sharding for as long as we can. We're still vertically scaled on the database side and I think we can get away with it for quite some time.

  • What do you do that is unique and different that people could best learn from?

    The three things that are unique is -

    1. Neither of the two developers in involved had previous experience in large scale Rails deployment.
    2. Our growth trajectory is relatively rare in the history of Rails deployments
    3. We had very little opportunity for static page caching - each request does hit the full Rails stack

  • What lessons have you learned? Why have you succeeded? What do you wish you would have done differently? What wouldn't you change?

    We learned that a good host, good hardware, and a good DBA are very important. We used to be hosted on Railsmachine, which to be fair is an excellent shared hosting company and they did go out of there way to support us. In the end though, we were barely responsive for a good month due to hardware problems, and it only took two hours to get up and running on Softlayer without a hitch. Choose a good host if you plan on scaling, because migrating isn't fun.

    The most important thing we learned is that your scalability problems is pretty much always, always, always the database. Check it first, and if you don't find anything, check again. Then check again. Without exception, every performance problem we had can be traced to the database server, the database configuration, the query, or the use and non-use of indices.

    We definitely should have gotten on to a better host earlier in the game so we would have been up.

    We definitely wouldn't change our choice of framework - Rails was invaluable for rapid application development, and I think we've pretty much proven that two guys without a lot of scaling experience can scale a Rails app up. The whole 'but does Rails scale?' discussion sounds like a bunch of masturbation - the point is moot.

  • How is your team setup?

    We have two Rails developers, inclusive of me. We very recently retained the services of a remote DBA for help on the database end.

  • How many people do you have?

    On the technical side, 2 part time (now full time), and 1 remote DBA contractor.

  • Where are they located?

    The full time employees are also located in the SOMA area of San Francisco.

  • Who performs what roles?

    The two developers server as co-founders . I (Siqi) was responsible for front end design and development early on, but since I had some experience with deployment I also ended up handling network operations and deployment as well. My co founder Alex is responsible for the bulk of the Rails code - basically all the application logic is from him. Now I find myself doing more deep back end network operations tasks like MySQL optimization and replication - it's hard to find time to get back to the front end which is what I love. But it's been a real fun learning experience so I've been eating up all I can from this.

  • Do you have a particular management philosophy?

    Yes - basically find the smartest people you can, give them the best deal possible, and get out of their way. The best managers GET OUT OF THE WAY, so I try to run the company as much as I can with that in mind. I think I usually fail at it.

  • If you have a distributed team how do you make that work?

    We'd have to have some really good communication tools in the cloud - somebody would have to be a Basecamp nazi. I think remote work / outsourcing is really difficult - I prefer to stay away with from it
    for core development. For something like MySQL DBA or even sysadmin - it might make more sense.

    What do you use?

    We use Rails with a bunch of plugins, most notable cache-fu from Chris Wanstrath and magic multi connections from Dr. Nic. I use VIM as the editor with the rails.vim plugin.

  • Which languages do you use to develop your system?

    Ruby / Rails

  • How many servers do you have?

    We now have 12 servers in the cluster.

  • How are they allocated?

    4 DB servers, 6 application servers, 1 staging server, and 1 front end server.

  • How are they provisioned?

    We order them from Softlayer - there's a less than 4 hour turn around for most boxes, which is awesome.

  • What operating systems do you use?

    CentOS 5 (64 bit)

  • Which web server do you use?

    nginx

  • Which database do you use?

    MySQL 5.1

  • Do you use a reverse proxy?

    We just use nginx's built in proxy balancer.

  • How is your system deployed in data centers?

    We use a dedicated hosting service, Softlayer.

  • What is your storage strategy?

    We use NAS for backups but internal SCSI drives for our production boxes.

  • How much capacity do you have?

    Across all of our boxes we probably have around ... 5 TB of storage or
    thereabouts.

  • How do you grow capacity?

    Ad-hoc. We haven't done a proper capacity planning study, to our detriment.

  • Do you use a storage service?

    Nope.

  • Do you use storage virtualization?

    Nope.

  • How do you handle session management?

    Right now we just persist it to the database - it would be fairly easy to use memcache directly for this purpose though.

  • How is your database architected? Master/slave? Shard? Other?

    Master/slave right now. We're moving towards a Master/Multi-slave with a read only load balancing proxy to the slave cluster.

  • How do you handle load balancing?

    We do it in software via nginx.

  • Which web framework/AJAX Library do you use?

    Rails.

  • Which real-time messaging frame works do you use?

    None.

  • Which distributed job management system do you use?

    Starling

  • How do you handle ad serving?

    We run network ads. We also weight our various ad networks by eCPM on our application layer.

  • Do you have a standard API to your website?

    Nope.

  • How many people are in your team?

    2 developers.

  • What skill sets does your team possess?

    Me: Front end design, development, limited Rails. Obviously, recently proficient in MySQL optimization and large scale Rails deployment.
    Alex: application logic development, front end design, general software engineering.

  • What is your development environment?

    Alex develops on OSX while I develop on Ubuntu. We use SVN for version control. I use VIM for editing and Alex uses TextMate.

  • What is your development process?

    On the logic layer, it's very test driven - we test extensively. On the application layer, it's all about quick iterations and testing.

  • What is your object and content caching strategy?

    We cache both in memcache with no TTL, and we just manually expire.

  • What is your client side caching strategy?

    None.

    How do you manage your system?

  • How do check global availability and simulate end-user performance?

    We use Pingdom for external website monitoring - they're really good.

  • How do you health check your server and networks?

    Right now we're just relying on our external monitoring and Softlayer's ping monitoring. We're investigating FiveRuns for monitoring as a possible solution to server monitoring.

  • How you do graph network and server statistics and trends?

    We don't.

  • How do you test your system?

    We deploy to staging and run some sanity tests, then we do a deploy to all application servers.

  • How you analyze performance?

    We trace back every SQL query in development to make sure we're not doing any unnecessary calls or model instantiations. Other than that, we haven't done any real benchmarking.

  • How do you handle security?

    Carefully.

  • How do you decide what features to add/keep?

    User feedback and critical thinking. We are big believers in simplicity so we are pretty careful to consider before we add any major features.

  • How do you implement web analytics?

    We use a home grown metrics tracking system for virality optimization,
    and we also use Google Analytics.

  • Do you do A/B testing?

    Yes, from the time to time we will tweak aspects of our design to optimize for virality.

    How is your data center setup?

  • Which firewall product do you use?
  • Which DNS service do you use?
  • Which routers do you use?
  • Which switches do you use?
  • Which email system do you use?
  • How do you handle spam?
  • How do you handle virus checking of email and uploads?

    Don't know to all of the above.

  • How do you backup and restore your system?

    We use LVM to do incrementals on a weekly and daily basis.

  • How are software and hardware upgrades rolled out?

    Right now they are done manually, except for new Rails application deployments. We use capistrano to update and restart our application servers.

  • How do you handle major changes in database schemas on upgrades?

    We usually migrate on a slave first and then just switch masters.

  • What is your fault tolerance and business continuity plan?

    Not very good.

  • Do you have a separate operations team managing your website?

    Oh we wish.

  • Do you use a content delivery network? If so, which one and what for?

    Nope

  • What is your revenue model?

    CPM - more page views more money. We also have incentivized direct offers through our virtual currency.

  • How do you market your product?

    Word of mouth - the social graph. We just leverage viral design tactics to grow.

  • Do you use any particularly cool technologies are algorithms?

    I think Ruby is pretty particularly cool. But no, not really - we're not doing rocket science, we're just trying to get people laid.

  • Do your store images in your database?

    No, that wouldn't be very smart.

  • How much up front design should you do?

    Hm. I'd say none if you haven't scaled up anything before, and a lot if you have. It's hard to know what's actually going to be the problem until you've actually been through and see what real load problems look like. Once you've done that, then you have enough domain knowledge to do some actual meaningful up front design on our next go around.

  • Has anything surprised your either for the good or bad?

    How unreliable vendor hardware can be, and how different support can be from host to host. The number one most important thing you will need is a scaled up dedicated host who can support your needs. We use Softlayer and we can't recommend them highly enough.

    On the other hand, it's surprising how far just a master-multislave setup can take you on commodity hardware. You can easily do a Billion page views per month on this setup.

  • How does your system evolve to meet new scaling challenges?

    It doesn't really, we just fix bottle necks as they come and we see them coming.

  • Who do you admire?

    Brad Fitzpatrick for inventing memcache, and anyone who has successfully horizontally scaled anything.

  • How are you thinking of changing your architecture in the future?

    We will have to start sharding by users soon as we hit database size and write limits.

    Their Thoughts on Facebook Virality

  • Facebook models the social graph in digital form as accurately and completely as possible.
  • Social graph is more important that features.
  • Facebook enables rapid social distribution of new applications through the social graph.
  • Your application idea should be: social, engaging, and universal.
  • The social aspect makes it viral.
  • Engaging makes it monetizable.
  • Universal gives it potential.
  • Friends for Sale is social because you are buying and selling your social graph.
  • It's engaging because it's a twist on an idea, low pressure, flirty, and a bit cynical.
  • It's universal because everyone is vain, has a price, and wants to flirt with hot people.
  • Every touch point in the application is a potential for recruiting new users.
  • Every user converts 1.4 other users which is the basis for exponential growth.
  • For every new user track the number of invites, notifications, minifeed items, profile clicks, and other channels.
  • For every channel track the percent clicked, converted, uninstalls.

    Lessons Learned

  • Scaling from the start is a requirement on Facebook. They went to 1 million pages/day in 4 weeks.
  • Ruby on Rails can scale.
  • Anything scales on the right architecture. Focus on architecture and operations.
  • You need a good DBA, good host, and good well configured hardware.
  • With caching and the heavy duty servers available today, you can go a long time without adopting more complicated database architectures.
  • The social graph is real. It's truly staggering the number of accessible users on Facebook with the right well implemented viral application.
  • Most performance problems are in the database. Look to the database server, the database configuration, the query, or the use and non-use of indexes.
  • People still use Vi!

    I'd really like to thank Siqi taking the time to answer all my questions and provide this fascinating look in to their system. It's amazing what you've done in so little time. Excellent job and thanks again.
  • Saturday
    Apr052008

    Skype Plans for PostgreSQL to Scale to 1 Billion Users

    Skype uses PostgreSQL as their backend database. PostgreSQL doesn't get enough run in the database world so I was excited to see how PostgreSQL is used "as the main DB for most of [Skype's] business needs." Their approach is to use a traditional stored procedure interface for accessing data and on top of that layer proxy servers which hash SQL requests to a set of database servers that actually carry out queries. The result is a horizontally partitioned system that they think will scale to handle 1 billion users.

  • Skype's goal is an architecture that can handle 1 billion plus users. This level of scale isn't practically solvable with one really big computer, so our masked superhero horizontal scaling comes to the rescue.
  • Hardware is dual or quad Opterons with SCSI RAID.
  • Followed common database progression: Start with one DB. Add new databases partitioned by functionality. Replicate read-mostly data for better read access. Then horizontally partition data across multiple nodes..
  • In a first for this blog anyway, Skype uses a traditional database architecture where all database access is encapsulated in stored procedures. This allows them to make behind the scenes performance tweaks without impacting frontend servers. And it fits in cleanly with their partitioning strategy using PL/Proxy.
  • PL/Proxy is used to scale the OLTP portion of their system by creating a horizontally partitioned cluster: - Database queries are routed by a proxy across a set of database servers. The proxy creates partitions based on a field value, typically a primary key. - For example, you could partition users across a cluster by hashing based on user name. Each user is slotted into a shard based on the hash. - Remote database calls are executed using a new PostgreSQL database language called plproxy. An example from Kristo Kaiv's blog:
    First, code to insert a user in a database:
    CREATE OR REPLACE FUNCTION insert_user(i_username text) RETURNS text AS $$
    BEGIN
        PERFORM 1 FROM users WHERE username = i_username;
        IF NOT FOUND THEN
            INSERT INTO users (username) VALUES (i_username);
            RETURN 'user created';
        ELSE
            RETURN 'user already exists';
        END IF;
    END;
    $$ LANGUAGE plpgsql SECURITY DEFINER;
    
    Heres the proxy code to distribute the user insert to the correct partition:
    queries=#
    CREATE OR REPLACE FUNCTION insert_user(i_username text) RETURNS TEXT AS $$
        CLUSTER 'queries'; RUN ON hashtext(i_username);
    $$ LANGUAGE plproxy;
    
    Your SQL query looks normal:
    SELECT insert_user("username");
    
    - The result of a query is exactly that same as if was executed on the remote database. - Currently they can route 1000-2000 requests/sec on Dual Opteron servers to a 16 parition cluster.
  • They like PL/Proxy approach for OLTP because: - PL/Proxy servers form a scalable and uniform "DB-bus." Proxies are robust because in a redundant configuration if one fails you can just connect to another. And if the proxy tier becomes slow you can add more proxies and load balance between them. - More partitions can be added to improve performance. - Only data on a failed partition is unavailable during a failover. All other partitions operate normally.
  • PgBouncer is used as a connection pooler for PostgreSQL. PL/Proxy "somewhat wastes connections as it opens connection to each partition from each backend process" so the pooler helps reduce the number of connections.
  • Hot-standby servers are created using WAL (Write Ahead Log) shipping. It doesn't appear that these servers can be used for read-only operations.
  • More sophisticated organizations often uses an OLTP database system to handle high performance transaction needs and then create seperate systems for more non-transactional needs. For example, an OLAP (Online analytical processing) system is often used for handling complicated analysis and reporting problems. These differ in schema, indexing, etc from the OLTP system. Skype also uses seperate systems for the presentation layer of web applications, sending email, and prining invoices. This requires data be moved from the OLTP to the other systems. - Initially Slony1 was used to move data to the other systems, but "as the complexity and loads grew Slony1 started to cause us greater and greater pains." - To solve this problem Skype developed their on lighter weight queueing and replication toolkit called SkyTools. The proxy approach is interesting and is an architecture we haven't seen previously. Its power comes from the make another level of indirection school of problem solving, which has advantages:
  • Applications are independent of the structure of the database servers. That's encapsulated in the proxy servers.
  • Applications do not need to change in response to partition, mapping, or other changes.
  • Load balancing, failover, and read/write splitting are invisible to applications. The downsides are:
  • Reduced performance. Another hop is added and queries must be parsed to perform all the transparent magic.
  • Inability to perform joins and other database operations across partitions.
  • Added administration complexity of dealing with proxy configuration and HA for the proxy servers. It's easy to see how the advantages can outweigh the disadvantages. Without changing your application you can slip in a proxy layer and get a lot of very cool features for what seems like a low cost. If you are a MySQL user and this approach interests you then take a look at MySQL Proxy, which accomplishes something similar in a different sort of way.

    Related Articles

  • An Unorthodox Approach to Database Design : The Coming of the Shard
  • PostgreSQProducts - Scaling infinitely with PL/Proxy
  • PL/Proxy
  • Heroku also uses PostgreSQL.
  • MySQL Proxy
  • PostgreSQL cluster: partitioning with plproxy (part I) by Kristo Kaiv'.
  • PostgreSQL cluster: partitioning with plproxy (part II) by Kristo Kaiv'.
  • PostgreSQL at Skype.
  • Skytools database scripting framework & PgQ by Kristo Kaiv'.
  • PostgreSQL High Availability.

    Click to read more ...

  • Wednesday
    Mar122008

    YouTube Architecture

    Update 3: 7 Years Of YouTube Scalability Lessons In 30 Minutes and YouTube Strategy: Adding Jitter Isn't A Bug

    Update 2: YouTube Reaches One Billion Views Per Day. That’s at least 11,574 views per second, 694,444 views per minute, and 41,666,667 views per hour. 

    Update: YouTube: The Platform. YouTube adds a new rich set of APIs in order to become your video platform leader--all for free. Upload, edit, watch, search, and comment on video from your own site without visiting YouTube. Compose your site internally from APIs because you'll need to expose them later anyway.

    YouTube grew incredibly fast, to over 100 million video views per day, with only a handful of people responsible for scaling the site. How did they manage to deliver all that video to all those users? And how have they evolved since being acquired by Google?

    Information Sources

  • Google Video

    Platform

  • Apache
  • Python
  • Linux (SuSe)
  • MySQL
  • psyco, a dynamic python->C compiler
  • lighttpd for video instead of Apache

    What's Inside?

    The Stats

  • Supports the delivery of over 100 million videos per day.
  • Founded 2/2005
  • 3/2006 30 million video views/day
  • 7/2006 100 million video views/day
  • 2 sysadmins, 2 scalability software architects
  • 2 feature developers, 2 network engineers, 1 DBA

    Recipe for handling rapid growth

    while (true) { identify_and_fix_bottlenecks(); drink(); sleep(); notice_new_bottleneck(); } This loop runs many times a day.

    Web Servers

  • NetScalar is used for load balancing and caching static content.
  • Run Apache with mod_fast_cgi.
  • Requests are routed for handling by a Python application server.
  • Application server talks to various databases and other informations sources to get all the data and formats the html page.
  • Can usually scale web tier by adding more machines.
  • The Python web code is usually NOT the bottleneck, it spends most of its time blocked on RPCs.
  • Python allows rapid flexible development and deployment. This is critical given the competition they face.
  • Usually less than 100 ms page service times.
  • Use psyco, a dynamic python->C compiler that uses a JIT compiler approach to optimize inner loops.
  • For high CPU intensive activities like encryption, they use C extensions.
  • Some pre-generated cached HTML for expensive to render blocks.
  • Row level caching in the database.
  • Fully formed Python objects are cached.
  • Some data are calculated and sent to each application so the values are cached in local memory. This is an underused strategy. The fastest cache is in your application server and it doesn't take much time to send precalculated data to all your servers. Just have an agent that watches for changes, precalculates, and sends.

    Video Serving

  • Costs include bandwidth, hardware, and power consumption.
  • Each video hosted by a mini-cluster. Each video is served by more than one machine.
  • Using a a cluster means: - More disks serving content which means more speed. - Headroom. If a machine goes down others can take over. - There are online backups.
  • Servers use the lighttpd web server for video: - Apache had too much overhead. - Uses epoll to wait on multiple fds. - Switched from single process to multiple process configuration to handle more connections.
  • Most popular content is moved to a CDN (content delivery network): - CDNs replicate content in multiple places. There's a better chance of content being closer to the user, with fewer hops, and content will run over a more friendly network. - CDN machines mostly serve out of memory because the content is so popular there's little thrashing of content into and out of memory.
  • Less popular content (1-20 views per day) uses YouTube servers in various colo sites. - There's a long tail effect. A video may have a few plays, but lots of videos are being played. Random disks blocks are being accessed. - Caching doesn't do a lot of good in this scenario, so spending money on more cache may not make sense. This is a very interesting point. If you have a long tail product caching won't always be your performance savior. - Tune RAID controller and pay attention to other lower level issues to help. - Tune memory on each machine so there's not too much and not too little.

    Serving Video Key Points

  • Keep it simple and cheap.
  • Keep a simple network path. Not too many devices between content and users. Routers, switches, and other appliances may not be able to keep up with so much load.
  • Use commodity hardware. More expensive hardware gets the more expensive everything else gets too (support contracts). You are also less likely find help on the net.
  • Use simple common tools. They use most tools build into Linux and layer on top of those.
  • Handle random seeks well (SATA, tweaks).

    Serving Thumbnails

  • Surprisingly difficult to do efficiently.
  • There are a like 4 thumbnails for each video so there are a lot more thumbnails than videos.
  • Thumbnails are hosted on just a few machines.
  • Saw problems associated with serving a lot of small objects: - Lots of disk seeks and problems with inode caches and page caches at OS level. - Ran into per directory file limit. Ext3 in particular. Moved to a more hierarchical structure. Recent improvements in the 2.6 kernel may improve Ext3 large directory handling up to 100 times, yet storing lots of files in a file system is still not a good idea. - A high number of requests/sec as web pages can display 60 thumbnails on page. - Under such high loads Apache performed badly. - Used squid (reverse proxy) in front of Apache. This worked for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20. - Tried using lighttpd but with a single threaded it stalled. Run into problems with multiprocesses mode because they would each keep a separate cache. - With so many images setting up a new machine took over 24 hours. - Rebooting machine took 6-10 hours for cache to warm up to not go to disk.
  • To solve all their problems they started using Google's BigTable, a distributed data store: - Avoids small file problem because it clumps files together. - Fast, fault tolerant. Assumes its working on a unreliable network. - Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites. - For more information on BigTable take a look at Google Architecture, GoogleTalk Architecture, and BigTable.

    Databases

  • The Early Years - Use MySQL to store meta data like users, tags, and descriptions. - Served data off a monolithic RAID 10 Volume with 10 disks. - Living off credit cards so they leased hardware. When they needed more hardware to handle load it took a few days to order and get delivered. - They went through a common evolution: single server, went to a single master with multiple read slaves, then partitioned the database, and then settled on a sharding approach. - Suffered from replica lag. The master is multi-threaded and runs on a large machine so it can handle a lot of work. Slaves are single threaded and usually run on lesser machines and replication is asynchronous, so the slaves can lag significantly behind the master. - Updates cause cache misses which goes to disk where slow I/O causes slow replication. - Using a replicating architecture you need to spend a lot of money for incremental bits of write performance. - One of their solutions was prioritize traffic by splitting the data into two clusters: a video watch pool and a general cluster. The idea is that people want to watch video so that function should get the most resources. The social networking features of YouTube are less important so they can be routed to a less capable cluster.
  • The later years: - Went to database partitioning. - Split into shards with users assigned to different shards. - Spreads writes and reads. - Much better cache locality which means less IO. - Resulted in a 30% hardware reduction. - Reduced replica lag to 0. - Can now scale database almost arbitrarily.

    Data Center Strategy

  • Used manage hosting providers at first. Living off credit cards so it was the only way.
  • Managed hosting can't scale with you. You can't control hardware or make favorable networking agreements.
  • So they went to a colocation arrangement. Now they can customize everything and negotiate their own contracts.
  • Use 5 or 6 data centers plus the CDN.
  • Videos come out of any data center. Not closest match or anything. If a video is popular enough it will move into the CDN.
  • Video bandwidth dependent, not really latency dependent. Can come from any colo.
  • For images latency matters, especially when you have 60 images on a page.
  • Images are replicated to different data centers using BigTable. Code looks at different metrics to know who is closest.

    Lessons Learned

  • Stall for time. Creative and risky tricks can help you cope in the short term while you work out longer term solutions.
  • Prioritize. Know what's essential to your service and prioritize your resources and efforts around those priorities.
  • Pick your battles. Don't be afraid to outsource some essential services. YouTube uses a CDN to distribute their most popular content. Creating their own network would have taken too long and cost too much. You may have similar opportunities in your system. Take a look at Software as a Service for more ideas.
  • Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It's true that nobody really knows what simplicity is, but if you aren't afraid to make changes then that's a good sign simplicity is happening.
  • Shard. Sharding helps to isolate and constrain storage, CPU, memory, and IO. It's not just about getting more writes performance.
  • Constant iteration on bottlenecks: - Software: DB, caching - OS: disk I/O - Hardware: memory, RAID
  • You succeed as a team. Have a good cross discipline team that understands the whole system and what's underneath the system. People who can set up printers, machines, install networks, and so on. With a good team all things are possible.

    Click to read more ...

  • Sunday
    Feb242008

    Yandex Architecture

    Update: Anatomy of a crash in a new part of Yandex written in Django. Writing to a magic session variable caused an unexpected write into an InnoDB database on every request. Writes took 6-7 seconds because of index rebuilding. Lots of useful details on the sizing of their system, what went wrong, and how they fixed it. Yandex is a Russian search engine with 3.5 billion pages in their search index. We only know a few fun facts about how they do things, nothing at a detailed architecture level. Hopefully we'll learn more later, but I thought it would still be interesting. From Allen Stern's interview with Yandex's CTO Ilya Segalovich, we learn:

  • 3.5 billion pages in the search index.
  • Over several thousand servers.
  • 35 million searches a day.
  • Several data centers around Russia.
  • Two-layer architecture.
  • The database is split in pieces and when a search is requested, it pulls the bits from the different database servers and brings it together for the user.
  • Languages used: c++, perl, some java.
  • FreeBSD is used as their server OS.
  • $72 million in revenue in 2006.

    Click to read more ...

  • Wednesday
    Jan302008

    How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

    How do you query hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers? If you think this sounds like the perfect battle ground for a head-to-head skirmish in the great MapReduce Versus Database War, you would be correct. Bill Boebel, CTO of Mailtrust (Rackspace's mail division), has generously provided a fascinating account of how they evolved their log processing system from an early amoeba'ic text file stored on each machine approach, to a Neandertholic relational database solution that just couldn't compete, and finally to a Homo sapien'ic Hadoop based solution that works wisely for them and has virtually unlimited scalability potential. Rackspace faced a now familiar problem. Lots and lots of data streaming in. Where do you store all that data? How do you do anything useful with it? In the first version of their system logs were stored in flat text files and had to be manually searched by engineers logging into each individual machine. Then came a scripted version of the same process. The next big evolution was a single machine MySQL version. Inserts quickly became the bottleneck as the huge torrents of data flooding caused a lot of index churn. Perdiodic bulk loading was the remedy to this problem, but the shear size of the indexes slowed it down. Data was then broken into Merge Tables based on time so index updates weren't a problem. As more and more data this solution broke down with a combination of load and operational problems. Facing exponential growth they spent about 3 months building a new log processing system using Hadoop (an open-source implementation of Google File System and MapReduce), Lucene and Solr. Moving to a partitioned MySQL data set was an option, but they thought it would only buy time until and a more scalable solution would need to be created in the future anyway. The future came a little early this year. The advantage of their new system is that they can now look at their data in anyway they want:

  • Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins.
  • When they wanted to find out which part of the the world their customers logged in from, a quick MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system. This switch has changed how they run their business. Stu Hood nicely sums up the impact: "Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This is powerful stuff." In the rest of this post Bill describes the evolution of their system and the forces that caused them to move from a relational database solution to a MapReduce system. Before getting started, I'd really like to thank Bill Boebel for spending so much time and effort in creating this very valuable experience report.

    Information Sources

  • MapReduce at Rackspace
  • A document sent to me by Bill Boebel, CTO of Mailtrust (Rackspace's mail division). This post is a little different than normal because most all the content past this point is by Bill, I've just organized it a little differently.

    The Platform

  • Hadoop
  • Hadoop Distributed File System (HDFS)
  • Lucene
  • Solr
  • Tomcat

    The Stats

  • Rackspace has more than 50K devices and 7 data centers.
  • The mail system and logging servers are currently in 3 of the Rackspace data centers.
  • The system stores over 800 million objects (an object = a user event such as receiving an email or logging into IMAP) within Solr and 9.6 billion within Hadoop, which equals 6.3 TB compressed.
  • Several hundred gigabytes of email log data is generated each day.

    Background on Mailtrust

  • Email hosting company
  • Founded in 1999, merged with Rackspace in 2007, previous name: Webmail.us
  • 80K business customers, 700K mailboxes.
  • 2 hosted mail products: Noteworthy, MS Exchange
  • The Noteworthy System: * Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds, shared calendaring, Outlook sync, Blackberry sync. * ~600 servers, commodity hardware, designed to work around frequent failures.
  • The MS Exchange System: * MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync. * ~100 servers, higher-end hardware, SAN & DAS storage.

    The Architecture

    The way the current Hadoop based system works is:
  • Raw logs get streamed from hundreds of mail servers to the Hadoop Distributed File System (”HDFS”) in real time.
  • MapReduce jobs are scheduled run to index the new data using Apache Lucene and Solr.
  • Once the indexes have been built, they are compressed and stored away in HDFS.
  • Each Hadoop datanode runs a Tomcat servlet container, which hosts a number of Solr instances that pull and merge the new indexes, and provide really fast search results to our support team.

    The System Evolution

    The Problem

    Mailtrust is a very customer service focused company. It is extremely important for our support techs to be able to examine mail logs in order to troubleshoot problems for our customers. Our support techs need to search the logs hundreds of times per day, so the tools that provide this functionality must be fast and accurate. With over 600 mail servers, and hundreds of gigabytes of raw log data produced each day, this can be tricky to manage. Here is a brief history of the Mailtrust logging architecture, problems we faced, how we over came them, and what the system looks like today...

    Logging v1.0

    Logs were stored in flat text files on the local disk of each mail server and were kept for 14 days. Our support techs did not have login access to the servers, so in order to search the logs they would have to escalate a ticket to our engineers. The engineers would then have to ssh into each mail server and grep /var/log/maillog. Problems: Once we grew much past a dozen servers, this manual process of logging into each server become too time consuming for our engineers.

    Logging v1.1

    Sped up the search process by writing a script that would search multiple servers via one command run from a centralized server. An engineer could tell the script what type of mail server to search (inbound smtp, outbound smtp, backend mailbox). The script would look at /etc/hosts for a list of servers of that type, and then iterate through each server, ssh in, perform the grep and then output the results. The script could also search in the past via "gunzip -c /var/log/maillog.* | grep" Problems: The support techs still had to escalate a ticket to the engineers in order to perform a search. As the number of customers and servers increased, this began to take too much of our engineers' scarce time. Also, storing and searching the logs on a live server was negatively affecting the performance of the servers. To make matters worse, the engineering team had grown and we started running into the problem where two engineers would perform a search at the same time, which really slowed things down.

    Logging v2.0

    We released a log search tool that the support techs could use directly, without involving the engineers. The support team was given a web-based tool where they could search the logs. It allowed searching by the sender or recipient's email address, domain name or IP address. All of these were indexed fields in a MySQL database. Wildcard text searches (i.e. MySQL "LIKE" statements) were not allowed because the data set was very large and these queries would be horribly slow. Each day's logs were stored in a separate table, so that we could cleanup old data by simply dropping and recreating MySQL tables. This made cleanup really fast compared to running a conditional DELETE command on a large table. Log data was only kept for 3 days in order to keep the MySQL database down to a reasonable size. To get the logs into the database, each mail server initially wrote its log data to a local 16MB tempfs partition. Logrotate was called via cron every 60 seconds to rotate the temporary log file and then preprocess the data before sending it on to the centralized log server. This preprocessing step reduced the volume of data that had to be transmitted over the network to the log server, and this also distributed the processing workload to avoid creating bottleneck on the log server. After the data was processed locally, the script would send comma delimited log data back to syslog-ng on the local server, and syslog-ng would then send it over the network to the centralized log server. The log server was configured to receive data on 6 different ports, one for each type of log data... inbound smtp, outbound smtp, backend smtp, spam/virus filtering, POP3 and IMAP. As log data was received, the records were inserted one by one into the database via MySQL INSERT commands. Problems: We quickly realized that we had a bottleneck with the MySQL inserts. As the tables grew, indexing each entry as it was inserted became slow. Within the first hours of testing, the inserts began slowing and could not keep up with the rate at which data was received. Version 2.0 of the logging system was never used in production.

    Logging v2.1

    Fixed the MySQL INSERT bottleneck by queuing up the log entries in local text files on the centralized log server and periodically bulk loading them into the database. As syslog-ng received logs on its 6 ports, the data would be streamed to 6 separate text files. Every 10 minutes a script would rotate those text files and execute a MySQL LOAD to load the data into the database. This was magnitudes faster than inserting the log data one record at a time. Problems: The LOADs would get progressively slower as the database grew because MySQL indexing performance decreases as the table you are inserting into gets larger. This version was fast enough to be released into production, but we knew the system would not scale too far without additional work.

    Logging v2.2

    Introduced Merge Tables in order to speed up loading the log data into the database. With this version, every 10 minutes our script would create a new database table and then load the text logs into the empty table. This made the LOAD command extremely fast because there were no existing database indexes that could negatively affect performance. After the data was loaded, the script would modify a set of Merge Tables that combined all of the 10-minute tables together. The web search tool was modified to allow searching within the following time ranges: all day, past 12-hours, past 6-hours, past 2-hours. Corresponding Merge Tables existed for each of those time ranges, and were modified every 10 minutes as new tables were created. Problems: This version of the logging system worked reliably for about one year. But we began having problems with it as our support team, customer base and server count grew. When we reached about 100 servers the database LOAD operations would take 2-3 minutes to run, which was acceptable, but the server was now always under a heavy cpu and disk IO load. Searches were being performed more frequently and were becoming slow. We started to see some strange problems such as random errors while trying to create new tables or modify the Merge Tables. These errors progressively became more frequent, resulting in missing log data. The support team began to lose confidence in the system's accuracy. Also, there were several occasions where our engineers performed a software upgrade to a particular application, which changed log format in such a way that broke the preprocessing script. Since our raw logs were deleted from the local mail servers every 60 seconds, we'd have no way to recover the missing logs when this occurred. Additionally, the log search tool was becoming ever more critical to our support team's daily operations; however, the logging system had no redundancy. There was no RAID, no backups, no failover system. We also do not have a good plan for scaling the log system beyond a single monolithic server. Incrementally patching problems and tweaking performance with the log system was taking up a lot of time and we needed something better. We needed a new solution that would be fast, reliable and could scale indefinitely with our growth. We needed something truly scalable.

    Logging v3.0

    While designing v3.0, we looked at several commercial log processing applications. Splunk stood out, and did just about everything we wanted; however, we worried that using a vendor product like this might limit our abilities to build new features down the road. For example, we wanted to build a tool that would allow our customers to search their logs directly. We had been keeping an eye on the Apache Hadoop project since its inception, and were extremely impressed with its progress and direction. Hadoop is an open-source implementation of Google File System and MapReduce... a system that is designed specifically for large scale distributed data processing. It scales out it's workload horizontally by adding servers and distributing the data and MapReduce jobs amongst the servers. Other companies were already using it for their own log processing. So chose to go with Hadoop. In about 3 months we build a fresh new log processing system using Hadoop, Lucene and Solr. The system is described here: http://blog.racklabs.com/?p=66 We believe this new system will be able to scale with us as our company grows. And there is a lot of momentum behind the Hadoop project, which gives us a lot of confidence that its scalability will continue to improve. Yahoo is one of the major contributors to the project and has built Hadoop clusters that contain thousands of servers, and they are aggressively working to get Hadoop to support tens of thousands of servers. Problems: To date, the only problems we have found have been our own bugs; and we fix those as we find them. We are actively running v3.0 today, but we're not going to stop here. We have a lot of plans for new features...

    The Future

    Version 3.1 is being coded currently. It includes new MapReduce jobs that support Microsoft Exchange log processing. (currently we only process Noteworthy logs with this system). We plan to go live in March. In version 4.0 we plan to put the log search tool in the hands of our customers so that they can have the same troubleshooting power that our support team has. This will most likely require reorganizing the way we store log index shards so that they are grouped by user, rather than letting Solr randomly group them. Our resellers seem to be excited about this, because it should allow them to better support their customers. Who knows what we'll build after v4.0...

    Related Articles

  • Google Architecture
  • Database People Hating on MapReduce
  • Product: Hadoop
  • Running Hadoop MapReduce on Amazon EC2 and Amazon S3
  • Solr

    Click to read more ...

  • Thursday
    Jan242008

    Mailinator Architecture

    Update: A fun exploration of applied searching in How to search for the word "pen1s" in 185 emails every second. When indexOf doesn't cut it you just trie harder. Has a drunken friend ever inspired you to create a first of its kind internet service that is loved by millions, deemed subversive by thousands, all while handling over 1.2 billion emails a year on one rickity old server? That's how Paul Tyma came to build Mailinator. Mailinator is a free no-setup web service for thwarting evil spammers by creating throw-away registration email addresses. If you don't give web sites you real email address they can't spam you. They spam Mailinator instead :-) I love design with a point-of-view and Mailinator has a big giant harry one: performance first, second, and last. Why? Because Mailinator is free and that allows Paul to showcase his different perspective on design. While competitors buy big Iron to handle load, Paul uses a big idea instead: pick the right problem and create a design to fit the problem. No more. No less. The result is a perfect system architecture sonnet, beauty within the constraints of form. How does Mailinator carry out its work as a spam busting super hero? Site: http://mailinator.com/

    Information Sources

  • The Architecture of Mailinator
  • Mailinator's 2006 Stats

    The Platform

  • Linux
  • Tomcat
  • Java

    The Stats

  • Will process an estimated 1.29 BILLION emails for 2007. 450.74 million in 2006. 280.68 million in 2005.
  • Peak rate of 6.5 million emails/day or 4513/min or 75/sec.
  • Mailinator runs on a very modest machine with an AMD 2Ghz Athlon processor, 1GB of RAM (much less is used), and a low-performance 80G IDE hard drive. And the machine is not very busy at all.
  • Mailinator runs for months unattended and very few emails are lost, even under constant spam attacks and high peak loads.

    The Architecture

  • Having a free system means the system doesn't have to be perfect. So the design goals are: - Design a system that values survival above all else, even users. Survival is key because Mailinator must fight off attacks on a daily basis. - Provide 99.99% uptime and accuracy for users. Higher uptime goals would be impractical and costly. And since the service is free this is just part of rules of the game for users. - Support the following service model: user signs-up for something, goes to Mailinator, clicks on the subscription link, and forgets about it. This means email doesn't have to be stored persistently on disk. Email can reside in RAM because it is temporary (3-4 hours). If you want a real mailbox then use another service.
  • The original flow of email handling was: - Sendmail received email in a single on-disk mailbox. - The Java based Mailinator grabbed emails using IMAP and/or POP (it changed over time) and deleted them. - The system then loaded all emails into memory and let them sit there. - The oldest email was pushed out once the 20,000 in memory limit was reached.
  • The original architecture worked well: - It was stable and stayed up for months at a time. - It used almost all the 1GB of RAM. - Problems started when the incoming email rate started surpassing 800,000 a day. The system broke down because of disk contention between Mailinator and the email subsystem.
  • The New Architecture: - The idea was to remove the path through the disk which was accomplished with a complete system rewrite. - The web application, the email server, and all email storage run in one JVM. - Sendmail was replaced with a custom built SMTP server. Because of the nature of Mailinator a full SMTP server was not necessary. Mailinator does not need to send email. And it's primary duty is to accept or reject email as fast as possible. This is the downside of layering. Layering is very often given as a key strategy in scaling, but it can kill performance because crucial decisions are best handled at the highest levels of the stack. So work flows through the system only to be dumped at the lower layers when many of the RAM and cycle stealing operations have already been accomplished. So the decision to go with a custome SMTP server is an interesting and brave decision. Most people at this point would just add more hardware. And they wouldn't be wrong, but it's interesting to see this path taken as well. Maybe with more DOM and AOP like architectures we can flatten the stack and get better performance when needed. - Now Mailinator receives an email directly, parses it, and stores it into memory. The disk is bypassed completely and the disk remains fairly idle. - Emails are written to disk when the system is coming down so they can be reloaded on startup. - Logging was shut-off to remove the risk of subpoenaes. When logging was performed log data was written in batches so several thousand logs lines would be written in one disk write. This minimized at disk contention at the risk of losing helpful diagnostic information. - The system uses under 300 threads. More aren't needed. - On arrival each email passes through a filter system and is stored in RAM if all filters are passed. - Every inbox is limited to only 10 emails so popular inboxes, like joe@mailinator.com, can't blow the system. - No incoming email can be over 100k and all attachments are immediately discarded. This saves on RAM.
  • Emails are compressed in RAM: - Since 99% of emails are never looked at, compressed email saves RAM. They are only ever decompressed when someone looks at them. - Mailinator can store about 80,000 emails in RAM, using under 300MB of RAM compared to the 20,000 emails which were stored in 1GB RAM in the original design. - With this pool the average email lifespan is about 3-4 hours. - It's likely 200,000 emails could fit in memory, but there hasn't been a real need. - This is one of the design details I love because it's based on real application usage patterns. RAM is precious and CPU is not, so use compression to save RAM at the expense of CPU, knowing you won't have to take the CPU hit twice, most of the time.
  • Mailinator does not guarantee anonymity and privacy: - There is no privacy. Anyone can read any inbox at anytime. - Relaxing these constrains, while shocking, makes the design much simpler. - For the user it is simple because there is no sign up needed. When a web site asks you for an email address you can just enter an mailinator address. You don't need to create a separate account. Typing in the email address effectively creates the mailinator account. Simple. - In practice users still get a high level of privacy.
  • Goal of survivability leads to aggressive SPAM filtering. - Mailinator doesn't have anything against SPAM, but because it gets so much SPAM, it must be filtered out when it threatens the up time of the system. - Which leads to this rule: If you do anything (spammer or not) that starts affecting the system - your emails will be refused and you may be locked out.
  • To be accepted an email must pass the following filter chain: - Bounce: all bounced emails are dropped. - IP: too much email from a single IP are dropped - Subject: too much email on the same subject is dropped - Potty: subjects containing words that indicate hate or crimes or just downright nastiness are dropped.
  • Surviving Email Floods from a Single IP Adress - An AgingHashmap is used to filter out spammers from a particular IP address. When an email arrives on a IP address the IP is put in the map and a counter is increased for all subsequent emails. - After a certain period of time with no emails the counter is cleared. - When a sender reaches a threshold email count the sender is blocked. This prevents a sender from flooding the system. - Many systems use this sort of logic to protect all sorts of resources, like comments. You can use memcached for the same purpose in a distributed system.
  • Protecting Against Zombie Attacks: - Spam can be sent from a large coordinates sets of different IP addresses, called zombie networks. The same message is sent from thousands of different IP addresses so the techniques for stopping email from a single IP address are not sufficient. - This filtering is a little more complex than IP blocking because you have to parse enough of the email to get the subject line and matching subject strings is a little more resource intensive. - When something like 20 emails with the same subject within 2 minutes, all emails with that subject are then banned for 1 hour. - Interestingly, subjects are not banned forever because that would mean Mailinator would have to track subjects forever and the system design is inherently transient. This is pretty clever I think. At the cost of a few "bad" emails getting through the system is much simpler because no persistent list must be managed and that list surely would become a bottleneck. A system with more stringent SPAM filtering goals would have to create a much more complex and less robust architecture. - Nealy 9% of emails are blocked with this filter. - From my reading Mailinator filters only on IP and subject, so it doesn't have to read the body of the email body to accept or reject the email. This minimizes resource usage when most email will be rejected.
  • To lessen the danger from DOS attacks: - All connections that are silent for a specific period of time are droped. - Mailinator sends replies to email senders very slowly, like 10 or 20 or 30 seconds, even for a very small amount of data. This slows down spammers who are trying to send out spam as fast as possible and may make them rethink sending email again to that address. The wait period is reduced during busy periods so email isn't dropped.

    Lessons Learned

  • Perfection is a trap. How many systems are made much more complicated by the drive to be 100% everything. If you've been in those meetings you know what they are like. Oh, we can't do it this way or that way because there's .01% chance of something going wrong. Instead ask: how imperfect can you be and be good enough?
  • What you throw out is as important as what you keep in. We have many preconceptions of how to design systems. We make take for granted that you need to scale-out, you need to have email accessible days later, and that you must provide private accounts for everyone. But you really need these things? What can you toss?
  • Know the purpose of your system and design accordingly. Being everything to everyone means you are nothing to nobody. Keeping emails for a short period of time, allowing some SPAM to get through, and accepting less than 100% uptime create a strong vision for the system that help drive the design in all areas. You would only build your own SMTP server if you had a very strong idea of what your system was about and what you needed. I know this would have never occurred to me as an idea. I would have added more hardware.
  • Fail fast for the common case before committing resources. A high percentage of email is rejected so it makes sense to reject it as early as possible in the stack to minimize resources to accomplish the task. Figure out how to short circuit frequently failed items as fast as possible. This is important and often over looked scaling strategy.
  • Efficiency often means build it yourself. Off the shelf tools tend to do the whole job. If you only need part of the job done you may be able to write a custom component that runs much faster.
  • Adaptively forget. A little failure is OK. All the blocked IP addresses don't need to be remembered forever. Let the block decisions build up from local data rather than global state. This is powerfully simple and robust architecture.
  • Java doesn't have to be slow. Enough said.
  • Avoid the disk. Many applications need to hit the disk, but the disk is always a bottleneck. Can you design around the disk using other creative strategies?
  • Constrain resource usage. Put in constraints, like inbox size, that will keep your system for spiking uncontrollably. Unconstrained resource usage must be avoided with limited resources.
  • Compress data. Compression can be a major win when trying to conserve RAM. I've seen memory usage drop by more than half when using compression with very little overhead. If you are communicating locally, just have the client encode the data and keep it encoded. Build APIs to access the data without have to decode the full message.
  • Use fixed size resource pools to handle load. Many applications don't control resource usage, like memory, and they crash when too much is used. To create a really robust system fix your resources and drop work when those resources are full. You can age resources, give priority access, give fair access, or use any other logic to arbitrate resource access, but because the resource will be limited, you will stay up under load.
  • If you don't keep data it can't be subpoenaed. Because Mailinator doesn't store email or logs on disk noting can be subpoenaed.
  • Use what you know. We've seen this lesson a few times. Paul knew Java better than anything else, so he used it, made it work, and he got the job got done.
  • Find your own Mailinators. Sure, Mailinator is a small system. In a large system it would just be a small feature, but your system is composed of many Mailinator sized projects. What if you developed some of those like Mailinator?
  • KISS exists, though it's rare. Keeping it simple is always talked about, but rarely are we shown real examples. It's mostly just your way is complex and my way is simple because it's my way. Mailinator is a good example of simple design.
  • Robustness is a function of architecture. To create a design that efficiently uses memory and survives massive spam attacks required an architectural approach that looked at the entire stack.

    Related Articles

  • PlentyOfFish champions straight forward bare bones simplicity.
  • Varnish smartly uses OS features to find incredible performance.
  • ThemBid gracefully pieces together open source components.

    Click to read more ...

  • Monday
    Jan072008

    How Ruby on Rails Survived a 550k Pageview Digging

    Shanti Braford details how his Ruby on Rails based website survived a 24 hour 550,000+ pageview digg attack. His post cleanly lays out all the juicy setup details, so there's not much I can add. Hosting costs $370 a month for 1 web server, 1 database server, and sufficient bandwidth. The site is built on RoR, nginx, MySQL, and 7 mongrel servers. He thinks Rails 2.0 has improved performance and credits database avoidance and fragment caching for much of the performance boost. Keep in mind his system is relatively static, but it's a very interesting and useful experience report.

    Click to read more ...

    Monday
    Nov192007

    Tailrank Architecture - Learn How to Track Memes Across the Entire Blogosphere

    Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that? This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.

    Sites

  • Tailrank - We track the hottest news in the blogosphere!
  • Spinn3r - A blog spider you can specialize with your own behavior instead of creating your own.
  • Kevin Burton's Blog - his blog is an indexing mix of politics and technical talk. Both are always interesting.

    Platform

  • MySQL
  • Java
  • Linux (Debian)
  • Apache
  • Squid
  • PowerDNS
  • DAS storage.
  • Federated database.
  • ServerBeach hosting.
  • Job scheduling system for work distribution.

    Interview

  • What is your system is for? Tailrank originally a memetracker to track the hottest news being discussed within the blogosphere. We started having a lot of requests to license our crawler and we shipped that in the form of Spinn3r about 8 months ago. Spinn3r is self contained crawler for companies that want to index the full blogosphere and consumer generated media. Tailrank is still a very important product alongside Spinn3r and we're working on Tailrank 3.0 which should be available in the future. No ETA at the moment but it's actively being worked on.
  • What particular design/architecture/implementation challenges does your system have? The biggest challenge we have is the sheer amount of data we have to process and keeping that data consistent within a distributed system. For example, we process 52TB of content per month. this has to be indexed in a highly available storage architecture so the normal distributed database problems arise.
  • What did you do to meet these challenges? We've spent a lot of time in building out a distributed system that can scale and handle failure. For example, we've built a tool called Task/Queue that is analogous to Google's MapReduce. It has a centralized queue server which hands out units of work to robots which make requests. It works VERY well for crawlers in that slower machines just fetch work at a slower rate while more modern machines (or better tuned machines) request work at a higher rate. This ends up easily solving one of the main distributed computing fallacies that the network is homogeneous. Task/Queue is generic enough that we could actually use it to implement MapReduce on top of the system. We'll probably open source it at some point. Right now it has too many tentacles wrapped into other parts of our system.
  • How big is your system? We index 24M weblogs and feeds per hour and process content at about 160-200Mbps. At the raw level we're writing to our disks at about 10-15MBps continuously.
  • How many documents, do you serve? How many images? How much data? Right now the database is about 500G. We're expecting it to grow well beyond this in 2008 as we expand our product offering.
  • What is your rate of growth? It's mostly a function of customer feature requests. If our customers want more data we sell it to them. In 2008 we're planning on expanding our cluster to index larger portions of the web and consumer generated media.
  • What is the architecture of your system? We use Java, MySQL and Linux for our cluster. Java is a great language for writing crawlers. The library support is pretty solid (though it seems like Java 7 is going to be killer when they add closures). We use MySQL with InnoDB. We're mostly happy with it though it seems I end up spending about 20% of my time fixing MySQL bugs and limitations. Of course nothing is perfect. MySQL for example was really designed to be used on single core systems. The MySQL 5.1 release goes a bit farther to fix multi-core scalability locks. I recently blogged about how these the new multi-core machines should really be considered N machines instead of one logical unit: Distributed Computing Fallacy #9.
  • How is your system architected to scale? We use a federated database system so that we can split the write load as we see more IO. We've released a lot of our code as Open Source a lot of our infrastructure and this will probably be released as Open Source as well. We've already opened up a lot of our infrastructure code:
  • http://code.tailrank.com/lbpool - load balancing JDBC driver for use with DB connection pools.
  • http://code.tailrank.com/feedparser - Java RSS/Atom parser designed to elegantly support all versions of RSS
  • http://code.google.com/p/benchmark4j/ - Java (and UNIX) equivalent of Windows' perfmon
  • http://code.google.com/p/spinn3r-client/ - Client bindings to access the Spinn3r web service
  • http://code.google.com/p/mysqlslavesync/ - Clone a MySQL installation and setup replication.
  • http://code.google.com/p/log5j/ - Logger facade that supports printf style message format for both performance and ease of use.
  • How many servers do you have? About 15 machines so far. We've spent a lot of time tuning our infrastructure so it's pretty efficient. That said, building a scalable crawler is not an easy task so it does take a lot of hardware. We're going to be expanding FAR past this in 2008 and will probably hit about 2-3 racks of machines (~120 boxes).
  • What operating systems do you use? Linux via Debian Etch on 64 bit Opterons. I'm a big Debian fan. I don't know why more hardware vendors don't support Debian. Debian is the big secret in the valley that no one talks about. Most of the big web 2.0 shops like Technorati, Digg, etc use Debian.
  • Which web server do you use? Apache 2.0. Lighttpd is looking interesting as well.
  • Which reverse proxy do you use? About 95% of the pages of Tailrank are served from Squid.
  • How is your system deployed in data centers? We use ServerBeach for hosting. It's a great model for small to medium sized startups. They rack the boxes, maintain inventory, handle network, etc. We just buy new machines and pay a flat markup. I wish Dell, SUN, HP would sell directly to clients in this manner. One right now. We're looking to expand into two for redundancy.
  • What is your storage strategy? Directly attached storage. We buy two SATA drives per box and set them up in RAID 0. We use the redundant array of inexpensive databases solution so if an individual machine fails there's another copy of the data on another box. Cheap SATA disks rule for what we do. They're cheap, commodity, and fast.
  • Do you have a standard API to your website? Tailrank has RSS feeds for every page. The Spinn3r service is itself an API and we have extensive documentation on the protocol. It's also free to use for researchers so if any of your readers are pursuing a Ph.D and generally doing research work and needs access to blog data we'd love to help them out. We already have the Ph.D students at the University of Washington and University of Maryland (my Alma Matter) using Spinn3r.
  • Which DNS service do you use? PowerDNS. It's a great product. We only use the recursor daemon but it's FAST. It uses async IO though so it doesn't really scale across processors on multicore boxes. Apparenty there's a hack to get it to run across cores but it isn't very reliable. AAA caching might be broken though. I still need to look into this.
  • Who do you admire? Donald Knuth is the man!
  • How are you thinking of changing your architecture in the future? We're still working on finishing up a fully sharded database. MySQL fault tolerance and autopromotion is also an issue.

    Click to read more ...