Entries in Example (248)

Monday
Mar302009

Lavabit Architecture - Creating a Scalable Email Service

Ladar Levison of Lavabit has written an incredible article on how they took a centralized off-the-shelf email server that could handle only few thousand users and built their own custom distributed infrastructure for handling hundreds of thousands of email users. Lavabit processes 70 gigabytes of data per day, is made up of 26 servers, hosts 260,000 email addresses, and processes 600,000 emails a day. That's a lot of email.

Lavabit's mission has a little edge to it too:

Lavabit was founded as a direct reaction to the larger free e-mail services available. We felt it was possible to create an e-mail service that was fast, reliable, feature rich and didn't achieve profitability by prostituting its user base to marketers.

What I really like about this article is that Lavabit has some challenging elements in dealing with different email protocols while being able to scale to a lot of users. There's more going on than just trying to scale out a database. Many products contain complicated bits like this, so it's interesting to see how Ladar handled them. There are lots of useful details that will help anyone build their own system. Putting in this extra work in is what Ladar thinks makes Lavabit different:

One of the ways to gain an advantage over your competition is to invest the time and money needed to build systems that are better than what is easily available to your competition. It is the custom platform we developed that has allowed us to thrive while many other free email companies either stopped offering their service for free, or shut down altogether.

Since Ladar was so thorough I saved article as a separate html file. Please select the visit link to read the entire article. I'd like to thank Ladar again for taking the time and making the effort to document their architecture for the benefit of the community at large to learn from.

Saturday
Feb142009

Scaling Digg and Other Web Applications

Joe Stump, Lead Architect at Digg, gave this presentation at the Web 2.0 Expo. I couldn't find the actual presentation, but fortunately Kris Jordan took some great notes. That's how key moments in history are accidentally captured forever. Joe was also kind enough to respond to my email questions with a phone call. In this first part of the post Joe shares some timeless wisdom that you may or may not have read before. I of course take some pains to extract all the wit from the original presentation in favor of simple rules. What really struck me however was how Joe thought MemcacheDB Will be the biggest new kid on the block in scaling. MemcacheDB has been around for a little while and I've never thought of it in that way. Well learn why Joe is so excited by MemcacheDB at the end of the post.

Impressive Stats

  • 80th-100th largest site in the world
  • 26 million uniques a month
  • 30 million users.
  • Uniques are only half that traffic. Traffic = unique web visitors + APIs + Digg buttons.
  • 2 billion requests a month
  • 13,000 requests a second, peak at 27,000 requests a second.
  • 3 Sys Admins, 2 DBAs, 1 Network Admin, 15 coders, QA team
  • Lots of servers.

    Scaling Strategies

  • Scaling is specialization. When off the shelf solutions no longer work at a certain scale you have to create systems that work for your particular needs.
  • Lesson of web 2.0: people love making crap and sharing it with the world.
  • Web 2.0 sucks for scalability. Web 1.0 was flat with a lot of static files. Additional load is handled by adding more hardware. Web 2.0 is heavily interactive. Content can be created at a crushing rate.
  • Languages don't scale. 100% of the time bottlenecks are in IO. Bottlenecks aren't in the language when you are handling so many simultaneous requests. Making PHP 300% faster won't matter. Don't optimize PHP by using single quotes instead of double quotes when the database is pegged.
  • Don’t share state. Decentralize. Partitioning is required to process a high number of requests in parallel.
  • Scale out instead of up. Expect failures. Just add boxes to scale and avoid the fail.
  • Database-driven sites need to be partitioned to scale both horizontally and vertically. Horizontal partitioning means store a subset of rows on a different machines. It is used when there's more data than will fit on one machine. Vertical partitioning means putting some columns in one table and some columns in another table. This allows you to add data to the system without downtime.
  • Data are separated into separate clusters: User Actions, Users, Comments, Items, etc.
  • Build a data access layer so partitioning is hidden behind an API.
  • With partitioning comes the CAP Theorem: you can only pick two of the following three: Strong Consistency, High Availability, Partition Tolerance.
  • Partitioned solutions require denormalization and has become a big problem at Digg. Denormalization means data is copied in multiple objects and must be kept synchronized.
  • MySQL replication is used to scale out reads.
  • Use an asynchronous queuing architecture for near-term processing. - This approach pushes chunks of processing to another service and let's that service schedule the processing on a grid of processors. - It's faster and more responsive than cron and only slightly less responsive than real-time. - For example, issuing 5 synchronous database requests slows you down. Do them in parallel. - Digg uses Gearman. An example use is to get a permalink. Three operations are done parallel: get the current logged, get the permalink, and grab the comments. All three are then combined to return a combined single answer to the client. It's also used for site crawling and logging. It's a different way of thinking. - See Flickr - Do the Essential Work Up-front and Queue the Rest and The Canonical Cloud Architecture for more information.
  • Bottlenecks are in IO so you have tune the database. When the database is bigger than RAM the disk is hit all the time which kills performance. As the database gets larger the table can't be scanned anymore. So you have to: - denormalize - avoid joins - avoid large scans across databases by partitioning - cache - add read slaves - don't use NFS
  • Run numbers before you try and fix a problem to make sure things actually will work.
  • Files like for icons and photos are handled by using MogileFS, a distributed file system. DFSs support high request rates because files are distributed and replicated around a network.
  • Cache forever and explicitly expire.
  • Cache fairly static content in a file based cache.
  • Cache changeable items in memcached
  • Cache rarely changed items in APC. APC is a local cache. It's not distributed so no other program have access to the values.
  • For caching use the Chain of Responsibility pattern. Cache in MySQL, memcached APC, and PHP globals. First check PHP globals as the fastest cache. If not present check APC, memcached and on up the chain.
  • Digg's recommendation engine is a custom graph database that is eventually consistent. Eventually consistent means that writes to one partition will eventually make it to all the other partitions. After a write reads made one after another don't have to return the same value as they could be handled by different partitions. This is a more relaxed constraint than strict consistency which means changes must be visible at all partitions simultaneously. Reads made one after another would always return the same value.
  • Assume 1 million people a day will bang on any new feature so make it scalable from the start. Example: the About page on Digg did a live query against the master database to show all employees. Just did a quick hack to get out. Then a spider went crazy and took the site down.

    Miscellaneous

  • Digg buttons were a major key to generating traffic.
  • Uses Debian Linux, Apache, PHP, MySQL.
  • Pick a language you enjoy developing in, pick a coding standard, add inline documentation that's extractable, use a code repository, and a bug tracker. Likes PHP, Track, and SVN.
  • You are only as good as your people. Have to trust guy next to you that he's doing his job. To cultivate trust empower people to make decisions. Trust that people have it handled and they'll take care of it. Cuts down on meetings because you know people will do the job right.
  • Completely a Mac shop.
  • Almost all developers are local. Some people are remote to offer 24 hour support.
  • Joe's approach is pragmatic. He doesn't have a language fetish. People went from PHP, to Python/Ruby, to Erlang. Uses vim. Develops from the command line. Has no idea how people constantly change tool sets all the time. It's not very productive.
  • Services (SOA) decoupling is a big win. Digg uses REST. Internal services return a vanilla structure that's mapped to JSON, XML, etc. Version in URL because it costs you nothing, for example: /1.0/service/id/xml. Version both internal and external services.
  • People don't understand how many moving parts are in a website. Something is going to happen and it will go down.

    MemcacheDB: Evolutionary Step for Code, Revolutionary Step for Performance

    Imagine Kevin Rose, the founder of Digg, who at the time of this presentation had 40,000 followers. If Kevin diggs just once a day that's 40,000 writes. As the most active diggers are the most followed it becomes a huge performance bottleneck. Two problems appear. You can't update 40,000 follower accounts at once. Fortunately the queuing system we talked about earlier takes care of that. The second problem is the huge number of writes that happen. Digg has a write problem. If the average user has 100 followers that’s 300 million diggs day. That's 3,000 writes per second, 7GB of storage per day, and 5TB of data spread across 50 to 60 servers. With such a heavy write load MySQL wasn’t going to work for Digg. That’s where MemcacheDB comes in. In Initial tests on a laptop MemcacheDB was able to handle 15,000 writes a second. MemcacheDB's own benchmark shows it capable of 23,000 writes/second and 64,000 reads/second. At those write rates it's easy to see why Joe was so excited about MemcacheDB's ability to handle their digg deluge. What is MemcacheDB? It's a distributed key-value storage system designed for persistent. It is NOT a cache solution, but a persistent storage engine for fast and reliable key-value based object storage and retrieval. It conforms to memcache protocol(not completed, see below), so any memcached client can have connectivity with it. MemcacheDB uses Berkeley DB as a storing backend, so lots of features including transaction and replication are supported. Before you get too excited keep in mind this is a key-value store. You read and write records by a single key. There aren't multiple indexes and there's no SQL. That's why it can be so fast. Digg uses MemcacheDB to scale out the huge number of writes that happen when data is denormalized. Remember it's a key-value store. The value is usually a complete application level object merged together from a possibly large number of normalized tables. Denormalizing introduces redundancies because you are keeping copies of data in multiple records instead of just one copy in a nicely normalized table. So denormalization means a lot more writes as data must be copied to all the records that contain a copy. To keep up they needed a database capable of handling their write load. MemcacheDB has the performance, especially when you layer memcached's normal partitioning scheme on top. I asked Joe why he didn't turn to one of the in-memory data grid solutions? Some of the reasons were:
  • This data is generated from many different databases and takes a long time to generate. So they want it in a persistent store.
  • MemcacheDB uses the memcache protocol. Digg already uses memcache so it's a no-brainer to start using MemcacheDB. It's easy to use and easy to setup.
  • Operations is happy with deploying it into the datacenter as it's not a new setup.
  • They already have memcached high availability and failover code so that stuff already works.
  • Using a new system would require more ramp-up time.
  • If there are any problems with the code you can take a look. It's all open source.
  • Not sure those other products are stable enough. So it's an evolutionary step for code and a revolutionary step for performance. Digg is looking at using MemcacheDB across the board.

    Related Articles

  • Scaling Digg and Other Web Applications by Kris Jordan.
  • MemcacheDB
  • Joe Stump's Blog
  • MemcachedRelated Tags on HighScalability
  • Caching Related Tags on HighScalability
  • BigTable
  • SimpleDB
  • Anti-RDBMS: A list of distributed key-value stores
  • An Unorthodox Approach to Database Design : The Coming of the Shard
  • Flickr Architecture
  • Episode 4: Scaling Large Web Sites with Joe Stump, Lead Architect at DIGG

    Click to read more ...

  • Thursday
    Feb122009

    MySpace Architecture

    Update:Presentation: Behind the Scenes at MySpace.com. Dan Farino, Chief Systems Architect at MySpace shares details of some of MySpace's cool internal operations tools. MySpace.com is one of the fastest growing site on the Internet with 65 million subscribers and 260,000 new users registering each day. Often criticized for poor performance, MySpace has had to tackle scalability issues few other sites have faced. How did they do it? Site: http://myspace.com

    Information Sources

  • Presentation: Behind the Scenes at MySpace.com
  • Inside MySpace.com

    Platform

  • ASP.NET 2.0
  • Windows
  • IIS
  • SQL Server

    What's Inside?

  • 300 million users.
  • Pushes 100 gigabits/second to the internet. 10Gb/sec is HTML content.
  • 4,500+ web servers windows 2003/IIS 6.0/APS.NET.
  • 1,200+ cache servers running 64-bit Windows 2003. 16GB of objects cached in RAM.
  • 500+ database servers running 64-bit Windows and SQL Server 2005.
  • MySpace processes 1.5 Billion page views per day and handles 2.3 million concurrent users during the day
  • Membership Milestones: - 500,000 Users: A Simple Architecture Stumbles - 1 Million Users:Vertical Partitioning Solves Scalability Woes - 3 Million Users: Scale-Out Wins Over Scale-Up - 9 Million Users: Site Migrates to ASP.NET, Adds Virtual Storage - 26 Million Users: MySpace Embraces 64-Bit Technology
  • 500,000 accounts was too much load for two web servers and a single database.
  • At 1-2 Million Accounts - They used a database architecture built around the concept of vertical partitioning, with separate databases for parts of the website that served different functions such as the log-in screen, user profiles and blogs. - The vertical partitioning scheme helped divide up the workload for database reads and writes alike, and when users demanded a new feature, MySpace would put a new database online to support it. - MySpace switched from using storage devices directly attached to its database servers to a storage area network (SAN), in which a pool of disk storage devices are tied together by a high-speed, specialized network, and the databases connect to the SAN. The change to a SAN boosted performance, uptime and reliability.
  • At 3 Million Accounts - the vertical partitioning solution didn't last because they replicated some horizontal information like user accounts across all vertical slices. With so many replications one would fail and slow down the system. - individual applications like blogs on sub-sections of the Web site would grow too large for a single database server - Reorganized all the core data to be logically organized into one database - split its user base into chunks of 1 million accounts and put all the data keyed to those accounts in a separate instance of SQL Server
  • 9 Million–17 Million Accounts - Moved to ASP.NET which used less resources than their previous architecture. 150 servers running the new code were able to do the same work that had previously required 246. - Saw storage bottlenecks again. Implementing a SAN had solved some early performance problems, but now the Web site's demands were starting to periodically overwhelm the SAN's I/O capacity—the speed with which it could read and write data to and from disk storage. - Hit limits with the 1 million-accounts-per-database division approach as these limits were exceeded. - Moved to a virtualized storage architecture where the entire SAN is treated as one big pool of storage capacity, without requiring that specific disks be dedicated to serving specific applications. MySpace now standardized on equipment from a relatively new SAN vendor, 3PARdata
  • Added a caching tier—a layer of servers placed between the Web servers and the database servers whose sole job was to capture copies of frequently accessed data objects in memory and serve them to the Web application without the need for a database lookup.
  • 26 Million Accounts - Moved to 64-bit SQL server to work around their memory bottleneck issues. Their standard database server configuration uses 64 GB of RAM.
  • Horizontally Federated Database. Databases are partition by purpose. Have profile, email databases etc. Partition is based on user range. 1 Million users live in each database. So you have Profile1, Profile2 all the way up to Profile300 as they have 300 million users.
  • Doesn't use ASP cache because they don't have a high enough hit rate on the front-end. The middle tier cache does have a high hit rate.
  • Failure isolation. Segment requests into web server by database. Allow only 7 threads per database. So if the database is slow only those threads will slowdown and the traffic in the other threads will flow.

    Operations

  • PerfCollector. Centralized collection of performance data via UDP. More reliable than Windows and allows any client to connect and see stats.
  • Web Based Stack Dump Tool. Can right-click on a problem server and get stack dump of the .Net managed threads. Used to have to RDC into system and attach a debugger and 1/2 later get an answer. Slow, nonscalable, and tedious. Not just a stack dump, gives a lot of context about what the thread is doing. Troubleshooting is easier because you can see 90 threads are blocked on a database so the database may be down.
  • Web Base Heap Dump Tool. Dumps all memory allocations. Very useful for developers. Save hours of doing it by hand.
  • Profiler. Traces a request from start to finish and produces a report. See URL, methods, status, everything that will help you identify a slow request. Looks at lock contentions, are a lot of exceptions being thrown, anything that might be interesting. Very light weight. It's running on one box in every VIP (group of 100 servers) in production. Samples 1 thread every 10 seconds. Always tracing in background.
  • Powershell. Microsoft's new shell that runs in process and pass objects between commands versus parsing text output. MySpace develops a lot of commandlets to support operations.
  • Developed their own asynchronous communication technology to get around windows networking problems and treat servers as a group. Can ship a .cs file, compile it, run it, and ship the response back.
  • Codespew. Pushes code updates on their communication technology. Used to do 5 code pushes a day, now down to 1 a week.

    Lessons Learned

  • You can build big websites using Microsoft tech.
  • A cache should have been used from the beginning.
  • The cache is a better place to store transitory data that doesn't need to be recorded in a database, such as temporary files created to track a particular user's session on the Web site.
  • Built in OS features to detect denial of service attacks can cause inexplicable failures.
  • Distribute your data to geographically diverse data centers to handle power failures.
  • Consider using virtualized storage/clustered file systems from the start. It allows you to massively parallelize IO access while being able to add disk as needed without any reorganization needed.
  • Develop tools that work in a production environment. Can't simulate everything in test environment. The scale and variety of uses APIs are put to can't be simulated in QA during testing. Legitimate users and hackers will run into corner cases that weren't hit in testing, though QA will find most of the problems.
  • Throw hardware at problems. Easier than changing their backend software to a new way of doing things. The example is they add a new database server for every million users. It might be more efficient to change their approach to more efficiently use the database hardware, but it's easier just to add servers. For now.

    Click to read more ...

  • Saturday
    Dec202008

    Second Life Architecture - The Grid

    Update:Presentation: Second Life’s Architecture. Ian Wilkes, VP of Systems Engineering, describes the architecture used by the popular game named Second Life. Ian presents how the architecture was at its debut and how it evolved over years as users and features have been added. Second Life is a 3-D virtual world created by its Residents. Virtual Worlds are expected to be more and more popular on the internet so their architecture might be of interest. Especially important is the appearance of open virtual worlds or metaverses. What happens when video games meet Web 2.0? What happens is the metaverse.

    Information Sources

    Platform

    • MySQL
    • Apache
    • Squid
    • Python
    • C++
    • Mono
    • Debian

    What's Inside?

    The Stats

    • ~1M active users
    • ~95M user hours per quarter
    • ~70K peak concurrent users (40% annual growth)
    • ~12Gbit/sec aggregate bandwidth (in 2007)

    Staff (in 2006)

    • 70 FTE + 20 part time
    "about 22 are programmers working on SL itself. At any one time probably 1/3 of the team is on infrastructure, 1/3 is on new features and 1/3 is on various maintenance tasks (bug fixes, general stability and speed improvements) or improvements to existing features. But it varies a lot."

    Software

    Client/Viewer
    • Open Source client
    • Render the Virtual World
    • Handles user interaction
    • Handles locations of objects
    • Gets velocities and does simple physics to keep track of what is moving where
    • No collision detection
    Simulator (Sim) Each geographic area (256x256 meter region) in Second Life runs on a single instantiation of server software, called a simulator or "sim." And each sim runs on a separate core of a server. The Simulator is the primary SL C++ server process which runs on most servers. As the viewer moves through the world it is handled off from one simulator to another.
    • Runs Havok 4 physics engine
    • Runs at 45 frames/sec. If it can't keep up, it will attempt time dialation without reducing frame rate.
    • Handles storing object state, land parcel state, and terrain height-map state
    • Keeps track of where everything is and does collision detection
    • Sends locations of stuff to viewer
    • Transmits image data in a prioritized queue
    • Sends updates to viewers only when needed (only when collision occurs or other changes in direction, velocity etc.)
    • Runs Linden Scripting Language (LSL) scripts
    • Scripting has been recently upgraded to the much faster Mono scripting engine
    • Handles chat and instant messages
      • Asset Server
        • One big clustered filesystem ~100TB
        • Stores asset data such as textures.
        MySQL database Second Life has started with One Database, and have subsequently been forced into clustering. They use a ton of MySQL databases running on Debian machines to handle lots of centralized services. Rather than attempt to build the one, impossibly large database – all hail the Central Database – or one, impossibly large central cluster – all hail the Cluster – Linden Lab instead adopted a divide and conquer strategy based around data partitioning. The good thing is that UUIDs– 128-bit unique identifiers – are associated with most things in Second Life, so partitioning is generally doable. Backbone Linden Lab has converted much of their backend architecture away from custom C++/messaging into web services. Certain services has been moved off of MySQL – or cached (Squid) between the queries and MySQL. Presence, in particular Agent Presence, ie are you online and where are you on the grid, is a particularly tricky kind of query to partition, so there is now a Python service running on the SL grid called Backbone. It proved to be easier to scale, develop and maintain than many of their older technologies, and as a result, it plays an increasingly important role in the Second Life platform as Linden Lab migrates their legacy code to web services. Two main components of the backbone are open source:
        • Eventlet is a networking library written in Python. It achieves high scalability by using non-blocking io while at the same time retaining high programmer usability by using coroutines to make the non-blocking io operations appear blocking at the source code level.
        • Mulib is a REST web service framework built on top of eventlet

        Hardware

        • 2000+ Servers in 2007
        • ~6000 Servers in early 2008
        • Plans to upgrade to ~10000 (?)
        • 4 sims per machine, for both class 4 and class 5
        • Used all-AMD for years, but are moving from the Opteron 270 to the Intel Xeon 5148
        • The upgrade to "class 5" servers doubled the RAM per machine from 2GB to 4GB and moved to a faster SATA disk
        • Class 1 - 4 are on 100Mb with 1Gb uplinks to the core. Class 5 is on pure 1Gb
        Do you have more details?

        Click to read more ...

    Saturday
    Nov222008

    Google Architecture

    Update 2: Sorting 1 PB with MapReduce. PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters. Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing higher scaling infrastructure to support their products. How do they do that?

    Information Sources

  • Video: Building Large Systems at Google
  • Google Lab: The Google File System
  • Google Lab: MapReduce: Simplified Data Processing on Large Clusters
  • Google Lab: BigTable.
  • Video: BigTable: A Distributed Structured Storage System.
  • Google Lab: The Chubby Lock Service for Loosely-Coupled Distributed Systems.
  • How Google Works by David Carr in Baseline Magazine.
  • Google Lab: Interpreting the Data: Parallel Analysis with Sawzall.
  • Dare Obasonjo's Notes on the scalability conference.

    Platform

  • Linux
  • A large diversity of languages: Python, Java, C++

    What's Inside?

    The Stats

  • Estimated 450,000 low-cost commodity servers in 2006
  • In 2005 Google indexed 8 billion web pages. By now, who knows?
  • Currently there over 200 GFS clusters at Google. A cluster can have 1000 or even 5000 machines. Pools of tens of thousands of machines retrieve data from GFS clusters that run as large as 5 petabytes of storage. Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster.
  • Currently there are 6000 MapReduce applications at Google and hundreds of new applications are being written each month.
  • BigTable scales to store billions of URLs, hundreds of terabytes of satellite imagery, and preferences for hundreds of millions of users.

    The Stack

    Google visualizes their infrastructure as a three layer stack:
  • Products: search, advertising, email, maps, video, chat, blogger
  • Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.
  • Computing Platforms: a bunch of machines in a bunch of different data centers
  • Make sure easy for folks in the company to deploy at a low cost.
  • Look at price performance data on a per application basis. Spend more money on hardware to not lose log data, but spend less on other types of data. Having said that, they don't lose data.

    Reliable Storage Mechanism with GFS (Google File System)

  • Reliable scalable storage is a core need of any application. GFS is their core storage platform.
  • Google File System - large distributed log structured file system in which they throw in a lot of data.
  • Why build it instead of using something off the shelf? Because they control everything and it's the platform that distinguishes them from everyone else. They required: - high reliability across data centers - scalability to thousands of network nodes - huge read/write bandwidth requirements - support for large blocks of data which are gigabytes in size. - efficient distribution of operations across nodes to reduce bottlenecks
  • System has master and chunk servers. - Master servers keep metadata on the various data files. Data are stored in the file system in 64MB chunks. Clients talk to the master servers to perform metadata operations on files and to locate the chunk server that contains the needed they need on disk. - Chunk servers store the actual data on disk. Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. Once directed by a master server, a client application retrieves files directly from chunk servers.
  • A new application coming on line can use an existing GFS cluster or they can make your own. It would be interesting to understand the provisioning process they use across their data centers.
  • Key is enough infrastructure to make sure people have choices for their application. GFS can be tuned to fit individual application needs.

    Do Something With the Data Using MapReduce

  • Now that you have a good storage system, how do you do anything with so much data? Let's say you have many TBs of data stored across a 1000 machines. Databases don't scale or cost effectively scale to those levels. That's where MapReduce comes in.
  • MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
  • Why use MapReduce? - Nice way to partition tasks across lots of machines. - Handle machine failure. - Works across different application types, like search and ads. Almost every application has map reduce type operations. You can precompute useful data, find word counts, sort TBs of data, etc. - Computation can automatically move closer to the IO source.
  • The MapReduce system has three different types of servers. - The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks. - The Map servers accept user input and performs map operations on them. The results are written to intermediate files - The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.
  • For example, you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically. - The steps look like: GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS. - In MapReduce a map maps one view of data to another, producing a key value pair, which in our example is word and count. - Shuffling aggregates key types. - The reductions sums up all the key value pairs and produces the final answer.
  • The Google indexing pipeline has about 20 different map reductions. A pipeline looks at data with a whole bunch of records and aggregating keys. A second map-reduce comes a long, takes that result and does something else. And so on.
  • Programs can be very small. As little as 20 to 50 lines of code.
  • One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.
  • Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.

    Storing Structured Data in BigTable

  • BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.
  • BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.
  • It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.
  • Commercial databases simply don't scale to this level and they don't work across 1000s machines.
  • By controlling their own low level storage system Google gets more control and leverage to improve their system. For example, if they want features that make cross data center operations easier, they can build it in.
  • Machines can be added and deleted while the system is running and the whole system just works.
  • Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.
  • Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable.
  • BigTable has three different types of servers: - The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed. - The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers. - The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion.
  • A locality group can be used to physically store related bits of data together for better locality of reference.
  • Tablets are cached in RAM as much as possible.

    Hardware

  • When you have a lot of machines how do you build them to be cost efficient and use power efficiently?
  • Use ultra cheap commodity hardware and built software on top to handle their death.
  • A 1,000-fold computer power increase can be had for a 33 times lower cost if you you use a failure-prone infrastructure rather than an infrastructure built on highly reliable components. You must build reliability on top of unreliability for this strategy to work.
  • Linux, in-house rack design, PC class mother boards, low end storage.
  • Price per wattage on performance basis isn't getting better. Have huge power and cooling issues.
  • Use a mix of collocation and their own data centers.

    Misc

  • Push changes out quickly rather than wait for QA.
  • Libraries are the predominant way of building programs.
  • Some are applications are provided as services, like crawling.
  • An infrastructure handles versioning of applications so they can be release without a fear of breaking things.

    Future Directions for Google

  • Support geo-distributed clusters.
  • Create a single global namespace for all data. Currently data is segregated by cluster.
  • More and better automated migration of data and computation.
  • Solve consistency issues that happen when you couple wide area replication with network partitioning (e.g. keeping services up even if a cluster goes offline for maintenance or due to some sort of outage).

    Lessons Learned

  • Infrastructure can be a competitive advantage. It certainly is for Google. They can roll out new internet services faster, cheaper, and at scale at few others can compete with. Many companies take a completely different approach. Many companies treat infrastructure as an expense. Each group will use completely different technologies and their will be little planning and commonality of how to build systems. Google thinks of themselves as a systems engineering company, which is a very refreshing way to look at building software.
  • Spanning multiple data centers is still an unsolved problem. Most websites are in one and at most two data centers. How to fully distribute a website across a set of data centers is, shall we say, tricky.
  • Take a look at Hadoop (product) if you don't have the time to rebuild all this infrastructure from scratch yourself. Hadoop is an open source implementation of many of the same ideas presented here.
  • An under appreciated advantage of a platform approach is junior developers can quickly and confidently create robust applications on top of the platform. If every project needs to create the same distributed infrastructure wheel you'll run into difficulty because the people who know how to do this are relatively rare.
  • Synergy isn't always crap. By making all parts of a system work together an improvement in one helps them all. Improve the file system and everyone benefits immediately and transparently. If every project uses a different file system then there's no continual incremental improvement across the entire stack.
  • Build self-managing systems that work without having to take the system down. This allows you to more easily rebalance resources across servers, add more capacity dynamically, bring machines off line, and gracefully handle upgrades.
  • Create a Darwinian infrastructure. Perform time consuming operation in parallel and take the winner.
  • Don't ignore the Academy. Academia has a lot of good ideas that don't get translated into production environments. Most of what Google has done has prior art, just not prior large scale deployment.
  • Consider compression. Compression is a good option when you have a lot of CPU to throw around and limited IO.

    Click to read more ...

  • Monday
    Oct272008

    Notify.me Architecture - Synchronicity Kills

    What's cool about starting a new project is you finally have a chance to do it right. You of course eventually mess everything up in your own way, but for that one moment the world has a perfect order, a rightness that feels satisfying and good. Arne Claassen, the CTO of notify.me, a brand new real time notification delivery service, is in this honeymoon period now. Arne has been gracious enough to share with us his philosophy of how to build a notification service. I think you'll find it fascinating because Arne goes into a lot of useful detail about how his system works. His main design philosophy is to minimize the bottlenecks that form around synchronous access, that is when some resource is requested and the requestor ties up more resources, waiting for a response. If the requested resource can’t be delivered in a timely manner, more and more requests pile up until the server can’t accept any new ones. Nobody gets what they want and you have an outage. Breaking synchronous operations into asynchronous operations by separating request and response into separate message passing actions, stops the resource overload. Instead of a system going down from too many parallel requests, it can works its way through a backlog of requests as fast as it can. And in most cases the request/response cycles are so fast that they appear like a linear sequence of events. Notify.me is taking the innovative and risky strategy of using ejabberd, an XMPP based system, as their internal messaging and routing layer. Will Erlang and Mnesia (Erlang's database) be able to keep up with traffic and keep low latencies as traffic scales? It will be interesting to find out. If you are interested in notify.me they've kindly offered 500 beta accounts for HS readers: http://notify.me/user/account/create/highscale

    Who are you?

    My name is Arne Claassen, the CTO of notify.me. I've been working on highly scalable web based applications and services for the past decade. These sites have employed various combinations of traditional scaling techniques such as server farms, caching, content pregeneration and highly available databases using replication and clustering. All of these techniques are ways to mitigate scarce resources (generally the database) being in contention by many users. Knowing the benefits and pitfalls of these techniques, it has become my focus to architect systems that circumvent scarce resource scenarios.

    What is notify.me, why did you make it, and why is it a good thing?

    notify.me is the brainchild of Jason Wieland, our CEO. It's a near real time notification service that alerts users to new content published on the web. It was created to address the common user pain of staying up to date of time critical events that occur on the web. For instance, a user searching for an apartment on Craigslist would want to be alerted once a new one matching their search criteria is posted. notify.me does the grunt work of repeatedly checking Craigslist for new listings and alerts the user once a new one is posted. Notifications can be delivered to instant messenger, desktop application, mobile device, email, and web application. Our goal is to create and publish open APIs allowing people to build new and interesting applications for generating and delivering information.

    How does your service compare to other services people might be familiar with? Like Twitter, Friend Feed, Gnip, or Yahoo Pipes?

    There are quite a few companies that are in our competitive landscape, some of which are direct competitors, like yotify.com or alerts.com. The main difference is our approach. Yotify and alerts are focused on being notification portal sites for users to visit. notify.me is a utility, with a focus of offering all the functionality available on the website via XMPP and REST APIs, allowing users to interact with notifications from the application of their choosing. We also allow for escalation of messages to destinations. For example if a user is not logged into their IM or have a status of away, notifications can be escalated and routed to their mobile device. In the messaging arena, we are nearly the opposite of twitter. Twitter is inward facing publishing model based on its own user generated content. Someone makes a tweet and it gets published to their followers. notify.me is creating an externally facing message delivery system. Users add any website that supports web feed standards, or redirect existing notification emails to us. If anything, we are a messaging pipeline that is complementary to twitter (more on that below). Friendfeed does a great job at combining all your social networks together into one centralized area. They're primary focus is to build features and tools to interact with the mashed feed. This feed would be perfect to add as a source to notify.me, allowing a user to receive all social network updates over instant messenger. Being able to know in near real time when you have a new posting on your wall so that you can immediately respond is a feature the social addicts want. Yahoo Pipes would be considered as a possible partner, similar to how they upsell Netvibes and Newsgator. Their focus is to provide an intuitive programming interface to be able to manipulate feeds and create a useful mashup. For example the Hot Deals Search is a nifty pipe that searches over a collection of sites for the best deal. Users might not want to use Yahoo's own notification options due to the limited options of destinations. In our beta group we've seen similar activity with users adding ebay links. Ebay has a competitive notification pipeline to notify.me however, users still add ebay search links to our service. It turns out they they would like one central place to manage their various news feeds. Gnip is a pure infrastructure play. We have similar technology but we are going after completely different markets. An additional core feature of our product that we have not yet exposed is that our pipelines is bi-directional, i.e. any data source can also be a destination and vice-verse. The primary benefit of this is the ability of allowing messages to be responded to, such as acknowledging a support ticket you received. Bi-directional communication will require integration with the notify.me API through which a source can communicate reply options. We currently are developing a deep integration with the Twitter API to provide two-way capabilities for tweets via IM in the same channel that you already receive your other notifications.

    Can you explain the different parts of notify.me and how they connect together?

    In general terms our system consists of three subsystems, each of which has a number of implementations. 1. Ingestion consists of rss and email ingestors, which constantly check the user's email address (username@notify.me) and the user's feeds for new data. New data is turned into notifications which are propagated to routing. 2. Routing is responsible for getting the user's notifications to the right delivery components. Routing is the point in the system that the user interacts with for management, such as changing sources and destinations, and viewing history. Notification history is a specialized delivery component, allowing all messages to be perused via the website, even after they have gone through the entire pipeline. 3. Delivery is currently comprised of history (which always gets the messages), Xmpp IM, SMS and email, with private RSS, AIM and MSN in development. On a more technical level, the topology of this system is comprised of two separate message busses: 1. Store-and-forward queues (using simpleMQ) 2. XMPP (using ejabberd) Store and forward queues are used by the ingestion side to distribute the work of ingestion and generally anywhere where data is handled before it becomes visible to the routing rules of the user. This allows for scaling flexibility as well as process isolation during a component failure. The Xmpp bus is called the Avatar bus, named thusly because every data owning entity is represented by a daemon process that is the sole authority for that entities data. We have four types of avatars, Monitor, Agent, Source and User 1. Monitor avatars are simply the responsible parties for observing instance health and spinning up and shutting down additional computing nodes per demand. 2. Agent avatars are the delivery gateways that provide presence information of our users into the bus and deliver messages to our users. 3. Source avatars are ingestors, such as an RSS. This avatar pulls new messages from a store and forward queue and notifies its subscribers of the new message. 4. User avatars persists all the configuration and messaging data for a particular user. It is responsible for receiving new notifications from ingestion avatars, deciding on the routing and pushing messages to the appropriate delivery agent as that agent declares the ability to execute that delivery.

    What particular challenges did you face and how did you overcome them? What options did you consider and why did you decide to do it another way?

    From the onset, our primary goal was to avoid bottlenecks and hindrances to scaling horizontally. Initially we planned on building the entire system as a stateless flow of messages through queues, with each daemon along the line being responsible for the data flowing through it, merging, multiplexing and routing it to the next point until delivery was achieved. This meant no single part's failure would ever affect the whole, other than queues getting backed up. However early on we realized that once a message was designated for a user we needed the ability to track where it was and be able to re-route it dynamically depending on user configuration and presence. This lead us to add a bit of inappropriate coupling between daemons via REST apis. Some of this plumbing still exists, as we're still migrating processing over to the combination queue/async bus architecture. As we realized that pure message passing without state was not going to satisfy our dynamic needs, the easy solution was to return to the tried state keeper, the central relational database. Knowing our scaling goals, this would have introduced a point of failure that sooner or later we would not be able to mitigate. We decided to look at our state in a different way and instead of thinking of creating processing units based on function (i.e. ingestion, parsing, transformation, routing, delivery) that queried state based on the data flowing through it, we thought of the units in terms of data ownership, i.e. sources and destinations (users). Once on that track, there was precious little shared state and we were able to change our storage pattern to have each owner of data be responsible for its own data, allowing horizontal scaling of the persistence layer, as well as much more efficient caching. The remaining need for accessing data across owner is analytics. In many systems analytics is a primary reason for the existence of a central database, since too often facts and dimensions are kept intermingled in the production schema. For our purposes, this data is not a production concern and therefore should never affect live capacity. Usage and state changes are treated as immutable events, which are queued at the point of occurence into our store-and-forward system. The nature of our store-and-forward queue allows us to automatically gather all these events from all hosts to a central archive which can then be processed into fact and dimension data by ETL processes. This allows for near real-time tracking of usage without affecting user facing systems.

    Could you explain your choice of XMPP a little more? Is it used mainly as a message bus between federated XMPP servers sitting on EC2 nodes? Is the XMPP queue used as the queue for each user's messages from all sources before they are pushed to users?

    We have three different xmpp clusters which take advantage of federation for cross-chatter: user, agent and the avatar bus.

    Users

    This is a regular xmpp IM server on which we create accounts for each of our users, providing them an IM account that they can use from any Xmpp capable client. This account also serves as the user our desktop app signs in as and that will be the authentication for our API for third party message ingestion and distribution

    Agents

    The daemons connecting to this cluster serve as communication bridges between our internal Avatar bus and outside clients. Currently this is primarily for communicating with chat clients, as every user is assigned an agent that they communicate without us through, regardless of whether they use their default account or some third party account such as jabber.org, googletalk, etc. We also are testing a client API that uses Xmpp RPC via these agents for dedicated Apps. In the future we will also offer full XMPP and REST APIs for third party integration that will use the agents to communicate with the Avatar bus. I mentioned earlier that Agents are avatars as well, however they are a little special in that they do not have a user on the Avatar bus but talk to other avatars through cross server federation. We currently are also building agents for the Oscar and MSN networks that will sit directly on the avatar bus since their native transport is not federated. We also plan to evaluate other networks for possible future support.

    Avatars

    Avatars is our internal message bus that we use to route and process all commands and messages. We primarily use direct messaging and IQ based RPC stanzas between avatars, although we do take advantage of presence for monitoring. So what is an avatar? It's a daemon (where a single physical daemon process can host many avatar daemons) that is the authority for some external entity's data. I.e. every user registered with notify.me has an avatar that monitors agents for status changes, receives messages in care of that user and is responsible for routing those messages to the appropriate delivery channel. This means that every avatar is the single authority for all data about that user and is responsible to persisting the data. If some other part of the system wants to know something about that user or modify its data, it uses Xmpp RPC to talk to the avatar, rather than some central database. Right now, avatars persist to disk and SimpleDB, while keeping a ttl-regulated cache in process. Since only the avatar can write its own data, it does not need to check the DB but can treat its memory and disk cache as authoritative and SDB is used primarily for writes. Reads are needed only in the case of a node failure to bring up the avatar on another node. At the other end of the bus we have our ingestors. Ingestors are made up of a number of daemons, generally running on polling loops against external sources, queueing new data into our store-and-forward queues, where the appropriate ingestor avatar picks up new messages and distributes them to its subscribers. In the ingestor avatar scenario, it is the authority on subscription and routing data. Here's a typical use case: A user subscribes to an RSS feed via the web interface. The web interface sends the request to the user's avatar, which persists the subscription for reference and then requests the subscription from the rss ingestor. As new rss items arrive, the rss ingestor multiplexes items to all user avatars that subscribe to that feed. The user avatars in turn determine the appropriate delivery mechanism and schedule delivery. In general that means that the user avatar is subscribed to the user's Xmpp presence via the users' agent. Until the user is in the proper state for accepting messages, the user avatar queues the rss items. Once the user is ready to receive the notification, the presence change is propagated from the agent into the internal bus and the user avatar then sends the rss items to agent, which in turn delivers it to the user. Right now, all avatars are always online (even if mostly idle), which is fine for our present user base size. Our plan is to mod the offline storage module of ejabberd so that we can blind fire stanzas and have queued messages signal a monitor to spin up the appropriate avatar for the destination XmppId. Once this system is in place we will be able to spin up on avatars on demand and shut down them down on idle.

    At what traffic load do you expect your current architecture to break and what's your plan?

    Since our system is distributed and asynchronous by design, we should avoid systemwide failures under load. However, while avoiding all the usual bottlenecks, the reality is that our message bus, which makes all this possible, will likely become our limiting factor, either because it cannot handle the number of avatars (nodes on the bus) or because latency on the bus becomes unacceptable. We're only starting to use the avatar system as our backbone, so it's still a bit fragile and we're still doing load testing on ejabberd to determine at what point we run into limiting factors. While we are already clustering ejabberd, the load of mnesia database replication and cross node chatter means that either number of connections or latency will eventually cause the cluster to fail or simply consume too much memory to be managable. Since our messaging is primarily point-to-point, we anticipate that we can split our user base into avatar silos, each hosted on a dedicated avatar subdomain cluster, reducing message and connection load. As long as our silos are appropriately designed to keep crosssubdomain chatter to a minimum, we should be able to have n silos to keep on top of load. Our single greatest challenge to avoid this architecture failing is eternal vigilance against introducing features that create messaging bottlenecks. A significant amount of our message traffic passing through a single processor or family of processors, would introduce dependencies we cannot scale ourselves out of with subdomain division.

    Related Articles

  • notify.me tech blog - INotification
  • Flickr - Do the Essential Work Up-front and Queue the Rest

    Click to read more ...

  • Wednesday
    Oct222008

    EVE Online Architecture

    EVE Online is "The World's Largest Game Universe", a massively multiplayer online game (MMO) made by CCP. EVE Online's Architecture is unusual for a MMOG because it doesn't divide the player load among different servers or shards. Instead, the same cluster handles the entire EVE universe. It is an interesting to compare this with the Architecture of the Second Life Grid. How do they manage to scale?

    Information Sources

    Platform

    • Stackless Python used for both server and client game logic. It allows programmers to reap the benefits of thread-based programming without the performance and complexity problems associated with conventional threads.
    • SQL Server
    • Blade servers with SSDs for high IOPS
    • Plans to use Infiniband interconnects for low latency networking
    • What's Inside?

      The Stats

      • Founded in 1997
      • ~300K active users
      • Up to 40K concurrent users
      • Battles involving hundreds of ships
      • 250M transactions per day

      Architecture

      The EVE Cluster is broken into 3 distinct layers
      • Proxy Blades - These are the public facing segment of the EVE Cluster - they are responsible for taking player connections and establishing player communication within the rest of the cluster.
      • SOL Blades - These are the workhorses of Tranquility. The cluster is divided across 90 - 100 SOL blades which run 2 nodes each. A node is the primarily CPU intensive EVE server process running on one core. There are some SOL blades dedicated to one busy solar systems such as Jita, Motsu and Saila.
      • Database Cluster - This is the persistence layer of EVE Online. The running nodes interact heavily with the Database, and of course pretty much everything to do with the game lives here. Thanks to Solid-state drives, the database is able to keep up with the enormous I/O load that Tranquility generates.

      Lessons Learned

      There are many interesting facts about the architecture of the EVE Online MMOG such as the use of Stacless Python and SSDs.
      • With innovative ideas MMO games can scale up to the hundreds of players in the same battle.
      • SSDs will in fact bridge the gap huge performance gap between the memory and disks to some extent.
      • Low latency Infiniband network interconnect will enable larger clusters.
      Check out the information sources for detailed insights to the development and operation of the EVE Online game.

      Click to read more ...

    Tuesday
    Jul222008

    Scaling Bumper Sticker: A 1 Billion Page Per Month Facebook RoR App  

    Several months ago I attended a Joyent presentation where the spokesman hinted that Joyent had the chops to support a one billion page per month Facebook Ruby on Rails application. Even under a few seconds of merciless grilling he would not give up the name of the application. Now we have the big reveal: it was LinkedIn's Bumper Sticker app. For those not currently sticking things on bumps, Bumper Sticker is quite surprisingly a viral media sharing application that allows users to express their individuality by sticking small virtual stickers on Facebook profiles. At the time I was quite curious how Joyent's cloud approach could be leveraged for this kind of app. Now that they've released a few details, we get to find out.

    Site: http://www.Facebook.com/apps/application.php?id=2427603417

    Information Sources

  • Video: Scaling to 1 Billion Page Views Per MonthVideo (very flashy)
  • Web Scalability Practices: Bumper Sticker on Rails by Ikai Lan and Jim Meyer from LinkedIn
  • 1 Billion Page Views a Month by David Young from Joyent
  • Ruby on Rails: scaling to 1 billion page views per month by Dennis Howlettby from Zdnet
  • Joyent's Grid Accelerators for Web Applications by Jason Hoffman from Joyent
  • On Grids, the Ambitions of Amazon and Joyent by Jason Hoffman from Joyent
  • Scaling Ruby on Rails to 1 Billion Page Views a Month by Joe Pruitt from DevCentral

    The Platform

  • MySQL
  • Nginx
  • Mongrel
  • CDN
  • Ruby on Rails (rapid prototype development approach)
  • Facebook
  • Joyent Accelerator - provides a highly scalable on-demand infrastructure for running web sites, including rich web applications written in Ruby on Rails, PHP, Python and Java. Joyent Accelerators are next-generation virtual computers that can grow and multiply (or shrink and consolidate) depending on the real world demands faced by your Web application. Accelerators are built on OpenSolaris, multi-core (8+), RAM-rich servers (32GB+ each) and vast amounts of NAS storage.
  • Masochism Plugin - provides an easy solution for Ruby on Rails applications to work in a replicated database environment. Connection proxy sends some database queries (those in a transaction, update statements, and ActiveRecord::Base#reload) to a master database, and the rest to the slave database.

    The Stats

  • 1 billion page views per month
  • 13.5 million installations
  • 1.5 million daily active users. Recruited 1 million users in first 46 days.
  • 20-27 million canvas page views a day
  • 13 web application servers running Nginx and Mongrel
  • 8 static asset servers serving over 3,500,000 stickers (migrating to a CDN)
  • 4 MySQL servers in a master/slave configuration using Masochism as a proxy to load balance database operations.
  • Cost is about $25K/month.

    The Architecture

  • Bumper Sticker was an experiment to see how fast the Light Engineering Development (LED) team at LinkedIn could build a Ruby on Rails Facebook application.
  • RoR was an easy an environment to prototype in, but they needed a production environment in which they could quickly develop, deploy, and scale. Joyent was selected.
  • Some Notes on Joyent:
    * Joyent is a scale on demand cloud. Allows customers to have a dynamic data center instead of being stuck using their own rigid infrastructure.
    * There's an API if you need one. The service is unmanaged, you get root on all your boxes.
    * They consider their infrastructure to be better and more open than Amazon. You get access to a high end load balancer and the capabilities of OpenSolaris (Dtrace, Zones, lower request processing overhead, sub 10 second reboot times).
    * Joyent's primary scalability principle is to organize apps around silos built from their powerful Accelerator blocks: put applications on different servers based on the quality of service you want to give them. For example, put static content on their own servers so the static content is always served fast and reliably. This allows you to prioritize based on what's important to you. You could, for example, prioritize the virality of your application by putting the Invite Friends functionality on their own servers, thus assuring the growth of your application through your viral functionality possibly at the expense of less important functionality.
    * Has three data centers in the US and are opening a fourth, none in Europe.
    * Considers their secret sauce to be their highly sophisticated administration system which allows a few people to easily manage a large infrastructure.
    * Has a peering relationship with Facebook. That means there are direct high-speed fiber links between Joyent’s data center in Emeryville and Facebook’s data center in San Francisco.

  • 80% of the content for Bumber Sticker is static. The Facebook API can directly render content at a specified memory location. Bumper Sticker was able to use the scripting feature of F5 BIG-IP load balancer to directly load static content by passing a pointer to the Facebook API.

    The Lessons

  • Rails scales exactly like any other app. Take into account all the components from the moment the request is received at the load balancer all the way down and all the way back again.
  • The development process is: put some measurements in place, find problems, fix problems, more people adopt and scale you out of your solution, and the cycle repeats. Sun's Dtrace feature makes it easy to instrument the stack to identify bottlenecks.
  • Rails scales as long as the development team using it understands that many of the bottlenecks are exactly those faced by developers on any other database-driven web platform.
  • Hit a disk spindle and you are screwed. Avoid going to the database or the file system. The more they avoided disk the fewer timeouts they experienced.
  • Convert anything dynamic into static content. Dynamic content is your enemy. Convert anything dynamic into static content so it can be removed from the disk path.
  • Push content to the edge. Move content as close to the client as possible. Move cache to the CDN. Reduce time going across the network.
  • Faster means more viral. On a viral system the better the performance the more people can play with your application. The more people who play with your system the more likely they are to pull more people in, which means the more the app will spread and go viral. Bumper Sticker has been successful at creating a community of fans who enjoy uploading and sharing their own stickers.

    Some issues:
  • Since most of the content is static and served by the load balancer, the impact of Rails in the system is not clear.
  • The functionality of Bumper Sticker is relatively simple. What would the impact be on scalability if other often requested features like search were added?

    Related Articles

  • Friends for Sale Architecture - A 300 Million Page View/Month Facebook RoR App
  • Wednesday
    Jul092008

    Federation at Flickr: Doing Billions of Queries Per Day

    Flickr's lone database guy Dathan Pattishall made his excellent presentation available on how on how Flickr scales its backend to handle tremendous loads. Some of this information is available in Flickr Architecture, but the paper is so good it's worth another read. If you want to see sharding done right, at scale, take a look.

    Click to read more ...

    Monday
    Jun092008

    FaceStat's Rousing Tale of Scaling Woe and Wisdom Won

    Lukas Biewald shares a fascinating slam by slam recount of how his FaceStat (upload your picture and be judged by the masses) site was battered by a link on Yahoo's main page that caused an almost instantaneous 650,000 page view jump on their site. Yahoo spends considerable effort making sure its own properties can handle the truly massive flow from the main page. Turning the Great Eye of the Internet towards an unsuspecting newborn site must be quite the diaper ready experience. Theo Schlossnagle eerily prophesized about such events in The Implications of Punctuated Scalabilium for Website Architecture: massive, unexpected and sudden traffic spikes will become more common as a fickle internet seeks ever for new entertainments (my summary). Exactly FaceStat's situation. This is also one of our first exposures to an application written on Merb, a popular Ruby on Rails competitor. For those who think Ruby is the problem, their architecture now serves 100 times the original load. How did our fine FaceStat fellowship fair against Yahoo’s onslaught? Not a lot of details of FaceStat’s architecture are available, so it’s not that kind of post. What interested me is that it’s a timely example of Theo’s traffic spike phenomena and I was also taken with how well the team handled the challenge. Few could do better so quickly. In fact, let’s apply Theo’s rubric for how to handle these situations to FaceStat:
  • Be Alert: build automated systems to detect and pinpoint the cause of these issues quickly (in less than 60 seconds). None initially, but they are building in more monitoring to better handle future situations. Better monitoring would have alerted them to the problems long before they actually were alerted. Perhaps many more potential customers might have been converted to actual customers. You can never have enough monitoring!
  • Be Prepared: understand the bottlenecks of your service systemically. As the system was relatively simple, new, and quickly changed, my assumption is they were fully aware of their system’s shortcomings, they were just busy with adding features rather than worrying about performance and scalability.
  • Perform Triage: understand the importance of the various services that make up your site. Definitely. They “started ripping out every database intensive feature” in response to the load.
  • Be Calm: any action that is not analytically driven is a waste of time and energy. They stayed amazingly calm as can be seen from the following quote: “It’s one thing to code scalably and grow slowly under increasing load, but it’s been a blast to crazily rearchitect a live site like FaceStat in a day or two.” I’m not sure how analytically driven they were however  All-in-all an impressive response to the Great Eye’s undivided attention. But not everyone was impressed as I. A commenter named Bernard said: Sorry, but this is a really dumb story. Given how dirt cheap things like slicehost and linode are, it is crazy that you launched a web app and had not already prepared a redundant, highly-scalable architecture… I’d say you were damn lucky that the disappointed users came back at all. Commenter Will thought it was a “Nice problem to be having!” Which it is, of course, being noticed is better than being ignored. But Lukas was spot on when he lamented about being noticed too soon has a downside: After working so hard to get users to come to your site, it’s amazingly frustrating to see hundreds of thousands of people suddenly locked out. Clearly we still don’t have the ability for developers to create scalable systems as simply as they create exploratory systems. Ed from Rackspace posted that they could help with their Auto Scale of Arrays feature. And Rackspace would be an excellent solution, but the cost would be $500/month and a $2500 setup fee. No “let’s put on a show” startup can afford those costs. The mode FaceStat was in is typical: We find that a Rails-like platform is invaluable for rapidly prototyping a new site, especially since we started FaceStat as a pure experiment with no idea whether people would like it or not, and with a very different feature set in mind compared to what it later became. A pay as you grow model is essential for scalability because that’s the only way you can bake scalability in from the start. And even with all the impressive advances in the industry we still don’t have the software infrastructure to make scaling second nature.

    Information Sources

  • Scaling Fast by Lukas Biewald
  • FaceStat scales! on Dlores BLog

    Platform

  • Merb. Ruby based MVC framework that is ORM-agnostic.
  • Thin. A fast and very simple Ruby web server.
  • Slicehost. Hosting service. Able to quickly provision servers as needed.
  • Amazon’s S3. Image server. Latency is high but it handles the load.
  • Capistrano. Automated deployment.
  • Git with github. Source code control system. Supports efficient simultaneous development, quick merging and deployment.
  • God. Server monitoring and management.
  • Memcached. Application caching layer.
  • PostgreSQL

    The Stats

  • Six app servers.
  • One big database machine.

    The Architecture

  • FaceStat is a write heavy application and performs involved calculations on data.
  • S3 is used to offload the responsibility for storing images. This freed them from the massive bandwidth requirements and complexity of managing their own images.
  • Memcached offloads reads from the database to allow the database to have more time for writes.

    Lessons Learned

  • Monitor the site. The sooner you know about a problem the faster it can be fixed. Don't rely on user email or email from exception handlers or you'll never get ahead of problems.
  • Communicate with your users with an error page. A meaningful error pages shows you care and that you are working on the problem. That's enough for a second chance with most people.
  • Use a cached statically generated homepage. Hard to beat that for performance.
  • Big sites might want to give a heads up when they mention smaller sites. Just a short polite email saying how your world will soon turn upside down would do.
  • High-level platform really doesn’t matter compared to overall architecture. How you handle writes, reads, caching, deployment, monitoring, etc are relatively framework independent and it's how you solve those problems that matter.
  • Ruby and Merb supported rapid prototyping to experiment and create a radically different system form the one they intended.

    Click to read more ...