Entries in Example (248)

Thursday
Aug062009

An Unorthodox Approach to Database Design : The Coming of the Shard

Update 4: Why you don’t want to shard. by Morgon on the MySQL Performance Blog. Optimize everything else first, and then if performance still isn’t good enough, it’s time to take a very bitter medicine.
Update 3: Building Scalable Databases: Pros and Cons of Various Database Sharding Schemes by Dare Obasanjo. Excellent discussion of why and when you would choose a sharding architecture, how to shard, and problems with sharding.
Update 2: Mr. Moore gets to punt on sharding by Alan Rimm-Kaufman of 37signals. Insightful article on design tradeoffs and the evils of premature optimization. With more memory, more CPU, and new tech like SSD, problems can be avoided before more exotic architectures like sharding are needed. Add features not infrastructure. Jeremy Zawodny says he's wrong wrong wrong. we're running multi-core CPUs at slower clock speeds. Moore won't save you.
Update: Dan Pritchett shares some excellent Sharding Lessons: Size Your Shards, Use Math on Shard Counts, Carefully Consider the Spread, Plan for Exceeding Your Shards

Once upon a time we scaled databases by buying ever bigger, faster, and more expensive machines. While this arrangement is great for big iron profit margins, it doesn't work so well for the bank accounts of our heroic system builders who need to scale well past what they can afford to spend on giant database servers. In a extraordinary two article series, Dathan Pattishall, explains his motivation for a revolutionary new database architecture--sharding--that he began thinking about even before he worked at Friendster, and fully implemented at Flickr. Flickr now handles more than 1 billion transactions per day, responding in less then a few seconds and can scale linearly at a low cost.

What is sharding and how has it come to be the answer to large website scaling problems?

Information Sources

What is sharding?

While working at Auction Watch, Dathan got the idea to solve their scaling problems by creating a database server for a group of users and running those servers on cheap Linux boxes. In this scheme the data for User A is stored on one server and the data for User B is stored on another server. It's a federated model. Groups of 500K users are stored together in what are called shards.

The advantages are:

  • High availability. If one box goes down the others still operate.
  • Faster queries. Smaller amounts of data in each user group mean faster querying.
  • More write bandwidth. With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites.
  • You can do more work. A parallel backend means you can do more work simultaneously. You can handle higher user loads, especially when writing data, because there are parallel paths through your system. You can load balance web servers, which access shards over different network paths, which are processed by separate CPUs, which use separate caches of RAM and separate disk IO paths to process work. Very few bottlenecks limit your work.

    How is sharding different than traditional architectures?

    Sharding is different than traditional database architecture in several important ways:

  • Data are denormalized. Traditionally we normalize data. Data are splayed out into anomaly-less tables and then joined back together again when they need to be used. In sharding the data are denormalized. You store together data that are used together.

    This doesn't mean you don't also segregate data by type. You can keep a user's profile data separate from their comments, blogs, email, media, etc, but the user profile data would be stored and retrieved as a whole. This is a very fast approach. You just get a blob and store a blob. No joins are needed and it can be written with one disk write.

  • Data are parallelized across many physical instances. Historically database servers are scaled up. You buy bigger machines to get more power. With sharding the data are parallelized and you scale by scaling out. Using this approach you can get massively more work done because it can be done in parallel.

  • Data are kept small. The larger a set of data a server handles the harder it is to cash intelligently because you have such a wide diversity of data being accessed. You need huge gobs of RAM that may not even be enough to cache the data when you need it. By isolating data into smaller shards the data you are accessing is more likely to stay in cache.

    Smaller sets of data are also easier to backup, restore, and manage.

  • Data are more highly available. Since the shards are independent a failure in one doesn't cause a failure in another. And if you make each shard operate at 50% capacity it's much easier to upgrade a shard in place. Keeping multiple data copies within a shard also helps with redundancy and making the data more parallelized so more work can be done on the data. You can also setup a shard to have a master-slave or dual master relationship within the shard to avoid a single point of failure within the shard. If one server goes down the other can take over.

  • It doesn't use replication. Replicating data from a master server to slave servers is a traditional approach to scaling. Data is written to a master server and then replicated to one or more slave servers. At that point read operations can be handled by the slaves, but all writes happen on the master.

    Obviously the master becomes the write bottleneck and a single point of failure. And as load increases the cost of replication increases. Replication costs in CPU, network bandwidth, and disk IO. The slaves fall behind and have stale data. The folks at YouTube had a big problem with replication overhead as they scaled.

    Sharding cleanly and elegantly solves the problems with replication.

    Some Problems With Sharding

    Sharding isn't perfect. It does have a few problems.

  • Rebalancing data. What happens when a shard outgrows your storage and needs to be split? Let's say some user has a particularly large friends list that blows your storage capacity for the shard. You need to move the user to a different shard.

    On some platforms I've worked on this is a killer problem. You had to build out the data center correctly from the start because moving data from shard to shard required a lot of downtime.

    Rebalancing has to be built in from the start. Google's shards automatically rebalance. For this to work data references must go through some sort of naming service so they can be relocated. This is what Flickr does. And your references must be invalidateable so the underlying data can be moved while you are using it.

  • Joining data from multiple shards. To create a complex friends page, or a user profile page, or a thread discussion page, you usually must pull together lots of different data from many different sources. With sharding you can't just issue a query and get back all the data. You have to make individual requests to your data sources, get all the responses, and the build the page. Thankfully, because of caching and fast networks this process is usually fast enough that your page load times can be excellent.

  • How do you partition your data in shards? What data do you put in which shard? Where do comments go? Should all user data really go together, or just their profile data? Should a user's media, IMs, friends lists, etc go somewhere else? Unfortunately there are no easy answer to these questions.

  • Less leverage. People have experience with traditional RDBMS tools so there is a lot of help out there. You have books, experts, tool chains, and discussion forums when something goes wrong or you are wondering how to implement a new feature. Eclipse won't have a shard view and you won't find any automated backup and restore programs for your shard. With sharding you are on your own.

  • Implementing shards is not well supported. Sharding is currently mostly a roll your own approach. LiveJournal makes their tool chain available. Hibernate has a library under development. MySQL has added support for partioning. But in general it's still something you must implement yourself.

    See Also

  • The Flickr Architecture for more interesting ideas on how to implement sharding.
  • The Google Arhitecture.
  • The LiveJournal Architecture. They talk quite a bit about their sharding approach and give a lot of helpful details.
  • The Shard category.
  • Wednesday
    Aug052009

    Stack Overflow Architecture

    Update 2: Stack Overflow Architecture Update - Now At 95 Million Page Views A Month
    Update: Startup – ASP.NET MVC, Cloud Scale & Deployment shows an interesting alternative approach for a Windows stack using ServerPath/GoGrid for a dedicated database machine, elastic VMs for the front end, and a free load balancer.

    Stack Overflow is a much loved programmer question and answer site written by two guys nobody has ever heard of before. Well, not exactly. The site was created by top programmer and blog stars Jeff Atwood and Joel Spolsky. In that sense Stack Overflow is like a celebrity owned restaurant, only it should be around for a while. Joel estimates 1/3 of all the programmers in the world have used the site so they must be serving up something good.

    I fell in deep like with Stack Overflow for purely selfish reasons, it helped me solve a few difficult problems that were jabbing my eyes out with pain. I also appreciate their no-apologies anthropologically based design philosophy. Use design to engineer in the behaviours you want to encourage and minimize the responses you want to discourage. It's the conscious awareness of the mechanisms that creates such a satisfying synergy.

    What is key about the Stack Overflow story for me is the strong case they make for scale up as a viable solution for a certain potentially large class of problems. The publicity these days is all going scale out using NoSQL databases.

    If you need to Google scale then you really have no choice but to go the NoSQL direction. But Stack Overflow is not Google and neither are most sites. When thinking about your design options keep Stack Overflow in mind. In this era of multi-core, large RAM machines and advances in parallel programming techniques, scale up is still a viable strategy and shouldn't be tossed aside just because it's not cool anymore. Maybe someday we'll have the best of both worlds, but for now there's a big painful choice to be made and that choice decides your fate.

    Joel boasts that for 1/10 the hardware they have performance comparable to similarly size sites. He wonders if these other sites have good programmers. Let's see how they did it and you be the judge.

    Site: http://stackoverflow.com

    The Stats

  • 16 million page views a month
  • 3 million unique visitors a month (Facebook reaches 77 million unique visitors a month)
  • 6 million visits a month
  • 86% of traffic comes from Google
  • 9 million active programmers in the world and 30% have used Stack Overflow.
  • Cheaper licensing was attained through Microsoft's BizSpark program. My impression is they pay about $11K for OS and SQL licensing.
  • Monitization strategy: unobtrusive adds, job placement ads, DevDays conferences, extend the software to target other related niches (Server Fault, Super User), develop StackExchange as a white label and self hosted version of Stack Overflow, and perhaps develop some sort of programmer rating system.

    Platform

  • Microsoft ASP.NET MVC
  • SQL Server 2008
  • C#
  • Visual Studio 2008 Team Suite
  • JQuery
  • LINQ to SQL
  • Subversion
  • Beyond Compare 3
  • VisualSVN 1.5
  • Web Tier
    - 2 x Lenovo ThinkServer RS110 1U
    - 4 cores, 2.83 Ghz, 12 MB L2 cache
    - 500 GB datacenter hard drives, mirrored
    - 8 GB RAM
    - 500 GB RAID 1 mirror array
  • Database Tier
    - 1 x Lenovo ThinkServer RD120 2U
    - 8 cores, 2.5 Ghz, 24 MB L2 cache
    - 48 GB RAM
  • A fourth server was added to run superuser.com. All together the servers also run Stack Overflow, Server Fault, and Super User.
  • QNAP TS-409U NAS for backups. Decided not to use a cloud solution because the bandwidth costs of transferring 5 GB of data per day becomes prohibitive.
  • Hosting at http://www.peakinternet.com/. Impressed with their detailed technical responses and reasonable hosting rates.
  • SQL Server's full text search is used extensively for the site search and detecting if a question has already been asked. Lucene.net is considered an attractive alternative.

    Lessons Learned

    This is a mix of lessons taken from Jeff and Joel and comments from their posts.

  • If you’re comfortable managing servers then buy them. The two biggest problems with renting costs were: 1) the insane cost of memory and disk upgrades 2) the fact that they [hosting providers] really couldn’t manage anything.
  • Make larger one time up front investments to avoid recurring monthly costs which are more expensive in the long term.
  • Update all network drivers. Performance went from 2x slower to 2x faster.
  • Upgrading to 48GB RAM required upgrading MS Enterprise edition.
  • Memory is incredibly cheap. Max it out for almost free performance. At Dell, for example, upgrading from 4G memory to 128G is $4378.
  • Stack Overflow copied a key part of the Wikipedia database design. This turned out to be a mistake which will need massive and painful database refactoring to fix. The refactorings will be to avoid excessive joins in a lot of key queries. This is the key lesson from giant multi-terabyte table schemas (like Google’s BigTable) which are completely join-free. This is significant because Stack Overflow's database is almost completely in RAM and the joins still exact too high a cost.
  • CPU speed is surprisingly important to the database server. Going from 1.86 GHz, to 2.5 GHz, to 3.5 GHz CPUs causes an almost linear improvement in typical query times. The exception is queries which don’t fit in memory.
  • When renting hardware nobody pays list price for RAM upgrades unless you are on a month-to-month contract.
  • The bottleneck is the database 90% of the time.
  • At low server volume, the key cost driver is not rackspace, power, bandwidth, servers, or software; it is NETWORKING EQUIPMENT. You need a gigabit network between your DB and Web tiers. Between the cloud and your web server, you need firewall, routing, and VPN devices. The moment you add a second web server, you also need a load balancing appliance. The upfront cost of these devices can easily be 2x the cost of a handful of servers.
  • EC2 is for scaling horizontally, that is you can split up your work across many machines (a good idea if you want to be able to scale). It makes even more sense if you need to be able to scale on demand (add and remove machines as load increases / decreases).
  • Scaling out is only frictionless when you use open source software. Otherwise scaling up means paying less for licenses and a lot more for hardware, while scaling out means paying less for the hardware, and a whole lot more for licenses.
  • RAID-10 is awesome in a heavy read/write database workload.
  • Separate application and database duties so each can scale independently of the other. Databases scale up and the applications scale out.
  • Applications should keep state in the database so they scale horizontally by adding more servers.
  • The problem with a scale up strategy is a lack of redundancy. A cluster ads more reliability, but is very expensive when the individual machines are expensive.
  • Few applications can scale linearly with the number of processors. Locks will be taken which serializes processing and ends up reducing the effectiveness of your Big Iron.
  • With larger form factors like 7U power and cooling become critical issues. Using something between 1U and 7U might be easier to make work in your data center.
  • As you add more and more database servers the SQL Server license costs can be outrageous. So by starting scale up and gradually going scale out with non-open source software you can be in a world of financial hurt.

    It's true there's not much about their architecture here. We know about their machines, their tool chain, and that they use a two-tier architecture where they access the database directly from the web server code. We don't know how they implement tags, etc. If interested you'll be able to glean some of this information from an explanation of their schema.

    Discussion

    As an architecture profile candidate Stack Overflow has earned two important HighScalability badges: the Microsoft Stack Badge and and the Scale Up Badge. Both controversial and interesting topics of discussion.

    Microsoft Stack Badge

    The Microsoft Stack Badge was earned because Stack Overflow uses the entire Microsoft Stack: OS, database, C#, Visual Studio, and ASP .NET. People are always interested in how MS compares to LAMP, but I don't have many case studies to show them.

    Markus Frind of Plenty of Fish fame is often used as a Microsoft stack poster child, but since he explicitly uses as little of the stack as possible he's not really a good example. Stack Overflow on the other hand is brash in proclaiming their love for MS, even when that love is occasionally spurned.

    It's hard to separate out the Microsoft stack and the scale up approach because for licensing reasons they tend to go together. If you find yourself in the position of transitioning from scale up to scale out by adding dozens of cores, MS licensing will bite you.

    Licensing aside I personally find C#, Visual Studio, and .Net a very productive environment. C#/.Net is at least as good as Java/JVM. ASP .NET has always been a confusing mess to me. The knock against SQL Server is you have to pay for it and if that doesn't bother you then it's a solid choice. The Windows OS may not be as solid as other alternatives but it works well enough.

    So for a scale up solution a Microsoft stack works, especially if you are already Windows centric.

    Scale Up Badge

    This won't be a reenactment of the scale out vs scale up vs rent vs buy wars. For a thorough discussion of these issues please take a look at Scaling Up vs. Scaling Out and Server Hosting — Rent vs. Buy?. If you aren't confused and if your head doesn't hurt after reading all that then you haven't properly understood the material :-)

    The Scale Up Badge was awarded because Stack Overflow uses a scale up strategy to meet their scaling requirements. When they reach a limit they scale vertically by buying a bigger machine and adding more memory.

    Stack Overflow is in the sweet spot for scale up. It's not too large, but with an Alexa ranking of 1,666 and 16 million page views a month it's still a substantial site. Not Google scale, and probably will never have to be, but those are numbers many sites would be thrilled to have. Yet they aren't uploading large amounts of media. They aren't dealing with billions of tweets across complex social networks with millions of users. Their number of users is self limiting. And there are still directions they can take if they need to scale (caching, more web servers, faster disks, more denormalization, more memory, some partitioning, etc). All-in-all it's a well done and very useful two-tier CRUD application.

    NoSQL is Hard

    So should Stack Overflow have scaled out instead of up, just in case?

    What some don't realize is NoSQL is hard. Relational databases have many many faults, but they make a lot of common tasks simple while hiding both the cost and complexity. If you want to know how many black Prius cars are in inventory, for example, then that's pretty easy to do.

    Not so with most NoSQL databases (I'll speak generally here, some NoSQL databases have more features than others). You would have program a counter of black Prius cars yourself, up front, in code. There are no aggregate operators. You must maintain secondary indexes. There's no searching. There are no distributed queries across partitions. There's no Group By or Order By. There are no cursors for easy paging through result sets. Returning even 100 large records at time may timeout. There may be quotas that are very restrictive because they must limit the amount of IO for any one operation. Query languages may lack expressive power.

    The biggest problem of all is that transactions can not span arbitrary boundaries. There are no ACID guarantees beyond a single record or small entity group. Once you wrap your head around what this means for the programmer it's not a pleasant prospect at all. References must be manually maintained. Relationships must be manually maintained. There are no cascading deletes that act correctly during a failure. Every copy of denormalized data must be manually tracked and updated taking into account the possibility of partial failures and externally visible inconsistency.

    All this functionality must be written manually by you in your code. While flexibility to write your own code is great in an OLAP/map-reduce situation, declarative approaches still cover a lot of ground and make for much less brittle code.

    What you gain is the ability to write huge quantities of data. What you lose is complacency. The programmer must be very aware at all times that they are dealing with a system where it costs a lot to perform distribute operations and failure can occur at anytime.

    All this may be the price of building a truly scalable and distributed system, but is this really the price you want to pay?

    The Multitenancy Problem

    With StackExchange Stack Overflow has gone into the multi-tenancy business. They are offering StackExchange either self-hosted or as a hosted white label application.

    It will be interesting to see if their architecture can scale to handle a large number of sites. Salesorce is the king of multitenancy and although it's true they use Oracle as their database, they basically use very little of Oracle and have written their own table structure, indexing and query processor on top of Oracle. All in order to support multitenancy.

    Salesforce went extreme because supporting a lot of different customers is way more difficult than it seems, especially once you allow customization and support versioning.

    Clearly all customers can't run in one server for security, customization, and scaling reasons.

    You may think just create a database for each customer, share a server for a certain number of customers, and then add more servers as needed. As long as a customer doesn't need more than one server you are golden.

    This doesn't seem to work well in practice. Oddly database managers aren't optimized for adding or updating databases. Creating databases is a heavyweight operation and can degrade performance for existing customers as system locks are taken. Upgrade issues are also problematic. Adding columns locks tables which causes problems in high traffic situations. Adding new indexes can also take a very long time and degrade performance. Plus each customer will likely have specializations that makes upgrading even more complicated.

    To get around these problems Salesforce's Craig Weissman, Chief Architect, created an innovative approach where tables are not created for each customer. All data from all customers is mapped into the same data table, including indexes. The schema for that table looks something like orgid, oid, value0, value1...value500. "orgid" is the organization ID and is how data is never mixed up. It's a very wide and sparse table, which Oracle seems to handle well. Hundreds and hundreds of "tables" and custom fields are mapped into the data table.

    With this approach Salesforce has no option other than to build their own infrastructure to interpret what's in that table. Oracle is left to handle transactions, concurrency, and deadlock detection. The advatange is because there's an interpreted layer handling versions and upgrades is relatively simple because the handling logic can be baked in. Strange but true.

    Related Articles

    This list includes a number of posts by Jeff as he chronicles their journey with Stack Overflow. Jeff is wonderful about being open about what they are doing and why. The comment threads are often tremendous. There's a lot to learn.

  • Learning from StackOverflow.com by Joel Spolsky
  • Scaling Up vs. Scaling Out: Hidden Costs by Jeff Atwood
  • What Was Stack Overflow Built With?
  • New Stack Overflow Server Glamour Shots
  • New Stack Overflow Servers Ready
  • Server Hosting — Rent vs. Buy? - this is a very informative discussion the pros and cons of renting vs buying.
  • Rent vs. Buy (or EC2 vs. building your own iron) by Michael Friis
  • Oh, You Wanted "Awesome" Edition - We recently upgraded our database server to 48 GB of memory -- because hardware is cheap, and programmers are expensive.
  • Our Backup Strategy - Inexpensive NAS
  • The Economics of Bandwidth
  • Understanding the StackOverflow Database Schema by Brent Ozar
  • Server Speed Tests - new hardware 2x slower - it was the network.
  • ASP.NET MVC: A New Framework for Building Web Applications
  • Three key things to know about moving MySQL into the cloud by morgan
  • NoSQL Conference
  • Decline of the Enterprise Data Warehouse by Bradford Stephens
  • Webinar: Multitenant Magic - Under the Covers of the Force.com Data Architecture by Craig Weissman, Chief Architect, salesforce.com.
  • Tuesday
    Jul282009

    37signals Architecture

    Update 7: Basecamp, now with more vroom. Basecamp application servers running Ruby code were upgraded and virtualization was removed. The result: A 66 % reduction in the response time while handling multiples of the traffic is beyond what I expected. They still use virtualization (Linux KVM), just less of it now.
    Update 6: Things We’ve Learned at 37Signals. Themes: less is more; don't worry be happy.
    Update 5: Nuts & Bolts: HAproxy . Nice explanation (post, screencast) by Mark Imbriaco of why HAProxy (load balancing proxy server) is their favorite (fast, efficient, graceful configuration, queues requests when Mongrels are busy) for spreading dynamic content between Apache web servers and Mongrel application servers.
    Update 4: O'Rielly's Tim O'Brien interviews David Hansson, Rails creator and 37signals partner. Says BaseCamp scales horizontally on the application and web tier. Scales up for the database, using one "big ass" 128GB machine. Says: As technology moves on, hardware gets cheaper and cheaper. In my mind, you don't want to shard unless you positively have to, sort of a last resort approach.
    Update 3: The need for speed: Making Basecamp faster. Pages now load twice as fast, cut CPU usage by a third and database time by about half. Results achieved by: Analysis, Caching, MySQL optimizations, Hardware upgrades.
    Update 2: customer support is handled in real-time using Campfire.
    Update: highly useful information on creating a customer billing system.


    In the giving spirit of Christmas the folks at 37signals have shared a bit about how their system works. 37signals is most famous for loosing Ruby on Rails into the world and they've use RoR to make their very popular Basecamp, Highrise, Backpack, and Campfire products. RoR takes a lot of heat for being a performance dog, but 37signals seems to handle a lot of traffic with relatively normal sounding resources. This is just an initial data dump, they promise to add more details later. As they add more I'll update it here.

    Site: http://www.37signals.com

    Information Sources

  • Ask 37signals: Numbers?
  • Ask 37signals: How do you process credit cards?
  • Behind the scenes at 37signals: Support
  • Ask 37signals: Why did you restart Highrise?

    Platform

  • Ruby on Rails
  • Memcached
  • Xen
  • MySQL
  • S3 for image storage

    The Stats

  • 30 servers ranging from single processor file servers to 8 CPU application servers for about 100 CPUs and 200GB of RAM.
  • Plan to diagonally scale by reducing the number of servers to 16 for about 92 CPU cores (each significantly faster than what are used today) and 230 GB of combined RAM.
  • Xen virtualization will be used to improve system management.
  • Basecamp (web based project management)
    * 2,000,000 people with accounts
    * 1,340,000 projects
    * 13,200,000 to-do items
    * 9,200,000 messages
    * 12,200,000 comments
    * 5,500,000 time tracking entries
    * 4,000,000 milestones

  • Backpack (personal and small business information management)
    * Just under 1,000,000 pages
    * 6,800,000 to-do items
    * 1,500,000 notes
    * 829,000 photos
    * 370,000 files

  • Overall storage stats (Nov 2007)
    * 5.9 terabytes of customer-uploaded files
    * 888 GB files uploaded (900,000 requests)
    * 2 TB files downloaded (8,500,000 requests)

    The Architecture

  • Memcached caching is used and they are looking to add more. Yields impressive performance results.
  • URL helper methods are used rather than building the URLs by hand.
  • Standard ActiveRecord built queries are used, but for performance reasons they will also "dig in and use" find_by_sql when necessary.
  • They fix Rails when they run into performance problems. It pays to be king :-)
  • Amazon’s S3 is used for storage of files upload by users. Extremely happy with results.

    Credit Card Processing Process

  • Bill monthly. It makes credit card companies more comfortable because they won't be on the hook for a large chunk of change if your company goes out of business. Customers also like it better because it costs less up front and you don't need a contract. Just pay as long as you want the service.

  • Get a Merchant Account. One is needed to process credit cards. They use Chase Bank. Use someone you trust and later negotiate rates when you get enough volume that it matters.
  • Authorize.net is the gateway they use to process the credit card charge.
  • A custom built system handles the monthly billing. It runs each night and bills the appropriate people and records the result.
  • On success an invoice is sent via email.
  • On failure an explanation is sent to the customer.
  • If the card is declined three times the account is frozen until a valid card number is provided.
  • Error handling is critical because problems with charges are common. Freeze to fast is bad, freezing too slow is also bad.
  • All products are being converted to using a centralized billing service.
  • You need to be PCI DSS (Payment Card Industry Data Security Standard) compliant.
  • Use a gateway service that makes it so you don't have to store credit card numbers on your site. That makes your life easier because of the greater security. Some gateway services do have reoccurring billing so you don't have to do it yourself.

    Customer Support

  • Campfire is used for customer service. Campfire is a web-based group chat tool, password-protectable, with chatting, file sharing, image previewing, and decision making.
  • Issues discussed are used to drive code changes and the subversion commit is shown in the conversation. Seems to skip a bug tracking system, which would make it hard to manage bugs and features in any traditional sense, ie, you can't track subversion changes back to a bug and you can't report what features and bugs are in a release.
  • Support can solve problems by customers uploading images, sharing screens, sharing files, and chatting in real-time.
  • Developers are always on within Campfire addressing problems in real-time with the customers.

    Lessons Learned

  • Take a lesson from Amazon and build internal functions as services from the start. This make it easier to share them across all product lines and transparently upgrade features.
  • Don't store credit card numbers on your site. This greatly reduces your security risk.
  • Developers and customers should interact in real-time on a public forum. Customers get better service as developers handle issues as they come up in the normal flow of their development cycle. Several layers of the usual BS are removed. Developers learn what customers like and dislike which makes product development more agile. Customers can see the responsiveness of the company to customers by reading the interactions. This goes a long ways to give potential customers the confidence and the motivation to sign up.
  • Evolve your software by actual features needed by users instead of making up features someone might need someday. Otherwise you end up building something that nobody wants and won't work anyway.
  • Saturday
    Jun272009

    Scaling Twitter: Making Twitter 10000 Percent Faster

    Update 6: Some interesting changes from Twitter's Evan Weaver: everything in RAM now, database is a backup; peaks at 300 tweets/second; every tweet followed by average 126 people; vector cache of tweet IDs; row cache; fragment cache; page cache; keep separate caches; GC makes Ruby optimization resistant so went with Scala; Thrift and HTTP are used internally; 100s internal requests for every external request; rewrote MQ but kept interface the same; 3 queues are used to load balance requests; extensive A/B testing for backwards capability; switched to C memcached client for speed; optimize critical path; faster to get the cached results from the network memory than recompute them locally.
    Update 5: Twitter on Scala. A Conversation with Steve Jenson, Alex Payne, and Robey Pointer by Bill Venners. A fascinating discussion of why Twitter moved to the Java JVM for their server infrastructure (long lived processes) and why they moved to Scala to program against it (high level language, static typing, functional). Ruby is used on the front-end but wasn't performant or reliable enough for the back-end.
    Update 4: Improving Running Components at Twitter by Evan Weaver. Tells how Twitter changed their infrastructure to go from handling 3 requests to 139 requests a second. They moved to a messaging model, asynchronous process, 3 levels of cache, and moved their middleware to a mixture C and Scala/JVM.
    Update 3: Upgrading Twitter without service disruptions by Gojko Adzic. Lots of good updates on the new Twitter architecture.
    Update 2: a commenter in Twitter Fails Macworld Keynote Test said this entry needs to be updated. LOL. My uneducated guess is it's not a language or architecture problem, but more a problem of not being able to add hardware fast enough into their data center. The predictability of this problem is debatable, but once you have it, it's hard to fix.
    Update: Twitter releases Starling - light-weight persistent queue server that speaks the MemCache protocol. It was built to drive Twitter's backend, and is in production across Twitter's cluster.

    Twitter started as a side project and blew up fast, going from 0 to millions of page views within a few terrifying months. Early design decisions that worked well in the small melted under the crush of new users chirping tweets to all their friends. Web darling Ruby on Rails was fingered early for the scaling problems, but Blaine Cook, Twitter's lead architect, held Ruby blameless:

    For us, it’s really about scaling horizontally - to that end, Rails and Ruby haven’t been stumbling blocks, compared to any other language or framework. The performance boosts associated with a “faster” language would give us a 10-20% improvement, but thanks to architectural changes that Ruby and Rails happily accommodated, Twitter is 10000% faster than it was in January.

    If Ruby on Rails wasn't to blame, how did Twitter learn to scale ever higher and higher?

    Update: added slides Small Talk on Getting Big. Scaling a Rails App & all that Jazz

    Site: http://twitter.com

    Information Sources

  • Scaling Twitter Video by Blaine Cook.
  • Scaling Twitter Slides
  • Good News blog post by Rick Denatale
  • Scaling Twitter blog post Patrick Joyce.
  • Twitter API Traffic is 10x Twitter’s Site.
  • A Small Talk on Getting Big. Scaling a Rails App & all that Jazz - really cute dog picks

    The Platform

  • Ruby on Rails
  • Erlang
  • MySQL
  • Mongrel - hybrid Ruby/C HTTP server designed to be small, fast, and secure
  • Munin
  • Nagios
  • Google Analytics
  • AWStats - real-time logfile analyzer to get advanced statistics
  • Memcached

    The Stats

  • Over 350,000 users. The actual numbers are as always, very super super top secret.
  • 600 requests per second.
  • Average 200-300 connections per second. Spiking to 800 connections per second.
  • MySQL handled 2,400 requests per second.
  • 180 Rails instances. Uses Mongrel as the "web" server.
  • 1 MySQL Server (one big 8 core box) and 1 slave. Slave is read only for statistics and reporting.
  • 30+ processes for handling odd jobs.
  • 8 Sun X4100s.
  • Process a request in 200 milliseconds in Rails.
  • Average time spent in the database is 50-100 milliseconds.
  • Over 16 GB of memcached.

    The Architecture

  • Ran into very public scaling problems. The little bird of failure popped up a lot for a while.
  • Originally they had no monitoring, no graphs, no statistics, which makes it hard to pinpoint and solve problems. Added Munin and Nagios. There were difficulties using tools on Solaris. Had Google analytics but the pages weren't loading so it wasn't that helpful :-)
  • Use caching with memcached a lot.
    - For example, if getting a count is slow, you can memoize the count into memcache in a millisecond.
    - Getting your friends status is complicated. There are security and other issues. So rather than doing a query, a friend's status is updated in cache instead. It never touches the database. This gives a predictable response time frame (upper bound 20 msecs).
    - ActiveRecord objects are huge so that's why they aren't cached. So they want to store critical attributes in a hash and lazy load the other attributes on access.
    - 90% of requests are API requests. So don't do any page/fragment caching on the front-end. The pages are so time sensitive it doesn't do any good. But they cache API requests.
  • Messaging
    - Use message a lot. Producers produce messages, which are queued, and then are distributed to consumers. Twitter's main functionality is to act as a messaging bridge between different formats (SMS, web, IM, etc).
    - Send message to invalidate friend's cache in the background instead of doing all individually, synchronously.
    - Started with DRb, which stands for distributed Ruby. A library that allows you to send and receive messages from remote Ruby objects via TCP/IP. But it was a little flaky and single point of failure.
    - Moved to Rinda, which a shared queue that uses a tuplespace model, along the lines of Linda. But the queues are persistent and the messages are lost on failure.
    - Tried Erlang. Problem: How do you get a broken server running at Sunday Monday with 20,000 users waiting? The developer didn't know. Not a lot of documentation. So it violates the use what you know rule.
    - Moved to Starling, a distributed queue written in Ruby.
    - Distributed queues were made to survive system crashes by writing them to disk. Other big websites take this simple approach as well.
  • SMS is handled using an API supplied by third party gateway's. It's very expensive.
  • Deployment
    - They do a review and push out new mongrel servers. No graceful way yet.
    - An internal server error is given to the user if their mongrel server is replaced.
    - All servers are killed at once. A rolling blackout isn't used because the message queue state is in the mongrels and a rolling approach would cause all the queues in the remaining mongrels to fill up.
  • Abuse
    - A lot of down time because people crawl the site and add everyone as friends. 9000 friends in 24 hours. It would take down the site.
    - Build tools to detect these problems so you can pinpoint when and where they are happening.
    - Be ruthless. Delete them as users.
  • Partitioning
    - Plan to partition in the future. Currently they don't. These changes have been enough so far.
    - The partition scheme will be based on time, not users, because most requests are very temporally local.
    - Partitioning will be difficult because of automatic memoization. They can't guarantee read-only operations will really be read-only. May write to a read-only slave, which is really bad.
  • Twitter's API Traffic is 10x Twitter’s Site
    - Their API is the most important thing Twitter has done.
    - Keeping the service simple allowed developers to build on top of their infrastructure and come up with ideas that are way better than Twitter could come up with. For example, Twitterrific, which is a beautiful way to use Twitter that a small team with different priorities could create.
  • Monit is used to kill process if they get too big.

    Lessons Learned

  • Talk to the community. Don't hide and try to solve all problems yourself. Many brilliant people are willing to help if you ask.
  • Treat your scaling plan like a business plan. Assemble a board of advisers to help you.
  • Build it yourself. Twitter spent a lot of time trying other people's solutions that just almost seemed to work, but not quite. It's better to build some things yourself so you at least have some control and you can build in the features you need.
  • Build in user limits. People will try to bust your system. Put in reasonable limits and detection mechanisms to protect your system from being killed.
  • Don't make the database the central bottleneck of doom. Not everything needs to require a gigantic join. Cache data. Think of other creative ways to get the same result. A good example is talked about in Twitter, Rails, Hammers, and 11,000 Nails per Second.
  • Make your application easily partitionable from the start. Then you always have a way to scale your system.
  • Realize your site is slow. Immediately add reporting to track problems.
  • Optimize the database.
    - Index everything. Rails won't do this for you.
    - Use explain to how your queries are running. Indexes may not be being as you expect.
    - Denormalize a lot. Single handedly saved them. For example, they store all a user IDs friend IDs together, which prevented a lot of costly joins.
    - Avoid complex joins.
    - Avoid scanning large sets of data.
  • Cache the hell out of everything. Individual active records are not cached, yet. The queries are fast enough for now.
  • Test everything.
    - You want to know when you deploy an application that it will render correctly.
    - They have a full test suite now. So when the caching broke they were able to find the problem before going live.
  • Long running processes should be abstracted to daemons.
  • Use exception notifier and exception logger to get immediate notification of problems so you can address the right away.
  • Don't do stupid things.
    - Scale changes what can be stupid.
    - Trying to load 3000 friends at once into memory can bring a server down, but when there were only 4 friends it works great.
  • Most performance comes not from the language, but from application design.
  • Turn your website into an open service by creating an API. Their API is a huge reason for Twitter's success. It allows user's to create an ever expanding and ecosystem around Twitter that is difficult to compete with. You can never do all the work your user's can do and you probably won't be as creative. So open you application up and make it easy for others to integrate your application with theirs.

    Related Articles

  • For a discussion of partitioning take a look at Amazon Architecture, An Unorthodox Approach to Database Design : The Coming of the Shard, Flickr Architecture
  • The Mailinator Architecture has good strategies for abuse protection.
  • GoogleTalk Architecture addresses some interesting issues when scaling social networking sites.
  • Friday
    Jun262009

    PlentyOfFish Architecture

    Update 5: PlentyOfFish Update - 6 Billion Pageviews And 32 Billion Images A Month
    Update 4: Jeff Atwood costs out Markus' scale up approach against a scale out approach and finds scale up wanting. The discussion in the comments is as interesting as the article. My guess is Markus doesn't want to rewrite his software to work across a scale out cluster so even if it's more expensive scale up works better for his needs.
    Update 3: POF now has 200 million images and serves 10,000 images served per second. They'll be moving to a 250,000 IOPS RamSan to handle the load. Also upgraded to a core database machine with 512 GB of RAM, 32 CPU’s, SQLServer 2008 and Windows 2008.
    Update 2: This seems to be a POF Peer1 love fest infomercial. It's pretty content free, but the production values are high. Lots of quirky sounds and fish swimming on the screen.
    Update: by Facebook standards Read/WriteWeb says POF is worth a cool one billion dollars. It helps to talk like Dr. Evil when saying it out loud.

    PlentyOfFish is a hugely popular on-line dating system slammed by over 45 million visitors a month and 30+ million hits a day (500 - 600 pages per second). But that's not the most interesting part of the story. All this is handled by one person, using a handful of servers, working a few hours a day, while making $6 million a year from Google ads. Jealous? I know I am. How are all these love connections made using so few resources?

    Site: http://www.plentyoffish.com/

    Information Sources

  • Channel9 Interview with Markus Frind
  • Blog of Markus Frind
  • Plentyoffish: 1-Man Company May Be Worth $1Billion

    The Platform

  • Microsoft Windows
  • ASP.NET
  • IIS
  • Akamai CDN
  • Foundry ServerIron Load Balancer

    The Stats

  • PlentyOfFish (POF) gets 1.2 billion page views/month, and 500,000 average unique logins per day. The peak season is January, when it will grow 30 percent.
  • POF has one single employee: the founder and CEO Markus Frind.
  • Makes up to $10 million a year on Google ads working only two hours a day.
  • 30+ Million Hits a Day (500 - 600 pages per second).
  • 1.1 billion page views and 45 million visitors a month.
  • Has 5-10 times the click through rate of Facebook.
  • A top 30 site in the US based on Competes Attention metric, top 10 in Canada and top 30 in the UK.
  • 2 load balanced web servers with 2 Quad Core Intel Xeon X5355 @ 2.66Ghz), 8 Gigs of RAM (using about 800 MBs), 2 hard drives, runs Windows x64 Server 2003.
  • 3 DB servers. No data on their configuration.
  • Approaching 64,000 simultaneous connections and 2 million page views per hour.
  • Internet connection is a 1Gbps line of which 200Mbps is used.
  • 1 TB/day serving 171 million images through Akamai.
  • 6TB storage array to handle millions of full sized images being uploaded every month to the site.

    What's Inside

  • Revenue model has been to use Google ads. Match.com, in comparison, generates $300 million a year, primarily from subscriptions. POF's revenue model is about to change so it can capture more revenue from all those users. The plan is to hire more employees, hire sales people, and sell ads directly instead of relying solely on AdSense.
  • With 30 million page views a day you can make good money on advertising, even a 5 - 10 cents a CPM.
  • Akamai is used to serve 100 million plus image requests a day. If you have 8 images and each takes 100 msecs you are talking a second load just for the images. So distributing the images makes sense.
  • 10’s of millions of image requests are served directly from their servers, but the majority of these images are less than 2KB and are mostly cached in RAM.
  • Everything is dynamic. Nothing is static.
  • All outbound Data is Gzipped at a cost of only 30% CPU usage. This implies a lot of processing power on those servers, but it really cuts bandwidth usage.
  • No caching functionality in ASP.NET is used. It is not used because as soon as the data is put in the cache it's already expired.
  • No built in components from ASP are used. Everything is written from scratch. Nothing is more complex than a simple if then and for loops. Keep it simple.
  • Load balancing
    - IIS arbitrarily limits the total connections to 64,000 so a load balancer was added to handle the large number of simultaneous connections. Adding a second IP address and then using a round robin DNS was considered, but the load balancer was considered more redundant and allowed easier swap in of more web servers. And using ServerIron allowed advanced functionality like bot blocking and load balancing based on passed on cookies, session data, and IP data.
    - The Windows Network Load Balancing (NLB) feature was not used because it doesn't do sticky sessions. A way around this would be to store session state in a database or in a shared file system.
    - 8-12 NLB servers can be put in a farm and there can be an unlimited number of farms. A DNS round-robin scheme can be used between farms. Such an architecture has been used to enable 70 front end web servers to support over 300,000 concurrent users.
    - NLB has an affinity option so a user always maps to a certain server, thus no external storage is used for session state and if the server fails the user loses their state and must relogin. If this state includes a shopping cart or other important data, this solution may be poor, but for a dating site it seems reasonable.
    - It was thought that the cost of storing and fetching session data in software was too expensive. Hardware load balancing is simpler. Just map users to specific servers and if a server fails have the user log in again.
    - The cost of a ServerIron was cheaper and simpler than using NLB. Many major sites use them for TCP connection pooling, automated bot detection, etc. ServerIron can do a lot more than load balancing and these features are attractive for the cost.
  • Has a big problem picking an ad server. Ad server firms want several hundred thousand a year plus they want multi-year contracts.
  • In the process of getting rid of ASP.NET repeaters and instead uses the append string thing or response.write. If you are doing over a million page views a day just write out the code to spit it out to the screen.
  • Most of the build out costs went towards a SAN. Redundancy at any cost.
  • Growth was through word of mouth. Went nuts in Canada, spread to UK, Australia, and then to the US.
  • Database
    - One database is the main database.
    - Two databases are for search. Load balanced between search servers based on the type of search performed.
    - Monitors performance using task manager. When spikes show up he investigates. Problems were usually blocking in the database. It's always database issues. Rarely any problems in .net. Because POF doesn't use the .net library it's relatively easy to track down performance problems. When you are using many layers of frameworks finding out where problems are hiding is frustrating and hard.
    - If you call the database 20 times per page view you are screwed no matter what you do.
    - Separate database reads from writes. If you don't have a lot of RAM and you do reads and writes you get paging involved which can hang your system for seconds.
    - Try and make a read only database if you can.
    - Denormalize data. If you have to fetch stuff from 20 different tables try and make one table that is just used for reading.
    - One day it will work, but when your database doubles in size it won't work anymore.
    - If you only do one thing in a system it will do it really really well. Just do writes and that's good. Just do reads and that's good. Mix them up and it messes things up. You run into locking and blocking issues.
    - If you are maxing the CPU you've either done something wrong or it's really really optimized. If you can fit the database in RAM do it.
  • The development process is: come up with an idea. Throw it up within 24 hours. It kind of half works. See what user response is by looking at what they actually do on the site. Do messages per user increase? Do session times increase? If people don't like it then take it down.
  • System failures are rare and short lived. Biggest issues are DNS issues where some ISP says POF doesn't exist anymore. But because the site is free, people accept a little down time. People often don't notice sites down because they think it's their problem.
  • Going from one million to 12 million users was a big jump. He could scale to 60 million users with two web servers.
  • Will often look at competitors for ideas for new features.
  • Will consider something like S3 when it becomes geographically load balanced.

    Lessons Learned

  • You don't need millions in funding, a sprawling infrastructure, and a building full of employees to create a world class website that handles a torrent of users while making good money. All you need is an idea that appeals to a lot of people, a site that takes off by word of mouth, and the experience and vision to build a site without falling into the typical traps of the trade. That's all you need :-)
  • Necessity is the mother of all change.
  • When you grow quickly, but not too quickly you have a chance grow, modify, and adapt.
  • RAM solves all problems. After that it's just growing using bigger machines.
  • When starting out keep everything as simple as possible. Nearly everyone gives this same advice and Markus makes a noticeable point of saying everything he does is just obvious common sense. But clearly what is simple isn't merely common sense. Creating simple things is the result of years of practical experience.
  • Keep database access fast and you have no issues.
  • A big reason POF can get away with so few people and so little equipment is they use a CDN for serving large heavily used content. Using a CDN may be the secret sauce in a lot of large websites. Markus thinks there isn't a single site in the top 100 that doesn’t use a CDN. Without a CDN he thinks load time in Australia would go to 3 or 4 seconds because of all the images.
  • Advertising on Facebook yielded poor results. With 2000 clicks only 1 signed up. With a CTR of 0.04% Facebook gets 0.4 clicks per 1000 ad impressions, or .4 clicks per CPM. At 5 cent/CPM = 12.5 cents a click, 50 cent/CPM = $1.25 a click. $1.00/CPM = $2.50 a click. $15.00/CPM = $37.50 a click.
  • It's easy to sell a few million page views at high CPM’s. It's a LOT harder to sell billions of page views at high CPM’s, as shown by Myspace and Facebook.
  • The ad-supported model limits your revenues. You have to go to a paid model to grow larger. To generate 100 million a year as a free site is virtually impossible as you need too big a market.
  • Growing page views via Facebook for a dating site won't work. Having a visitor on you site is much more profitable. Most of Facebook's page views are outside the US and you have to split 5 cent CPM’s with Facebook.
  • Co-req is a potential large source of income. This is where you offer in your site's sign up to send the user more information about mortgages are some other product.
  • You can't always listen to user responses. Some users will always love new features and others will hate it. Only a fraction will complain. Instead, look at what features people are actually using by watching your site.

    Related Articles

  • MySpace also uses Windows to run their site.
  • Markus Frind's posts on Webmaster World.
  • And the Money Comes Rolling In by Max Chafkin
  • How I started A Dating Empire by Markus Frind

    Thanks to Erik Osterman for recommending profiling PlentyOfFish.
  • Friday
    Jun052009

    HotPads Shows the True Cost of Hosting on Amazon

    Mather Corgan, president of HotPads, gave a great talk on how HotPads uses AWS to run their real estate search engine. I loved the presentation for a few reasons:

  • It gives real costs on on their servers, how many servers they have, what they are used for, and exactly how they use S2, EBS, CloudFront and other AWS services. This is great information for anybody trying to architect a system and wondering where to run it.
  • HotPads is a "real" application. It's a small company and at 4.5 million page-views/month it's large but not super large. It has custom server side components like indexing engines, image processing, and background database update engines for syncing new real estate data. And it also stores a lot of images and has low latency requirements.

    This a really good example mix of where many companies are or would like to be with their applications.

    Their total costs are about $11K/month, which is about what they were paying at their previous provider. I found this is a little surprising as I thought the cloud would be more expensive, but they only pay for what they need instead of having to over provision for transient uses like testing. And some servers aren't necessary anymore as EBS handles backups so database slave servers are no longer required.

    There are lots more lessons like this that I've abstracted down below.

    Site: http://hotpads.com - a map-based real estate search engine, listing homes for sale, apartments, condos, and rental houses.

    Stats

  • 800,000 visits/month
  • 4.5 million page-views/month
  • 3.5 million real-estate listings updated daily

    Platform

  • Java
  • MySQL
  • AWS

    Costs

  • EC2 - $7400/month - run 20 of various size instances at anyone time. Most work is in the background processing of images, not web serving.
    * $150: 2 Small HAProxy Load Balancers - 2 for failover, these have the elastic IPs, round robin DNS point at the elastic IPs.
    * $1,200: 3-5 Large Tomcat Web Servers - an array of 3 run at night and 5 during the day.
    * $1,500: 5 Large Tomcat Job Servers
    * $900: 1 X-Large 1 Large Index Server - used to power property search and have several GB of RAM for the JVM
    * $1,200: 1 X-Large 2 Large MySQL masters
    * $1,200: 1 X-Large 2 Large MySQL slaves
    * $300: 1 Large Messaging Server ActiveMQ - will be replaced with SQS
    * $300: 1 Large Map tile creation servers Tilecache
    * $600: Development/testing/migration/ servers
  • S3 - $1500/month - few hundred million objects for files for maps and real-estate listing photos. 4TB of database backup stored as EBS diffs ($600/month).
  • Elastic Block Storage - $500/month
  • CloudFront - $460/month - is used to serve static files and map files throughout the world. It serves static files, map tiles, and listing photos.
  • Elastic IP Addresses - $8/month
  • RightScale - $500/month - used for management and deployment.

    Lessons Learned

  • Major reason for choosing EC2 was the cloud API which allows adding servers at any time. In their previous hosting service they had to prepay for a month at a time so they would order the minimum necessary to get by that month. That doesn't leave room for servers for development, test, preview servers for customers or making live database servers upgrades (which requires 2x servers)?
  • Overall cost is about the same as with previous hosting site but the overall speed of development and ease of management is night and day different. Getting more servers and lots more flexibility.
  • HotPads is a small company and doesn't think added trouble of colocation isn't worth it for them yet.
  • Advantage of Amazon over something like Google App Engine is that Amazon allows you to innovate by building your own services on your own machines.
  • S3 is better for larger objects because for small files that are not viewed often the cost of puts outweighs everything. Not a cache to use for short lived objects because the put costs start to dominate.
    * For a 67 KB object (600 px image) which is where the cost of putting an image into S3 equals the cost of storing it there and about equal the cost of storing it once.
    * For a 6.7 KB object (15 px thumb nail) the put (small fee for putting an object into S3) cost is 10x the storage transfer costs.
  • Costs have to figured into the algorithms you use.
    * In April 330 GB of images downloaded at $.15/GB cost $49. 55mm GETs at $1/mm cost $55. 42mm PUTs at $1/1k cost $420!
    * $100 download and GETs of maptiles.
    * So S3 very cheap for larger files, watch out for lots of short lived small files.
  • CloudFront is 10 times faster than S3 but is more expensive for infrequently viewed files.
    * Makes frequently viewed listings faster.
    * For infrequently viewed listings the CloudFront has to go to S3 to get the file the first time which means you have to pay twice for a file that will be viewed only once.
  • EBS
    * Used on database servers because it's faster than local storage (especially for random writes), blocks of data redundant, and supports easy backups and versioning via cloning.
    * Only 10% cost overhead.
    * Allowed them to get rid of second set of slaves because the backups were so CPU intensive they had to have slaves to do the backups. EBS allows snapshots of running drives so the extra slaves are unnecessary.
    * Databases are I/O bound and the CPU is vastly underutilized so there's extra capacity when you need it.
  • SimpleDB - not using, pretty proprietary. May be of value because you only pay for what you use given how under utilized your own database servers can be.
  • Reserved Instances
    * 1 year for the cost of 6 months and guaranteed (denied one time) to get an instance.
    * Con is tied to an instance type and they want more flexibility to choose instance types as their software changes and take advantage of new instance types as they are released.
  • Rather than having dedicated memcached machines they've scavenged 8 GB of memory from their existing servers.

    Related Sites

  • AWS Start-Up Event DC 2009: HotPads On AWS Slideshow.
  • Cloud Programming Directly Feeds Cost Allocation Back into Software Design
  • AWS Elastic Load Balancer Tutorial

  • Friday
    May152009

    Wolfram|Alpha Architecture

    Making the world's knowledge computable

    Today's Wolfram|Alpha is the first step in an ambitious, long-term project to make all systematic knowledge immediately computable by anyone. You enter your question or calculation, and Wolfram|Alpha uses its built-in algorithms and growing collection of data to compute the answer.

    Answer Engine vs Search Engine

    When Wolfram|Alpha launches later today, it will be one of the most computationally intensive websites on the internet. The Wolfram|Alpha computational knowledge engine is an "answer engine" that is able to produce answers to various questions such as
    • What is the GDP of France?
    • Weather is Springfield when David Ortiz was born
    • 33 g of gold
    • LDL vs. serum potassium 150 smoker male age 40
    • life expectancy male age 40 finland
    • highschool teacher median wage
    Wolfram|Alpha excels at different areas like mathematics, statistics, physics, engineering, astronomy, chemistry, life sciences, geology, business and finance as demonstrated by Steven Wolfram in his Introduction screencast.

    The Stats

    • Abour 10,000 CPU cores at launch
    • 10+ trillion of pieces of data
    • 50,000+ types of algorithms
    • Able to handle about 175 million queries per day
    • 5+ million lines of symbolic Mathematica code

    The Computers Powering Computable Knowledge

    There is no way to know exactly how much traffic to expect, especially during the initial period immediately following the launch, but the Wolfram|Alpha team is working hard to put reasonable capacity in place. As Stephen writes in the Wolfram|Alpha blog Alpha will run in 5 distributed colocation facilities. What computing power have they gathered in these facilities for launch day? Two supercomputers, just about 10,000 processor cores, hundreds of terabytes of disks, a heck of a lot of bandwidth, and what seems like enough air conditioning for the Sahara to host a ski resort. One of their launch partners, R Systems, created the world’s 44th largest supercomputer (per the June 2008 TOP500 list - it is listed as 66th per the latest Top500 list). They call it the R Smarr. It will be running Wolfram|Alpha on launch day! R Smarr has a Sum Rmax of 39580 GFlops using Dell DCS CS23-SH, QC HT 2.8 GHz computers, 4608 cores, 65536 GB of RAM and Infiniband interconnect. Dell is another of the launch partners with a data center full of quad-board, dual-processor, quad-core Harpertown servers. What does it all add up to? The ability to handle 175 million queries (yielding maybe a billion) per day—over 5 billion queries (encompassing around 30 billion calculations) per month.

    The Launch of Wolfram|Alpha

    Watch a live webcast of the Wolfram|Alpha system being brought online for the first time on
    • Friday, May 15, beginning at 7pm CST

    The First Killer App of The New Kind of Science

    The Genius behind Wolfram|Alpha is Stephen Wolfram. He is best know for his ambitious projects: Mathematica and A New Kind of Science (NKS). May 14, 2009 marks the 7th anniversary of the publication of his book A New Kind of Science. Stephen explains is his blog post: But for me the biggest thing that’s happened this year is the emergence of Wolfram|Alpha. Wolfram|Alpha is, I believe, going to be the first killer app of NKS.

    Status

    That it should be possible to build Wolfram|Alpha as it exists today in the first decade of the 21st century was far from obvious. And yet there is much more to come. As of now, Wolfram|Alpha contains 10+ trillion of pieces of data, 50,000+ types of algorithms and models, and linguistic capabilities for 1000+ domains. Built with Mathematica—which is itself the result of more than 20 years of development at Wolfram Research—Wolfram|Alpha's core code base now exceeds 5 million lines of symbolic Mathematica code. Running on supercomputer-class compute clusters, Wolfram|Alpha makes extensive use of the latest generation of web and parallel computing technologies, including webMathematica and gridMathematica.

    How Mathematica Made Wolfram|Alpha Possible?

    Wolfram|Alpha is a major software engineering development to make all systematic knowledge immediately computable by anyone. It is developed and deployed entirely with Mathematica—in fact, Mathematica has uniquely made Wolfram|Alpha possible. Here's why.
    • Computational knowledge and intelligence
    • High-performance enterprise deployment
    • One coherent architecture
    • Smart method selection
    • Dynamic report generation
    • Database connectivity
    • Built-in, computable data
    • High-level programming language
    • Efficient text processing and linguistic analysis
    • Wide-ranging, automated visualization capabilities
    • Automated importing
    • Development environment

    Information Sources

    Congratulations Stephen!

    Click to read more ...

    Friday
    Apr242009

    Heroku - Simultaneously Develop and Deploy Automatically Scalable Rails Applications in the Cloud

    Update 4: Heroku versus GAE & GAE/J

    Update 3: Heroku has gone live!. Congratulations to the team. It's difficult right now to get a feeling for the relative cost and reliability of Heroku, but it's an impressive accomplishment and a viable option for people looking for a delivery platform.

    Update 2: Heroku Architecture. A great interactive presentation of the Heroku stack. Requests flow into Nginx used as a HTTP Reverse Proxy. Nginx routes requests into a Varnish based HTTP cache. Then requests are injected into an Erlang based routing mesh that balances requests across a grid of dynos. Dynos are your application "VMs" that implement application specific behaviors. Dynos themselves are a stack of: POSIX, Ruby VM, App Server, Rack, Middleware, Framework, Your App. Applications can access PostgreSQL. Memcached is used as an application caching layer.

    Update: Aaron Worsham Interview with James Lindenbaum, CEO of Heroku. Aaron nicely sums up their goal: Heroku is looking to eliminate all the reasons companies have for not doing software projects.


    Adam Wiggins of Heroku presented at the lollapalooza that was the Cloud Computing Demo Night. The idea behind Heroku is that you upload a Rails application into Heroku and it automatically deploys into EC2 and it automatically scales using behind the scenes magic. They call this "liquid scaling." You just dump your code and go. You don't have to think about SVN, databases, mongrels, load balancing, or hosting. You just concentrate on building your application. Heroku's unique feature is their web based development environment that lets you develop applications completely from their control panel. Or you can stick with your own development environment and use their API and Git to move code in and out of their system.

    For website developers this is as high up the stack as it gets. With Heroku we lose that "build your first lightsaber" moment marking the transition out of apprenticeship and into mastery. Upload your code and go isn't exactly a heroes journey, but it is damn effective...

    I must confess to having an inherent love of Heroku's idea because I had a similar notion many moons ago, but the trendy language of the time was Perl instead of Rails. At the time though it just didn't make sense. The economics of creating your own "cloud" for such a different model wasn't there. It's amazing the niches utility computing will seed, fertilize, and help grow. Even today when using Eclipse I really wish it was hosted in the cloud and I didn't have to deal with all its deployment headaches. Firefox based interfaces are pretty impressive these days. Why not?

    Adam views their stack as:
    1. Developer Tools
    2. Application Management
    3. Cluster Management
    4. Elastic Compute Cloud

    At the top level developers see a control panel that lets them edit code, deploy code, interact with the database, see logs, and so on. Your website is live from the first moment you start writing code. It's a powerful feeling to write normal code, see it run immediately, and know it will scale without further effort on your part. Now, will you be able toss your Facebook app into the Heroku engine and immediately handle a deluge of 500 million hits a month? It will be interesting to see how far a generic scaling model can go without special tweaking by a certified scaling professional. Elastra has the same sort of issue.

    Underneath Heroku makes sure all the software components work together in Lennon-McCartney style harmony. They take care (or will take care of) starting and stopping VMs, deploying to those VMs, billing, load balancing, scaling, storage, upgrades, failover, etc. The dynamic nature of Ruby and the development and deployment infrastructure of Rails is what makes this type of hosting possible. You don't have to worry about builds. There's a great infrastructure for installing packages and plugins. And the big hard one of database upgrades is tackled with the new migrations feature.

    A major issue in the Rails world is versioning. Given the precambrian explosion of Rails tools, how does Heroku make sure all the various versions of everything work together? Heroku sees this as their big value add. They are in charge of making sure everything works together. We see a lot companies on the web taking on the role of curator ([1], [2], [3]). A curator is a guardian or an overseer. Of curators Steve Rubel says: They acquire pieces that fit within the tone, direction and - above all - the purpose of the institution. They travel the corners of the world looking for "finds." Then, once located, clean them up and make sure they are presentable and offer the patron a high quality experience. That's the role Heroku will play for their deployable Rails environment.

    With great automated power comes great restrictions. And great opportunity. Curating has a cost for developers: flexibility. The database they support is Postgres. Out of luck if you wan't MySQL. Want a different Ruby version or Rails version? Not if they don't support it. Want memcache? You just can't add it yourself. One forum poster wanted, for example, to use the command line version of ImageMagick but was told it wasn't installed and use RMagick instead. Not the end of the world. And this sort of curating has to be done to keep a happy and healthy environment running, but it is something to be aware of.

    The upside of curation is stuff will work. And we all know how hard it can be to get stuff to work. When I see an EC2 AMI that already has most of what I need my heart goes pitter patter over the headaches I'll save because someone already did the heavy curation for me. A lot of the value in services like rPath offers, for example, is in curation. rPath helps you build images that work, that can be deployed automatically, and can be easily upgraded. It can take a big load off your shoulders.

    There's a lot of competition for Heroku. Mosso has a hosting system that can do much of what Heroku wants to do. It can automatically scale up at the webserver, data, and storage tiers. It supports a variery of frameworks, including Rails. And Mosso also says all you have to do is load and go.

    3Tera is another competitor. As one user said: It lets you visually (through a web ui) create "applications" based on "appliances". There is a standard portfolio of prebuilt applications (SugarCRM, etc.) and templates for LAMP, etc. So, we build our application by taking a firewall appliance, a CentOS appliance, a gateway, a MySql appliance, glue them together, customize them, and then create our own template. You can specify down to the appliance level, the amount of cpu, memory, disk, and bandwidth each are assigned which let's you scale up your capacity simply by tweaking values through the UI. We can now deploy our Rails/Java hosted offering for new customers in about 20 minutes on our grid. AppLogic has automatic failover so that if anything goes wrong, it reploys your application to a new node in your grid and restarts it. It's not as cheap as EC2, but much more powerful. True, 3Tera won't help with your application directly, but most of the hard bits are handled.

    RightScale is another company that combines curation along with load balancing, scaling, failover, and system management.

    What differentiates Heroku is their web based IDE that allows you to focus solely on the application and ignore the details. Though now that they have a command line based interface as well, it's not as clear how they will differentiate themselves from other offerings.

    The hosting model has a possible downside if you want to do something other than straight web hosting. Let's say you want your system to insert commercials into podcasts. That sort of large scale batch logic doesn't cleanly fit into the hosting model. A separate service accessed via something like a REST interface needs to be created. Possibly double the work. Mosso suffers from this same concern. But maybe leaving the web front end to Heroku is exactly what you want to do. That would leave you to concentrate on the back end service without worrying about the web tier. That's a good approach too.

    Heroku is just getting started so everything isn't in place yet. They've been working on how to scale their own infrastructure. Next is working on scaling user applications beyond starting and stopping mongrels based on load. They aren't doing any vertical scaling of the database yet. They plan on memcaching reads, implementing read-only slaves via Slony, and using the automatic partitioning features built into Postgres 8.3. The idea is to start a little smaller with them now and grow as they grow. By the time you need to scale bigger they should have the infrastructure in place.

    One concern is that pricing isn't nailed down yet, but my gut says it will be fair. It's not clear how you will transfer an existing database over, especially from a non-Postgres database. And if you use the web IDE I wonder how you will normal project stuff like continuous integration, upgrades, branching, release tracking, and bug tracking? Certainly a lot of work to do and a lot of details to work out, but I am sure it's nothing they can't handle.

    Related Articles

  • Heroku Rails Podcast
  • Heroku Open Source Plugins etc
  • Tuesday
    Apr072009

    Six Lessons Learned Deploying a Large-scale Infrastructure in Amazon EC2 

    Lessons learned from OpenX's large-scale deployment to Amazon EC2:

  • Expect failures; what's more, embrace them
  • Fully automate your infrastructure deployments
  • Design your infrastructure so that it scales horizontally
  • Establish clear measurable goals
  • Be prepared to quickly identify and eliminate bottlenecks
  • Play wack-a-mole for a while, until things get stable

    Click to read more ...

  • Saturday
    Apr042009

    Digg Architecture

    Update 4:: Introducing Digg’s IDDB Infrastructure by Joe Stump. IDDB is a way to partition both indexes (e.g. integer sequences and unique character indexes) and actual tables across multiple storage servers (MySQL and MemcacheDB are currently supported with more to follow). Update 3:: Scaling Digg and Other Web Applications. Update 2:: How Digg Works and How Digg Really Works (wear ear plugs). Brought to you straight from Digg's blog. A very succinct explanation of the major elements of the Digg architecture while tracing a request through the system. I've updated this profile with the new information. Update: Digg now receives 230 million plus page views per month and 26 million unique visitors - traffic that necessitated major internal upgrades. Traffic generated by Digg's over 22 million famously info-hungry users and 230 million page views can crash an unsuspecting website head-on into its CPU, memory, and bandwidth limits. How does Digg handle billions of requests a month? Site: http://digg.com

    Information Sources

  • How Digg Works by Digg
  • How Digg.com uses the LAMP stack to scale upward
  • Digg PHP's Scalability and Performance

    Platform

  • MySQL
  • Linux
  • PHP
  • Lucene
  • Python
  • APC PHP Accelerator
  • MCache
  • Gearman - job scheduling system
  • MogileFS - open source distributed filesystem
  • Apache
  • Memcached

    The Stats

  • Started in late 2004 with a single Linux server running Apache 1.3, PHP 4, and MySQL. 4.0 using the default MyISAM storage engine
  • Over 22 million users.
  • 230 million plus page views per month
  • 26 million unique visitors per month
  • Several billion page views per month
  • None of the scaling challenges faced had anything to do with PHP. The biggest issues faced were database related.
  • Dozens of web servers.
  • Dozens of DB servers.
  • Six specialized graph database servers to run the Recommendation Engine.
  • Six to ten machines that serve files from MogileFS.

    What's Inside

  • Specialized load balancer appliances monitor the application servers, handle failover, constantly adjust the cluster according to health, balance incoming requests and caching JavaScript, CSS and images. If you don't have the fancy load balancers take a look at Linux Virtual Server and Squid as a replacement.
  • Requests are passed to the Application Server cluster. Application servers consist of: Apache+PHP, Memcached, Gearman and other daemons. They are responsible for making coordinating access to different services (DB, MogileFS, etc) and creating the response sent to the browser.
  • Uses a MySQL master-slave setup. - Four master databases are partitioned by functionality: promotion, profiles, comments, main. Many slave databases hang off each master. - Writes go to the masters and reads go to the slaves. - Transaction-heavy servers use the InnoDB storage engine. - OLAP-heavy servers use the MyISAM storage engine. - They did not notice a performance degradation moving from MySQL 4.1 to version 5. - The schema is denormalized more than "your average database design." - Sharding is used to break the database into several smaller ones.
  • Digg's usage pattern makes it easier for them to scale. Most people just view the front page and leave. Thus 98% of Digg's database accesses are reads. With this balance of operations they don't have to worry about the complex work of architecting for writes, which makes it a lot easier for them to scale.
  • They had problems with their storage system telling them writes were on disk when they really weren't. Controllers do this to improve the appearance of their performance. But what it does is leave a giant data integrity whole in failure scenarios. This is really a pretty common problem and can be hard to fix, depending on your hardware setup.
  • To lighten their database load they used the APC PHP accelerator MCache.
  • Memcached is used for caching and memcached servers seemed to be spread across their database and application servers. A specialized daemon monitors connections and kills connections that have been open too long.
  • You can configure PHP not parse and compile on each load using a combination of Apache 2’s worker threads, FastCGI, and a PHP accelerator. On a page's first load the PHP code is compiles so any subsequent page loads are very fast.
  • MogileFS, a distributed file system, serves story icons, user icons, and stores copies of each story’s source. A distributed file system spreads and replicates files across a lot of disks which supports fast and scalable file access.
  • A specialized Recommendation Engine service was built to act as their distributed graph database. Relational databases are not well structured for generating recommendations so a separate service was created. LinkedIn did something similar for their graph.

    Lessons Learned

  • The number of machines isn't as important what the pieces are and how they fit together.
  • Don't treat the database as a hammer. Recommendations didn't fit will with the relational model so they made a specialized service.
  • Tune MySQL through your database engine selection. Use InnoDB when you need transactions and MyISAM when you don't. For example, transactional tables on the master can use MyISAM for read-only slaves.
  • At some point in their growth curve they were unable to grow by adding RAM so had to grow through architecture.
  • People often complain Digg is slow. This is perhaps due to their large javascript libraries rather than their backend architecture.
  • One way they scale is by being careful of which application they deploy on their system. They are careful not to release applications which use too much CPU. Clearly Digg has a pretty standard LAMP architecture, but I thought this was an interesting point. Engineers often have a bunch of cool features they want to release, but those features can kill an infrastructure if that infrastructure doesn't grow along with the features. So push back until your system can handle the new features. This goes to capacity planning, something the Flickr emphasizes in their scaling process.
  • You have to wonder if by limiting new features to match their infrastructure might Digg lose ground to other faster moving social bookmarking services? Perhaps if the infrastructure was more easily scaled they could add features faster which would help them compete better? On the other hand, just adding features because you can doesn't make a lot of sense either.
  • The data layer is where most scaling and performance problems are to be found and these are language specific. You'll hit them using Java, PHP, Ruby, or insert your favorite language here.

    Related Articles

    * LinkedIn Architecture * Live Journal Architecture * Flickr Architecture * An Unorthodox Approach to Database Design : The Coming of the Shard
  • Ebay Architecture

    Click to read more ...