Entries in Linux (20)

Monday
Nov122007

Slashdot Architecture - How the Old Man of the Internet Learned to Scale

Slashdot effect: overwhelming unprepared sites with an avalanche of reader's clicks after being mentioned on Slashdot. Sure, we now have the "Digg effect" and other hot new stars, but Slashdot was the original. And like many stars from generations past, Slashdot plays the elder statesman's role with with class, dignity, and restraint. Yet with millions and millions of users Slashdot is still box office gold and more than keeps up with the young'ins. And with age comes the wisdom of learning how to handle all those users. Just how does Slashdot scale and what can you learn by going old school? Site: http://slashdot.org

Information Sources

  • Slashdot's Setup, Part 1- Hardware
  • Slashdot's Setup, Part 2- Software
  • History of Slashdot Part 3- Going Corporate
  • The History of Slashdot Part 4 - Yesterday, Today, Tomorrow

    The Platform

  • MySQL
  • Linux (CentOS/RHEL)
  • Pound
  • Apache
  • Perl
  • Memcached
  • LVS

    The Stats

  • Started building the system in 1999.
  • 5.5 million user visits per month.
  • 7,000 comments are added every day.
  • Over 9 million pages views daily.
  • Over 21 million comments.
  • Average monthly bandwidth usage is around 40-50 mbit/sec.
  • For the same story Kottke.org found Slashdot delivered 4 times more users than Digg. So Slashdot ain't dead yet.
  • From The History of Slashdot Part 4: On [September 11th] the mainstream news websites buckled under the loads, and although we had to turn off logging, we managed to stay up, sharing news in a time where it was often difficult to get. That was the day where the team of engineers that make this site happen pulled together and did the impossible, forcing our limited little hardware cluster to handle traffic that was probably triple or quadruple a normal day.

    The Hardware Architecture

  • Data center design is similar to all the other SourceForge, Inc. sites and has proven to scale well.
  • Two Active-Active gigabit uplinks.
  • A pair of Cisco 7301s serve as gateway/border routers. Perform some basic filtering. Filtering is tiered to spread the load.
  • Foundry BigIron 8000s act as core switches/routers.
  • Foundry FastIron 9604s are used as switches for some racks.
  • A pair of Rackable System (1Us; P4 Xeon 2.66Gz, 2G RAM, 2x80GB IDE, running CentOS and LVS) serve as load balancing firewalls, distributing traffic to web servers. BIG-IP F5's are being deployed in their new datacenter.
  • All servers are at least RAID 1.
  • 16 web servers: - Running Red Hat 9. - Rackable 1U servers with 2 Xeon 2.66Ghz processors, 2GB of RAM, and 2x80GB IDE hard drives. - Two serve static content: javascript, images and the front page for non logged-in users. - Four serve the front page to logged in users - 10 handle comment pages. - Host roles are changed in response to load. - All NFS mounts are in read-only mode.
  • NFS server is a Rackable 2U with 2 Xeon 2.4Ghz processors, 2GB of RAM, and 4x36GB 15K RPM SCSI drives.
  • 7 database servers: - All run CentOS 4. - 2 in a Master-master configuration: -- Dual Opteron 270's with 16GB RAM, 4x36GB 15K RPM SCSI -- One master is the write only database. -- One master is the read only database. -- They can failover at any time and switch roles. - 2 reader databases: -- Dual Opteron 270's with 8GB RAM, 4x36GB 15K RPM SCSI Drive -- Each syncs from one of the master databases. -- Can add more to scale, but plenty fast enough for now. - 3 miscellaneous databases -- Quad P3 Xeon 700Mhz with 4GB RAM, 8x36GB 10K RPM SCSI Drives -- Accesslog writer and accesslog reader. Separate databases are used because moderation and stats require a lot of CPU time for computation. -- Search database.

    The Software Architecture

  • Logged in and non-logged in users are treated differently. - Non-logged in user see the same page. This page is a static page that is updated every couple of minutes. - Logged in users have custom options which can't be cached so generating pages for these users take more resources.
  • 6 pound servers (1 for SSL) are used as reverse proxies: - If a request can't be handled it is forwarded on to a web server. - Pound servers are run on the same machines as the web servers. - They are distributed for load balancing and redundancy. - SSL is handled by the pound server so the web server doesn't need to support SSL.
  • 16 apache web servers (version 1.3): - Software is mounted from /usr/local on the read-only NFS server. - The images are kept simple. All that is compiled in is: -- mod_perl -- lingerd to free up RAM during delivery. -- mod_auth_useragent to block bots. - 1 For SSL. - 2 for static (.shtml) requests. - 4 for the dynamic homepage. - 6 for dynamic comment-delivery pages (comments, article, pollBooth.pl). - 3 for all other dynamic scripts (ajax, tags, bookmarks, firehose).
  • Reasons for segregating apache servers to different roles: - Isolate the servers in case there are performance problems or a DDoS attack on a specific page. The rest of the system will function even when one part is failing. - For efficiency reasons like httpd-level caching and MaxClients tuning. The web server can be tuned differently for each role. MaxClients is set to 5-15 for dynamic web servers and 25 for static servers. The bottleneck is CPU, not RAM so if requests aren't process quickly then something's wrong and queuing more requests won't help the CPU process them any faster.
  • Using read-only mounted has contributed to the robustness of the system. Tasks that write to /usr/local, for example, to update index.html every second, run on the NFS server.
  • Use their own SQL API built on top of DBD::mysql and DBI.pm.
  • A huge performance boost was provided by caching users, stories, and comment text using memcached.
  • Most data access is through get and set methods written custom for each data type and through methods that perform one specific update or select.
  • The Multiple-master replication architecture allows keeping the site fully live even during blocking queries like ALTER TABLE.
  • Multi-pass log processing is to detect abuse and picking which users get mod points.
  • The moderation system was created in response to spam. It was just a few friends at first and then a lot of friends. This didn't scale. So the 'mod points' system was introduced so that any user who contributed to the system could moderate the system.
  • Active users are banned to protect from excessive usage from bots.

    Lessons Learned

  • The most creatively satisfying period was when money was tight, the group was small, and everyone was helping everyone else with anything that needed to be done.
  • Don't waste your time optimizing code because you are too cheap to buy more machines. Buy the hardware and spend your time working on features.
  • Sell out to a large corporation and you lose control. There's continual pressure to go to the dark side of creating new products, blending in advertiser supplied content, and serving giant ads.
  • Say no to the forces that want you to become just like everyone else. Though many competitors have come and gone, Slashdot is still around because they: continue to maintain editorial independence, moderate advertising quantity with a clear distinction between advertising and content, and of course, that we continue to select the right stories to appeal to our existing audience... not to spend our time courting other audiences that would only dilute the discussions that bring so many of you here day after day.
  • Segregate servers into different policy domains so you can optimize their configuration.
  • Optimizing usually means caching, caching, caching.
  • Tables not fully, but mostly normalized. This improves performance in most cases.
  • Over the last seven years the process of developing database backed websites has changed: The database used to be the bottleneck: centralized, hard to expand, slow. Now even a cheap DB server can run a pretty big site if you code defensively, and thanks to Moore's Law, memcached, and improvements in open-source database software, that part of the scaling issue isn't really a problem until you're practically the size of eBay. It's an exciting time to be coding web applications.

    Click to read more ...

  • Tuesday
    Nov062007

    Product: ChironFS

    If you are trying to create highly available file systems, especially across data centers, then ChironFS is one potential solution. It's relatively new, so there aren't lots of experience reports, but it looks worth considering. What is ChironFS and how does it work? Adapted from the ChironFS website: The Chiron Filesystem is a Fuse based filesystem that frees you from single points of failure. It's main purpose is to guarantee filesystem availability using replication. But it isn't a RAID implementation. RAID replicates DEVICES not FILESYSTEMS. Why not just use RAID over some network block device? Because it is a block device and if one server mounts that device in RW mode, no other server will be able to mount it in RW mode. Any real network may have many servers and offer a variety of services. Keeping everything running can become a real nightmare!

    Click to read more ...

    Sunday
    Oct282007

    Scaling Early Stage Startups

    Mark Maunder of No VC Required--who advocates not taking VC money lest you be turned into a frog instead of the prince (or princess) you were dreaming of--has an excellent slide deck on how to scale an early stage startup. His blog also has some good SEO tips and a very spooky widget showing the geographical location of his readers. Perfect for Halloween! What is Mark's other worldly scaling strategies for startups? Site: http://novcrequired.com/

    Information Sources

  • Slides from Seattle Tech Startup Talk.
  • Scaling Early Stage Startups blog post by Mark Maunder.

    The Platform

  • Linxux
  • An ISAM type data store.
  • Perl
  • Httperf is used for benchmarking.
  • Websitepulse.com is used for perf monitoring.

    The Architecture

  • Performance matters because being slow could cost you 20% of your revenue. The UIE guys disagree saying this ain't necessarily so. They explain their reasoning in Usability Tools Podcast: The Truth About Page Download Time. The idea is: "There was still another surprising finding from our study: a strong correlation between perceived download time and whether users successfully completed their tasks on a site. There was, however, no correlation between actual download time and task success, causing us to discard our original hypothesis. It seems that, when people accomplish what they set out to do on a site, they perceive that site to be fast." So it might be a better use of time to improve the front-end rather than the back-end.
  • MySQL was dumped because of performance problems: MySQL didn't handle a high number of writes and deletes on large tables, writes blow away the query cache, large numbers of small tables (over 10,000) are not well supported, uses a lot of memory to cache indexes, maxed out at 200 concurrent read/write queuries per second with over 1 million records.
  • For data storage they evolved to a fixed length ISAM like record scheme that allows seeking directly to the data. Still uses file level locking and its benchmarked at 20,000+ concurrent reads/writes/deletes. Considering moving to BerkelyDB which is a very highly performing and is used by many large websites, especially when you primarily need key-value type lookups. I think it might be interesting to store json if a lot of this data ends up being displayed on the web page.
  • Moved to httpd.prefork for Perl. That with no keepalive on the application servers uses less RAM and works well.

    Lessons Learned

  • Configure your DB and web server correctly. MySQL and Apache's memory usage can easily spiral out of control which leads gridingly slow performance as swapping increases. Here are a few resources for helping with configuration issues.
  • Serve only the users you care about. Block content theives that crawl your site using a lot of valuable resources for nothing. Monitor the number of content pages they fetch per minute. If a threshold is exceeded and then do a reverse lookup on their IP address and configure your firewall to block them.
  • Cache as much DB data and static content as possible. Perl's Cache::FileCache was used to cache DB data and rendered HTML on disk.
  • Use two different host names in URLs to enable browser clients to load images in parallele.
  • Make content as static as possible Create a separate Image and CSS server to serve the static content. Use keepalives on static content as static content uses little memory per thread/process.
  • Leave plenty of spare memory. Spare memory allows Linux to use more memory fore file system caching which increased performance about 20 percent.
  • Turn Keepalive off on your dynamic content. Increasing http requests can exhaust the thread and memory resources needed to serve them.
  • You may not need a complex RDBMS for accessing data. Consider a lighter weight database BerkelyDB.

    Click to read more ...

  • Tuesday
    Sep182007

    Amazon Architecture

    This is a wonderfully informative Amazon update based on Joachim Rohde's discovery of an interview with Amazon's CTO. You'll learn about how Amazon organizes their teams around services, the CAP theorem of building scalable systems, how they deploy software, and a lot more. Many new additions from the ACM Queue article have also been included. Amazon grew from a tiny online bookstore to one of the largest stores on earth. They did it while pioneering new and interesting ways to rate, review, and recommend products. Greg Linden shared is version of Amazon's birth pangs in a series of blog articles Site: http://amazon.com

    Information Sources

  • Early Amazon by Greg Linden
  • How Linux saved Amazon millions
  • Interview Werner Vogels - Amazon's CTO
  • Asynchronous Architectures - a nice summary of Werner Vogels' talk by Chris Loosley
  • Learning from the Amazon technology platform - A Conversation with Werner Vogels
  • Werner Vogels' Weblog - building scalable and robust distributed systems

    Platform

  • Linux
  • Oracle
  • C++
  • Perl
  • Mason
  • Java
  • Jboss
  • Servlets

    The Stats

  • More than 55 million active customer accounts.
  • More than 1 million active retail partners worldwide.
  • Between 100-150 services are accessed to build a page.

    The Architecture

  • What is it that we really mean by scalability? A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added. Increasing performance in general means serving more units of work, but it can also be to handle larger units of work, such as when datasets grow.
  • The big architectural change that Amazon made was to move from a two-tier monolith to a fully-distributed, decentralized, services platform serving many different applications.
  • Started as one application talking to a back end. Written in C++.
  • It grew. For years the scaling efforts at Amazon focused on making the back-end databases scale to hold more items, more customers, more orders, and to support multiple international sites. In 2001 it became clear that the front-end application couldn't scale anymore. The databases were split into small parts and around each part and created a services interface that was the only way to access the data.
  • The databases became a shared resource that made it hard to scale-out the overall business. The front-end and back-end processes were restricted in their evolution because they were shared by many different teams and processes.
  • Their architecture is loosely coupled and built around services. A service-oriented architecture gave them the isolation that would allow building many software components rapidly and independently.
  • Grew into hundreds of services and a number of application servers that aggregate the information from the services. The application that renders the Amazon.com Web pages is one such application server. So are the applications that serve the Web-services interface, the customer service application, and the seller interface.
  • Many third party technologies are hard to scale to Amazon size. Especially communication infrastructure technologies. They work well up to a certain scale and then fail. So they are forced to build their own.
  • Not stuck with one particular approach. Some places they use jboss/java, but they use only servlets, not the rest of the J2EE stack.
  • C++ is uses to process requests. Perl/Mason is used to build content.
  • Amazon doesn't like middleware because it tends to be framework and not a tool. If you use a middleware package you get lock-in around the software patterns they have chosen. You'll only be able to use their software. So if you want to use different packages you won't be able to. You're stuck. One event loop for messaging, data persistence, AJAX, etc. Too complex. If middleware was available in smaller components, more as a tool than a framework, they would be more interested.
  • The SOAP web stack seems to want to solve all the same distributed systems problems all over again.
  • Offer both SOAP and REST web services. 30% use SOAP. These tend to be Java and .NET users and use WSDL files to generate remote object interfaces. 70% use REST. These tend to be PHP or PERL users.
  • In either SOAP or REST developers can get an object interface to Amazon. Developers just want to get job done. They don't care what goes over the wire.
  • Amazon wanted to build an open community around their services. Web services were chosed because it's simple. But hat's only on the perimeter. Internally it's a service oriented architecture. You can only access the data via the interface. It's described in WSDL, but they use their own encapsulation and transport mechanisms.
  • Teams are Small and are Organized Around Services - Services are the independent units delivering functionality within Amazon. It's also how Amazon is organized internally in terms of teams. - If you have a new business idea or problem you want to solve you form a team. Limit the team to 8-10 people because communication hard. They are called two pizza teams. The number of people you can feed off two pizzas. - Teams are small. They are assigned authority and empowered to solve a problem as a service in anyway they see fit. - As an example, they created a team to find phrases within a book that are unique to the text. This team built a separate service interface for that feature and they had authority to do what they needed. - Extensive A/B testing is used to integrate a new service . They see what the impact is and take extensive measurements.
  • Deployment - They create special infrastructure for managing dependencies and doing a deployment. - Goal is to have all right services to be deployed on a box. All application code, monitoring, licensing, etc should be on a box. - Everyone has a home grown system to solve these problems. - Output of deployment process is a virtual machine. You can use EC2 to run them.
  • Work From the Customer Backwards to Verify a New Service is Worth Doing - Work from the customer backward. Focus on value you want to deliver for the customer. - Force developers to focus on value delivered to the customer instead of building technology first and then figuring how to use it. - Start with a press release of what features the user will see and work backwards to check that you are building something valuable. - End up with a design that is as minimal as possible. Simplicity is the key if you really want to build large distributed systems.
  • State Management is the Core Problem for Large Scale Systems - Internally they can deliver infinite storage. - Not all that many operations are stateful. Checkout steps are stateful. - Most recent clicked web page service has recommendations based on session IDs. - They keep track of everything anyway so it's not a matter of keeping state. There's little separate state that needs to be kept for a session. The services will already be keeping the information so you just use the services.
  • Eric Brewer's CAP Theorem or the Three properties of Systems - Three properties of a system: consistency, availability, tolerance to network partitions. - You can have at most two of these three properties for any shared-data system. - Partitionability: divide nodes into small groups that can see other groups, but they can't see everyone. - Consistency: write a value and then you read the value you get the same value back. In a partitioned system there are windows where that's not true. - Availability: may not always be able to write or read. The system will say you can't write because it wants to keep the system consistent. - To scale you have to partition, so you are left with choosing either high consistency or high availability for a particular system. You must find the right overlap of availability and consistency. - Choose a specific approach based on the needs of the service. - For the checkout process you always want to honor requests to add items to a shopping cart because it's revenue producing. In this case you choose high availability. Errors are hidden from the customer and sorted out later. - When a customer submits an order you favor consistency because several services--credit card processing, shipping and handling, reporting--are simultaneously accessing the data.

    Lessons Learned

  • You must change your mentality to build really scalable systems. Approach chaos in a probabilistic sense that things will work well. In traditional systems we present a perfect world where nothing goes down and then we build complex algorithms (agreement technologies) on this perfect world. Instead, take it for granted stuff fails, that's reality, embrace it. For example, go more with a fast reboot and fast recover approach. With a decent spread of data and services you might get close to 100%. Create self-healing, self-organizing lights out operations.
  • Create a shared nothing infrastructure. Infrastructure can become a shared resource for development and deployment with the same downsides as shared resources in your logic and data tiers. It can cause locking and blocking and dead lock. A service oriented architecture allows the creation of a parallel and isolated development process that scales feature development to match your growth.
  • Open up you system with APIs and you'll create an ecosystem around your application.
  • Only way to manage as large distributed system is to keep things as simple as possible. Keep things simple by making sure there are no hidden requirements and hidden dependencies in the design. Cut technology to the minimum you need to solve the problem you have. It doesn't help the company to create artificial and unneeded layers of complexity.
  • Organizing around services gives agility. You can do things in parallel is because the output is a service. This allows fast time to market. Create an infrastructure that allows services to be built very fast.
  • There's bound to be problems with anything that produces hype before real implementation
  • Use SLAs internally to manage services.
  • Anyone can very quickly add web services to their product. Just implement one part of your product as a service and start using it.
  • Build your own infrastructure for performance, reliability, and cost control reasons. By building it yourself you never have to say you went down because it was company X's fault. Your software may not be more reliable than others, but you can fix, debug, and deployment much quicker than when working with a 3rd party.
  • Use measurement and objective debate to separate the good from the bad. I've been to several presentations by ex-Amazoners and this is the aspect of Amazon that strikes me as uniquely different and interesting from other companies. Their deep seated ethic is to expose real customers to a choice and see which one works best and to make decisions based on those tests. Avinash Kaushik calls this getting rid of the influence of the HiPPO's, the highest paid people in the room. This is done with techniques like A/B testing and Web Analytics. If you have a question about what you should do code it up, let people use it, and see which alternative gives you the results you want.
  • Create a frugal culture. Amazon used doors for desks, for example.
  • Know what you need. Amazon has a bad experience with an early recommender system that didn't work out: "This wasn't what Amazon needed. Book recommendations at Amazon needed to work from sparse data, just a few ratings or purchases. It needed to be fast. The system needed to scale to massive numbers of customers and a huge catalog. And it needed to enhance discovery, surfacing books from deep in the catalog that readers wouldn't find on their own."
  • People's side projects, the one's they follow because they are interested, are often ones where you get the most value and innovation. Never underestimate the power of wandering where you are most interested.
  • Involve everyone in making dog food. Go out into the warehouse and pack books during the Christmas rush. That's teamwork.
  • Create a staging site where you can run thorough tests before releasing into the wild.
  • A robust, clustered, replicated, distributed file system is perfect for read-only data used by the web servers.
  • Have a way to rollback if an update doesn't work. Write the tools if necessary.
  • Switch to a deep services-based architecture (http://webservices.sys-con.com/read/262024.htm).
  • Look for three things in interviews: enthusiasm, creativity, competence. The single biggest predictor of success at Amazon.com was enthusiasm.
  • Hire a Bob. Someone who knows their stuff, has incredible debugging skills and system knowledge, and most importantly, has the stones to tackle the worst high pressure problems imaginable by just leaping in.
  • Innovation can only come from the bottom. Those closest to the problem are in the best position to solve it. any organization that depends on innovation must embrace chaos. Loyalty and obedience are not your tools.
  • Creativity must flow from everywhere.
  • Everyone must be able to experiment, learn, and iterate. Position, obedience, and tradition should hold no power. For innovation to flourish, measurement must rule.
  • Embrace innovation. In front of the whole company, Jeff Bezos would give an old Nike shoe as "Just do it" award to those who innovated.
  • Don't pay for performance. Give good perks and high pay, but keep it flat. Recognize exceptional work in other ways. Merit pay sounds good but is almost impossible to do fairly in large organizations. Use non-monetary awards, like an old shoe. It's a way of saying thank you, somebody cared.
  • Get big fast. The big guys like Barnes and Nobel are on your tail. Amazon wasn't even the first, second, or even third book store on the web, but their vision and drive won out in the end.
  • In the data center, only 30 percent of the staff time spent on infrastructure issues related to value creation, with the remaining 70 percent devoted to dealing with the "heavy lifting" of hardware procurement, software management, load balancing, maintenance, scalability challenges and so on.
  • Prohibit direct database access by clients. This means you can make you service scale and be more reliable without involving your clients. This is much like Google's ability to independently distribute improvements in their stack to the benefit of all applications.
  • Create a single unified service-access mechanism. This allows for the easy aggregation of services, decentralized request routing, distributed request tracking, and other advanced infrastructure techniques.
  • Making Amazon.com available through a Web services interface to any developer in the world free of charge has also been a major success because it has driven so much innovation that they couldn't have thought of or built on their own.
  • Developers themselves know best which tools make them most productive and which tools are right for the job.
  • Don't impose too many constraints on engineers. Provide incentives for some things, such as integration with the monitoring system and other infrastructure tools. But for the rest, allow teams to function as independently as possible.
  • Developers are like artists; they produce their best work if they have the freedom to do so, but they need good tools. Have many support tools that are of a self-help nature. Support an environment around the service development that never gets in the way of the development itself.
  • You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.
  • Developers should spend some time with customer service every two years. Their they'll actually listen to customer service calls, answer customer service e-mails, and really understand the impact of the kinds of things they do as technologists.
  • Use a "voice of the customer," which is a realistic story from a customer about some specific part of your site's experience. This helps managers and engineers connect with the fact that we build these technologies for real people. Customer service statistics are an early indicator if you are doing something wrong, or what the real pain points are for your customers.
  • Infrastructure for Amazon, like for Google, is a huge competitive advantage. They can build very complex applications out of primitive services that are by themselves relatively simple. They can scale their operation independently, maintain unparalleled system availability, and introduce new services quickly without the need for massive reconfiguration.

    Click to read more ...

  • Wednesday
    Aug222007

    Wikimedia architecture

    Wikimedia is the platform on which Wikipedia, Wiktionary, and the other seven wiki dwarfs are built on. This document is just excellent for the student trying to scale the heights of giant websites. It is full of details and innovative ideas that have been proven on some of the most used websites on the internet. Site: http://wikimedia.org/

    Information Sources

  • Wikimedia architecture
  • http://meta.wikimedia.org/wiki/Wikimedia_servers
  • scale-out vs scale-up in the from Oracle to MySQL blog.

    Platform

  • Apache
  • Linux
  • MySQL
  • PHP
  • Squid
  • LVS
  • Lucene for Search
  • Memcached for Distributed Object Cache
  • Lighttpd Image Server

    The Stats

  • 8 million articles spread over hundreds of language projects (english, dutch, ...)
  • 10th busiest site in the world (source: Alexa)
  • Exponential growth: doubling every 4-6 months in terms of visitors / traffic / servers
  • 30 000 HTTP requests/s during peak-time
  • 3 Gbit/s of data traffic
  • 3 data centers: Tampa, Amsterdam, Seoul
  • 350 servers, ranging between 1x P4 to 2x Xeon Quad-Core, 0.5 - 16 GB of memory
  • managed by ~ 6 people
  • 3 clusters on 3 different continents

    The Architecture

  • Geographic Load Balancing, based on source IP of client resolver, directs clients to the nearest server cluster. Statically mapping IP addresses to countries to clusters
  • HTTP reverse proxy caching implemented using Squid, grouped by text for wiki content and media for images and large static files.
  • 55 Squid servers currently, plus 20 waiting for setup.
  • 1,000 HTTP requests/s per server, up to 2,500 under stress
  • ~ 100 - 250 Mbit/s per server
  • ~ 14 000 - 32 000 open connections per server
  • Up to 40 GB of disk caches per Squid server
  • Up to 4 disks per server (1U rack servers)
  • 8 GB of memory, half of that used by Squid
  • Hit rates: 85% for Text, 98% for Media, since the use of CARP.
  • PowerDNS provides geographical distribution.
  • In their primary and regional data center they build text and media clusters built on LVS, CARP Squid, Cache Squid. In the primary datacenter they have the media storage.
  • To make sure the latest revision of all pages are served invalidation requests are sent to all Squid caches.
  • One centrally managed & synchronized software installation for hundreds of wikis.
  • MediaWiki scales well with multiple CPUs, so we buy dual quad-core servers now (8 CPU cores per box)
  • Hardware shared with External Storage and Memcached tasks
  • Memcached is used to cache image metadata, parser data, differences, users and sessions, and revision text. Metadata, such as article revision history, article relations (links, categories etc.), user accounts and settings are stored in the core databases
  • Actual revision text is stored as blobs in External storage
  • Static (uploaded) files, such as images, are stored separately on the image server - metadata (size, type, etc.) is cached in the core database and object caches
  • Separate database per wiki (not separate server!)
  • One master, many replicated slaves
  • Read operations are load balanced over the slaves, write operations go to the master
  • The master is used for some read operations in case the slaves are not yet up to date (lagged)
  • External Storage - Article text is stored on separate data storage clusters, simple append-only blob storage. Saves space on expensive and busy core databases for largely unused data - Allows use of spare resources on application servers (2x 250-500 GB per server) - Currently replicated clusters of 3 MySQL hosts are used; this might change in the future for better manageability

    Lessons Learned

  • Focus on architecture, not so much on operations or nontechnical stuff.
  • Sometimes caching costs more than recalculating or looking up at the data source...profiling!
  • Avoid expensive algorithms, database queries, etc.
  • Cache every result that is expensive and has temporal locality of reference.
  • Focus on the hot spots in the code (profiling!).
  • Scale by separating: - Read and write operations (master/slave) - Expensive operations from cheap and more frequent operations (query groups) - Big, popular wikis from smaller wikis
  • Improve caching: temporal and spatial locality of reference and reduces the data set size per server
  • Text is compressed and only revisions between articles are stored.
  • Simple seeming library calls like using stat to check for a file's existence can take too long when loaded.
  • Disk seek I/O limited, the more disk spindles, the better!
  • Scale-out using commodity hardware doesn't require using cheap hardware. Wikipedia's database servers these days are 16GB dual or quad core boxes with 6 15,000 RPM SCSI drives in a RAID 0 setup. That happens to be the sweet spot for the working set and load balancing setup they have. They would use smaller/cheaper systems if it made sense, but 16GB is right for the working set size and that drives the rest of the spec to match the demands of a system with that much RAM. Similarly the web servers are currently 8 core boxes because that happens to work well for load balancing and gives good PHP throughput with relatively easy load balancing.
  • It is a lot of work to scale out, more if you didn't design it in originally. Wikipedia's MediaWiki was originally written for a single master database server. Then slave support was added. Then partitioning by language/project was added. The designs from that time have stood the test well, though with much more refining to address new bottlenecks.
  • Anyone who wants to design their database architecture so that it'll allow them to inexpensively grow from one box rank nothing to the top ten or hundred sites on the net should start out by designing it to handle slightly out of date data from replication slaves, know how to load balance to slaves for all read queries and if at all possible to design it so that chunks of data (batches of users, accounts, whatever) can go on different servers. You can do this from day one using virtualisation, proving the architecture when you're small. It's a LOT easier than doing it while load is doubling every few months!

    Click to read more ...

  • Monday
    Aug202007

    TypePad Architecture

    TypePad is considered the largest paid blogging service in the world. After experience problems because of their meteoric growth, they eventually transitioned to an architecture patterned after their sister company, LiveJournal. Site: http://www.typepad.com/

    The Platform

  • MySQL
  • Memcached
  • Perl
  • MogileFS
  • Apache
  • Linux

    The Stats

  • As of 2005 TypePad sends 250mbps of traffic using multiple network pipes for 3TB of traffic a day. They were growing by 10-20% each month. I was unable to find more recent statistics.

    The Architecture

  • Original Architecture: - Single server running Linux, Apache, Postgres, Perl, mod_perl - Storage was NFS on a filer.
  • A Devastating Crash Caused a New Direction - A RAID controller failed and spewed data across all RAID disks. - The database was corrupted and the backups were corrupted. - Their redundant filers suffered from "split brain" syndrome.
  • They move to LiveJournal Architecture type architecture which isn't surprising since TypePad and LiveJounral are both owned by Six Apart. - Replicated MySQL clusters partitioned by ID. - A global DB generated globally unique sequence numbers and mapped users to partitions. - Other data was mapped by role.
  • Highly Available Database Configuration: - A master-master MySQL replication model is used. - The Linux clustering heartbeat was used to failover using virtual IP addresses.
  • MogileFS is used to serve images.
  • Perlbal is used as reverse proxy and to load balance requests.
  • A reliable, asynchronous job dispatch system called TheSchwartz is used to support moblogging, adding comments, future publishing, cache invalidation, and publishing.
  • Memcached is used to store counts, sets, stats, and heavyweight data.
  • Migration from the old architecture to the new architecture was tricky: - All users were migrated over without service interruption. - Postgres was removed. - During the migration images were served from NFS and MogileFS.
  • Benefits of their new architecture: - Can easily add new machines and adjust workload. - More highly available and is cheaply scalable

    Lessons Learned

  • Small details are important.
  • Every mistake is a learning experience.
  • Success requires coordination and cooperation.

    Related Articles

  • LiveJournal Architecture.
  • Linux High Availability.

    Click to read more ...

  • Thursday
    Jul262007

    ThemBid Architecture

    ThemBid provides a market where people needing work done broadcast their request and accept bids from people competing for the job. Unlike many of the sites profiled at HighScalability, ThemBid is not in the popular press as often as Paris Hilton. It's not a media darling or a giant of the industry. But what I like is they have a strategy, a point-of-view for building websites and were gracious enough to share very detailed instructions on how to go about building a website. They even delve into actual installation details of the various software packages they use. Anyone can benefit by taking a look at their work. Site: http://www.thembid.com/

    Information Sources

  • Build Scalable Web 2.0 Sites with Ubuntu, Symfony, and Lighttpd

    Platform

  • Linux (Ubuntu)
  • Symfony
  • Lighttpd
  • PHP
  • eAccelerator
  • Eclipse
  • Munin
  • AWStats

    What's Inside?

    The Stats

  • Started work in December of 2006 and had a full demo by March 2007.
  • One developer/sys admin worked with a part-time graphics designer.
  • Targeted a few thousand users after launch.

    The Architecture

  • Hardware. Dual core server with 2GB RAM
  • Storage. 2 x 36SCSI 10K RPM on RAID1.
  • Data Center. They went with with Layeredtech for the managed server because of past positive experiences.
  • Development Environment. Ubuntu and Eclipse.
  • OS. They chose the server distribution of Ubuntu because that's what they use on the client side and Ubuntu supports "simpler installation and easier maintenance than typical IT deployments."
  • Web Server. Lighttpd is used to handle static content and forward the dynamic PHP page requests to FastCGI.
  • Database. MySQL. When growth is necessary the idea is to move to a master-slave arrangement and them maybe MySQL cluster.
  • Web Framework. Went with PHP because they knew it and other successful sites like Digg and Yahoo successfully deploy PHP. They chose Symfony as there framework because of its nice documentation and active development community. And Yahoo also uses Symfony. It's a decision that has worked well for them.
  • PHP Cache. eAccelerator is used to compile and cache PHP scripts.
  • Object and Content Cache. The plan is to cache a lot of content. For a bid site like theirs this makes sense. Many of the pieces are used over and over again so putting them in memory will speed up the entire system and take pressure off the database and the IO system. Initially the used a SQLite cache on top of of a memory based file system. This choice was because it was supported by Symfony. When a memcached plugin is available they'll try that.
  • Client Side Cache. Lighttp's mod_expire module is used to prevent Javascript, style sheets, and images that rarely change from being uncessarily redownloaded by the browser.
  • Monitoring. Munin is used to monitor their resource usage. It's as simple as visiting "yoursite.com/status" to see what's going on.
  • Log Analysis. AWStats is used to track hits and types of requests. This information can be used to target bottlenecks.
  • Scalability Plan. - Use Munin to tell when to think about upgrading. When your growth trend will soon cross your resources trend, it's time to do something. - Move MySQL to a separate server. This frees up resources (CPU, disk, memory). What you want to run on this server depend on its capabilities. Maybe run a memcached server on it. - Move to a distributed memory cache using memcached. - Add a MySQL master/slave configuration. - If more webservers are needed us LVS on the front end as a load balancer.
  • Future Directions. Work on fault tolerance.

    Lessons Learned

  • It's possible to create a nice site fairly quickly with just a few people using commonly available low cost tools. And your system will be solid and powerful. No cut corners.
  • Use feedback from your system to know what needs optimizing and when it's time to scale.
  • Good documentation and an active community draw people. These are very attractive qualities for people making decisions about what to use. It's hard to go with a tool chain when it looks like you may get stuck in the future with no way out and no help. If you make tools make them dead easy to understand, learn, use, and deploy.
  • Stick with the familiar. It may not be optimal, it may not be the best, but it's more important that you get started and make progress. You don't want to delay releasing your site so you can learn a completely different tool chain that may make your life somewhat easier and in some projected future. The future is now.
  • Use what works for other people. The fact that Yahoo and Digg use PHP is a good recommendation. Certainly PHP is not the only way to build a site, but it does cut your risk level and help you sleep at night. It also means there's an active community that can help you when you have problems.

    Click to read more ...

  • Monday
    Jul232007

    GoogleTalk Architecture

    Google Talk is Google's instant communications service. Interestingly the IM messages aren't the major architectural challenge, handling user presence indications dominate the design. They also have the challenge of handling small low latency messages and integrating with many other systems. How do they do it? Site: http://www.google.com/talk

    Information Sources

  • GoogleTalk Architecture

    Platform

  • Linux
  • Java
  • Google Stack
  • Shard

    What's Inside?

    The Stats

  • Support presence and messages for millions of users.
  • Handles billions of packets per day in under 100ms.
  • IM is different than many other applications because the requests are small packets.
  • Routing and application logic are applied per packet for sender and receiver.
  • Messages must be delivered in-order.
  • Architecture extends to new clients and Google services.

    Lessons Learned

  • Measure the right thing. - People ask about how many IMs do you deliver or how many active users. Turns out not to be the right engineering question. - Hard part of IM is how to show correct present to all connected users because growth is non-linear: ConnectedUsers * BuddyListSize * OnlineStateChanges - A linear user grown can mean a very non-linear server growth which requires serving many billions of presence packets per day. - Have a large number friends and presence explodes. The number IMs not that big of deal.
  • Real Life Load Tests - Lab tests are good, but don't tell you enough. - Did a backend launch before the real product launch. - Simulate presence requests and going on-line and off-line for weeks and months, even if real data is not returned. It works out many of the kinks in network, failover, etc.
  • Dynamic Resharding - Divide user data or load across shards. - Google Talk backend servers handle traffic for a subset of users. - Make it easy to change the number of shards with zero downtime. - Don't shard across data centers. Try and keep users local. - Servers can bring down servers and backups take over. Then you can bring up new servers and data migrated automatically and clients auto detect and go to new servers.
  • Add Abstractions to Hide System Complexity - Different systems should have little knowledge of each other, especially when separate groups are working together. - Gmail and Orkut don't know about sharding, load-balancing, or fail-over, data center architecture, or number of servers. Can change at anytime without cascading changes throughout the system. - Abstract these complexities into a set of gateways that are discovered at runtime. - RPC infrastructure should handle rerouting.
  • Understand Semantics of Lower Level Libraries - Everything is abstracted, but you must still have enough knowledge of how they work to architect your system. - Does your RPC create TCP connections to all or some of your servers? Very different implications. - Does the library performance health checking? This is architectural implications as you can have separate system failing independently. - Which kernel operation should you use? IM requires a lot connections but few have any activity. Use epoll vs poll/select.
  • Protect Again Operation Problems - Smooth out all spoke in server activity graphs. - What happens when servers restart with an empty cache? - What happens if traffic shifts to a new data center? - Limit cascading problems. Back of from busy servers. Don't accept work when sick. - Isolate in emergencies. Don't infect others with your problems. - Have intelligent retry logic policies abstracted away. Don't sit in hard 1msec retry loops, for example.
  • Any Scalable System is a Distributed System - Add fault tolerance to every component of the system. Everything fails. - Add ability to profile live servers without impacting server. Allows continual improvement. - Collect metrics from server for monitoring. Log everything about your system so you see patterns in cause and effects. - Log end-to-end so you can reconstruct an entire operation from beginning to end across all machines.
  • Software Development Strategies - Make sure binaries are both backward and forward compatible so you can have old clients work with new code. - Build an experimentation framework to try new features. - Give engineers access to product machines. Gives end-to-end ownership. This is very different than many companies who have completely separate OP teams in their data centers. Often developers can't touch production machines.

    Click to read more ...

  • Wednesday
    Jul112007

    Friendster Architecture

    Friendster is one of the largest social network sites on the web. it emphasizes genuine friendships and the discovery of new people through friends. Site: http://www.friendster.com/

    Information Sources

  • Friendster - Scaling for 1 Billion Queries per day

    Platform

  • MySQL
  • Perl
  • PHP
  • Linux
  • Apache

    What's Inside?

  • Dual x86-64 AMD Opterons with 8 GB of RAM
  • Faster disk (SAN)
  • Optimized indexes
  • Traditional 3-tier architecture with hardware load balancer in front of the databases
  • Clusters based on types: ad, app, photo, monitoring, DNS, gallery search DB, profile DB, user infor DB, IM status cache, message DB, testimonial DB, friend DB, graph servers, gallery search, object cache.

    Lessons Learned

  • No persistent database connections.
  • Removed all sorts.
  • Optimized indexes
  • Don’t go after the biggest problems first
  • Optimize without downtime
  • Split load
  • Moved sorting query types into the application and added LIMITS.
  • Reduced ranges
  • Range on primary key
  • Benchmark -> Make Change -> Benchmark -> Make Change (Cycle of Improvement)
  • Stabilize: always have a plan to rollback
  • Work with a team
  • Assess: Define the issues
  • A key design goal for the new system was to move away from maintaining session state toward a stateless architecture that would clean up after each request
  • Rather than buy big, centralized boxes, [our philosophy] was about buying a lot of thin, cheap boxes. If one fails, you roll over to another box.

    Click to read more ...

  • Monday
    Jul092007

    LiveJournal Architecture

    A fascinating and detailed story of how LiveJournal evolved their system to scale. LiveJournal was an early player in the free blog service race and faced issues from quickly adding a large number of users. Blog posts come fast and furious which causes a lot of writes and writes are particularly hard to scale. Understanding how LiveJournal faced their scaling problems will help any aspiring website builder. Site: http://www.livejournal.com/

    Information Sources

  • LiveJournal - Behind The Scenes Scaling Storytime
  • Google Video
  • Tokyo Video
  • 2005 version

    Platform

  • Linux
  • MySql
  • Perl
  • Memcached
  • MogileFS
  • Apache

    What's Inside?

  • Scaling from 1, 2, and 4 hosts to cluster of servers.
  • Avoid single points of failure.
  • Using MySQL replication only takes you so far.
  • Becoming IO bound kills scaling.
  • Spread out writes and reads for more parallelism.
  • You can't keep adding read slaves and scale.
  • Shard storage approach, using DRBD, for maximal throughput. Allocate shards based on roles.
  • Caching to improve performance with memcached. Two-level hashing to distributed RAM.
  • Perlbal for web load balancing.
  • MogileFS, a distributed file system, for parallelism.
  • TheSchwartz and Gearman for distributed job queuing to do more work in parallel.
  • Solving persistent connection problems.

    Lessons Learned

  • Don't be afraid to write your own software to solve your own problems. LiveJournal as provided incredible value to the community through their efforts.
  • Sites can evolve from small 1, 2 machine setups to larger systems as they learn about their users and what their system really needs to do.
  • Parallelization is key to scaling. Remove choke points by caching, load balancing, sharding, clustering file systems, and making use of more disk spindles.
  • Replication has a cost. You can't just keep adding more and more read slaves and expect to scale.
  • Low level issues like which OS event notification mechanism to use, file system and disk interactions, threading and even models, and connection types, matter at scale.
  • Large sites eventually turn to a distributed queuing and scheduling mechanism to distribute large work loads across a grid.

    Click to read more ...

  • Page 1 2