Entries in LVS (2)

Monday
Nov122007

Slashdot Architecture - How the Old Man of the Internet Learned to Scale

Slashdot effect: overwhelming unprepared sites with an avalanche of reader's clicks after being mentioned on Slashdot. Sure, we now have the "Digg effect" and other hot new stars, but Slashdot was the original. And like many stars from generations past, Slashdot plays the elder statesman's role with with class, dignity, and restraint. Yet with millions and millions of users Slashdot is still box office gold and more than keeps up with the young'ins. And with age comes the wisdom of learning how to handle all those users. Just how does Slashdot scale and what can you learn by going old school? Site: http://slashdot.org

Information Sources

  • Slashdot's Setup, Part 1- Hardware
  • Slashdot's Setup, Part 2- Software
  • History of Slashdot Part 3- Going Corporate
  • The History of Slashdot Part 4 - Yesterday, Today, Tomorrow

    The Platform

  • MySQL
  • Linux (CentOS/RHEL)
  • Pound
  • Apache
  • Perl
  • Memcached
  • LVS

    The Stats

  • Started building the system in 1999.
  • 5.5 million user visits per month.
  • 7,000 comments are added every day.
  • Over 9 million pages views daily.
  • Over 21 million comments.
  • Average monthly bandwidth usage is around 40-50 mbit/sec.
  • For the same story Kottke.org found Slashdot delivered 4 times more users than Digg. So Slashdot ain't dead yet.
  • From The History of Slashdot Part 4: On [September 11th] the mainstream news websites buckled under the loads, and although we had to turn off logging, we managed to stay up, sharing news in a time where it was often difficult to get. That was the day where the team of engineers that make this site happen pulled together and did the impossible, forcing our limited little hardware cluster to handle traffic that was probably triple or quadruple a normal day.

    The Hardware Architecture

  • Data center design is similar to all the other SourceForge, Inc. sites and has proven to scale well.
  • Two Active-Active gigabit uplinks.
  • A pair of Cisco 7301s serve as gateway/border routers. Perform some basic filtering. Filtering is tiered to spread the load.
  • Foundry BigIron 8000s act as core switches/routers.
  • Foundry FastIron 9604s are used as switches for some racks.
  • A pair of Rackable System (1Us; P4 Xeon 2.66Gz, 2G RAM, 2x80GB IDE, running CentOS and LVS) serve as load balancing firewalls, distributing traffic to web servers. BIG-IP F5's are being deployed in their new datacenter.
  • All servers are at least RAID 1.
  • 16 web servers: - Running Red Hat 9. - Rackable 1U servers with 2 Xeon 2.66Ghz processors, 2GB of RAM, and 2x80GB IDE hard drives. - Two serve static content: javascript, images and the front page for non logged-in users. - Four serve the front page to logged in users - 10 handle comment pages. - Host roles are changed in response to load. - All NFS mounts are in read-only mode.
  • NFS server is a Rackable 2U with 2 Xeon 2.4Ghz processors, 2GB of RAM, and 4x36GB 15K RPM SCSI drives.
  • 7 database servers: - All run CentOS 4. - 2 in a Master-master configuration: -- Dual Opteron 270's with 16GB RAM, 4x36GB 15K RPM SCSI -- One master is the write only database. -- One master is the read only database. -- They can failover at any time and switch roles. - 2 reader databases: -- Dual Opteron 270's with 8GB RAM, 4x36GB 15K RPM SCSI Drive -- Each syncs from one of the master databases. -- Can add more to scale, but plenty fast enough for now. - 3 miscellaneous databases -- Quad P3 Xeon 700Mhz with 4GB RAM, 8x36GB 10K RPM SCSI Drives -- Accesslog writer and accesslog reader. Separate databases are used because moderation and stats require a lot of CPU time for computation. -- Search database.

    The Software Architecture

  • Logged in and non-logged in users are treated differently. - Non-logged in user see the same page. This page is a static page that is updated every couple of minutes. - Logged in users have custom options which can't be cached so generating pages for these users take more resources.
  • 6 pound servers (1 for SSL) are used as reverse proxies: - If a request can't be handled it is forwarded on to a web server. - Pound servers are run on the same machines as the web servers. - They are distributed for load balancing and redundancy. - SSL is handled by the pound server so the web server doesn't need to support SSL.
  • 16 apache web servers (version 1.3): - Software is mounted from /usr/local on the read-only NFS server. - The images are kept simple. All that is compiled in is: -- mod_perl -- lingerd to free up RAM during delivery. -- mod_auth_useragent to block bots. - 1 For SSL. - 2 for static (.shtml) requests. - 4 for the dynamic homepage. - 6 for dynamic comment-delivery pages (comments, article, pollBooth.pl). - 3 for all other dynamic scripts (ajax, tags, bookmarks, firehose).
  • Reasons for segregating apache servers to different roles: - Isolate the servers in case there are performance problems or a DDoS attack on a specific page. The rest of the system will function even when one part is failing. - For efficiency reasons like httpd-level caching and MaxClients tuning. The web server can be tuned differently for each role. MaxClients is set to 5-15 for dynamic web servers and 25 for static servers. The bottleneck is CPU, not RAM so if requests aren't process quickly then something's wrong and queuing more requests won't help the CPU process them any faster.
  • Using read-only mounted has contributed to the robustness of the system. Tasks that write to /usr/local, for example, to update index.html every second, run on the NFS server.
  • Use their own SQL API built on top of DBD::mysql and DBI.pm.
  • A huge performance boost was provided by caching users, stories, and comment text using memcached.
  • Most data access is through get and set methods written custom for each data type and through methods that perform one specific update or select.
  • The Multiple-master replication architecture allows keeping the site fully live even during blocking queries like ALTER TABLE.
  • Multi-pass log processing is to detect abuse and picking which users get mod points.
  • The moderation system was created in response to spam. It was just a few friends at first and then a lot of friends. This didn't scale. So the 'mod points' system was introduced so that any user who contributed to the system could moderate the system.
  • Active users are banned to protect from excessive usage from bots.

    Lessons Learned

  • The most creatively satisfying period was when money was tight, the group was small, and everyone was helping everyone else with anything that needed to be done.
  • Don't waste your time optimizing code because you are too cheap to buy more machines. Buy the hardware and spend your time working on features.
  • Sell out to a large corporation and you lose control. There's continual pressure to go to the dark side of creating new products, blending in advertiser supplied content, and serving giant ads.
  • Say no to the forces that want you to become just like everyone else. Though many competitors have come and gone, Slashdot is still around because they: continue to maintain editorial independence, moderate advertising quantity with a clear distinction between advertising and content, and of course, that we continue to select the right stories to appeal to our existing audience... not to spend our time courting other audiences that would only dilute the discussions that bring so many of you here day after day.
  • Segregate servers into different policy domains so you can optimize their configuration.
  • Optimizing usually means caching, caching, caching.
  • Tables not fully, but mostly normalized. This improves performance in most cases.
  • Over the last seven years the process of developing database backed websites has changed: The database used to be the bottleneck: centralized, hard to expand, slow. Now even a cheap DB server can run a pretty big site if you code defensively, and thanks to Moore's Law, memcached, and improvements in open-source database software, that part of the scaling issue isn't really a problem until you're practically the size of eBay. It's an exciting time to be coding web applications.

    Click to read more ...

  • Wednesday
    Aug222007

    Wikimedia architecture

    Wikimedia is the platform on which Wikipedia, Wiktionary, and the other seven wiki dwarfs are built on. This document is just excellent for the student trying to scale the heights of giant websites. It is full of details and innovative ideas that have been proven on some of the most used websites on the internet. Site: http://wikimedia.org/

    Information Sources

  • Wikimedia architecture
  • http://meta.wikimedia.org/wiki/Wikimedia_servers
  • scale-out vs scale-up in the from Oracle to MySQL blog.

    Platform

  • Apache
  • Linux
  • MySQL
  • PHP
  • Squid
  • LVS
  • Lucene for Search
  • Memcached for Distributed Object Cache
  • Lighttpd Image Server

    The Stats

  • 8 million articles spread over hundreds of language projects (english, dutch, ...)
  • 10th busiest site in the world (source: Alexa)
  • Exponential growth: doubling every 4-6 months in terms of visitors / traffic / servers
  • 30 000 HTTP requests/s during peak-time
  • 3 Gbit/s of data traffic
  • 3 data centers: Tampa, Amsterdam, Seoul
  • 350 servers, ranging between 1x P4 to 2x Xeon Quad-Core, 0.5 - 16 GB of memory
  • managed by ~ 6 people
  • 3 clusters on 3 different continents

    The Architecture

  • Geographic Load Balancing, based on source IP of client resolver, directs clients to the nearest server cluster. Statically mapping IP addresses to countries to clusters
  • HTTP reverse proxy caching implemented using Squid, grouped by text for wiki content and media for images and large static files.
  • 55 Squid servers currently, plus 20 waiting for setup.
  • 1,000 HTTP requests/s per server, up to 2,500 under stress
  • ~ 100 - 250 Mbit/s per server
  • ~ 14 000 - 32 000 open connections per server
  • Up to 40 GB of disk caches per Squid server
  • Up to 4 disks per server (1U rack servers)
  • 8 GB of memory, half of that used by Squid
  • Hit rates: 85% for Text, 98% for Media, since the use of CARP.
  • PowerDNS provides geographical distribution.
  • In their primary and regional data center they build text and media clusters built on LVS, CARP Squid, Cache Squid. In the primary datacenter they have the media storage.
  • To make sure the latest revision of all pages are served invalidation requests are sent to all Squid caches.
  • One centrally managed & synchronized software installation for hundreds of wikis.
  • MediaWiki scales well with multiple CPUs, so we buy dual quad-core servers now (8 CPU cores per box)
  • Hardware shared with External Storage and Memcached tasks
  • Memcached is used to cache image metadata, parser data, differences, users and sessions, and revision text. Metadata, such as article revision history, article relations (links, categories etc.), user accounts and settings are stored in the core databases
  • Actual revision text is stored as blobs in External storage
  • Static (uploaded) files, such as images, are stored separately on the image server - metadata (size, type, etc.) is cached in the core database and object caches
  • Separate database per wiki (not separate server!)
  • One master, many replicated slaves
  • Read operations are load balanced over the slaves, write operations go to the master
  • The master is used for some read operations in case the slaves are not yet up to date (lagged)
  • External Storage - Article text is stored on separate data storage clusters, simple append-only blob storage. Saves space on expensive and busy core databases for largely unused data - Allows use of spare resources on application servers (2x 250-500 GB per server) - Currently replicated clusters of 3 MySQL hosts are used; this might change in the future for better manageability

    Lessons Learned

  • Focus on architecture, not so much on operations or nontechnical stuff.
  • Sometimes caching costs more than recalculating or looking up at the data source...profiling!
  • Avoid expensive algorithms, database queries, etc.
  • Cache every result that is expensive and has temporal locality of reference.
  • Focus on the hot spots in the code (profiling!).
  • Scale by separating: - Read and write operations (master/slave) - Expensive operations from cheap and more frequent operations (query groups) - Big, popular wikis from smaller wikis
  • Improve caching: temporal and spatial locality of reference and reduces the data set size per server
  • Text is compressed and only revisions between articles are stored.
  • Simple seeming library calls like using stat to check for a file's existence can take too long when loaded.
  • Disk seek I/O limited, the more disk spindles, the better!
  • Scale-out using commodity hardware doesn't require using cheap hardware. Wikipedia's database servers these days are 16GB dual or quad core boxes with 6 15,000 RPM SCSI drives in a RAID 0 setup. That happens to be the sweet spot for the working set and load balancing setup they have. They would use smaller/cheaper systems if it made sense, but 16GB is right for the working set size and that drives the rest of the spec to match the demands of a system with that much RAM. Similarly the web servers are currently 8 core boxes because that happens to work well for load balancing and gives good PHP throughput with relatively easy load balancing.
  • It is a lot of work to scale out, more if you didn't design it in originally. Wikipedia's MediaWiki was originally written for a single master database server. Then slave support was added. Then partitioning by language/project was added. The designs from that time have stood the test well, though with much more refining to address new bottlenecks.
  • Anyone who wants to design their database architecture so that it'll allow them to inexpensively grow from one box rank nothing to the top ten or hundred sites on the net should start out by designing it to handle slightly out of date data from replication slaves, know how to load balance to slaves for all read queries and if at all possible to design it so that chunks of data (batches of users, accounts, whatever) can go on different servers. You can do this from day one using virtualisation, proving the architecture when you're small. It's a LOT easier than doing it while load is doubling every few months!

    Click to read more ...