High Scalability -

Entries in Apache (13)

Wednesday

Aug222007

Wikimedia architecture

Wednesday, August 22, 2007 at 9:56AM

Wikimedia is the platform on which Wikipedia, Wiktionary, and the other seven wiki dwarfs are built on. This document is just excellent for the student trying to scale the heights of giant websites. It is full of details and innovative ideas that have been proven on some of the most used websites on the internet. Site: http://wikimedia.org/

Information Sources

Wikimedia architecture

http://meta.wikimedia.org/wiki/Wikimedia_servers

scale-out vs scale-up in the from Oracle to MySQL blog.

Platform

Apache

Linux

MySQL

PHP

Squid

LVS

Lucene for Search

Memcached for Distributed Object Cache

Lighttpd Image Server

The Stats

8 million articles spread over hundreds of language projects (english, dutch, ...)

10th busiest site in the world (source: Alexa)

Exponential growth: doubling every 4-6 months in terms of visitors / traffic / servers

30 000 HTTP requests/s during peak-time

3 Gbit/s of data traffic

3 data centers: Tampa, Amsterdam, Seoul

350 servers, ranging between 1x P4 to 2x Xeon Quad-Core, 0.5 - 16 GB of memory

managed by ~ 6 people

3 clusters on 3 different continents

The Architecture

Geographic Load Balancing, based on source IP of client resolver, directs clients to the nearest server cluster. Statically mapping IP addresses to countries to clusters

HTTP reverse proxy caching implemented using Squid, grouped by text for wiki content and media for images and large static files.

55 Squid servers currently, plus 20 waiting for setup.

1,000 HTTP requests/s per server, up to 2,500 under stress

~ 100 - 250 Mbit/s per server

~ 14 000 - 32 000 open connections per server

Up to 40 GB of disk caches per Squid server

Up to 4 disks per server (1U rack servers)

8 GB of memory, half of that used by Squid

Hit rates: 85% for Text, 98% for Media, since the use of CARP.

PowerDNS provides geographical distribution.

In their primary and regional data center they build text and media clusters built on LVS, CARP Squid, Cache Squid. In the primary datacenter they have the media storage.

To make sure the latest revision of all pages are served invalidation requests are sent to all Squid caches.

One centrally managed & synchronized software installation for hundreds of wikis.

MediaWiki scales well with multiple CPUs, so we buy dual quad-core servers now (8 CPU cores per box)

Hardware shared with External Storage and Memcached tasks

Memcached is used to cache image metadata, parser data, differences, users and sessions, and revision text. Metadata, such as article revision history, article relations (links, categories etc.), user accounts and settings are stored in the core databases

Actual revision text is stored as blobs in External storage

Static (uploaded) files, such as images, are stored separately on the image server - metadata (size, type, etc.) is cached in the core database and object caches

Separate database per wiki (not separate server!)

One master, many replicated slaves

Read operations are load balanced over the slaves, write operations go to the master

The master is used for some read operations in case the slaves are not yet up to date (lagged)

External Storage - Article text is stored on separate data storage clusters, simple append-only blob storage. Saves space on expensive and busy core databases for largely unused data - Allows use of spare resources on application servers (2x 250-500 GB per server) - Currently replicated clusters of 3 MySQL hosts are used; this might change in the future for better manageability

Lessons Learned

Focus on architecture, not so much on operations or nontechnical stuff.

Sometimes caching costs more than recalculating or looking up at the data source...profiling!

Avoid expensive algorithms, database queries, etc.

Cache every result that is expensive and has temporal locality of reference.

Focus on the hot spots in the code (profiling!).

Scale by separating: - Read and write operations (master/slave) - Expensive operations from cheap and more frequent operations (query groups) - Big, popular wikis from smaller wikis

Improve caching: temporal and spatial locality of reference and reduces the data set size per server

Text is compressed and only revisions between articles are stored.

Simple seeming library calls like using stat to check for a file's existence can take too long when loaded.

Disk seek I/O limited, the more disk spindles, the better!

Scale-out using commodity hardware doesn't require using cheap hardware. Wikipedia's database servers these days are 16GB dual or quad core boxes with 6 15,000 RPM SCSI drives in a RAID 0 setup. That happens to be the sweet spot for the working set and load balancing setup they have. They would use smaller/cheaper systems if it made sense, but 16GB is right for the working set size and that drives the rest of the spec to match the demands of a system with that much RAM. Similarly the web servers are currently 8 core boxes because that happens to work well for load balancing and gives good PHP throughput with relatively easy load balancing.

It is a lot of work to scale out, more if you didn't design it in originally. Wikipedia's MediaWiki was originally written for a single master database server. Then slave support was added. Then partitioning by language/project was added. The designs from that time have stood the test well, though with much more refining to address new bottlenecks.

Anyone who wants to design their database architecture so that it'll allow them to inexpensively grow from one box rank nothing to the top ten or hundred sites on the net should start out by designing it to handle slightly out of date data from replication slaves, know how to load balance to slaves for all read queries and if at all possible to design it so that chunks of data (batches of users, accounts, whatever) can go on different servers. You can do this from day one using virtualisation, proving the architecture when you're small. It's a LOT easier than doing it while load is doubling every few months!

Click to read more ...

Todd Hoff |

13 Comments |

Permalink |

Print Article

Email Article

Apache,

Example,

Geo-distributed Clusters,

LVS,

Linux,

Lucene,

MySQL,

PHP,

Squid

Monday

Aug202007

TypePad Architecture

Monday, August 20, 2007 at 9:53AM

TypePad is considered the largest paid blogging service in the world. After experience problems because of their meteoric growth, they eventually transitioned to an architecture patterned after their sister company, LiveJournal. Site: http://www.typepad.com/

The Platform

MySQL

Memcached

Perl

MogileFS

Apache

Linux

The Stats

As of 2005 TypePad sends 250mbps of traffic using multiple network pipes for 3TB of traffic a day. They were growing by 10-20% each month. I was unable to find more recent statistics.

The Architecture

Original Architecture: - Single server running Linux, Apache, Postgres, Perl, mod_perl - Storage was NFS on a filer.

A Devastating Crash Caused a New Direction - A RAID controller failed and spewed data across all RAID disks. - The database was corrupted and the backups were corrupted. - Their redundant filers suffered from "split brain" syndrome.

They move to LiveJournal Architecture type architecture which isn't surprising since TypePad and LiveJounral are both owned by Six Apart. - Replicated MySQL clusters partitioned by ID. - A global DB generated globally unique sequence numbers and mapped users to partitions. - Other data was mapped by role.

Highly Available Database Configuration: - A master-master MySQL replication model is used. - The Linux clustering heartbeat was used to failover using virtual IP addresses.

MogileFS is used to serve images.

Perlbal is used as reverse proxy and to load balance requests.

A reliable, asynchronous job dispatch system called TheSchwartz is used to support moblogging, adding comments, future publishing, cache invalidation, and publishing.

Memcached is used to store counts, sets, stats, and heavyweight data.

Migration from the old architecture to the new architecture was tricky: - All users were migrated over without service interruption. - Postgres was removed. - During the migration images were served from NFS and MogileFS.

Benefits of their new architecture: - Can easily add new machines and adjust workload. - More highly available and is cheaply scalable

Lessons Learned

Small details are important.

Every mistake is a learning experience.

Success requires coordination and cooperation.

LiveJournal Architecture.

Linux High Availability.

Click to read more ...

Todd Hoff |

1 Comment |

Permalink |

Apache,

Linux,

MySQL,

Perl

Tuesday

Jul102007

mixi.jp Architecture

Tuesday, July 10, 2007 at 7:55AM

Mixi is a fast growing social networking site in Japan. They provide services like: diary, community, message, review, and photo album. Having a lot in common with LiveJournal they also developed many of the same approaches. Their write up on how they scaled their system is easily one of the best out there. Site: http://mixi.jp

Information Sources

mixi.jp - scaling out with open source

Platform

Linux

Apache

MySQL

Perl

Memcached

Squid

Shard

What's Inside?

They grew to approximately 4 million users in two years and add over 15,000 new users/day.

Ranks 35th on Alexa and 3rd in Japan.

More than 100 MySQL servers

Add more than 10 servers/month

Use non-persistent connections.

Diary traffic is 85% read and 15% write.

Message traffic is is 75% read and 25% write.

Ran into replication performance problems so they had to split the database.

Considered splitting vertically by user or splitting horizontally by table type.

The ended up partitioning by table type and user. So all the messages for a group of users would be assigned to a particular database. Partitioning key is used to decide in which database data should be stored.

For caching they use memcached with 39 machines x 2 GB memory.

Stores more than 8 TB of images with about 23 GB added per day.

MySQL is only used to store metadata about the images, not the images themselves.

Images are either frequently accessed or rarely accessed.

Frequently accessed images are cached using Squid on multiple machines.

Rarely accessed images are served from the file system. There's no profit in caching them.

Lessons Learned

When using dynamic partitioning it's difficult to pick keys and algorithms for where data should be stored.

Once you partition data you can no longer do joins and you have to open a lot of connections to different databases to merge the data back together.

It's hard to add new hosts and rearrange data when you partition. For example, let's say your partitioning algorithm stores all the messages for users 1-N on host 1. Now let's say host 1 becomes overburdened and you want to repartition users across more hosts. This is very difficult to do.

By using distributed memory caching they rarely hit the DB and there average page load time is about .02 seconds. This reduces the problems associated with partitioning.

You will often have to develop strategies based on the type of content. For example, image will be treated differently than short text posts.

Social networking sites are very time oriented, so it might be useful to partition data by time as well as user and type.

Click to read more ...

Todd Hoff |

Entries in Apache (13)

Wikimedia architecture

Information Sources

Platform

The Stats

The Architecture

Lessons Learned

TypePad Architecture

The Platform

The Stats

The Architecture

Lessons Learned

Related Articles

mixi.jp Architecture

Information Sources

Platform

What's Inside?

Lessons Learned