High Scalability -

89 Comments |

Permalink |

Apache,

Java,

Slashdot's Setup, Part 1- Hardware

PHP,

Perl,

Shard

Monday

Nov122007

Slashdot Architecture - How the Old Man of the Internet Learned to Scale

Monday, November 12, 2007 at 3:13AM

Slashdot effect: overwhelming unprepared sites with an avalanche of reader's clicks after being mentioned on Slashdot. Sure, we now have the "Digg effect" and other hot new stars, but Slashdot was the original. And like many stars from generations past, Slashdot plays the elder statesman's role with with class, dignity, and restraint. Yet with millions and millions of users Slashdot is still box office gold and more than keeps up with the young'ins. And with age comes the wisdom of learning how to handle all those users. Just how does Slashdot scale and what can you learn by going old school? Site: http://slashdot.org

Information Sources

Slashdot's Setup, Part 2- Software

History of Slashdot Part 3- Going Corporate

The History of Slashdot Part 4 - Yesterday, Today, Tomorrow

The Platform

MySQL

Linux (CentOS/RHEL)

Pound

Apache

Perl

Memcached

LVS

The Stats

Started building the system in 1999.

5.5 million user visits per month.

7,000 comments are added every day.

Over 9 million pages views daily.

Over 21 million comments.

Average monthly bandwidth usage is around 40-50 mbit/sec.

For the same story Kottke.org found Slashdot delivered 4 times more users than Digg. So Slashdot ain't dead yet.

From The History of Slashdot Part 4: On [September 11th] the mainstream news websites buckled under the loads, and although we had to turn off logging, we managed to stay up, sharing news in a time where it was often difficult to get. That was the day where the team of engineers that make this site happen pulled together and did the impossible, forcing our limited little hardware cluster to handle traffic that was probably triple or quadruple a normal day.

The Hardware Architecture

Data center design is similar to all the other SourceForge, Inc. sites and has proven to scale well.

Two Active-Active gigabit uplinks.

A pair of Cisco 7301s serve as gateway/border routers. Perform some basic filtering. Filtering is tiered to spread the load.

Foundry BigIron 8000s act as core switches/routers.

Foundry FastIron 9604s are used as switches for some racks.

A pair of Rackable System (1Us; P4 Xeon 2.66Gz, 2G RAM, 2x80GB IDE, running CentOS and LVS) serve as load balancing firewalls, distributing traffic to web servers. BIG-IP F5's are being deployed in their new datacenter.

All servers are at least RAID 1.

16 web servers: - Running Red Hat 9. - Rackable 1U servers with 2 Xeon 2.66Ghz processors, 2GB of RAM, and 2x80GB IDE hard drives. - Two serve static content: javascript, images and the front page for non logged-in users. - Four serve the front page to logged in users - 10 handle comment pages. - Host roles are changed in response to load. - All NFS mounts are in read-only mode.

NFS server is a Rackable 2U with 2 Xeon 2.4Ghz processors, 2GB of RAM, and 4x36GB 15K RPM SCSI drives.

7 database servers: - All run CentOS 4. - 2 in a Master-master configuration: -- Dual Opteron 270's with 16GB RAM, 4x36GB 15K RPM SCSI -- One master is the write only database. -- One master is the read only database. -- They can failover at any time and switch roles. - 2 reader databases: -- Dual Opteron 270's with 8GB RAM, 4x36GB 15K RPM SCSI Drive -- Each syncs from one of the master databases. -- Can add more to scale, but plenty fast enough for now. - 3 miscellaneous databases -- Quad P3 Xeon 700Mhz with 4GB RAM, 8x36GB 10K RPM SCSI Drives -- Accesslog writer and accesslog reader. Separate databases are used because moderation and stats require a lot of CPU time for computation. -- Search database.

The Software Architecture

Logged in and non-logged in users are treated differently. - Non-logged in user see the same page. This page is a static page that is updated every couple of minutes. - Logged in users have custom options which can't be cached so generating pages for these users take more resources.

6 pound servers (1 for SSL) are used as reverse proxies: - If a request can't be handled it is forwarded on to a web server. - Pound servers are run on the same machines as the web servers. - They are distributed for load balancing and redundancy. - SSL is handled by the pound server so the web server doesn't need to support SSL.

16 apache web servers (version 1.3): - Software is mounted from /usr/local on the read-only NFS server. - The images are kept simple. All that is compiled in is: -- mod_perl -- lingerd to free up RAM during delivery. -- mod_auth_useragent to block bots. - 1 For SSL. - 2 for static (.shtml) requests. - 4 for the dynamic homepage. - 6 for dynamic comment-delivery pages (comments, article, pollBooth.pl). - 3 for all other dynamic scripts (ajax, tags, bookmarks, firehose).

Reasons for segregating apache servers to different roles: - Isolate the servers in case there are performance problems or a DDoS attack on a specific page. The rest of the system will function even when one part is failing. - For efficiency reasons like httpd-level caching and MaxClients tuning. The web server can be tuned differently for each role. MaxClients is set to 5-15 for dynamic web servers and 25 for static servers. The bottleneck is CPU, not RAM so if requests aren't process quickly then something's wrong and queuing more requests won't help the CPU process them any faster.

Using read-only mounted has contributed to the robustness of the system. Tasks that write to /usr/local, for example, to update index.html every second, run on the NFS server.

Use their own SQL API built on top of DBD::mysql and DBI.pm.

A huge performance boost was provided by caching users, stories, and comment text using memcached.

Most data access is through get and set methods written custom for each data type and through methods that perform one specific update or select.

The Multiple-master replication architecture allows keeping the site fully live even during blocking queries like ALTER TABLE.

Multi-pass log processing is to detect abuse and picking which users get mod points.

The moderation system was created in response to spam. It was just a few friends at first and then a lot of friends. This didn't scale. So the 'mod points' system was introduced so that any user who contributed to the system could moderate the system.

Active users are banned to protect from excessive usage from bots.

Lessons Learned

The most creatively satisfying period was when money was tight, the group was small, and everyone was helping everyone else with anything that needed to be done.

Don't waste your time optimizing code because you are too cheap to buy more machines. Buy the hardware and spend your time working on features.

Sell out to a large corporation and you lose control. There's continual pressure to go to the dark side of creating new products, blending in advertiser supplied content, and serving giant ads.

Say no to the forces that want you to become just like everyone else. Though many competitors have come and gone, Slashdot is still around because they: continue to maintain editorial independence, moderate advertising quantity with a clear distinction between advertising and content, and of course, that we continue to select the right stories to appeal to our existing audience... not to spend our time courting other audiences that would only dilute the discussions that bring so many of you here day after day.

Segregate servers into different policy domains so you can optimize their configuration.

Optimizing usually means caching, caching, caching.

Tables not fully, but mostly normalized. This improves performance in most cases.

Over the last seven years the process of developing database backed websites has changed: The database used to be the bottleneck: centralized, hard to expand, slow. Now even a cheap DB server can run a pretty big site if you code defensively, and thanks to Moore's Law, memcached, and improvements in open-source database software, that part of the scaling issue isn't really a problem until you're practically the size of eBay. It's an exciting time to be coding web applications.

4 Comments |

Permalink |

Apache,

LVS,

Memcached,

What is a 'River of News' style aggregator?

Perl,

pound

Tuesday

Oct302007

Feedblendr Architecture - Using EC2 to Scale

Tuesday, October 30, 2007 at 3:15PM

A man had a dream. His dream was to blend a bunch of RSS/Atom/RDF feeds into a single feed. The man is Beau Lebens of Feedville and like most dreamers he was a little short on coin. So he took refuge in the home of a cheap hosting provider and Beau realized his dream, creating FEEDblendr. But FEEDblendr chewed up so much CPU creating blended feeds that the cheap hosting provider ordered Beau to find another home. Where was Beau to go? He eventually found a new home in the virtual machine room of Amazon's EC2. This is the story of how Beau was finally able to create his one feeds safe within the cradle of affordable CPU cycles. Site: http://feedblendr.com/

The Platform

EC2 (Fedora Core 6 Lite distro)

Apache

PHP

MySQL

DynDNS (for round robin DNS)

The Stats

Beau is a developer with some sysadmin skills, not a web server admin, so a lot of learning was involved in creating FEEDblendr.

FEEDblendr uses 2 EC2 instances. The same Amazon Instance (AMI) is used for both instances.

Over 10,000 blends have been created, containing over 45,000 source feeds.

Approx 30 blends created per day. Processors on the 2 instances are actually pegged pretty high (load averages at ~ 10 - 20 most of the time).

The Architecture

Round robin DNS is used to load balance between instances. -The DNS is updated by hand as an instance is validited to work correctly before the DNS is updated. -Instances seem to be more stable now than they were in the past, but you must still assume they can be lost at any time and no data will be persisted between reboots.

The database is still hosted on an external service because EC2 does not have a decent persistent storage system.

The AMI is kept as minimal as possible. It is a clean instance with some auto-deployment code to load the application off of S3. This means you don't have to create new instances for every software release.

The deployment process is: - Software is developed on a laptop and stored in subversion. - A makefile is used to get a revision, fix permissions etc, package and push to S3. - When the AMI launches it runs a script to grab the software package from S3. - The package is unpacked and a specific script inside is executed to continue the installation process. - Configuration files for Apache, PHP, etc are updated. - Server-specific permissions, symlinks etc are fixed up. - Apache is restarted and email is sent with the IP of that machine. Then the DNS is updated by hand with the new IP address.

Feeds are intelligently cached independely on each instance. This is to reduce the costly polling for feeds as much as possible. S3 was tried as a common feed cache for both instances, but it was too slow. Perhaps feeds could be written to each instance so they would be cached on each machine?

Lesson Learned

A low budget startup can effectively bootstrap using EC2 and S3.

For the budget conscious the free ZoneEdit service might work just as well as the $50/year DynDNS service (which works fine).

Round robin load balancing is slow and unreliable. Even with a short TTL for the DNS some systems hold on to the IP addressed for a long time, so new machines are not load balanced to.

Many problems exist with RSS implementations that keep feeds from being effectively blended. A lot of CPU is spent reading and blending feeds unecessarily because there's no reliable cross implementation way to tell when a feed has really changed or not.

It's really a big mindset change to consider that your instances can go away at any time. You have to change your architecture and design to live with this fact. But once you internalize this model, most problems can be solved.

EC2's poor load balancing and persistence capabilities make development and deployment a lot harder than it should be.

Use the AMI's ability to be passed a parameter to select which configuration to load from S3. This allows you to test different configurations without moving/deleting the current active one.

Create an automated test system to validate an instance as it boots. Then automatically update the DNS if the tests pass. This makes it easy create new instances and takes the slow human out of the loop.

Always load software from S3. The last thing you want happening is your instance loading, and for some reason not being able to contact your SVN server, and thus failing to load properly. Putting it in S3 virtually eliminates the chances of this occurring, because it's on the same network.

Build an Infinitely Scalable Infrastructure for $100 Using Amazon Services

6 Comments |

Permalink |

EC2,

Slides from Seattle Tech Startup Talk.

PHP,

Sunday

Oct282007

Scaling Early Stage Startups

Sunday, October 28, 2007 at 2:26PM

Mark Maunder of No VC Required--who advocates not taking VC money lest you be turned into a frog instead of the prince (or princess) you were dreaming of--has an excellent slide deck on how to scale an early stage startup. His blog also has some good SEO tips and a very spooky widget showing the geographical location of his readers. Perfect for Halloween! What is Mark's other worldly scaling strategies for startups? Site: http://novcrequired.com/

Information Sources

Scaling Early Stage Startups blog post by Mark Maunder.

The Platform

Linxux

An ISAM type data store.

Perl

Httperf is used for benchmarking.

Websitepulse.com is used for perf monitoring.

The Architecture

Performance matters because being slow could cost you 20% of your revenue. The UIE guys disagree saying this ain't necessarily so. They explain their reasoning in Usability Tools Podcast: The Truth About Page Download Time. The idea is: "There was still another surprising finding from our study: a strong correlation between perceived download time and whether users successfully completed their tasks on a site. There was, however, no correlation between actual download time and task success, causing us to discard our original hypothesis. It seems that, when people accomplish what they set out to do on a site, they perceive that site to be fast." So it might be a better use of time to improve the front-end rather than the back-end.

MySQL was dumped because of performance problems: MySQL didn't handle a high number of writes and deletes on large tables, writes blow away the query cache, large numbers of small tables (over 10,000) are not well supported, uses a lot of memory to cache indexes, maxed out at 200 concurrent read/write queuries per second with over 1 million records.

For data storage they evolved to a fixed length ISAM like record scheme that allows seeking directly to the data. Still uses file level locking and its benchmarked at 20,000+ concurrent reads/writes/deletes. Considering moving to BerkelyDB which is a very highly performing and is used by many large websites, especially when you primarily need key-value type lookups. I think it might be interesting to store json if a lot of this data ends up being displayed on the web page.

Moved to httpd.prefork for Perl. That with no keepalive on the application servers uses less RAM and works well.

Lessons Learned

Configure your DB and web server correctly. MySQL and Apache's memory usage can easily spiral out of control which leads gridingly slow performance as swapping increases. Here are a few resources for helping with configuration issues.

Serve only the users you care about. Block content theives that crawl your site using a lot of valuable resources for nothing. Monitor the number of content pages they fetch per minute. If a threshold is exceeded and then do a reverse lookup on their IP address and configure your firewall to block them.

Cache as much DB data and static content as possible. Perl's Cache::FileCache was used to cache DB data and rendered HTML on disk.

Use two different host names in URLs to enable browser clients to load images in parallele.

Make content as static as possible Create a separate Image and CSS server to serve the static content. Use keepalives on static content as static content uses little memory per thread/process.

Leave plenty of spare memory. Spare memory allows Linux to use more memory fore file system caching which increased performance about 20 percent.

Turn Keepalive off on your dynamic content. Increasing http requests can exhaust the thread and memory resources needed to serve them.

You may not need a complex RDBMS for accessing data. Consider a lighter weight database BerkelyDB.

8 Comments |

Permalink |

Perl

Friday

Oct262007

Paper: Wikipedia's Site Internals, Configuration, Code Examples and Management Issues

Friday, October 26, 2007 at 7:54AM

Wikipedia and Wikimedia have some of the best, most complete real-world documentation on how to build highly scalable systems. This paper by Domas Mituzas covers a lot of details about how Wikipedia works, including: an overview of the different packages used (Linux, PowerDNS, LVS, Squid, lighttpd, Apache, PHP5, Lucene, Mono, Memcached), how they use their CDN, how caching works, how they profile their code, how they store their media, how they structure their database access, how they handle search, how they handle load balancing and administration. All with real code examples and examples of configuration files. This is a really useful resource.

Wikimedia Architecture

Domas Mituzas' Blog

1 Comment |

Permalink |

Pownce Lessons Learned - FOWA 2007

Paper

Monday

Oct082007

Lessons from Pownce - The Early Years

Monday, October 8, 2007 at 10:35AM

Pownce is a new social messaging application competing micromessage to micromessage with the likes of Twitter and Jaiku. Still in closed beta, Pownce has generously shared some of what they've learned so far. Like going to a barrel tasting of a young wine and then tasting the same wine after some aging, I think what will be really interesting is to follow Pownce and compare the Pownce of today with the Pownce of tomorrow, after a few years spent in the barrel. What lessons lie in wait for Pownce as they grow? Site: http://www.pownce.com/

Information Sources

Scoble on Twitter vs Pownce

Founder Leah Culver's Blog

The Platform

Python

Django for the website framework

Amazon's S3 for file storage.

Adobe AIR (Adobe Integrated Runtime) for desktop application

Memcached

Available on Facebook

Timeplot for charts and graphs.

The Stats

Developed in 4 months and went to an invite-only launch in June.

Began as Leah's hobby project and then it snowballed into a real horse with the addition of Digg's Daniel Burka and Kevin Rose.

Small 4 person team with one website developer.

Self funded.

One MySQL database.

Features include: - Short messaging, invites for events, links, file sharing (you can attach mp3s to messages, for example). - You can limit usage to a specific subset of friends and friends can be grouped in sets. So you can send your mp3 to a specific group of friends. - It does not have an SMS gateway, IM gateway, or an API.

The Architecture

Chose Django because it had an active community, good documentation, good readability, it is open to growth, and auto generated administration.

Chose S3 because it minimized maintenance and was inexpensive. It has been reliable for them.

Chose AIR because it had a lot of good buzz, ease of development, creates a nice UI, and is cross platform.

Database has been the main bottleneck. Attack and fix slow queries.

Static pages, objects, and lists are cached using memcached.

Queuing is used to defer more complex work, like sending notes, until later.

Use pagination and a good UI to limit the amount of work performed.

Good indexing helped improve the performance for friend searching.

In a social site: - Make it easy to create and destroy relationships. - Friend relationships are the most important information to display correctly because people really care about it. - Friends in the online world have real-world effects.

Features are "biased" for scalability - You must get an invite from someone on already on Pownce. - Invites are limited to their data center's ability to keep up with the added load. Blindly uploading address books can bring on new users exponentially. Limiting that unnatural growth is a good idea.

Their feature set will expand but they aren't ready to commit to an API yet.

Revenue model: ads between posts.

Lessons Learned

The four big lessons they've experienced so far are: - Think about technology choices. - Do a lot with a little. - Be kind to your database. - Expect anything.

Have a small dedicated team where people handle multiple jobs.

Use open source. There's lots of it, it's free, and there's a lot of good help.

Use your resources. Learn from website doc, use IRC, network, participate in communities and knowledge exchange.

Shed work off the database by making sure that complex features are really needed before implementing them.

Cultivate a prepared mind. Expect the unexpected and respond quickly to the inevitable problems.

Use version control and make backups.

Maintain a lot of performance related stats.

Don't promise users a deadline because you just might not make it.

Commune with your community. I especially like this one and I wish it was done more often. I hope this attitude can survive growth. - Let them know what you are working on and about new features and bug fixes. - Respond personally to individual bug creators.

Take a look at your framework's automatically generated queries. They might suck.

A sexy UI and a good buzz marketing campaign can get you a lot of users.

Scaling Twitter: Making Twitter 10000 Percent Faster.

Scaling the World's Largest Photo Blogging Community

7 Comments |

Permalink |

Django,

Python,

Tuesday

Oct022007

Secrets to Fotolog's Scaling Success

Tuesday, October 2, 2007 at 12:08AM

Fotolog, a social blogging site centered around photos, grew from about 300 thousand users in 2004 to over 11 million users in 2007. Though they initially experienced the inevitable pains of rapid growth, they overcame their problems and now manage over 300 million photos and 800,000 new photos are added each day. Generating all that fabulous content are 20 million unique monthly visitors and a volunteer army of 30,000 new users each day. They did so well a very impressed suitor bought them out for a cool $90 million. That's scale meets success by anyone standards. How did they do it? Site: http://www.fotolog.com/

Information Sources

Congrats to Fotolog on $90mm sale to Hi-Media

Fotolog overtaking Flickr?

Fotolog Hits 11 Million Members and 300 Million Photos Posted

Site of the Week: Fotolog.com by PC Magazine

CEO John Borthwick's Blog.

DBA Frank Mash's Blog

Fotolog, lessons learnt by John Borthwick .

The Platform

Java

PHP

Sun

Solaris 10

MySQL

Apache

Hibernate

Memcached

3PAR (a simple, efficient and scalable tiered-storage array for utility computing)

IBRIX (a single namespace parallel file system, a scalable volume manager, high availability feature)

StrongMail

CDN: Akamai/Panther

The Stats

Started in 2002. In 2004 they had around 300k or 400k members, 3 employees, no scalable infrastructure, and no revenue model.

Due to the rapid growth the site had frequent technical problems and 2005 they had to limit new free members to 1,000 a day.

In 2007 they had over 11 million users and were sold for $90 million to Hi-Media.

Members are from over 200 countries with a majority in South America. Over 20% of page views are from Europe. They rejected a US centric strategy, developing a global and engaged audience.

Generates over 3.5 billion page views and receives over 20 million unique visitors each month and has earned a top 20 Alexa ranking.

Manages over 300 Million photos and over 500,000 photos are uploaded each day.

Over 30,000 new members are added each day and attracts more than 4.6 million daily users. Expanded with no marketing or member incentives.

Over 500 user-generated communities.

20% of member visit the site daily and spend an average of 24 minutes.

32 MySQL servers and a 30 memcached server cluster.

The Architecture

Site originally written in PHP. - Their new "Fotolog memberpage" feature is written in Java with significant performance improvement. Page is cleaner with an improved response time. - They are now serving the site on less than half the boxes they were using. - Daily registrations are up over 35% given the improved performance and a requirement to register to post a guest book message. - The new code base allows them to innovate much more on the member experience.

They have surpassed Flickr in popularity being a firmly Web 1.0 application. - There are no tags, no APIs, no JavaScript widgets, no Ajax. - They have a Spanish language option which extends the site to a broad user base. - They use very little text. It's mostly visual so it usable by a broad range of users. - Their interface is customizable and many people like to express their individual identities. - Their unique visitors are 1MM less than Yahoo's, yet the total minutes on the site are twice that of Yahoo and pages are 3x.

Revenue model: - Gold camera member for about $5/month means you can upload 6 photos a day instead of 1, have 200 comments per photo instead of 20, a custom title image for your profile, a mini-thumbnail of your most recent photo displayed next to your name in guest books, plus the possibility of having your photo featured on the front page.. - Adsense. Revenue lift from Google is trending up approximately 15% given additional contextual data from guest books. - Will move to a peer-to-peer advertising among their members. - Members will have the ability to buy and sell real and virtual items using a micro-payment service.

They have a one-post-per-day rule where users can only post one photo a day. Rather than inhibit growth this rules ensures quality and generates exceptional usage by increasing the chance of a photo being seen and by attracting positive comments. Where as people usually run out of things to say on a blog, people can always find a picture to take, upload, and talk about.

Only photos less than 2,000 kb in size can be uploaded. These are automatically resized to a 500x500 format. Pages look cleaner and load faster.

Model is browsing over searching. Opportunistic serendipitous treasure hunting is encouraged.

Friends are added automatically without needing permission. This generates an audience for your photos.

Supports a browse by groups feature, which have categories like "Colors" and "Emotions."

The site is intentionally simple. - They have resisted the temptation to add feature after feature. Instead their vision is to offer a handful of features, similar to Craig's list, the focus being on content and the conversations. - Pages need to be social. - Pages need to include not only your images, but also images from across the network, providing a visual navigation that today drives much of the time their members spend on the site, a self formed, organic distribution system, letting members see and be seen. - Complementing this social network of images are comments and guest book entries — making the experience one where media intersects with communications, day in day out, millions of images collide with billions of conversations.

Photobucket vs Fotolog - Photobucket stores image-based media, then distributes it to your page on social networking sites such as Myspace, Bebo, Piczo, Friendster, etc. - Fotolog is a destination. - The first generation of social-networking sites stressed self-publishing over connections (from Geocities, to Tripod to Blogger). The next generation focused mostly on connections (sixdegrees, and friendster are the classic examples here — tools to gather friends and connections, as social capital accrues in theory to the people with the most connections). The third and current generation of sites blends media with connections — each with a different emphasis.

Backup: Sun 6540 disk array

Their 32 SQL servers are divided into four clusters - user, GB (guest book), PH (photos), FF (friends and favorites lists) - Uses non-persistent connections. - Connection pooling on the Java side. - InnoDB - Partitioning is handled by the application layer.

Each cluster: - Is fronted by a set of application servers. - Divided into a set of shards. - Each shard has MySQL write-only master-master configuration feeding a few read-only slaves. - Application servers send their read requests to the slaves and their write requests to the masters. - Data are assigned to shared based on some sort of cluster specific partioning key. Naive partitioning algorithms can lead to very uneven shard loads, you want a more balanced load on each shard.

MySQL is used to store image metadata only. This seems pretty standard. Almost nobody seems to store important blobs in the database because it slows down database operations.

Photo storage uses 3PAR and IBRIX. A CDN is used for hot content.

The virtual storage system, though expensive, has worked very well.

As more selects are used lock contention for auto-incremented keys grows.

Through database optimizations they've been able to grow from 4 million members to 11 million members on the same 32 database servers. This is also do to the efficiency of MySQL running on Solaris 10, and increasing the memcache cluster, porting to Java, and increasing RAM.

Happy with memcached. - Created a distributed cluster of 50 memcached servers with a total cache size of approximately 150 gigabytes, supporting around 4 billion page views/month. Peak load times dropped from 10 seconds to 2 seconds. - Quote from CTO:

I have a new memcached user to add to your list: we here at Fotolog, the world's largest photo blogging community, now use it and we love it. I just rolled our first code to use it into production today and it has been a lifesaver. I can't wait to start using it in places where we had been relying on Berkeley databases to offload some database work. We are not some wimpy million page a day site, either. Fotolog is a billion+ pages/month site (35 to 40 million views/day is pretty typical for us). We had recently overcome some significant DB-related performance issues which allowed our site traffic to explode, and it started to bog down again under the heavy traffic load (getting back up towards 10 seconds for a page to load sometimes during the peak periods). The servers were churning away each recreating a list every time when it could easily be shared in the same form for at least 5 or 10 minutes. So we introduced memcache, creating a distributed 30-server cluster with 4 gigs available in total and made a very minor code mod to use memcache, and our peak period load times dropped back down to the 2 second or so range. It has allowed for continued growth and incredible efficiency. I can't say when I've ever been so pleased with something that worked so simply."

Lessons Learned

Popularity is driven by a base of active users, not a rich set of cool features.

The web is global and its tail is very long. By courting users outside the US with language and culturally specific design you can compete with the big boys. Some the hardest competition for Google, Yahoo, etc comes from local startups with an ear to what the locals want.

If you want to get a lot of buzz then do what ever alpha geeks want you to do. If you want a lot of happy users do what they want you to do.

Constraints in web sites can, like in poetry, make something unexpectedly better. The rule that users are only allowed to post one photo per day creates an environment where people comment more on each others photos which creates a more engaged community. Who knew?

Protect your website with limits. Limit the size of pictures, comments, etc so your resource usage doesn't grow outrageously.

Have a vision. Have a strong sense of what your site is supposed to be and why, then use that vision to decide what you should build and how you should build it. Their vision of social site built around daily photographs led to a very different site than one where your goal is to store all your photos.

Revenue generation features can be added without destroying the integrity of your site. I really like how they give people a reasonable set of features for free and then charge for the resources they need to have more. Those features also serve to extend and reinforce the social vision of their site. It will be interesting to see how their new monetization strategies play out.

Don't be afraid to scale up and out. By adding more cache, more RAM, more CPUs, and more efficient CPUs you handle dramatically more load with the same number of machines. And that's a good thing from a datacenter space and power POV.

Making MySQL perform: - Find the source of the problem. - Mature systems are mostly disk bound. - The query cache may be hurting you. - Add RAM to help dodge the bullet. - Stripe your disks. - Restructure tables for optimal performance. - Use libumem.so to find memory leaks.

Things to remember: - Know the problem - Know your application - Know your storage engine - Know your requirements - Know your budget - Use all this information to decide what parts of your system really require the investment of time, money, and testing to be highly available.

Flickr Architecture

An Unorthodox Approach to Database Design : The Coming of the Shard

12 Comments |

Permalink |

CDN,

Java,

Early Amazon by Greg Linden

PHP,

Storage Virtualization,

hibernate,

solaris

Tuesday

Sep182007

Amazon Architecture

Tuesday, September 18, 2007 at 5:44AM

This is a wonderfully informative Amazon update based on Joachim Rohde's discovery of an interview with Amazon's CTO. You'll learn about how Amazon organizes their teams around services, the CAP theorem of building scalable systems, how they deploy software, and a lot more. Many new additions from the ACM Queue article have also been included. Amazon grew from a tiny online bookstore to one of the largest stores on earth. They did it while pioneering new and interesting ways to rate, review, and recommend products. Greg Linden shared is version of Amazon's birth pangs in a series of blog articles Site: http://amazon.com

Information Sources

How Linux saved Amazon millions

Interview Werner Vogels - Amazon's CTO

Asynchronous Architectures - a nice summary of Werner Vogels' talk by Chris Loosley

Learning from the Amazon technology platform - A Conversation with Werner Vogels

Werner Vogels' Weblog - building scalable and robust distributed systems

Platform

Linux

Oracle

C++

Perl

Mason

Java

Jboss

Servlets

The Stats

More than 55 million active customer accounts.

More than 1 million active retail partners worldwide.

Between 100-150 services are accessed to build a page.

The Architecture

What is it that we really mean by scalability? A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added. Increasing performance in general means serving more units of work, but it can also be to handle larger units of work, such as when datasets grow.

The big architectural change that Amazon made was to move from a two-tier monolith to a fully-distributed, decentralized, services platform serving many different applications.

Started as one application talking to a back end. Written in C++.

It grew. For years the scaling efforts at Amazon focused on making the back-end databases scale to hold more items, more customers, more orders, and to support multiple international sites. In 2001 it became clear that the front-end application couldn't scale anymore. The databases were split into small parts and around each part and created a services interface that was the only way to access the data.

The databases became a shared resource that made it hard to scale-out the overall business. The front-end and back-end processes were restricted in their evolution because they were shared by many different teams and processes.

Their architecture is loosely coupled and built around services. A service-oriented architecture gave them the isolation that would allow building many software components rapidly and independently.

Grew into hundreds of services and a number of application servers that aggregate the information from the services. The application that renders the Amazon.com Web pages is one such application server. So are the applications that serve the Web-services interface, the customer service application, and the seller interface.

Many third party technologies are hard to scale to Amazon size. Especially communication infrastructure technologies. They work well up to a certain scale and then fail. So they are forced to build their own.

Not stuck with one particular approach. Some places they use jboss/java, but they use only servlets, not the rest of the J2EE stack.

C++ is uses to process requests. Perl/Mason is used to build content.

Amazon doesn't like middleware because it tends to be framework and not a tool. If you use a middleware package you get lock-in around the software patterns they have chosen. You'll only be able to use their software. So if you want to use different packages you won't be able to. You're stuck. One event loop for messaging, data persistence, AJAX, etc. Too complex. If middleware was available in smaller components, more as a tool than a framework, they would be more interested.

The SOAP web stack seems to want to solve all the same distributed systems problems all over again.

Offer both SOAP and REST web services. 30% use SOAP. These tend to be Java and .NET users and use WSDL files to generate remote object interfaces. 70% use REST. These tend to be PHP or PERL users.

In either SOAP or REST developers can get an object interface to Amazon. Developers just want to get job done. They don't care what goes over the wire.

Amazon wanted to build an open community around their services. Web services were chosed because it's simple. But hat's only on the perimeter. Internally it's a service oriented architecture. You can only access the data via the interface. It's described in WSDL, but they use their own encapsulation and transport mechanisms.

Teams are Small and are Organized Around Services - Services are the independent units delivering functionality within Amazon. It's also how Amazon is organized internally in terms of teams. - If you have a new business idea or problem you want to solve you form a team. Limit the team to 8-10 people because communication hard. They are called two pizza teams. The number of people you can feed off two pizzas. - Teams are small. They are assigned authority and empowered to solve a problem as a service in anyway they see fit. - As an example, they created a team to find phrases within a book that are unique to the text. This team built a separate service interface for that feature and they had authority to do what they needed. - Extensive A/B testing is used to integrate a new service . They see what the impact is and take extensive measurements.

Deployment - They create special infrastructure for managing dependencies and doing a deployment. - Goal is to have all right services to be deployed on a box. All application code, monitoring, licensing, etc should be on a box. - Everyone has a home grown system to solve these problems. - Output of deployment process is a virtual machine. You can use EC2 to run them.

Work From the Customer Backwards to Verify a New Service is Worth Doing - Work from the customer backward. Focus on value you want to deliver for the customer. - Force developers to focus on value delivered to the customer instead of building technology first and then figuring how to use it. - Start with a press release of what features the user will see and work backwards to check that you are building something valuable. - End up with a design that is as minimal as possible. Simplicity is the key if you really want to build large distributed systems.

State Management is the Core Problem for Large Scale Systems - Internally they can deliver infinite storage. - Not all that many operations are stateful. Checkout steps are stateful. - Most recent clicked web page service has recommendations based on session IDs. - They keep track of everything anyway so it's not a matter of keeping state. There's little separate state that needs to be kept for a session. The services will already be keeping the information so you just use the services.

Eric Brewer's CAP Theorem or the Three properties of Systems - Three properties of a system: consistency, availability, tolerance to network partitions. - You can have at most two of these three properties for any shared-data system. - Partitionability: divide nodes into small groups that can see other groups, but they can't see everyone. - Consistency: write a value and then you read the value you get the same value back. In a partitioned system there are windows where that's not true. - Availability: may not always be able to write or read. The system will say you can't write because it wants to keep the system consistent. - To scale you have to partition, so you are left with choosing either high consistency or high availability for a particular system. You must find the right overlap of availability and consistency. - Choose a specific approach based on the needs of the service. - For the checkout process you always want to honor requests to add items to a shopping cart because it's revenue producing. In this case you choose high availability. Errors are hidden from the customer and sorted out later. - When a customer submits an order you favor consistency because several services--credit card processing, shipping and handling, reporting--are simultaneously accessing the data.

Lessons Learned

You must change your mentality to build really scalable systems. Approach chaos in a probabilistic sense that things will work well. In traditional systems we present a perfect world where nothing goes down and then we build complex algorithms (agreement technologies) on this perfect world. Instead, take it for granted stuff fails, that's reality, embrace it. For example, go more with a fast reboot and fast recover approach. With a decent spread of data and services you might get close to 100%. Create self-healing, self-organizing lights out operations.

Create a shared nothing infrastructure. Infrastructure can become a shared resource for development and deployment with the same downsides as shared resources in your logic and data tiers. It can cause locking and blocking and dead lock. A service oriented architecture allows the creation of a parallel and isolated development process that scales feature development to match your growth.

Open up you system with APIs and you'll create an ecosystem around your application.

Only way to manage as large distributed system is to keep things as simple as possible. Keep things simple by making sure there are no hidden requirements and hidden dependencies in the design. Cut technology to the minimum you need to solve the problem you have. It doesn't help the company to create artificial and unneeded layers of complexity.

Organizing around services gives agility. You can do things in parallel is because the output is a service. This allows fast time to market. Create an infrastructure that allows services to be built very fast.

There's bound to be problems with anything that produces hype before real implementation

Use SLAs internally to manage services.

Anyone can very quickly add web services to their product. Just implement one part of your product as a service and start using it.

Build your own infrastructure for performance, reliability, and cost control reasons. By building it yourself you never have to say you went down because it was company X's fault. Your software may not be more reliable than others, but you can fix, debug, and deployment much quicker than when working with a 3rd party.

Use measurement and objective debate to separate the good from the bad. I've been to several presentations by ex-Amazoners and this is the aspect of Amazon that strikes me as uniquely different and interesting from other companies. Their deep seated ethic is to expose real customers to a choice and see which one works best and to make decisions based on those tests. Avinash Kaushik calls this getting rid of the influence of the HiPPO's, the highest paid people in the room. This is done with techniques like A/B testing and Web Analytics. If you have a question about what you should do code it up, let people use it, and see which alternative gives you the results you want.

Create a frugal culture. Amazon used doors for desks, for example.

Know what you need. Amazon has a bad experience with an early recommender system that didn't work out: "This wasn't what Amazon needed. Book recommendations at Amazon needed to work from sparse data, just a few ratings or purchases. It needed to be fast. The system needed to scale to massive numbers of customers and a huge catalog. And it needed to enhance discovery, surfacing books from deep in the catalog that readers wouldn't find on their own."

People's side projects, the one's they follow because they are interested, are often ones where you get the most value and innovation. Never underestimate the power of wandering where you are most interested.

Involve everyone in making dog food. Go out into the warehouse and pack books during the Christmas rush. That's teamwork.

Create a staging site where you can run thorough tests before releasing into the wild.

A robust, clustered, replicated, distributed file system is perfect for read-only data used by the web servers.

Have a way to rollback if an update doesn't work. Write the tools if necessary.

Switch to a deep services-based architecture (http://webservices.sys-con.com/read/262024.htm).

Look for three things in interviews: enthusiasm, creativity, competence. The single biggest predictor of success at Amazon.com was enthusiasm.

Hire a Bob. Someone who knows their stuff, has incredible debugging skills and system knowledge, and most importantly, has the stones to tackle the worst high pressure problems imaginable by just leaping in.

Innovation can only come from the bottom. Those closest to the problem are in the best position to solve it. any organization that depends on innovation must embrace chaos. Loyalty and obedience are not your tools.

Creativity must flow from everywhere.

Everyone must be able to experiment, learn, and iterate. Position, obedience, and tradition should hold no power. For innovation to flourish, measurement must rule.

Embrace innovation. In front of the whole company, Jeff Bezos would give an old Nike shoe as "Just do it" award to those who innovated.

Don't pay for performance. Give good perks and high pay, but keep it flat. Recognize exceptional work in other ways. Merit pay sounds good but is almost impossible to do fairly in large organizations. Use non-monetary awards, like an old shoe. It's a way of saying thank you, somebody cared.

Get big fast. The big guys like Barnes and Nobel are on your tail. Amazon wasn't even the first, second, or even third book store on the web, but their vision and drive won out in the end.

In the data center, only 30 percent of the staff time spent on infrastructure issues related to value creation, with the remaining 70 percent devoted to dealing with the "heavy lifting" of hardware procurement, software management, load balancing, maintenance, scalability challenges and so on.

Prohibit direct database access by clients. This means you can make you service scale and be more reliable without involving your clients. This is much like Google's ability to independently distribute improvements in their stack to the benefit of all applications.

Create a single unified service-access mechanism. This allows for the easy aggregation of services, decentralized request routing, distributed request tracking, and other advanced infrastructure techniques.

Making Amazon.com available through a Web services interface to any developer in the world free of charge has also been a major success because it has driven so much innovation that they couldn't have thought of or built on their own.

Developers themselves know best which tools make them most productive and which tools are right for the job.

Don't impose too many constraints on engineers. Provide incentives for some things, such as integration with the monitoring system and other infrastructure tools. But for the rest, allow teams to function as independently as possible.

Developers are like artists; they produce their best work if they have the freedom to do so, but they need good tools. Have many support tools that are of a self-help nature. Support an environment around the service development that never gets in the way of the development itself.

You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.

Developers should spend some time with customer service every two years. Their they'll actually listen to customer service calls, answer customer service e-mails, and really understand the impact of the kinds of things they do as technologists.

Use a "voice of the customer," which is a realistic story from a customer about some specific part of your site's experience. This helps managers and engineers connect with the fact that we build these technologies for real people. Customer service statistics are an early indicator if you are doing something wrong, or what the real pain points are for your customers.

Infrastructure for Amazon, like for Google, is a huge competitive advantage. They can build very complex applications out of primitive services that are by themselves relatively simple. They can scale their operation independently, maintain unparalleled system availability, and introduce new services quickly without the need for massive reconfiguration.

76 Comments |

Permalink |

Java,

Oracle,

Perl

Friday

Sep072007

Joost Network Architecture

Friday, September 7, 2007 at 10:42PM

Colm MacCarthaigh, Network Architect at Joost, gave this presentation at the UK Network Operators' Forum Meeting in Manchester on April 3rd, 2007.

Sander van Zoest |

2 Comments |

Permalink |

Joost,

Network Architecture,

P2P,

Television

Wednesday

Aug222007

Wikimedia architecture

Wednesday, August 22, 2007 at 9:56AM

Wikimedia is the platform on which Wikipedia, Wiktionary, and the other seven wiki dwarfs are built on. This document is just excellent for the student trying to scale the heights of giant websites. It is full of details and innovative ideas that have been proven on some of the most used websites on the internet. Site: http://wikimedia.org/

Information Sources

Wikimedia architecture

http://meta.wikimedia.org/wiki/Wikimedia_servers

scale-out vs scale-up in the from Oracle to MySQL blog.

Platform

Apache

Linux

MySQL

PHP

Squid

LVS

Lucene for Search

Memcached for Distributed Object Cache

Lighttpd Image Server

The Stats

8 million articles spread over hundreds of language projects (english, dutch, ...)

10th busiest site in the world (source: Alexa)

Exponential growth: doubling every 4-6 months in terms of visitors / traffic / servers

30 000 HTTP requests/s during peak-time

3 Gbit/s of data traffic

3 data centers: Tampa, Amsterdam, Seoul

350 servers, ranging between 1x P4 to 2x Xeon Quad-Core, 0.5 - 16 GB of memory

managed by ~ 6 people

3 clusters on 3 different continents

The Architecture

Geographic Load Balancing, based on source IP of client resolver, directs clients to the nearest server cluster. Statically mapping IP addresses to countries to clusters

HTTP reverse proxy caching implemented using Squid, grouped by text for wiki content and media for images and large static files.

55 Squid servers currently, plus 20 waiting for setup.

1,000 HTTP requests/s per server, up to 2,500 under stress

~ 100 - 250 Mbit/s per server

~ 14 000 - 32 000 open connections per server

Up to 40 GB of disk caches per Squid server

Up to 4 disks per server (1U rack servers)

8 GB of memory, half of that used by Squid

Hit rates: 85% for Text, 98% for Media, since the use of CARP.

PowerDNS provides geographical distribution.

In their primary and regional data center they build text and media clusters built on LVS, CARP Squid, Cache Squid. In the primary datacenter they have the media storage.

To make sure the latest revision of all pages are served invalidation requests are sent to all Squid caches.

One centrally managed & synchronized software installation for hundreds of wikis.

MediaWiki scales well with multiple CPUs, so we buy dual quad-core servers now (8 CPU cores per box)

Hardware shared with External Storage and Memcached tasks

Memcached is used to cache image metadata, parser data, differences, users and sessions, and revision text. Metadata, such as article revision history, article relations (links, categories etc.), user accounts and settings are stored in the core databases

Actual revision text is stored as blobs in External storage

Static (uploaded) files, such as images, are stored separately on the image server - metadata (size, type, etc.) is cached in the core database and object caches

Separate database per wiki (not separate server!)

One master, many replicated slaves

Read operations are load balanced over the slaves, write operations go to the master

The master is used for some read operations in case the slaves are not yet up to date (lagged)

External Storage - Article text is stored on separate data storage clusters, simple append-only blob storage. Saves space on expensive and busy core databases for largely unused data - Allows use of spare resources on application servers (2x 250-500 GB per server) - Currently replicated clusters of 3 MySQL hosts are used; this might change in the future for better manageability

Lessons Learned

Focus on architecture, not so much on operations or nontechnical stuff.

Sometimes caching costs more than recalculating or looking up at the data source...profiling!

Avoid expensive algorithms, database queries, etc.

Cache every result that is expensive and has temporal locality of reference.

Focus on the hot spots in the code (profiling!).

Scale by separating: - Read and write operations (master/slave) - Expensive operations from cheap and more frequent operations (query groups) - Big, popular wikis from smaller wikis

Improve caching: temporal and spatial locality of reference and reduces the data set size per server

Text is compressed and only revisions between articles are stored.

Simple seeming library calls like using stat to check for a file's existence can take too long when loaded.

Disk seek I/O limited, the more disk spindles, the better!

Scale-out using commodity hardware doesn't require using cheap hardware. Wikipedia's database servers these days are 16GB dual or quad core boxes with 6 15,000 RPM SCSI drives in a RAID 0 setup. That happens to be the sweet spot for the working set and load balancing setup they have. They would use smaller/cheaper systems if it made sense, but 16GB is right for the working set size and that drives the rest of the spec to match the demands of a system with that much RAM. Similarly the web servers are currently 8 core boxes because that happens to work well for load balancing and gives good PHP throughput with relatively easy load balancing.

It is a lot of work to scale out, more if you didn't design it in originally. Wikipedia's MediaWiki was originally written for a single master database server. Then slave support was added. Then partitioning by language/project was added. The designs from that time have stood the test well, though with much more refining to address new bottlenecks.

Anyone who wants to design their database architecture so that it'll allow them to inexpensively grow from one box rank nothing to the top ten or hundred sites on the net should start out by designing it to handle slightly out of date data from replication slaves, know how to load balance to slaves for all read queries and if at all possible to design it so that chunks of data (batches of users, accounts, whatever) can go on different servers. You can do this from day one using virtualisation, proving the architecture when you're small. It's a LOT easier than doing it while load is doubling every few months!

13 Comments |

Permalink |

Apache,

Geo-distributed Clusters,

LVS,