Information Sources

General Discussion,

PHP,

mysql load,

server load

Saturday

Dec202008

Second Life Architecture - The Grid

Saturday, December 20, 2008 at 2:55PM

Update:Presentation: Second Life’s Architecture. Ian Wilkes, VP of Systems Engineering, describes the architecture used by the popular game named Second Life. Ian presents how the architecture was at its debut and how it evolved over years as users and features have been added. Second Life is a 3-D virtual world created by its Residents. Virtual Worlds are expected to be more and more popular on the internet so their architecture might be of interest. Especially important is the appearance of open virtual worlds or metaverses. What happens when video games meet Web 2.0? What happens is the metaverse.

Information Sources

Second Life runs MySQL
Interview with Ian Wilkes
TechTrends: Inside Linden Lab
Town Hall with Cory Linden
InformationWeek articles (1, 2) and blog
Second Life Wiki: Server Architecture
Wikipedia: Second Life Server
Second Life Blog
Second Life: A Guide to Your Virtual World

Platform

MySQL
Apache
Squid
Python
C++
Mono
Debian

What's Inside?

The Stats

~1M active users
~95M user hours per quarter
~70K peak concurrent users (40% annual growth)
~12Gbit/sec aggregate bandwidth (in 2007)

Staff (in 2006)

70 FTE + 20 part time

"about 22 are programmers working on SL itself. At any one time probably 1/3 of the team is on infrastructure, 1/3 is on new features and 1/3 is on various maintenance tasks (bug fixes, general stability and speed improvements) or improvements to existing features. But it varies a lot."

Software

Client/Viewer

Open Source client
Render the Virtual World
Handles user interaction
Handles locations of objects
Gets velocities and does simple physics to keep track of what is moving where
No collision detection

Simulator (Sim) Each geographic area (256x256 meter region) in Second Life runs on a single instantiation of server software, called a simulator or "sim." And each sim runs on a separate core of a server. The Simulator is the primary SL C++ server process which runs on most servers. As the viewer moves through the world it is handled off from one simulator to another.

Runs Havok 4 physics engine
Runs at 45 frames/sec. If it can't keep up, it will attempt time dialation without reducing frame rate.
Handles storing object state, land parcel state, and terrain height-map state
Keeps track of where everything is and does collision detection
Sends locations of stuff to viewer
Transmits image data in a prioritized queue
Sends updates to viewers only when needed (only when collision occurs or other changes in direction, velocity etc.)
Runs Linden Scripting Language (LSL) scripts
Scripting has been recently upgraded to the much faster Mono scripting engine
Handles chat and instant messages

Asset Server

One big clustered filesystem ~100TB
Stores asset data such as textures.

MySQL database

Backbone

Eventlet is a networking library written in Python. It achieves high scalability by using non-blocking io while at the same time retaining high programmer usability by using coroutines to make the non-blocking io operations appear blocking at the source code level.
Mulib is a REST web service framework built on top of eventlet

Hardware

2000+ Servers in 2007
~6000 Servers in early 2008
Plans to upgrade to ~10000 (?)
4 sims per machine, for both class 4 and class 5
Used all-AMD for years, but are moving from the Opteron 270 to the Intel Xeon 5148
The upgrade to "class 5" servers doubled the RAM per machine from 2GB to 4GB and moved to a faster SATA disk
Class 1 - 4 are on 100Mb with 1Gb uplinks to the core. Class 5 is on pure 1Gb

Do you have more details?

7 Comments |

Permalink |

Python,

Squid,

couchdb,

games

Wednesday

Apr302008

Rather small site architecture.

Wednesday, April 30, 2008 at 11:08PM

Website stats:

Webserver: Apache 2.2 Database: MySQL 5.0 APC cache for php CMS: Drupal 6.2 (bleeding-edge version)* *Aggressive caching is ON, Page Compression ON, Block Cache ON (can't use CCS),Optimize CSS/JS ON. 2 Servers: Apache/Mysql (low-tech servers - Celeron processors, 512 MB RAM, 7200 RPM HDD) Bandwidth 10 Mb/s

The benchmark:

Used ab : ab -n 1000 -c 20 howwhatwho.com Server Software: Apache/2.2.3 Server Hostname: howwhatwho.com Server Port: 80 Document Path: / Document Length: 41639 bytes Concurrency Level: 20 Time taken for tests: 13.556796 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 42118000 bytes HTML transferred: 41639000 bytes Requests per second: 73.76 [#/sec] (mean) Time per request: 271.136 [ms] (mean) Time per request: 13.557 [ms] (mean, across all concurrent requests) Transfer rate: 3033.90 [Kbytes/sec] received The Apache server is also running the postifx and bind although they aren't resource intensive applications. The Cron job for drupal runs every 50 minutes, and the agreggator module is enabled and fetches more than 30 rss feeds every time. The site used to be hosted on a single Celeron machine but on peak times the CPU went up to 80 %. Question : Does anybody know a website hosted on an IBM Mainframe? :) Todd?

Marcelb |

6 Comments |

Permalink |

PHP,

apc

Monday

Apr072008

Scalr - Open Source Auto-scaling Hosting on Amazon EC2

Monday, April 7, 2008 at 9:02PM

Scalr is a fully redundant, self-curing and self-scaling hosting environment utilizing Amazon's EC2. It has been recently open sourced on Google Code. Scalr allows you to create server farms through a web-based interface using prebuilt AMI's for load balancers (pound or nginx), app servers (apache, others), databases (mysql master-slave, others), and a generic AMI to build on top of. Scalr promises automatic high-availability and scaling for developers by health and load monitoring. The health of the farm is continuously monitored and maintained. When the Load Average on a type of node goes above a configurable threshold a new node is inserted into the farm to spread the load and the cluster is reconfigured. When a node crashes a new machine of that type is inserted into the farm to replace it. 4 AMI's are provided for load balancers, mysql databases, application servers, and a generic base image to customize. Scalr allows you to further customize each image, bundle the image and use that for future nodes that are inserted into the farm. You can make changes to one machine and use that for a specific type of node. New machines of this type will be brought online to meet current levels and the old machines are terminated one by one. The open source scalr platform with the combination of the static EC2 IP addresses makes elastic computing easier to implement. Check out the blog announcement by Intridea for more info. As AWS conquers the scalable web application hosting space it is time to check out the new Programming Amazon Web Services: S3, EC2, SQS, FPS, and SimpleDB (Programming) book on amazon.com. What do you think of the opportunities of using scalr for automatic scalability?

2 Comments |

Permalink |

AWS,

EC2,

Memcached,

amazon,

hosting,

open source,

pound

Wednesday

Mar122008

YouTube Architecture

Wednesday, March 12, 2008 at 3:54PM

Update 3: 7 Years Of YouTube Scalability Lessons In 30 Minutes and YouTube Strategy: Adding Jitter Isn't A Bug

Update 2: YouTube Reaches One Billion Views Per Day. That’s at least 11,574 views per second, 694,444 views per minute, and 41,666,667 views per hour.

Update: YouTube: The Platform. YouTube adds a new rich set of APIs in order to become your video platform leader--all for free. Upload, edit, watch, search, and comment on video from your own site without visiting YouTube. Compose your site internally from APIs because you'll need to expose them later anyway.

YouTube grew incredibly fast, to over 100 million video views per day, with only a handful of people responsible for scaling the site. How did they manage to deliver all that video to all those users? And how have they evolved since being acquired by Google?

Information Sources

Google Video

Platform

Apache

Python

Linux (SuSe)

MySQL

psyco, a dynamic python->C compiler

lighttpd for video instead of Apache

What's Inside?

The Stats

Supports the delivery of over 100 million videos per day.

Founded 2/2005

3/2006 30 million video views/day

7/2006 100 million video views/day

2 sysadmins, 2 scalability software architects

2 feature developers, 2 network engineers, 1 DBA

Recipe for handling rapid growth


   while (true)
   { 
      identify_and_fix_bottlenecks();
      drink();
      sleep();
      notice_new_bottleneck();
   }

This loop runs many times a day.

Web Servers

NetScalar is used for load balancing and caching static content.

Run Apache with mod_fast_cgi.

Requests are routed for handling by a Python application server.

Application server talks to various databases and other informations sources to get all the data and formats the html page.

Can usually scale web tier by adding more machines.

The Python web code is usually NOT the bottleneck, it spends most of its time blocked on RPCs.

Python allows rapid flexible development and deployment. This is critical given the competition they face.

Usually less than 100 ms page service times.

Use psyco, a dynamic python->C compiler that uses a JIT compiler approach to optimize inner loops.

For high CPU intensive activities like encryption, they use C extensions.

Some pre-generated cached HTML for expensive to render blocks.

Row level caching in the database.

Fully formed Python objects are cached.

Some data are calculated and sent to each application so the values are cached in local memory. This is an underused strategy. The fastest cache is in your application server and it doesn't take much time to send precalculated data to all your servers. Just have an agent that watches for changes, precalculates, and sends.

Video Serving

Costs include bandwidth, hardware, and power consumption.

Each video hosted by a mini-cluster. Each video is served by more than one machine.

Using a a cluster means: - More disks serving content which means more speed. - Headroom. If a machine goes down others can take over. - There are online backups.

Servers use the lighttpd web server for video: - Apache had too much overhead. - Uses epoll to wait on multiple fds. - Switched from single process to multiple process configuration to handle more connections.

Most popular content is moved to a CDN (content delivery network): - CDNs replicate content in multiple places. There's a better chance of content being closer to the user, with fewer hops, and content will run over a more friendly network. - CDN machines mostly serve out of memory because the content is so popular there's little thrashing of content into and out of memory.

Less popular content (1-20 views per day) uses YouTube servers in various colo sites. - There's a long tail effect. A video may have a few plays, but lots of videos are being played. Random disks blocks are being accessed. - Caching doesn't do a lot of good in this scenario, so spending money on more cache may not make sense. This is a very interesting point. If you have a long tail product caching won't always be your performance savior. - Tune RAID controller and pay attention to other lower level issues to help. - Tune memory on each machine so there's not too much and not too little.

Serving Video Key Points

Keep it simple and cheap.

Keep a simple network path. Not too many devices between content and users. Routers, switches, and other appliances may not be able to keep up with so much load.

Use commodity hardware. More expensive hardware gets the more expensive everything else gets too (support contracts). You are also less likely find help on the net.

Use simple common tools. They use most tools build into Linux and layer on top of those.

Handle random seeks well (SATA, tweaks).

Serving Thumbnails

Surprisingly difficult to do efficiently.

There are a like 4 thumbnails for each video so there are a lot more thumbnails than videos.

Thumbnails are hosted on just a few machines.

Saw problems associated with serving a lot of small objects: - Lots of disk seeks and problems with inode caches and page caches at OS level. - Ran into per directory file limit. Ext3 in particular. Moved to a more hierarchical structure. Recent improvements in the 2.6 kernel may improve Ext3 large directory handling up to 100 times, yet storing lots of files in a file system is still not a good idea. - A high number of requests/sec as web pages can display 60 thumbnails on page. - Under such high loads Apache performed badly. - Used squid (reverse proxy) in front of Apache. This worked for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20. - Tried using lighttpd but with a single threaded it stalled. Run into problems with multiprocesses mode because they would each keep a separate cache. - With so many images setting up a new machine took over 24 hours. - Rebooting machine took 6-10 hours for cache to warm up to not go to disk.

To solve all their problems they started using Google's BigTable, a distributed data store: - Avoids small file problem because it clumps files together. - Fast, fault tolerant. Assumes its working on a unreliable network. - Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites. - For more information on BigTable take a look at Google Architecture, GoogleTalk Architecture, and BigTable.

Databases

The Early Years - Use MySQL to store meta data like users, tags, and descriptions. - Served data off a monolithic RAID 10 Volume with 10 disks. - Living off credit cards so they leased hardware. When they needed more hardware to handle load it took a few days to order and get delivered. - They went through a common evolution: single server, went to a single master with multiple read slaves, then partitioned the database, and then settled on a sharding approach. - Suffered from replica lag. The master is multi-threaded and runs on a large machine so it can handle a lot of work. Slaves are single threaded and usually run on lesser machines and replication is asynchronous, so the slaves can lag significantly behind the master. - Updates cause cache misses which goes to disk where slow I/O causes slow replication. - Using a replicating architecture you need to spend a lot of money for incremental bits of write performance. - One of their solutions was prioritize traffic by splitting the data into two clusters: a video watch pool and a general cluster. The idea is that people want to watch video so that function should get the most resources. The social networking features of YouTube are less important so they can be routed to a less capable cluster.

The later years: - Went to database partitioning. - Split into shards with users assigned to different shards. - Spreads writes and reads. - Much better cache locality which means less IO. - Resulted in a 30% hardware reduction. - Reduced replica lag to 0. - Can now scale database almost arbitrarily.

Data Center Strategy

Used manage hosting providers at first. Living off credit cards so it was the only way.

Managed hosting can't scale with you. You can't control hardware or make favorable networking agreements.

So they went to a colocation arrangement. Now they can customize everything and negotiate their own contracts.

Use 5 or 6 data centers plus the CDN.

Videos come out of any data center. Not closest match or anything. If a video is popular enough it will move into the CDN.

Video bandwidth dependent, not really latency dependent. Can come from any colo.

For images latency matters, especially when you have 60 images on a page.

Images are replicated to different data centers using BigTable. Code looks at different metrics to know who is closest.

Lessons Learned

Stall for time. Creative and risky tricks can help you cope in the short term while you work out longer term solutions.

Prioritize. Know what's essential to your service and prioritize your resources and efforts around those priorities.

Pick your battles. Don't be afraid to outsource some essential services. YouTube uses a CDN to distribute their most popular content. Creating their own network would have taken too long and cost too much. You may have similar opportunities in your system. Take a look at Software as a Service for more ideas.

Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It's true that nobody really knows what simplicity is, but if you aren't afraid to make changes then that's a good sign simplicity is happening.

Shard. Sharding helps to isolate and constrain storage, CPU, memory, and IO. It's not just about getting more writes performance.

Constant iteration on bottlenecks: - Software: DB, caching - OS: disk I/O - Hardware: memory, RAID

You succeed as a team. Have a good cross discipline team that understands the whole system and what's underneath the system. People who can set up printers, machines, install networks, and so on. With a good team all things are possible.

62 Comments |

Permalink |

CDN,

Linux,

Python,

Shard,

lighttpd

Wednesday

Dec052007

Easier Production Releases

Wednesday, December 5, 2007 at 3:10PM

I’ve been a part of some late night release procedures and they’re never fun. You’ve got QA, Dev, IT and a handful of managers sitting in their jammies in a group IM (or worse, a conference call) from 2:00 AM until way too early in the morning. Everyone’s grumpy and sleepy, causing the release to be more difficult and take longer. Sometimes the dreaded “rollback!” is yelled. All this because you’re running a high profile website that needs to be accessible 24/7, and 2:00 AM - 5:00 AM downtime is better than daytime downtime. If you're a site that doesn't have 10s of thousands to drop on a real http load balancer, use this strategy to release software during business hours with no downtime using apache's mod_proxy_balancer....

archieco |

1 Comment |

Permalink |

Capacity Planning for LAMP

load balancer,

proxy

Tuesday

Nov132007

Flickr Architecture

Tuesday, November 13, 2007 at 6:04PM

Update: Flickr hits 2 Billion photos served. That's a lot of hamburgers.

Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it?

Site: http://www.flickr.com

Information Sources

Flickr and PHP (an early document)

Federation at Flickr: Doing Billions of Queries a Day by Dathan Pattishall.

Building Scalable Web Sites by Cal Henderson from Flickr.

Database War Stories #3: Flickr by Tim O'Reilly

Cal Henderson's Talks. A lot of useful PowerPoint presentations.

Platform

89 Comments |

Permalink |

Java,

Linux,

Slashdot's Setup, Part 1- Hardware

PHP,

Perl,

Shard

Monday

Nov122007

Slashdot Architecture - How the Old Man of the Internet Learned to Scale

Monday, November 12, 2007 at 3:13AM

Slashdot effect: overwhelming unprepared sites with an avalanche of reader's clicks after being mentioned on Slashdot. Sure, we now have the "Digg effect" and other hot new stars, but Slashdot was the original. And like many stars from generations past, Slashdot plays the elder statesman's role with with class, dignity, and restraint. Yet with millions and millions of users Slashdot is still box office gold and more than keeps up with the young'ins. And with age comes the wisdom of learning how to handle all those users. Just how does Slashdot scale and what can you learn by going old school? Site: http://slashdot.org

Information Sources

Slashdot's Setup, Part 2- Software

History of Slashdot Part 3- Going Corporate

The History of Slashdot Part 4 - Yesterday, Today, Tomorrow

The Platform

MySQL

Linux (CentOS/RHEL)

Pound

Apache

Perl

Memcached

LVS

The Stats

Started building the system in 1999.

5.5 million user visits per month.

7,000 comments are added every day.

Over 9 million pages views daily.

Over 21 million comments.

Average monthly bandwidth usage is around 40-50 mbit/sec.

For the same story Kottke.org found Slashdot delivered 4 times more users than Digg. So Slashdot ain't dead yet.

From The History of Slashdot Part 4: On [September 11th] the mainstream news websites buckled under the loads, and although we had to turn off logging, we managed to stay up, sharing news in a time where it was often difficult to get. That was the day where the team of engineers that make this site happen pulled together and did the impossible, forcing our limited little hardware cluster to handle traffic that was probably triple or quadruple a normal day.

The Hardware Architecture

Data center design is similar to all the other SourceForge, Inc. sites and has proven to scale well.

Two Active-Active gigabit uplinks.

A pair of Cisco 7301s serve as gateway/border routers. Perform some basic filtering. Filtering is tiered to spread the load.

Foundry BigIron 8000s act as core switches/routers.

Foundry FastIron 9604s are used as switches for some racks.

A pair of Rackable System (1Us; P4 Xeon 2.66Gz, 2G RAM, 2x80GB IDE, running CentOS and LVS) serve as load balancing firewalls, distributing traffic to web servers. BIG-IP F5's are being deployed in their new datacenter.

All servers are at least RAID 1.

16 web servers: - Running Red Hat 9. - Rackable 1U servers with 2 Xeon 2.66Ghz processors, 2GB of RAM, and 2x80GB IDE hard drives. - Two serve static content: javascript, images and the front page for non logged-in users. - Four serve the front page to logged in users - 10 handle comment pages. - Host roles are changed in response to load. - All NFS mounts are in read-only mode.

NFS server is a Rackable 2U with 2 Xeon 2.4Ghz processors, 2GB of RAM, and 4x36GB 15K RPM SCSI drives.

7 database servers: - All run CentOS 4. - 2 in a Master-master configuration: -- Dual Opteron 270's with 16GB RAM, 4x36GB 15K RPM SCSI -- One master is the write only database. -- One master is the read only database. -- They can failover at any time and switch roles. - 2 reader databases: -- Dual Opteron 270's with 8GB RAM, 4x36GB 15K RPM SCSI Drive -- Each syncs from one of the master databases. -- Can add more to scale, but plenty fast enough for now. - 3 miscellaneous databases -- Quad P3 Xeon 700Mhz with 4GB RAM, 8x36GB 10K RPM SCSI Drives -- Accesslog writer and accesslog reader. Separate databases are used because moderation and stats require a lot of CPU time for computation. -- Search database.

The Software Architecture

Logged in and non-logged in users are treated differently. - Non-logged in user see the same page. This page is a static page that is updated every couple of minutes. - Logged in users have custom options which can't be cached so generating pages for these users take more resources.

6 pound servers (1 for SSL) are used as reverse proxies: - If a request can't be handled it is forwarded on to a web server. - Pound servers are run on the same machines as the web servers. - They are distributed for load balancing and redundancy. - SSL is handled by the pound server so the web server doesn't need to support SSL.

16 apache web servers (version 1.3): - Software is mounted from /usr/local on the read-only NFS server. - The images are kept simple. All that is compiled in is: -- mod_perl -- lingerd to free up RAM during delivery. -- mod_auth_useragent to block bots. - 1 For SSL. - 2 for static (.shtml) requests. - 4 for the dynamic homepage. - 6 for dynamic comment-delivery pages (comments, article, pollBooth.pl). - 3 for all other dynamic scripts (ajax, tags, bookmarks, firehose).

Reasons for segregating apache servers to different roles: - Isolate the servers in case there are performance problems or a DDoS attack on a specific page. The rest of the system will function even when one part is failing. - For efficiency reasons like httpd-level caching and MaxClients tuning. The web server can be tuned differently for each role. MaxClients is set to 5-15 for dynamic web servers and 25 for static servers. The bottleneck is CPU, not RAM so if requests aren't process quickly then something's wrong and queuing more requests won't help the CPU process them any faster.

Using read-only mounted has contributed to the robustness of the system. Tasks that write to /usr/local, for example, to update index.html every second, run on the NFS server.

Use their own SQL API built on top of DBD::mysql and DBI.pm.

A huge performance boost was provided by caching users, stories, and comment text using memcached.

Most data access is through get and set methods written custom for each data type and through methods that perform one specific update or select.

The Multiple-master replication architecture allows keeping the site fully live even during blocking queries like ALTER TABLE.

Multi-pass log processing is to detect abuse and picking which users get mod points.

The moderation system was created in response to spam. It was just a few friends at first and then a lot of friends. This didn't scale. So the 'mod points' system was introduced so that any user who contributed to the system could moderate the system.

Active users are banned to protect from excessive usage from bots.

Lessons Learned

The most creatively satisfying period was when money was tight, the group was small, and everyone was helping everyone else with anything that needed to be done.

Don't waste your time optimizing code because you are too cheap to buy more machines. Buy the hardware and spend your time working on features.

Sell out to a large corporation and you lose control. There's continual pressure to go to the dark side of creating new products, blending in advertiser supplied content, and serving giant ads.

Say no to the forces that want you to become just like everyone else. Though many competitors have come and gone, Slashdot is still around because they: continue to maintain editorial independence, moderate advertising quantity with a clear distinction between advertising and content, and of course, that we continue to select the right stories to appeal to our existing audience... not to spend our time courting other audiences that would only dilute the discussions that bring so many of you here day after day.

Segregate servers into different policy domains so you can optimize their configuration.

Optimizing usually means caching, caching, caching.

Tables not fully, but mostly normalized. This improves performance in most cases.

Over the last seven years the process of developing database backed websites has changed: The database used to be the bottleneck: centralized, hard to expand, slow. Now even a cheap DB server can run a pretty big site if you code defensively, and thanks to Moore's Law, memcached, and improvements in open-source database software, that part of the scaling issue isn't really a problem until you're practically the size of eBay. It's an exciting time to be coding web applications.

4 Comments |

Permalink |

LVS,

Linux,

Memcached,

Perl,

pound

Thursday

Nov082007

scaling drupal - an open-source infrastructure for high-traffic drupal sites

Thursday, November 8, 2007 at 4:14AM

the authors of drupal have paid considerable attention to performance and scalability. consequently even a default install running on modest hardware can easily handle the demands a small website. if you are lucky, eventually the time comes when you need to service more users than your system can handle. at some point, you'll start looking at your hardware and network deployment.

Making the case for PHP at Yahoo! (Oct 2002)

Saturday, September 8, 2007 at 6:01PM

This presentation by Michael Radwin describes why Yahoo! had standardized on PHP going forward. It describes how after reviewing all the web technologies including their own internal ones, PHP was choosen. It shows that not only technical reasons , but also business and development processes were taken into account.

Sander van Zoest |

4 Comments |

Permalink |