High Scalability -

Entries in Shard (18)

Wednesday

Mar122008

YouTube Architecture

Wednesday, March 12, 2008 at 3:54PM

Update 3: 7 Years Of YouTube Scalability Lessons In 30 Minutes and YouTube Strategy: Adding Jitter Isn't A Bug

Update 2: YouTube Reaches One Billion Views Per Day. That’s at least 11,574 views per second, 694,444 views per minute, and 41,666,667 views per hour.

Update: YouTube: The Platform. YouTube adds a new rich set of APIs in order to become your video platform leader--all for free. Upload, edit, watch, search, and comment on video from your own site without visiting YouTube. Compose your site internally from APIs because you'll need to expose them later anyway.

YouTube grew incredibly fast, to over 100 million video views per day, with only a handful of people responsible for scaling the site. How did they manage to deliver all that video to all those users? And how have they evolved since being acquired by Google?

Information Sources

Google Video

Platform

Apache

Python

Linux (SuSe)

MySQL

psyco, a dynamic python->C compiler

lighttpd for video instead of Apache

What's Inside?

The Stats

Supports the delivery of over 100 million videos per day.

Founded 2/2005

3/2006 30 million video views/day

7/2006 100 million video views/day

2 sysadmins, 2 scalability software architects

2 feature developers, 2 network engineers, 1 DBA

Recipe for handling rapid growth


   while (true)
   { 
      identify_and_fix_bottlenecks();
      drink();
      sleep();
      notice_new_bottleneck();
   }

This loop runs many times a day.

Web Servers

NetScalar is used for load balancing and caching static content.

Run Apache with mod_fast_cgi.

Requests are routed for handling by a Python application server.

Application server talks to various databases and other informations sources to get all the data and formats the html page.

Can usually scale web tier by adding more machines.

The Python web code is usually NOT the bottleneck, it spends most of its time blocked on RPCs.

Python allows rapid flexible development and deployment. This is critical given the competition they face.

Usually less than 100 ms page service times.

Use psyco, a dynamic python->C compiler that uses a JIT compiler approach to optimize inner loops.

For high CPU intensive activities like encryption, they use C extensions.

Some pre-generated cached HTML for expensive to render blocks.

Row level caching in the database.

Fully formed Python objects are cached.

Some data are calculated and sent to each application so the values are cached in local memory. This is an underused strategy. The fastest cache is in your application server and it doesn't take much time to send precalculated data to all your servers. Just have an agent that watches for changes, precalculates, and sends.

Video Serving

Costs include bandwidth, hardware, and power consumption.

Each video hosted by a mini-cluster. Each video is served by more than one machine.

Using a a cluster means: - More disks serving content which means more speed. - Headroom. If a machine goes down others can take over. - There are online backups.

Servers use the lighttpd web server for video: - Apache had too much overhead. - Uses epoll to wait on multiple fds. - Switched from single process to multiple process configuration to handle more connections.

Most popular content is moved to a CDN (content delivery network): - CDNs replicate content in multiple places. There's a better chance of content being closer to the user, with fewer hops, and content will run over a more friendly network. - CDN machines mostly serve out of memory because the content is so popular there's little thrashing of content into and out of memory.

Less popular content (1-20 views per day) uses YouTube servers in various colo sites. - There's a long tail effect. A video may have a few plays, but lots of videos are being played. Random disks blocks are being accessed. - Caching doesn't do a lot of good in this scenario, so spending money on more cache may not make sense. This is a very interesting point. If you have a long tail product caching won't always be your performance savior. - Tune RAID controller and pay attention to other lower level issues to help. - Tune memory on each machine so there's not too much and not too little.

Serving Video Key Points

Keep it simple and cheap.

Keep a simple network path. Not too many devices between content and users. Routers, switches, and other appliances may not be able to keep up with so much load.

Use commodity hardware. More expensive hardware gets the more expensive everything else gets too (support contracts). You are also less likely find help on the net.

Use simple common tools. They use most tools build into Linux and layer on top of those.

Handle random seeks well (SATA, tweaks).

Serving Thumbnails

Surprisingly difficult to do efficiently.

There are a like 4 thumbnails for each video so there are a lot more thumbnails than videos.

Thumbnails are hosted on just a few machines.

Saw problems associated with serving a lot of small objects: - Lots of disk seeks and problems with inode caches and page caches at OS level. - Ran into per directory file limit. Ext3 in particular. Moved to a more hierarchical structure. Recent improvements in the 2.6 kernel may improve Ext3 large directory handling up to 100 times, yet storing lots of files in a file system is still not a good idea. - A high number of requests/sec as web pages can display 60 thumbnails on page. - Under such high loads Apache performed badly. - Used squid (reverse proxy) in front of Apache. This worked for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20. - Tried using lighttpd but with a single threaded it stalled. Run into problems with multiprocesses mode because they would each keep a separate cache. - With so many images setting up a new machine took over 24 hours. - Rebooting machine took 6-10 hours for cache to warm up to not go to disk.

To solve all their problems they started using Google's BigTable, a distributed data store: - Avoids small file problem because it clumps files together. - Fast, fault tolerant. Assumes its working on a unreliable network. - Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites. - For more information on BigTable take a look at Google Architecture, GoogleTalk Architecture, and BigTable.

Databases

The Early Years - Use MySQL to store meta data like users, tags, and descriptions. - Served data off a monolithic RAID 10 Volume with 10 disks. - Living off credit cards so they leased hardware. When they needed more hardware to handle load it took a few days to order and get delivered. - They went through a common evolution: single server, went to a single master with multiple read slaves, then partitioned the database, and then settled on a sharding approach. - Suffered from replica lag. The master is multi-threaded and runs on a large machine so it can handle a lot of work. Slaves are single threaded and usually run on lesser machines and replication is asynchronous, so the slaves can lag significantly behind the master. - Updates cause cache misses which goes to disk where slow I/O causes slow replication. - Using a replicating architecture you need to spend a lot of money for incremental bits of write performance. - One of their solutions was prioritize traffic by splitting the data into two clusters: a video watch pool and a general cluster. The idea is that people want to watch video so that function should get the most resources. The social networking features of YouTube are less important so they can be routed to a less capable cluster.

The later years: - Went to database partitioning. - Split into shards with users assigned to different shards. - Spreads writes and reads. - Much better cache locality which means less IO. - Resulted in a 30% hardware reduction. - Reduced replica lag to 0. - Can now scale database almost arbitrarily.

Data Center Strategy

Used manage hosting providers at first. Living off credit cards so it was the only way.

Managed hosting can't scale with you. You can't control hardware or make favorable networking agreements.

So they went to a colocation arrangement. Now they can customize everything and negotiate their own contracts.

Use 5 or 6 data centers plus the CDN.

Videos come out of any data center. Not closest match or anything. If a video is popular enough it will move into the CDN.

Video bandwidth dependent, not really latency dependent. Can come from any colo.

For images latency matters, especially when you have 60 images on a page.

Images are replicated to different data centers using BigTable. Code looks at different metrics to know who is closest.

Lessons Learned

Stall for time. Creative and risky tricks can help you cope in the short term while you work out longer term solutions.

Prioritize. Know what's essential to your service and prioritize your resources and efforts around those priorities.

Pick your battles. Don't be afraid to outsource some essential services. YouTube uses a CDN to distribute their most popular content. Creating their own network would have taken too long and cost too much. You may have similar opportunities in your system. Take a look at Software as a Service for more ideas.

Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It's true that nobody really knows what simplicity is, but if you aren't afraid to make changes then that's a good sign simplicity is happening.

Shard. Sharding helps to isolate and constrain storage, CPU, memory, and IO. It's not just about getting more writes performance.

Constant iteration on bottlenecks: - Software: DB, caching - OS: disk I/O - Hardware: memory, RAID

You succeed as a team. Have a good cross discipline team that understands the whole system and what's underneath the system. People who can set up printers, machines, install networks, and so on. With a good team all things are possible.

Click to read more ...

Todd Hoff |

62 Comments |

Permalink |

Print Article

Email Article

Apache,

CDN,

Example,

Linux,

MySQL,

Python,

Shard,

lighttpd

Tuesday

Nov132007

Flickr Architecture

Tuesday, November 13, 2007 at 6:04PM

Update: Flickr hits 2 Billion photos served. That's a lot of hamburgers.

Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it?

Site: http://www.flickr.com

Information Sources

Flickr and PHP (an early document)

Capacity Planning for LAMP

Federation at Flickr: Doing Billions of Queries a Day by Dathan Pattishall.

Building Scalable Web Sites by Cal Henderson from Flickr.

Database War Stories #3: Flickr by Tim O'Reilly

Cal Henderson's Talks. A lot of useful PowerPoint presentations.

Platform

Click to read more ...

Todd Hoff |

89 Comments |

Permalink |

Print Article

Email Article

Apache,

Example,

Java,

Linux,

MySQL,

PHP,

Perl,

Shard

Thursday

Aug162007

Scaling Secret #2: Denormalizing Your Way to Speed and Profit

Thursday, August 16, 2007 at 11:16AM

Alan Watts once observed how after we accepted Descartes' separation of the mind and body we've been trying to smash them back together again ever since when really they were never separate to begin with. The database normalization-denormalization dualism has the same mobius shaped reverberations as Descartes' error. We separate data into a million jagged little pieces and then spend all our time stooping over, picking them and up, and joining them back together again. Normalization has been standard practice now for decades. But times are changing. Many mega-website architects are concluding Watts was right: the data was never separate to begin with. And even more radical, we may even need to store multiple copies of data.

Information Sources

Normalization Is for Sissies by Pat Helland

Data normalization, is it really that good? by Arnon Rotem-Gal-Oz

When Not to Normalize your SQL Database by Dare Obasanjo

MegaData by Joe Gregorio

Audio of talk by Adam Bosworth at the MySQL Users Conference 2005 We normalize data to prevent anomalies. Anomalies are bad things like forgetting to update someone's address in an all the places its been stored when they move. This anomaly happens because the address has been duplicated. So to prevent the anomaly we don't duplicate data. We split everything up so it is stored once and exactly once. Bad things are far less likely to happen if we follow this strategy. And that's a good thing. The process of getting rid of all potential bad things is called normalization and we have a bunch of rules to follow to normalize our data. The price of normalization is that when we want a person's address we have to go find the person and their address in separate operations and bring the data together again. This is called a join. The problem is joins are relatively slow, especially over very large data sets, and if they are slow your website is slow. It takes a long time to get all those separate bits of information off disk and put them all together again. Flickr decided to denormalize because it took 13 Selects to each Insert, Delete or Update. If you say your database is the bottleneck then the finger is pointed back and you and you are asked what you are doing wrong. Have you created proper indexes? Is your schema design good? Is your database efficient? Are you tuning your queries? Have you cached in the database? Have you used views? Have you cached complicated queries in memcached? Can you get more parallel IO out of your database? And all these are valid and good questions. For your typical transactional database these would be your normal paths of attack. But we aren't talking about your normal database. We are talking about web scale services that have to process loads higher than any database can scale to. At some point you need a different approach. Many mega-scale websites with billions of records, petabytes of data, many thousands of simultaneous users, and millions of queries a day are doing is using a sharding scheme and some are even advocating denormalization as the best strategy for architecting the data tier. We sees this with Ebay who moved all significant functionality out of the database and into applications. Flickr shards and replicates their data to reach high performance levels. For Flickr this moves transaction logic back into their application layer, but the win is higher scalability. Joe Gregorio has identified some common themes across these new mega-data systems:

Distributed - The data has to be distributed across multiple machines.

Joinless - No joins, and no referential integrity, at least at the data store level.

De-Normalized - De-normalization is needed if you are avoiding joins.

Transcationless - No transactions It's the web model pushed to the data tier. Ironically, it may take a web model on the back-end to support a web model on the front-end.

The Great Data Ownership Wars: The Database vs. The Application

A not so subtle clue as to who won the data wars is to look at the words used. Data that are split up are considered "normal." Those who keep their data whole are considered "de-normal." All right, that's not what those words mean, but it was to good to pass up. :-) Traditionally the database owns the data. Referential integrity, triggers, stored procedures, and everything else that keeps the data safe and whole is in the database. Applications are prevented from screwing up the data. And this makes sense until you scale. Centralizing all behavior in the database won't mega-scale as the web does, which is why Ebay went completely the other way. Ebay maintains data integrity through a service layer that encapsulates all data access. The service layer handles referential integrity, managing replicated copies, doing joins, and so on. It's more error prone than having the database do all this work, but you are able to do scale past what even the highest end databases can handle. All this sharding and denormalization and duplicating at one levels feels so wrong because it's so different than we were all taught. And unless you are a really large website you probably don't need to worry about this level of complexity. But it's a really fascinating and unexpected evolution in design. Scaling to handle the world wide web requires techniques and strategies that are often at odds with our years of experience. It will be fun to see where it all leads.

Flickr both denormalizes and duplicates data. Horror!

Ebay is the most radical in moving almost all functionality out of the database and into the application.

Plenty of Fish also advocates denormalization as a key strategy.

Hadoop - a framework for running applications on large clusters of commodity hardware using a computational paradigm named map/reduce.

Click to read more ...

Todd Hoff |

3 Comments |

Permalink |

Print Article

Email Article

Shard,

Strategy

Tuesday

Jul242007

Product: Hibernate Shards

Tuesday, July 24, 2007 at 3:23AM

If you want to adopt a shard architecture, but don't want to start from scratch, you may want to consider Hibernate's sharding system. Hibernate Shards is a framework that is designed to encapsulate and minimize this complexity by adding support for horizontal partitioning to Hibernate Core. Hibernate Shards key features: * Standard Hibernate programming model - Hibernate Shards allows you to continue using the Hibernate APIs you know and love: SessionFactory, Session, Criteria, Query. If you already know how to use Hibernate, you already know how to use Hibernate Shards. * Flexible sharding strategies - Distribute data across your shards any way you want. Use one of the default strategies we provide or plug in your own application-specific logic. * Support for virtual shards - Think your sharding strategy is never going to change? Think again. Adding new shards and redistributing your data is one of the toughest operational challenges you will face once you've deployed your shard-aware application. Hibernate Sharding supports virtual shards, a feature designed to simplify the process of resharding your data. * Free/open source - Hibernate Shards is licensed under the LGPL (Lesser GNU Public License)

Click to read more ...

Todd Hoff |

1 Comment |

Permalink |

Print Article

Email Article

Product,

Shard,

hibernate

Monday

Jul232007

GoogleTalk Architecture

Monday, July 23, 2007 at 8:47AM

Google Talk is Google's instant communications service. Interestingly the IM messages aren't the major architectural challenge, handling user presence indications dominate the design. They also have the challenge of handling small low latency messages and integrating with many other systems. How do they do it? Site: http://www.google.com/talk

Information Sources

GoogleTalk Architecture

Platform

Linux

Java

Google Stack

Shard

What's Inside?

The Stats

Support presence and messages for millions of users.

Handles billions of packets per day in under 100ms.

IM is different than many other applications because the requests are small packets.

Routing and application logic are applied per packet for sender and receiver.

Messages must be delivered in-order.

Architecture extends to new clients and Google services.

Lessons Learned

Measure the right thing. - People ask about how many IMs do you deliver or how many active users. Turns out not to be the right engineering question. - Hard part of IM is how to show correct present to all connected users because growth is non-linear: ConnectedUsers * BuddyListSize * OnlineStateChanges - A linear user grown can mean a very non-linear server growth which requires serving many billions of presence packets per day. - Have a large number friends and presence explodes. The number IMs not that big of deal.

Real Life Load Tests - Lab tests are good, but don't tell you enough. - Did a backend launch before the real product launch. - Simulate presence requests and going on-line and off-line for weeks and months, even if real data is not returned. It works out many of the kinks in network, failover, etc.

Dynamic Resharding - Divide user data or load across shards. - Google Talk backend servers handle traffic for a subset of users. - Make it easy to change the number of shards with zero downtime. - Don't shard across data centers. Try and keep users local. - Servers can bring down servers and backups take over. Then you can bring up new servers and data migrated automatically and clients auto detect and go to new servers.

Add Abstractions to Hide System Complexity - Different systems should have little knowledge of each other, especially when separate groups are working together. - Gmail and Orkut don't know about sharding, load-balancing, or fail-over, data center architecture, or number of servers. Can change at anytime without cascading changes throughout the system. - Abstract these complexities into a set of gateways that are discovered at runtime. - RPC infrastructure should handle rerouting.

Understand Semantics of Lower Level Libraries - Everything is abstracted, but you must still have enough knowledge of how they work to architect your system. - Does your RPC create TCP connections to all or some of your servers? Very different implications. - Does the library performance health checking? This is architectural implications as you can have separate system failing independently. - Which kernel operation should you use? IM requires a lot connections but few have any activity. Use epoll vs poll/select.

Protect Again Operation Problems - Smooth out all spoke in server activity graphs. - What happens when servers restart with an empty cache? - What happens if traffic shifts to a new data center? - Limit cascading problems. Back of from busy servers. Don't accept work when sick. - Isolate in emergencies. Don't infect others with your problems. - Have intelligent retry logic policies abstracted away. Don't sit in hard 1msec retry loops, for example.

Any Scalable System is a Distributed System - Add fault tolerance to every component of the system. Everything fails. - Add ability to profile live servers without impacting server. Allows continual improvement. - Collect metrics from server for monitoring. Log everything about your system so you see patterns in cause and effects. - Log end-to-end so you can reconstruct an entire operation from beginning to end across all machines.

Software Development Strategies - Make sure binaries are both backward and forward compatible so you can have old clients work with new code. - Build an experimentation framework to try new features. - Give engineers access to product machines. Gives end-to-end ownership. This is very different than many companies who have completely separate OP teams in their data centers. Often developers can't touch production machines.

Click to read more ...

Todd Hoff |

3 Comments |

Permalink |

Java,

Linux,

Shard

Monday

Jul162007

Paper: Replication Under Scalable Hashing

Monday, July 16, 2007 at 12:54AM

Replication Under Scalable Hashing: A Family of Algorithms for Scalable Decentralized Data Distribution Typical algorithms for decentralized data distribution work best in a system that is fully built before it first used; adding or removing components results in either extensive reorganization of data or load imbalance in the system. We have developed a family of decentralized algorithms, RUSH (Replication Under Scalable Hashing), that maps replicated objects to a scalable collection of storage servers or disks. RUSH algorithms distribute objects to servers according to user-specified server weighting. While all RUSH variants support addition of servers to the system, different variants have different characteristics with respect to lookup time in petabyte-scale systems, performance with mirroring (as opposed to redundancy codes), and storage server removal. All RUSH variants redistribute as few objects as possible when new servers are added or existing servers are removed, and all variants guarantee that no two replicas of a particular object are ever placed on the same server. Because there is no central directory, clients can compute data locations in parallel, allowing thousands of clients to access objects on thousands of servers simultaneously.

Click to read more ...

Todd Hoff |

mixi.jp Architecture

Tuesday, July 10, 2007 at 7:55AM

Mixi is a fast growing social networking site in Japan. They provide services like: diary, community, message, review, and photo album. Having a lot in common with LiveJournal they also developed many of the same approaches. Their write up on how they scaled their system is easily one of the best out there. Site: http://mixi.jp

Information Sources

mixi.jp - scaling out with open source

Platform

Linux

Apache

MySQL

Perl

Memcached

Squid

Shard

What's Inside?

They grew to approximately 4 million users in two years and add over 15,000 new users/day.

Ranks 35th on Alexa and 3rd in Japan.

More than 100 MySQL servers

Add more than 10 servers/month

Use non-persistent connections.

Diary traffic is 85% read and 15% write.

Message traffic is is 75% read and 25% write.

Ran into replication performance problems so they had to split the database.

Considered splitting vertically by user or splitting horizontally by table type.

The ended up partitioning by table type and user. So all the messages for a group of users would be assigned to a particular database. Partitioning key is used to decide in which database data should be stored.

For caching they use memcached with 39 machines x 2 GB memory.

Stores more than 8 TB of images with about 23 GB added per day.

MySQL is only used to store metadata about the images, not the images themselves.

Images are either frequently accessed or rarely accessed.

Frequently accessed images are cached using Squid on multiple machines.

Rarely accessed images are served from the file system. There's no profit in caching them.

Lessons Learned

When using dynamic partitioning it's difficult to pick keys and algorithms for where data should be stored.

Once you partition data you can no longer do joins and you have to open a lot of connections to different databases to merge the data back together.

It's hard to add new hosts and rearrange data when you partition. For example, let's say your partitioning algorithm stores all the messages for users 1-N on host 1. Now let's say host 1 becomes overburdened and you want to repartition users across more hosts. This is very difficult to do.

By using distributed memory caching they rarely hit the DB and there average page load time is about .02 seconds. This reduces the problems associated with partitioning.

You will often have to develop strategies based on the type of content. For example, image will be treated differently than short text posts.

Social networking sites are very time oriented, so it might be useful to partition data by time as well as user and type.

Click to read more ...

Todd Hoff |

LiveJournal Architecture

Monday, July 9, 2007 at 2:57AM

A fascinating and detailed story of how LiveJournal evolved their system to scale. LiveJournal was an early player in the free blog service race and faced issues from quickly adding a large number of users. Blog posts come fast and furious which causes a lot of writes and writes are particularly hard to scale. Understanding how LiveJournal faced their scaling problems will help any aspiring website builder. Site: http://www.livejournal.com/

Information Sources

LiveJournal - Behind The Scenes Scaling Storytime

Google Video

Tokyo Video

2005 version

Platform

Linux

MySql

Perl

Memcached

MogileFS

Apache

What's Inside?

Scaling from 1, 2, and 4 hosts to cluster of servers.

Avoid single points of failure.

Using MySQL replication only takes you so far.

Becoming IO bound kills scaling.

Spread out writes and reads for more parallelism.

You can't keep adding read slaves and scale.

Shard storage approach, using DRBD, for maximal throughput. Allocate shards based on roles.

Caching to improve performance with memcached. Two-level hashing to distributed RAM.

Perlbal for web load balancing.

MogileFS, a distributed file system, for parallelism.

TheSchwartz and Gearman for distributed job queuing to do more work in parallel.

Solving persistent connection problems.

Lessons Learned

Don't be afraid to write your own software to solve your own problems. LiveJournal as provided incredible value to the community through their efforts.

Sites can evolve from small 1, 2 machine setups to larger systems as they learn about their users and what their system really needs to do.

Parallelization is key to scaling. Remove choke points by caching, load balancing, sharding, clustering file systems, and making use of more disk spindles.

Replication has a cost. You can't just keep adding more and more read slaves and expect to scale.

Low level issues like which OS event notification mechanism to use, file system and disk interactions, threading and even models, and connection types, matter at scale.

Large sites eventually turn to a distributed queuing and scheduling mechanism to distribute large work loads across a grid.

Click to read more ...

Todd Hoff |

9 Comments |

Permalink |

Linux,

MySQL,

Shard