Wednesday
Aug222007
How many machines do you need to run your site?

Amazingly TechCrunch runs their website on one web server and one database server, according to the fascinating survey What the Web’s most popular sites are running on by Pingdom, a provider of uptime and response time monitoring.
Early we learned PlentyOfFish catches and releases many millions of hits a day on just 1 web server and three database servers. Google runs a Dalek army full of servers. YouSendIt, a company making it easy to send and receive large files, has 24 web servers, 3 database servers, 170 storage servers, and a few miscellaneous servers. Vimeo, a video sharing company, has 100 servers for streaming video, 4 web servers, and 2 database servers. Meebo, an AJAX based instant messaging company, uses 40 servers to handle messaging, over 40 web servers, and 10 servers for forums, jabber, testing, and so on. FeedBurner, a news feed management company, has 70 web servers, 15 database servers, and 10 miscellaneous servers. Now multiply FeedBurner's server count by two because they maintain two geographically separate sites, in an active-passive configuration, for high availability purposes.
How many servers will you need and how can you trick yourself into using fewer?
We see quite a disparity in the number of servers needed for popular web sites. It ranges from just a few servers to many hundreds. Where do you fit?
The easiest approach to figuring out how many servers you'll need is to find a company similar to yours and look how many they need. You won't need that many right away, but as you grow it's something to think about. Can your data center handle your growth? Do they have enough affordable bandwidth and rack space? How will you install and manage all the machines? Who will do the work? And a million other similar questions that might be better handled if you had some idea where you are going.
Clearly content sites end up needing a lot of servers. Videos, music, pictures, blogs, and attachments all eat up space and since that's your business you have no alternative but to find a way to store all that data. This is unstructured data that can be stored outside the database in a SAN or NAS.
Or, rather that building your own storage infrastructure, you can follow the golden rule of laziness: get someone else to do it.
That's what SmugMug, an image sharing company did. They use S3 to store many hundreds of terabytes of data. This drops the expense of creating a large highly available storage infrastructure so much that it creates a whole new level of competition for content rich sites. At one time expertise in creating massive storage farms would have been enough to keep competition away, but no more. These sorts of abilities are becoming commoditized, affordable, and open.
PlentyOfFish and YouTube make use of CDNs to reduce the amount of infrastructure they need to create for themselves. If you need to stream video why not let a CDN do it instead of building out your own expensive infrastructure?
You can take a "let other people do it approach" for services like email, DNS, backup, forums, and blogs too. These are all now outsourcable. Does it make sense to put these services in your data center if you don't need to?
If you have compute intensive tasks you can use
Amazon services without needing to perform your own build out.
And an approach I am really excited to investigate in the future is a new breed of grid based virtual private data centers like 3tera and mediatemple. Their claim to fame is that you can componetize your infrastructure in such a way that you can scale automatically and transparently using their grid as demand fluctuates. I don't have any experience with this approach yet, but it's interesting and probably where the world is heading.
If your web site is relatively simple blog then with mostly static content then you can get away with far fewer servers. Even a popular site like Digg has only 30GB of data to store.
A question you have to ask also is do your resources scale linearly, exponentially, or not much at all with the number of users. A blog site may not scale much with the number of users. Some sites scale linearly as users are added. And others sites that rely on social interaction, like Google Talk, may scale exponentially as users are added. Getting a feel for the type of site you have can help more realistic numbers pop up on your magic server eight-ball.
A lot of sites use Memcached and Squid for caching. You can fill up a few racks with caching servers. How many servers will you need for caching? Or can you get away with just beefing up the database server cache?
Servers aren't just for storage, database, and the web servers. You may have a bit of computation going on. YouTube offloads tag calculations to a server farm. GoogleTalk has to have servers for handling presence calculations. PlentyOfFish has servers to handle geographical searches because they are so resource intensive. GigaVox needs servers to transcode podcasts into different formats and include fresh commercial content. If you are a calendar service you may need servers to calculate more complicated schedule availability schemes and to sync address books. So depending on your site, you may have to budget for many application related servers like these.
The Pingdom folks also created a sweet table on what technologies the companies profiled on this site are using. You can find it at What nine of the world’s largest websites are running on. I'm very jealous of their masterful colorful graphics-fu style. Someday I hope rise to that level of presentation skill.
Early we learned PlentyOfFish catches and releases many millions of hits a day on just 1 web server and three database servers. Google runs a Dalek army full of servers. YouSendIt, a company making it easy to send and receive large files, has 24 web servers, 3 database servers, 170 storage servers, and a few miscellaneous servers. Vimeo, a video sharing company, has 100 servers for streaming video, 4 web servers, and 2 database servers. Meebo, an AJAX based instant messaging company, uses 40 servers to handle messaging, over 40 web servers, and 10 servers for forums, jabber, testing, and so on. FeedBurner, a news feed management company, has 70 web servers, 15 database servers, and 10 miscellaneous servers. Now multiply FeedBurner's server count by two because they maintain two geographically separate sites, in an active-passive configuration, for high availability purposes.
How many servers will you need and how can you trick yourself into using fewer?
Find Someone Like You and Base Your Resource Estimates Off Them
We see quite a disparity in the number of servers needed for popular web sites. It ranges from just a few servers to many hundreds. Where do you fit?
The easiest approach to figuring out how many servers you'll need is to find a company similar to yours and look how many they need. You won't need that many right away, but as you grow it's something to think about. Can your data center handle your growth? Do they have enough affordable bandwidth and rack space? How will you install and manage all the machines? Who will do the work? And a million other similar questions that might be better handled if you had some idea where you are going.
Get Someone Else to Do it
Clearly content sites end up needing a lot of servers. Videos, music, pictures, blogs, and attachments all eat up space and since that's your business you have no alternative but to find a way to store all that data. This is unstructured data that can be stored outside the database in a SAN or NAS.
Or, rather that building your own storage infrastructure, you can follow the golden rule of laziness: get someone else to do it.
That's what SmugMug, an image sharing company did. They use S3 to store many hundreds of terabytes of data. This drops the expense of creating a large highly available storage infrastructure so much that it creates a whole new level of competition for content rich sites. At one time expertise in creating massive storage farms would have been enough to keep competition away, but no more. These sorts of abilities are becoming commoditized, affordable, and open.
PlentyOfFish and YouTube make use of CDNs to reduce the amount of infrastructure they need to create for themselves. If you need to stream video why not let a CDN do it instead of building out your own expensive infrastructure?
You can take a "let other people do it approach" for services like email, DNS, backup, forums, and blogs too. These are all now outsourcable. Does it make sense to put these services in your data center if you don't need to?
If you have compute intensive tasks you can use
Amazon services without needing to perform your own build out.
And an approach I am really excited to investigate in the future is a new breed of grid based virtual private data centers like 3tera and mediatemple. Their claim to fame is that you can componetize your infrastructure in such a way that you can scale automatically and transparently using their grid as demand fluctuates. I don't have any experience with this approach yet, but it's interesting and probably where the world is heading.
If your web site is relatively simple blog then with mostly static content then you can get away with far fewer servers. Even a popular site like Digg has only 30GB of data to store.
How do your resources scale with the number of users?
A question you have to ask also is do your resources scale linearly, exponentially, or not much at all with the number of users. A blog site may not scale much with the number of users. Some sites scale linearly as users are added. And others sites that rely on social interaction, like Google Talk, may scale exponentially as users are added. Getting a feel for the type of site you have can help more realistic numbers pop up on your magic server eight-ball.
What's your caching strategy?
A lot of sites use Memcached and Squid for caching. You can fill up a few racks with caching servers. How many servers will you need for caching? Or can you get away with just beefing up the database server cache?
Do you need servers for application specific tasks?
Servers aren't just for storage, database, and the web servers. You may have a bit of computation going on. YouTube offloads tag calculations to a server farm. GoogleTalk has to have servers for handling presence calculations. PlentyOfFish has servers to handle geographical searches because they are so resource intensive. GigaVox needs servers to transcode podcasts into different formats and include fresh commercial content. If you are a calendar service you may need servers to calculate more complicated schedule availability schemes and to sync address books. So depending on your site, you may have to budget for many application related servers like these.
The Pingdom folks also created a sweet table on what technologies the companies profiled on this site are using. You can find it at What nine of the world’s largest websites are running on. I'm very jealous of their masterful colorful graphics-fu style. Someday I hope rise to that level of presentation skill.
Reader Comments (6)
Huh? Techcrunch are hosted by mediatemple, aka Gridserver.com, which claims:
"We've eliminated roadblocks and single points of failure by using hundreds of servers working in tandem for your site, applications, and email."
-- http://www.mediatemple.net/webhosting/gs/
I have read that "Varnish" is up to 20 times faster than Squid. http://en.wikipedia.org/wiki/Varnish_cache
A news paper in Norway used Squid, but when they switched to Varnish they went from using 12 servers to just one.
HellHi There, I Want to Create a Server ( Chat Server ) Can You Please Tell me How can i Create And Which Software is Best For Chatserver. What Software i use to Rumake a Chatserver and what i've to do to make a server. Thanx
"And others sites that rely on social interaction, like Google Talk, may scale exponentially as users are added"
I think you mean quadratic. Social networks can scale around N^2 .... certainly not 2^N.
All right, better architecture is less server costs.
Regu posted an interesting question recently: "Is scalability a factor of the number of machines/CPUs?". His answer can ultimately be summed up as "yes, but..." -- it was qualified in terms of threads: "... scalability in a well designed system is a factor of number of threads that can be efficiently executed in parallel". The word "efficiently" meaning that the threads are actually doing work and not just waiting. However, the question of how many machines do we need is a hard one. Nick calls out a very important point on this, "An asymmetric farm, with machines of varying capabilities, is really hard to tune." In all cases we find that load-leveling mechanisms like queues are good for scalability.
Just as a slight sidebar for anybody who deals with systems where work needs to be divided up and run in parallel to achieve required latency requirements, we have to deal with all the above problems and more. For instance, if we have to process images, finishing the processing on each image in one minute. Now, we have an algorithm that can do part of an image that runs at a speed of 1MB per second, single threaded on a dedicated machine with a standard 3GHz processor. So, how can we process a 1GB image in 60 seconds? Simple, get 17 processors right? Well, if you were running a 16 or 32 way SMP machine then probably yes. But what if you want to scale out, say, because you're receiving one image every 2 seconds on average? Well, once we scale out, time is impacted quite significantly by the cost of just moving data between servers - one of the fallacies of distributed computing. It becomes a much more difficult problem - the kind that I just love sinking my teeth into :)
Anyway, a lot of us aren't dealing in these massively parallel problem spaces but are just looking for good scalability advice. Well, one of the characteristics of a scalable system is that load is evenly distributed between machines (up to a point - if we have more machines than work that needs to be done, some will be idle). Load can be broken up in terms of resource usage - CPU, memory, disk, network, etc and we should be looking at all parameters. I've noticed a tendency of people to focus only on CPU usage. One case I consulted on was a system that was having performance problems although average CPU utilization was around 50%. They did a costly hardware upgrade at the time from single-CPU machines to all double-CPU, hoping to drive down the utilization and improve performance. They only succeeded half way - CPU utilization did drop, but performance (in terms of response time and throughput) didn't improve - quite simply because the network was the bottleneck, and not processor power. As Dan so eloquently states: "Latency exists, Cope!"
If you use the Pipeline architectural pattern (page 5) that is so well known in the embedded/real-time space at the macro level (inside the service, not between services - that's SOA), and SEDA (Staged Event-Driven Architecture) at the micro level you can create an environment where you can know the amount of resources you need to buy/provision for the expected load at a high degree of accuracy. An additional, maybe even more important benefit has to do with the resiliency of such a system. If there is a degradation in resource performance or availability, the system won't come crashing down but rather "limp along". Conversely, if load continues to increase beyond expected maxima, the performance (in terms of throughput) of such a system would not degrade. By monitoring response time per request, you could notice the upward trend and provision more resources. If you were working with a grid-like infrastructure, you could set these rules up so that they would be executed automatically. These are the building blocks for building "self healing" systems - one of my current favorite areas of interest.
Bottom line, I've found that the layered-architecture/tiered-distribution pair to be rather limited in terms of scalability (in terms of load). I would say that the solution isn't necessarily to move to a Space-Based Architecture, as Guy mentions in this post, although many of the event-based concepts are definitely broadly applicable. Werners Vogels (Amazon's CTO) mentions the CAP (consistency, availability, partitioning - choose 2) model for distributed systems in this podcast which I think is critical in analyzing the different parts of a complex system. On the flip side, Patrick does an excellent job of warning about the dangers of other appealing, siren-esque paths - follow them at your peril.
I'm afraid that there aren't any easy answers, but at least we have some models that have proven themselves viable in the most strenuous scenarios. These models sometimes contradict popular architectural styles and it's good to be aware of that. At the end of the day, it is our job to make the difficult technical tradeoffs.
_____________________
Submited by : http://www.librosgratisweb.com/libros/padre-rico-padre-pobre.html">Libros Gratis