6 Ways to Kill Your Servers - Learning How to Scale the Hard Way

This is a guest post by Steffen Konerow, author of the High Performance Blog.
Learning how to scale isn’t easy without any prior experience. Nowadays you have plenty of websites like highscalability.com to get some inspiration, but unfortunately there is no solution that fits all websites and needs. You still have to think on your own to find a concept that works for your requirements. So did I.
A few years ago, my bosses came to me and said “We’ve got a new project for you. It’s the relaunch of a website that has already 1 million users a month. You have to build the website and make sure we’ll be able to grow afterwards”. I was already an experienced coder, but not in these dimensions, so I had to start learning how to scale – the hard way.
The software behind the website was a PHP content management system, based on Smarty and MySQL. The first task was finding a proper hosting company who had the experience and would also manage the servers for us. After some research we found one, told them our requirements and ordered the suggested setup:
- LoadBalancer (+Fallback)
- 2 Webservers
- Mysql Server (+Fallback)
- development machine
They said, that’s gonna be all we need – and we believed it. What we got was:
- Loadbalancer (single core, 1GB RAM, Pound)
- 2 Webservers (Dual core, 4GB RAM, Apache)
- MySQL Server (Quad core, 8GB RAM)
- Dev (single core, 1GB RAM)
The setup was very basic without any further optimization. To synchronize the files (php+media files) they installed DRBD in active-active configuration.
Eventually the relaunch came – of course we were all excited. Very early in the morning we switched the domains to the new IPs, started our monitoring scripts and stared at the screens. We had almost instantly traffic on the machines and everything seemed to work pretty fine. The pages loaded quickly, MySQL was serving lots of queries and we were all happy.
Then, suddenly our telephones started to ring “We can’t access our website, what’s going on?”. We looked at our monitoring software and indeed – the servers were frozen and the site offline! Of course, the first thing we did was calling our hoster “hey, all our servers are dead. What’s going on?”. They promised to check the machines and call back immediately afterwards. The call came: “well,…erm… your filesystem is completely fubar. What did you do? It’s totally screwed”. They stopped the loadbalancers and told me have a look at one of the webservers. Looking at the index.php file I was shocked. It contained some weird fragments of C code, error messages and something that looked like log files. After some further investigation we found out that DRBD was the cause for this mess.
Lesson #1 learned
Put Smarty compile and template caches on an active-active DRBD cluster with high load and your servers will DIE!
While our hoster was fixing the webservers I rewrote some parts of the CMS to store the Smarty cache files on the servers local filesystems. Issue found & fixed. We went back online! Hurray!!!
Now it was early afternoon. The website usually reaches its peak in the late afternoon until early evening. At night the traffic goes back to almost none. We kept staring at the monitoring software and we were all sweating. The website was loading but the later it got, the higher the system load and the slower the responses. I increased the lifetime of the Smarty template caches and hoped it would do the trick – it didn’t! Very soon the servers started to give timeouts, white pages and error messages. The two machines couldn’t handle the load.
Our customer got a bit nervous at the same time, but he said: Ok, relaunches usually cause some issues. As long as you fix it quickly, it will be fine!
We needed a plan to reduce the load and discussed the issue with our hoster. On of their administrators came up with a good idea: “Guys, your servers are currently running on a pretty common Apache+mod_php setup. How about switching to an alternative webserver like Lighttpd? It’s a fairly small project, but even wikipedia is using it”. We agreed.
Lesson #2 learned
Put an out-of-the-box webserver configuration on your machines, do not optimize it at all and your servers will DIE!
The administrator gave his best and reconfigured both webservers as quickly as he could. He threw away the Apache configuration and switched to Lighttpd+FastCGI+Xcache. Later, when we went back online we almost couldn’t stand the pressure anymore. How long will the servers last this time?
The servers did surprisingly well. The load was MUCH lower than before and the average response time was good. After this huge relief we went home and got some sleep. It was already late and we came to the conclusion there was nothing left we could do.
The next days the website was doing rather well, but at peak times it was still close to crash. We spotted MySQL as the bottleneck and called our hoster again. They suggested a MySQL Master-Slave replication with a MySQL slave on each webserver.
Lesson #3 learned
Even a powerful database server has its limits and when you reach them – your servers will DIE!
In this case the database became so slow at some point, that the incoming and queued network connections killed our webservers – again. Unfortunately this issue wasn’t easy to fix. The content management system was pretty simple in this regard and there was no built-in support for separating reading and writing SQL queries. It took a while to rewrite everything, but the result was astonishing and worth every minute of suspended sleep.
The MySQL replication really did the trick and the website was finally stable! YEAH! Over the next weeks and months the website became a success and the number of users started to increase constantly. It was only a matter of time until the traffic would exceed our resources again.
Lesson #4 learned
Stop planning in advance and your servers are likely to DIE.
Fortunately we kept thinking and planning. We optimized the code, reduced the number of needed SQL queries per pageload and suddenly stumbled upon MemCached. At first I added MemCached support in some of the core functions, as well as in the most heavy (slow) functions. When we deployed the changes we couldn’t believe the results – it felt a bit like finding the Holy Grail. We reduced the number of queries per second by at least 50%. Instead of buying another webserver we decided to make even more use of MemCached.
Lesson #5 learned
Forget about caching and you will either waste a lot of money on hardware or your servers will die!
It turned out, that MemCached helped us to reduce the load on the MySQL servers by 70-80%, which also resulted in a huge performance boost – also on the webservers. The pages were loading much quicker!
Eventually our setup seemed to be perfect. Even on peak times we didn’t have to worry about crashes or slow responding pages anymore. Did we make it? No! Out of the blue one of the webservers started having some kinda hickups. Error messages, white pages and so on. The system load was fine and in most cases the server worked, but only in “most cases”.
Lesson #6 learned
Put a few hundred thousand small files in one folder, run out of Inodes and your server will die!
Yes you read it correct. We were so focused on MySQL, PHP and the webservers itself that we didn’t pay enough attention to the filesystem. The Smarty cache file were stored on the local filesystem – all in one single directory. The solution here was putting Smarty on a dedicated ReiserFS partition. Furthermore we enabled the Smarty “use_subdirs” option.
Over the past years we kept optimizing the pages. We put the Smarty caches into memcached, installed Varnish to reduce the I/O load for serving static files more quickly, switched to Nginx (Lighttpd randomly produced error 500 messages), installed more RAM, bought better hardware, more hardware… this list is endless.
Conclusion
Scaling a website is a never ending process. As soon as you fix one bottleneck you’r very likely to stumble into the next one. Never ever start thinking “that’s it, we’re done” and lean back. It will kill your servers and perhaps even your business. It’s a constant process of planning and learning. If you can’t get a job done on your own, because you have a lack of experience and/or resources – find a competent and realiable partner to work with. Never stop talking with your team and partners about the current issues and the ones that might arise in (near) future. Think ahead and be proactive!
Reader Comments (47)
Haha, loved your "and your servers will DIE!" rallying cry.
So let me sum this post in few words: Use MySQL and PHP and your servers will DIE! ;)
What kind monitoring scripts were you using? Custom?
Enjoyed the write-up, but it pains me to think that most of it could have been easily avoided. Testing, careful planning and a little up-front reading (Web Operations by John Allspaw, Scalable Internet Architectures by Theo Schlossnagle) would have provided the foresight to implement many of these "fixes" as sound decision choices from the get-go.
Excelent article !! ... your servers will DIE LoL !!
Well done Steffen! I love case studies like this one. I liked your post so much that I just wrote a blog about it entitled "Web Performance Tuning Never Ends". http://loadstorm.com/2010/web-performance-tuning-never-ends
You are given credit and an inbound link. There is additional commentary from me of course. Seemed like it needed a few mentions of load testing. ;-)
If you want me to link to your High Performance blog too, please let me know.
I look forward to reading more of your posts. Thanks for sharing.
I would maybe add to this that you should "try and get 64bit servers if you are running MySQL + InnoDB. This was a problem for me because I wanted to do 'innodb_buffer_pool_size=4GB' and InnoDB can only handle a max. of 2GB on 32bit machines"
Steffen, I'm curious why you used Varnish? it doesn't seem like your infrastructure is so large that you'd need Varnish-type service/server..
Nginx(or lighttpd) supposed to serve static content much faster than Apache..
Marko: one should rather say "use PHP (or other interpreted language) and your servers wil DIE". ;-P And Apache isn't very efficient at all. Some keywords: JavaScript frameworks, Comet, Bayeux, nginx, POE... Generally speaking - let the browsers do the hard work, while your servers keep serving data.
Please go into more detail on how DRBD screwed up..
What filesystem were you using for DRBD to be active/active?
Thanks for your feedback!
@Greg:
We were using the MySQL Tools (live queries/sec graph), munin and sometimes a simple "top" call is worth more than thousand colored pictures :)
@Jason:
I agree with you to a certain degree. Some issues could have been spotted earlier with more intense testing and of course a bit more of foresight. Unfortunately we ran a bit out of time and when you also dunno where to start looking for issues, you can't find them. Reading books is very important, but in my opinion theoretical knowledge is worth nothing as long as you can't make your own experiences. Sometimes the "hard way" of doing things teaches much more than a book. ;)
@Maxim R,
we decided to use Varnish to reduce the I/O load as well as the network latency to have a more responsive site. After the huge DRDB disaster and the messed up start in general we quickly decided to add another webserver. This decision also required a new strategy for storing the files since the DRBD config only works with 2 servers. So we had to buy a storage server (SAS Raid 5), put the files on that machine and mounted it via NFS. (that's another annoying story). So, in order to keep the latency low we tried Varnish and it works like a charm.
@Cowmix:
I can't tell much about the details... it was a simple "active-active" setup. Both servers started writing tons of cache files and probably also ran into race conditions - trying to write the same files at the same time. With "tons of files" I mean something between 20k-50k. At the same time the system load was pretty high and suddenly both machines crashed with a totally f* up filesystem. I've never seen something like that again (fortunately!)... have to say, all this happened a few years ago and DRBD made a good progress in the meantime.
If you have any further questions let me know and/or stop by at my blog. I'll probably come up with a new article soon - why "Tuning and benchmarking can lead into serious addictions"!
Replacing Apache with Lighttpd, LiteSpeed, Nginx, etc, at first signs of an increasing server load is not the best move to make.
You have to 1) figure out which mpm module to use: prefork, worker, event, 2) decide on mod_php vs fastcgi, and 3) tune the mpm setting for the job.
Doing so can easily increase requests/second and decrease load by a factor of 3x, sometimes even more.
Erm... active-active DRBD was never intended to be used with standard filesystems. It's one of the first things mentioned in the manual:
http://www.drbd.org/users-guide/ch-fundamentals.html
You'd need a real clustering filesystem (like OCFS) to do that sort of thing. Or NFS.
Nice article! This is something that even applies to sites that don't need to scale to 1000s of servers :)
Put Smarty compile and template caches on an active-active DRBD cluster with high load and your servers will DIE!
I don't think this is very fair. If cache files can mess up you filesystem than so can other files. The real lesson here should be "if you use file or storage systems that are widely used stress-test them in advance". A lot of Sites use DRBD in all kinds of scenarios today without problems.
Put an out-of-the-box webserver configuration on your machines, do not optimize it at all and your servers will DIE!
The thing that kills you here usually is not so much the performance of the webserver but the fact that if you use php "in process" (i.e. by using mod_php) then every static request even for a tiny gif will use up a process that is several megabytes in size. If you get lots of page views and every page pulls in lots of images, javascripts and css files than this can add up quickly. So you don't have to go all the way and throw out apache completely but can instead drop mod_php and use FastCGI. That way you can still leverage the many resources out there helping you to tune apache.
If you start from scratch though I'd recommend nginx + php 5.3.3 with FPM.
Forget about caching and you will either waste a lot of money on hardware or your servers will die!
That should probably be the number 1 lesson of all. This alone might have prevented the DRBD blow-out and the Database load issues if introduced earlier. This cannot be stressed enough: Even a little bit of caching can reduce your overall infrastructure load by a huge amount.
A final point: measure, measure, measure.
You cannot optimize your infrastructure if you don't know what's going on. Understand where the choke points are and understand *why* they are the choke points. Adding cpu power isn't going to help making your database faster when you are i/o bound. The first step is to find out that you are i/o bound but the second step is to find out *why* you are i/o bound. Maybe an index is all that is needed to fix the problem or maybe 50% of the load is created by an ugly query for a bit of content that can be cached on the webserver? Having i/o problems doesn't necessarily mean you have to buy more or faster disks.
very interesting. I could have spent weeks before finding the inodes bug myself.
Just wondering though, have you done any load testing beforehand? seems like it should have found the first issue with DRBD.
I don't pretend to be an expert on scalability, but this post just saddened me.
Each “lesson” brought up a giant *facepalm*.
Throwing Apache out when you didn't customize its config, only to substitute it with Lighttpd and then Nginx because of “random error messages” is like substituting magic with another kind of magic. Might aswell pray that Nginx never brings “random” error messages too.
And then I just don't understand how using a caching solution only occurred to you by the time your SQL servers were crashing left and right.
This is why companies hire sysadmins. Coders, even experienced coders, are not supposed to do this job (beyond coding sanely, that is). Trial and error system administration is no way to ensure scalability.
What was the reason for choosing DRBD? Was it for uploaded files or for storing the php code of the CMS?
This reeks of incompetence. DRBD is blamed for trashing a filesystem. I find that hard to believe, I've used it (for several years) in much larger scenarios and it performs admirably.
Also, there is no mention _at_all_ of any profiling of SQL statements which is fairly easy to do even with MySQL. That can buy you an order of magnitude if the coder was lazy (most are, in my experience). After that I would profile PHP execution. That's harder but you can get at least an idea with a bit of timing statements.
Then the very idea of deploying a production system without testing the load ... that's just... unprofessional, is the word I'll use.
http://wp.me/p4XzQ-9j has a reply to this post.
Thanks for the post! We need more people with this kind of brutal honesty so we can all get better at what we do.
The solution can be wrapped up with one word: MICROSOFT
@ltp Yes I'll have to agree. Microsoft caused all of their problems.
Umm... did anyone think of doing Performance/Scalability Testing before launching this whole mess? As others have said you could have avoided a lot of the problems if the task of Performance/Scalability Testing had been factored in from the beginning. In this day and age this should be one of the first things you should think of and plan for on a project. We are no longer living in 1999/2000 when it was the wild west of the internet.
My question is how much did the downtime and rework cost you and your customers? I bet it was quite large (money and loss of customer good faith due to a failing site), and the cost of having a consultant come in and do the Performance Test work would have been small in comparison.
Hopefully you, your company, and its customers have learned a valuable lesson regarding this. Do Performance Testing and start on it from the beginning. Find related articles by Scott Barber on this and learn to use the information he provides. Scott is one of the top Performance Test consultants around.
Jim
Thanks for sharing your experiences. I'm sure we all can learn something from it.