6 Ways to Kill Your Servers - Learning How to Scale the Hard Way

This is a guest post by Steffen Konerow, author of the High Performance Blog.
Learning how to scale isn’t easy without any prior experience. Nowadays you have plenty of websites like highscalability.com to get some inspiration, but unfortunately there is no solution that fits all websites and needs. You still have to think on your own to find a concept that works for your requirements. So did I.
A few years ago, my bosses came to me and said “We’ve got a new project for you. It’s the relaunch of a website that has already 1 million users a month. You have to build the website and make sure we’ll be able to grow afterwards”. I was already an experienced coder, but not in these dimensions, so I had to start learning how to scale – the hard way.
The software behind the website was a PHP content management system, based on Smarty and MySQL. The first task was finding a proper hosting company who had the experience and would also manage the servers for us. After some research we found one, told them our requirements and ordered the suggested setup:
- LoadBalancer (+Fallback)
- 2 Webservers
- Mysql Server (+Fallback)
- development machine
They said, that’s gonna be all we need – and we believed it. What we got was:
- Loadbalancer (single core, 1GB RAM, Pound)
- 2 Webservers (Dual core, 4GB RAM, Apache)
- MySQL Server (Quad core, 8GB RAM)
- Dev (single core, 1GB RAM)
The setup was very basic without any further optimization. To synchronize the files (php+media files) they installed DRBD in active-active configuration.
Eventually the relaunch came – of course we were all excited. Very early in the morning we switched the domains to the new IPs, started our monitoring scripts and stared at the screens. We had almost instantly traffic on the machines and everything seemed to work pretty fine. The pages loaded quickly, MySQL was serving lots of queries and we were all happy.
Then, suddenly our telephones started to ring “We can’t access our website, what’s going on?”. We looked at our monitoring software and indeed – the servers were frozen and the site offline! Of course, the first thing we did was calling our hoster “hey, all our servers are dead. What’s going on?”. They promised to check the machines and call back immediately afterwards. The call came: “well,…erm… your filesystem is completely fubar. What did you do? It’s totally screwed”. They stopped the loadbalancers and told me have a look at one of the webservers. Looking at the index.php file I was shocked. It contained some weird fragments of C code, error messages and something that looked like log files. After some further investigation we found out that DRBD was the cause for this mess.
Lesson #1 learned
Put Smarty compile and template caches on an active-active DRBD cluster with high load and your servers will DIE!
While our hoster was fixing the webservers I rewrote some parts of the CMS to store the Smarty cache files on the servers local filesystems. Issue found & fixed. We went back online! Hurray!!!
Now it was early afternoon. The website usually reaches its peak in the late afternoon until early evening. At night the traffic goes back to almost none. We kept staring at the monitoring software and we were all sweating. The website was loading but the later it got, the higher the system load and the slower the responses. I increased the lifetime of the Smarty template caches and hoped it would do the trick – it didn’t! Very soon the servers started to give timeouts, white pages and error messages. The two machines couldn’t handle the load.
Our customer got a bit nervous at the same time, but he said: Ok, relaunches usually cause some issues. As long as you fix it quickly, it will be fine!
We needed a plan to reduce the load and discussed the issue with our hoster. On of their administrators came up with a good idea: “Guys, your servers are currently running on a pretty common Apache+mod_php setup. How about switching to an alternative webserver like Lighttpd? It’s a fairly small project, but even wikipedia is using it”. We agreed.
Lesson #2 learned
Put an out-of-the-box webserver configuration on your machines, do not optimize it at all and your servers will DIE!
The administrator gave his best and reconfigured both webservers as quickly as he could. He threw away the Apache configuration and switched to Lighttpd+FastCGI+Xcache. Later, when we went back online we almost couldn’t stand the pressure anymore. How long will the servers last this time?
The servers did surprisingly well. The load was MUCH lower than before and the average response time was good. After this huge relief we went home and got some sleep. It was already late and we came to the conclusion there was nothing left we could do.
The next days the website was doing rather well, but at peak times it was still close to crash. We spotted MySQL as the bottleneck and called our hoster again. They suggested a MySQL Master-Slave replication with a MySQL slave on each webserver.
Lesson #3 learned
Even a powerful database server has its limits and when you reach them – your servers will DIE!
In this case the database became so slow at some point, that the incoming and queued network connections killed our webservers – again. Unfortunately this issue wasn’t easy to fix. The content management system was pretty simple in this regard and there was no built-in support for separating reading and writing SQL queries. It took a while to rewrite everything, but the result was astonishing and worth every minute of suspended sleep.
The MySQL replication really did the trick and the website was finally stable! YEAH! Over the next weeks and months the website became a success and the number of users started to increase constantly. It was only a matter of time until the traffic would exceed our resources again.
Lesson #4 learned
Stop planning in advance and your servers are likely to DIE.
Fortunately we kept thinking and planning. We optimized the code, reduced the number of needed SQL queries per pageload and suddenly stumbled upon MemCached. At first I added MemCached support in some of the core functions, as well as in the most heavy (slow) functions. When we deployed the changes we couldn’t believe the results – it felt a bit like finding the Holy Grail. We reduced the number of queries per second by at least 50%. Instead of buying another webserver we decided to make even more use of MemCached.
Lesson #5 learned
Forget about caching and you will either waste a lot of money on hardware or your servers will die!
It turned out, that MemCached helped us to reduce the load on the MySQL servers by 70-80%, which also resulted in a huge performance boost – also on the webservers. The pages were loading much quicker!
Eventually our setup seemed to be perfect. Even on peak times we didn’t have to worry about crashes or slow responding pages anymore. Did we make it? No! Out of the blue one of the webservers started having some kinda hickups. Error messages, white pages and so on. The system load was fine and in most cases the server worked, but only in “most cases”.
Lesson #6 learned
Put a few hundred thousand small files in one folder, run out of Inodes and your server will die!
Yes you read it correct. We were so focused on MySQL, PHP and the webservers itself that we didn’t pay enough attention to the filesystem. The Smarty cache file were stored on the local filesystem – all in one single directory. The solution here was putting Smarty on a dedicated ReiserFS partition. Furthermore we enabled the Smarty “use_subdirs” option.
Over the past years we kept optimizing the pages. We put the Smarty caches into memcached, installed Varnish to reduce the I/O load for serving static files more quickly, switched to Nginx (Lighttpd randomly produced error 500 messages), installed more RAM, bought better hardware, more hardware… this list is endless.
Conclusion
Scaling a website is a never ending process. As soon as you fix one bottleneck you’r very likely to stumble into the next one. Never ever start thinking “that’s it, we’re done” and lean back. It will kill your servers and perhaps even your business. It’s a constant process of planning and learning. If you can’t get a job done on your own, because you have a lack of experience and/or resources – find a competent and realiable partner to work with. Never stop talking with your team and partners about the current issues and the ones that might arise in (near) future. Think ahead and be proactive!
Reader Comments (47)
Yeah, great article, i really enjoyed it. it's simple i guess, More visitors one a webserver, the more load it gets. Keeping machines running at optimal efficiancy is called System administrating i guess ;) Never the less, good job on the article, and fixing the performance issue's. Thumbs up.
Thank you for sharing. It's good to learn from our mistakes. Unfortunately quite often time requirements impose that on us. Then the "smart" experts can gather experiences and make money writing a book about it...
Cheers
I'm glad to see that you've at least learned how to blame your hosting provider for your lack of competency. Good show!
It's probably pretty hard to identify a competent hosting company if you don't know what to look for. Most of this was clearly a failure on their part given that you are a coder not an operations guy. I hope you know that you don't achieve high availability to any degree with that setup.
Don't read this article and you will DIE!
Since when has going live been the first point of stress testing?
Step 1: Load test
Step 2: Bug fix load test results and repeat
Step 3: Stress test
Step 4: Bug fix stress test results and repeat
Step 5: Go live
10 years ago 1M users a month seemed like a lot. These days my phone can handle that. Maybe we dont need that many datacenters in 10 years because then every server is now 100 times as powerful as it is today.
I definitely agree with the posts about load & performance testing early and often. Especially with Agile development. The cost of poor performing sites is measurable. Google, Bing, Yahoo, Shopzilla, Aberdeen, et. al. have shown real world examples of 7-12% improvements in revenue associated with performance enhancement (e.g. response times). Want details? See web performance tuning
Who would turn down an extra 10% of cash flow? Then why overlook load testing? It shouldn't be optional before any product release, or for that matter, any upgrade.
As for 1 mil users/month, that's under 100 concurrent users at peak load (based on a few assumptions). We have customers testing with 50,000 concurrent users. While I don't know if my phone can handle 50 concurrent, I agree that the loads against web apps has increased dramatically in the past couple of years.
Umm, what about Lesson #0 always have backups as something CAN and WILL fail.
Very enjoyable post, very nice style of writing. Found it a great insight into the problems that can pccur when scaling a large website. Thanks Todd!
Nice to see that you folks were learning. But if I'd been your boss I'd have fired your ass for learning in a production environment at the expense of the customers and the company's reputation.
- Bob
It's good to have a text like this, giving a clear overview of the mistakes to look after, although most of the things mentioned here are very basic. Limiting the number of files per folder, using memcached, DB replication, you should have planned for that ahead really. As well as to do some proper benchmarks and stress tests BEFORE the launch...
But, again, don't take this as a criticism, I totally know how it goes, short deadlines, you always postpone doing things that seem non-critical. But from my experience, when making an estimate for a high traffic sites, always plan for some extra time to stress test it properly, that time is never wasted.
Good article; nice and to the point. Dont quite understand why so many people feel the need to vent the obvious opinions. Time constraints force us all into compromising to meet deadlines then fire-fighting to make it stable. Thats the world of dev :o)
One element that I'm curious about, and doesn't seem to be touched on too much here, is the previous site.
What were the motivations behind the relaunch and/or move to a new data center? Unless I read the post incorrectly, you had an existing site that was already handling the load.
Was the site just being given a new face lift with new features ? Or was it a complete rewrite on a different platform/architecture and the current hosting facility couldn't handle it? Was it a move designed to reduce costs by reducing hardware?
The post mentions the boss wanting to be able to scale/grow after the relaunch. What did the analysis on the current site consist of that lead to the conclusion that it couldn't handle the same scale/growth required ?
Most important lesson here is if the site is important enough it is worth the cost to hire a company to load test your setup before you go live. Loadrunner and other tools are necessary.
Interesting story! What was the matter with the disk-block replication? Did it corrupt the disks? Could it have worked in an active-passive configuration? Anyways, I was inspired by the story and wrote a blogpost with some thoughts and classifications on scalability patterns and how it could classify your lessons learned. It's at http://thebigsoftwareblog.blogspot.com/2010/08/scalability-fundamentals-and.html.
LESSON #7: Skip load tests only when your users will return even after repeated downtime.
No seriously tnx for sharing :)
Really you should be testing application before just pushing them into a production environment. there are so many tools available and were even a few years ago to perform load and functional testing.
Great article, these are all great points. Particularly about how maintaining a website is simply tending to the next performance bottleneck, ad infinitum.
However, what I've found is that while most starting websites start with load-balanced configurations, few seem to start with a MySQL master-slave pair. MySQL tends to be the biggest bottleneck in most LAMP applications, but for some reason is ignored. Any ideas why that is?
I hope the negative posters never attend an AA meeting.
More "Thanks for sharing" comments with constructive discussion and less comments akin to "Well that's your problem right there - too much alcohol! Ok, I guess we are done here...".
Solutions are great, personal commentary is unpleasant to read (and cannot be sent to /dev/null).
More testing? Of course! But why was there not enough testing? Sounds like root-cause analysis time.
The Title of this Article could also be, "What happens when a developer does not understand System Engineering and migrates to a new platform." The Server DIES and the admins dont sleep!
It seems that you can save your hair (and big $$$) by just using more efficient programs.
This study has shown how pointless caching can be when Web servers can do it much faster:
http://nbonvin.wordpress.com/2011/03/24/serving-small-static-files-which-server-to-use/
And, adding offense to injury, the fastest of the test is an application server which supports a KV store - meaning that you will save a proxy, a cache, a database backend, a fastCGI backend, and several sloppy web servers...