- High Scalability

This is a guest a post by Alvaro Videla describing their architecture for Poppen.de, a popular German dating site. This site is very much NSFW, so be careful before clicking on the link. What I found most interesting is how they manage to sucessfully blend a little of the old with a little of the new, using technologies like Nginx, MySQL, CouchDB, and Erlang, Memcached, RabbitMQ, PHP, Graphite, Red5, and Tsung.

What is Poppen.de?

Poppen.de (NSFW) is the top dating website in Germany, and while it may be a small site compared to giants like Flickr or Facebook, we believe it's a nice architecture to learn from if you are starting to get some scaling problems.

The Stats

2.000.000 users
20.000 concurrent users
300.000 private messages per day
250.000 logins per day
We have a team of eleven developers, two designers and two sysadmins for this project.

Business Model

The site works with a freemium model, where users can do for free things like:

Search for other users.
Write private messages to each other.
Upload pictures and videos.
Have friends.
Video Chat.
Much more…

If they want to send unlimited messages or have unlimited picture uploads then they can pay for different kinds of membership according to their needs. The same applies for the video chat and other parts of the web site.

Toolbox

Nginx

All our site is served via Nginx. We have 2 frontend Nginx servers delivering 150.000 requests to www.poppen.de per minute during peak time. They are four years old machines with one CPU and 3GB of RAM each. Then we have separate machines to serve the site images. There are 80.000 requests per minute to *.bilder.poppen.de (image servers) served by 3 nginx servers.

One of the cool things that Nginx lets us do is to deliver many requests out of Memcached, without the need of hitting the PHP machines to get content that is already cached. So for example, the users profiles are one of the most CPU intensive pages in the site. Once the profile has been requested we cache the whole content on Memcached. Then the Nginx server will hit the Memcached and deliver the content from there. There are 8000 requests per minute delivered out of the Memcached.

We have 3 Nginx servers that are delivering the images from a local cache. The users upload their pictures to a central file server. A picture request will then hit one of the 3 Nginx servers. If the picture is not in the local cache filesystem, the Nginx will download the picture from the central server, store in its local cache and serve it. This lets us load balance the image distribution and alleviate the load in the main storage machine.

PHP-FPM

The site is running on PHP-FPM. We use 28 PHP machines with two CPUs and 6GB of memory each. They run 100 PHP-FPM worker processes each. We use PHP 5.3.x with APC enabled. The 5.3 version of PHP allowed us to reduce 30%+ of both CPU and Memory usage.

The code is written using the symfony 1.2 PHP framework. On one hand this means extra resource footprint, on the other hand it gives us speed of development and a well know framework that lets us integrate new developers to the team with ease. Not everything is "Flowers and Roses" here. So while we have a lot of advantages provided by the framework, we had to tweak it a lot to get it up to the task of serving www.poppen.de.

What we did was profile the site using XHProf –Facebook's profiling library for PHP– and then optimize the bottlenecks. Thanks to the fact that the framework is easy to customize and configure, we were able to cache most of the expensive calculations that were adding extra load to the servers in APC.

MySQL

MySQL is our main RDBMS. We have several MySQL servers: A 32GB machine with 4 CPUs storing all the users related information, like profiles, pictures metadata, etc. This machine is 4 years old. We are planning to replace it by a sharded cluster. We are still working on the design of this system, trying to have a low impact in our data access code. We want to partition the data by user id, since most of the information on the site is centered on the user itself, like images, videos, messages, etc.

We have 3 machines working in a master-slave-slave configuration for the users' forum. Then there's a cluster of servers that runs as storage for the web site custom message system. Currently it has more than 250 million messages. They are 4 machines configured in a master slave master/slave slave system.

We also have an NDB cluster composed by 4 machines for write intensive data, like the statistics of which user visited which other user's profile.

We try to avoid joins like the plague and cache as much as possible. The datastructure is heavily denormalized. For that we have created summary tables, to ease searching.

Most of the tables are MyISAM which provides fast lookups. The problem we are seeing more and more are full table locks. We are moving to the XtraDB storage engine.

Memcached

We use Memcached heavily. We have 45 GB of cache over 51 nodes. We use it for Session Storage, View Cache –like for user profiles–, and Function Execution cache –like queries–, etc. Most of the queries by primary key that we have to the users table are cached in Memcached and then delivered from there. We have a system that lets automatically invalidate the cache every time one record of that table is modified. One possible solution to improve cache invalidation in the future is to use the new Redis Hash API or MongoDB. With those databases we can update the cache with enough granularity to not need to invalidate it.

RabbitMQ

Since mid 2009 we introduced RabbitMQ into our stack. It's been a solution that was easy to deploy and integrate with our system. We run two RabbitMQ servers behind LVS. During the last month we have been moving more and more stuff to the queue, meaning that at the moment the 28 PHP frontend machines are publishing around 500.000 jobs per day. We send logs, email notifications, system messages, image uploads, and much more to the queue.

To enqueue messages we use one of the coolest features provided by PHP-FPM which is the fastcgi_finish_request() function. This allows us to send messages to the queue in an asynchronous fashion. After we generate the HTML or JSON response that must be sent to the user, we call that function, this means that the user doesn't have to wait for our PHP script to cleanup, like closing Memcached connections, DB connections, etc. At the same time, all the messages that where hold in an array in memory are then sent to RabbitMQ. In this way the user doesn't have to wait for this either.

We have two machines dedicated to consume those messages, running at the moment 40 PHP processes in total to consume the jobs. Each PHP process consumes 250 jobs and then dies and respawns again. We do that to avoid any kind of garbage collection problems with PHP. In the future we may increase the number of jobs consumed per session in order to improve the performance, since respawing a PHP process proved to be quite CPU intensive.

This system lets us improve the resource management. For example during peak time we can even have 1000 logins per minute. This means that we will have 1000 concurrent updates to the users table, to store the user last login time. Because now we enqueue those queries, we can run each of them sequentially instead. If we need more processing speed we can add more consumers to the queue, even joining machines to the cluster, without the need of modifying any configuration or deploying any new code.

CouchDB

To store the logs we run CouchDB in one machine. From there we can query/group the logs by module/action; by kind of errors, etc. It proved to be useful to detect where the problem is. Before having CouchDB as a log aggregator, we had to login and tail -f in each of the PHP machines and from there try to find where the problem was. Now we relay all the logs to the queue, and then a consumer inserts them into CouchDB. In this way we can check for problems at a centralized place.

Graphite

We use Graphite to collect real time information and statistics from the website. From requests per module/action to Memcached hits/misses, RabbitMQ status monitoring, Unix Load of the servers and much more. The Graphite server is getting around 4800 update operations per minute. This tool has proven to be really useful to see what's going on in the site. It's simple text protocol and the graphing capabilities make it easy to use and nearly plug and play to any system that we want to monitor.

One cool thing that we did with Graphite was monitoring two versions of the site running at the same time. Last January we deployed our code backed by a new version of the symfony framework. This meant that we will probably encounter performance regressions. We were able to run one version of the site in half of the servers while the new version was running in the others. Then in Graphite we created Unix load graphs for each half and then compared them live.

Since we found that the Unix load of the new version was higher, we launched the XHProf profiler and compared both versions. We use APC to store "flags" that lets us enable/disable XHProf without the need of redeploying our code. We have a separate server where we send the XHProf profiles and from there we aggregate them and analyze them to find where the problems are.

Red5

Our site also serves video to the users. We have two kinds of them. One are videos from the user profiles which are movies produced and uploaded by the users. Also we have a Video Chat to let our users interact and share their videos. On mid 2009 we were streaming 17TB of video per month to our users.

Tsung

Tsung is a distributed benchmarking tool written in Erlang. We use it to do HTTP benchmarks and also to compare different storage engines for MySQL that we plan to use, for example the new XtraDB. We have a tool to record traffic to the main MySQL server and convert that traffic to Tsung benchmarking sessions. Then we replayed back that traffic and hit the machines in our lab with thousands of concurrent users generated by Tsung. The cool thing is that we could produce test scenarios that look closer to what's happening in the real production environment.

Lessons Learned

While Buzz Oriented Development is cool, look for tools with an important community behind them. Documentation and a good community are invaluable when there are problems to solve, or when you need to incorporate people to your team. symfony provides that, with more than 5 official books published which can be obtained for free. CouchDB and RabbitMQ also have good support from their developers, with active mailing list where questions are answered in time.
Get to know what you are using and what the limitations of those systems/libraries are. We learned a lot from symfony. Where it could be tweaked and what could be improved. The same we can say about PHP-FPM, just by reading the docs we found the mighty fastcgi_finish_request() function which proved to be immensely useful. Another example is Nginx, several problems that we had were already solved by the Nginx community, like what we explained about the image storage cache.
Extend the tools. If they are working well there's no need to introduce new software into the current stack. We have written several Nginx modules that have even been tested by the Nginx community. In this way you contribute back.
Don't be conservative with what doesn't matter. Graphite seemed to be a cool tool for running in production, but there wasn't so many reports about it. We just had to give it a try. If it hadn't worked, we could have just disabled it. Nobody will cry if we couldn't get a nice graph of Unix Load in our systems. Today even our Product Managers love it.
Meassure everything: Unix Load, Site Usage, Memcached Hits/Misses ratio, Requests per module/action, etc. Learn to interpret those metrics.
Create tools that let you react to problems as fast as possible. If you have to rollback a deployment, you don't want to spend more than a second doing that.
Create tools that let you profile the site live. In the lab most tests give optimistic information, but then fail to cope with production load.

The Future

Build a new more scalable Message System, since the current version is quite old and wasn't designed for such an amount of messages.
Move more and more processing tasks to the queue system.
Add more Erlang applications to our system. RabbitMQ has proven to work well for us and the same we can say of CouchDB. They were systems easy to install and deploy, increasing our trust in Erlang as a language/platform.
We are looking for a replacement for Red5, probably the Oneteam Media Server which is written in Erlang. While at the moment we are using open source Erlang products, we may start writing our own applications with the language because now we have experience with it.
Improve our log visualization tools.

I'd like to thanks Alvaro Videla for this excellent write up. If you would like to share the architecture for your fablous system, please contact me and we'll get started.