« Strategy: Saving Your Butt With Deferred Deletes | Main | Vagrant - Build and Deploy Virtualized Development Environments Using Ruby »
Monday
Apr122010

Poppen.de Architecture

This is a guest a post by Alvaro Videla describing their architecture for Poppen.de, a popular German dating site. This site is very much NSFW, so be careful before clicking on the link. What I found most interesting is how they manage to sucessfully blend a little of the old with a little of the new, using technologies like Nginx, MySQL, CouchDB, and Erlang, Memcached, RabbitMQ, PHP, Graphite, Red5, and Tsung.

What is Poppen.de?

Poppen.de (NSFW) is the top dating website in Germany, and while it may be a small site compared to giants like Flickr or Facebook, we believe it's a nice architecture to learn from if you are starting to get some scaling problems.

The Stats

  • 2.000.000 users
  • 20.000 concurrent users
  • 300.000 private messages per day
  • 250.000 logins per day
  • We have a team of eleven developers, two designers and two sysadmins for this project.

Business Model

The site works with a freemium model, where users can do for free things like: 

  • Search for other users.
  • Write private messages to each other.
  • Upload pictures and videos.
  • Have friends.
  • Video Chat.
  • Much more…

If they want to send unlimited messages or have unlimited picture uploads then they can pay for different kinds of membership according to their needs. The same applies for the video chat and other parts of the web site.

Toolbox

Nginx

All our site is served via Nginx. We have 2 frontend Nginx servers delivering 150.000 requests to www.poppen.de per minute during peak time. They are four years old machines with one CPU and 3GB of RAM each. Then we have separate machines to serve the site images. There are 80.000 requests per minute to *.bilder.poppen.de (image servers) served by 3 nginx servers.

One of the cool things that Nginx lets us do is to deliver many requests out of Memcached, without the need of hitting the PHP machines to get content that is already cached. So for example, the users profiles are one of the most CPU intensive pages in the site. Once the profile has been requested we cache the whole content on Memcached. Then the Nginx server will hit the Memcached and deliver the content from there. There are 8000 requests per minute delivered out of the Memcached.

We have 3 Nginx servers that are delivering the images from a local cache. The users upload their pictures to a central file server. A picture request will then hit one of the 3 Nginx servers. If the picture is not in the local cache filesystem, the Nginx will download the picture from the central server, store in its local cache and serve it. This lets us load balance the image distribution and alleviate the load in the main storage machine.

PHP-FPM

The site is running on PHP-FPM. We use 28 PHP machines with two CPUs and 6GB of memory each. They run 100 PHP-FPM worker processes each. We use PHP 5.3.x with APC enabled. The 5.3 version of PHP allowed us to reduce 30%+ of both CPU and Memory usage.

The code is written using the symfony 1.2 PHP framework. On one hand this means extra resource footprint, on the other hand it gives us speed of development and a well know framework that lets us integrate new developers to the team with ease. Not everything is "Flowers and Roses" here. So while we have a lot of advantages provided by the framework, we had to tweak it a lot to get it up to the task of serving www.poppen.de.

What we did was profile the site using XHProf –Facebook's profiling library for PHP– and then optimize the bottlenecks. Thanks to the fact that the framework is easy to customize and configure, we were able to cache most of the expensive calculations that were adding extra load to the servers in APC.

MySQL

MySQL is our main RDBMS. We have several MySQL servers: A 32GB machine with 4 CPUs storing all the users related information, like profiles, pictures metadata, etc. This machine is 4 years old. We are planning to replace it by a sharded cluster. We are still working on the design of this system, trying to have a low impact in our data access code. We want to partition the data by user id, since most of the information on the site is centered on the user itself, like images, videos, messages, etc.

We have 3 machines working in a master-slave-slave configuration for the users' forum. Then there's a cluster of servers that runs as storage for the web site custom message system. Currently it has more than 250 million messages. They are 4 machines configured in a master slave master/slave slave system.

We also have an NDB cluster composed by 4 machines for write intensive data, like the statistics of which user visited which other user's profile.

We try to avoid joins like the plague and cache as much as possible. The datastructure is heavily denormalized. For that we have created summary tables, to ease searching.

Most of the tables are MyISAM which provides fast lookups. The problem we are seeing more and more are full table locks. We are moving to the XtraDB storage engine.

Memcached

We use Memcached heavily. We have 45 GB of cache over 51 nodes. We use it for Session Storage, View Cache –like for user profiles–, and Function Execution cache –like queries–, etc. Most of the queries by primary key that we have to the users table are cached in Memcached and then delivered from there. We have a system that lets automatically invalidate the cache every time one record of that table is modified. One possible solution to improve cache invalidation in the future is to use the new Redis Hash API or MongoDB. With those databases we can update the cache with enough granularity to not need to invalidate it.

RabbitMQ

Since mid 2009 we introduced RabbitMQ into our stack. It's been a solution that was easy to deploy and integrate with our system. We run two RabbitMQ servers behind LVS. During the last month we have been moving more and more stuff to the queue, meaning that at the moment the 28 PHP frontend machines are publishing around 500.000 jobs per day. We send logs, email notifications, system messages, image uploads, and much more to the queue.

To enqueue messages we use one of the coolest features provided by PHP-FPM which is the fastcgi_finish_request() function. This allows us to send messages to the queue in an asynchronous fashion. After we generate the HTML or JSON response that must be sent to the user, we call that function, this means that the user doesn't have to wait for our PHP script to cleanup, like closing Memcached connections, DB connections, etc. At the same time, all the messages that where hold in an array in memory are then sent to RabbitMQ. In this way the user doesn't have to wait for this either.

We have two machines dedicated to consume those messages, running at the moment 40 PHP processes in total to consume the jobs. Each PHP process consumes 250 jobs and then dies and respawns again. We do that to avoid any kind of garbage collection problems with PHP. In the future we may increase the number of jobs consumed per session in order to improve the performance, since respawing a PHP process proved to be quite CPU intensive.

This system lets us improve the resource management. For example during peak time we can even have 1000 logins per minute. This means that we will have 1000 concurrent updates to the users table, to store the user last login time. Because now we enqueue those queries, we can run each of them sequentially instead. If we need more processing speed we can add more consumers to the queue, even joining machines to the cluster, without the need of modifying any configuration or deploying any new code.

CouchDB

To store the logs we run CouchDB in one machine. From there we can query/group the logs by module/action; by kind of errors, etc. It proved to be useful to detect where the problem is. Before having CouchDB as a log aggregator, we had to login and tail -f in each of the PHP machines and from there try to find where the problem was. Now we relay all the logs to the queue, and then a consumer inserts them into CouchDB. In this way we can check for problems at a centralized place.

Graphite

We use Graphite to collect real time information and statistics from the website. From requests per module/action to Memcached hits/misses, RabbitMQ status monitoring, Unix Load of the servers and much more. The Graphite server is getting around 4800 update operations per minute. This tool has proven to be really useful to see what's going on in the site. It's simple text protocol and the graphing capabilities make it easy to use and nearly plug and play to any system that we want to monitor.

One cool thing that we did with Graphite was monitoring two versions of the site running at the same time. Last January we deployed our code backed by a new version of the symfony framework. This meant that we will probably encounter performance regressions. We were able to run one version of the site in half of the servers while the new version was running in the others. Then in Graphite we created Unix load graphs for each half and then compared them live. 

Since we found that the Unix load of the new version was higher, we launched the XHProf profiler and compared both versions. We use APC to store "flags" that lets us enable/disable XHProf without the need of redeploying our code. We have a separate server where we send the XHProf profiles and from there we aggregate them and analyze them to find where the problems are.

Red5

Our site also serves video to the users. We have two kinds of them. One are videos from the user profiles which are movies produced and uploaded by the users. Also we have a Video Chat to let our users interact and share their videos. On mid 2009 we were streaming 17TB of video per month to our users.

Tsung

Tsung is a distributed benchmarking tool written in Erlang. We use it to do HTTP benchmarks and also to compare different storage engines for MySQL that we plan to use, for example the new XtraDB. We have a tool to record traffic to the main MySQL server and convert that traffic to Tsung benchmarking sessions. Then we replayed back that traffic and hit the machines in our lab with thousands of concurrent users generated by Tsung. The cool thing is that we could produce test scenarios that look closer to what's happening in the real production environment.

Lessons Learned

  • While Buzz Oriented Development is cool, look for tools with an important community behind them. Documentation and a good community are invaluable when there are problems to solve, or when you need to incorporate people to your team. symfony provides that, with more than 5 official books published which can be obtained for free. CouchDB and RabbitMQ also have good support from their developers, with active mailing list where questions are answered in time.
  • Get to know what you are using and what the limitations of those systems/libraries are. We learned a lot from symfony. Where it could be tweaked and what could be improved. The same we can say about PHP-FPM, just by reading the docs we found the mighty fastcgi_finish_request() function which proved to be immensely useful. Another example is Nginx, several problems that we had were already solved by the Nginx community, like what we explained about the image storage cache.
  • Extend the tools. If they are working well there's no need to introduce new software into the current stack. We have written several Nginx modules that have even been tested by the Nginx community. In this way you contribute back.
  • Don't be conservative with what doesn't matter. Graphite seemed to be a cool tool for running in production, but there wasn't so many reports about it. We just had to give it a try. If it hadn't worked, we could have just disabled it. Nobody will cry if we couldn't get a nice graph of Unix Load in our systems. Today even our Product Managers love it.
  • Meassure everything: Unix Load, Site Usage, Memcached Hits/Misses ratio, Requests per module/action, etc. Learn to interpret those metrics.
  • Create tools that let you react to problems as fast as possible. If you have to rollback a deployment, you don't want to spend more than a second doing that.
  • Create tools that let you profile the site live. In the lab most tests give optimistic information, but then fail to cope with production load.

The Future

  • Build a new more scalable Message System, since the current version is quite old and wasn't designed for such an amount of messages. 
  • Move more and more processing tasks to the queue system. 
  • Add more Erlang applications to our system. RabbitMQ has proven to work well for us and the same we can say of CouchDB. They were systems easy to install and deploy, increasing our trust in Erlang as a language/platform.
  • We are looking for a replacement for Red5, probably the Oneteam Media Server which is written in Erlang. While at the moment we are using open source Erlang products, we may start writing our own applications with the language because now we have experience with it.
  • Improve our log visualization tools.

I'd like to thanks Alvaro Videla for this excellent write up. If you would like to share the architecture for your fablous system, please contact me and we'll get started.

Reader Comments (21)

Let's do the math. 150k requests per minute to www.* which means 2500 requests per second.
They have 28 PHP boxes with 100 processes each. that means 2800 PHP processes.
You need as many PHP processes as you need to be able to handle concurrent requests (not per second). That means either their scripts take 1 second to execute each or they have way to many processes.
Either way something is broken.

I know of sites that serve this number of requests with PHP using 2-4 servers. Not 28.

Quote:
This system lets us improve the resource management. For example during peak time we can even have 1000 logins per minute. This means that we will have 1000 concurrent updates to the users table, to store the user last login time

No that does not mean you have 1000 concurrent updates. It means you have ~16 logins per second and that means you maybe have 10-20 concurrent updates. Most of the time a lot less.

Also note they have 50 memcached nodes. How many servers do they have to handle this moderate amount of load? It's insane.

Conclusion: not impressive and I have not seen any new insights. I question the efficience of their code a lot.

April 12, 2010 | Unregistered Commenterfrost

Hi Alvaro, thanks for that interesting insight in your architecture. i came up with two questions: - how do you measure your concurrent users (timeout?) and why do you use so many nodes for memcache with a quite small dataset? regards, paul

April 12, 2010 | Unregistered Commenterpaul p

Can you provide a link to Graphite? It sounds interesting, and we're beginning to look at those systems, but its such a common word that simple Googles aren't coming up with anything that I think is correct.

April 12, 2010 | Unregistered CommenterRichard

Graphite site is available at http://graphite.wikidot.com/

April 12, 2010 | Unregistered CommenterSilas

@frost:

- 150.000 request to poppen.de doesn't mean that all of them hit the PHP backends.

"I know of sites that serve this number of requests with PHP using 2-4 servers. Not 28." That depends on what they do, what do they cache, how much they can cache from their request. How many partials components do they show? Is the site information completely dynamic? And the list of questions can go on. Besides that we keep the load avg on the quite low and we have enough servers for our planned growth.

Besides that when you build a website you have to do business decisions. Is not like you pick your best book about website programming theory. In our case, we use a Framework and an ORM. That let us develop quite fast. You have to take that into account too. I've learned that is hard to talk about business decisions of other companies without knowing the background behind them.

Regarding the concurrent queries to the database and the login numbers, you are right, I did a mistake on the numbers. I apology to the readers for giving misleading information. On the other side I hope you and other readers of the site can understand what you can accomplish with a queue server. If you know that already and you don't need to learn it from me then better for you. I hope this is useful for at least one developer.

50 Memcached nodes doesn't mean 50 dedicated machines for them.

@paul p:

We have a Who Is Online server that tracks the online users. It uses a timeout for it to mark them as logged out.

We use several Memcached nodes because we have specialized buckets depending on what we want to cache. For example we have view cache, to cache templates. Function cache, to cache queries to the database. Then One Memcached to specifically cache queries to one table, etc. In that way the usage of one memcached doesn't affect the others.

@Richard:

http://graphite.wikidot.com/

April 12, 2010 | Unregistered CommenterAlvaro

Hi, Alvaro. I want to introduce you a better streaming server: erlyvideo, it is worthy to test, how many users it will handle in your situation (for me it can serve 1800 connections from one machine).

Write max@maxidoors.ru if you wish.

April 12, 2010 | Unregistered CommenterMax Lapshin

> We want to partition the data by user id, since most of the information on the site is centered on the user itself, like images, videos, messages, etc.

humm... this is interesting... won't it create too many partitions? I am not very familiar with Mysql but the one on which I work on recommends that we don't create more than 2000 partitions.

However partitioning on User Id would mean several 100K+ partitions.

April 12, 2010 | Unregistered CommenterAbhishek

@Alvaro:

So if they don't even hit PHP then I'm even more correct in that you have either too slow scripts are too many processes. But that's not really a problem.

The sites I am talking about have a lot of dynamic content but very clever caching plus they don't use any framework or ORM wrappers.
And here is where I think you lose probably 90% of performance, fhese things tend to be absolute performance killers.
Granted you get some advantage in terms of development time but once you reach a certain size, you will whish you didn't go that route.
It's not that hard to code some classes for your objects which use more intelligent queries and caching.

You have 2.5k req/s on the frontend, 133 req/s get served by memcached. Does that mean your cache hit ratio is 5%?

And please, don't use "requests per minute", nobody with interest in scale uses this term. It's mostly "requests per second" and suddenly your numbers don't seem so big anymore because it's only one 60th.

April 12, 2010 | Unregistered Commenterfrost

@Abhishek:

He did not say one partition per user, he said partition by user id. That does not suggest anything about partition size. It can be 100 users or 1 million users per partition. It only tells you what key is used to decide in which partition a value is stored.
Also that does not have anything to do with MySQL per se. The what one you work on? Also 2000 partitions? Yea right..

April 12, 2010 | Unregistered Commenterfrost

Interesting..
Do you use any type of virtualization/cloud services or is it all physical hardware? any CDN?
What OS/Distro?

April 12, 2010 | Unregistered CommenterMaxim R

Alvaro,

Great post. I think it was interesting to read and see how you solved many of your issues. Also nice tip regarding graphite.

Regards,
niklas

April 12, 2010 | Unregistered CommenterNiklas

Hi Alvaro. You said you were using memcached to cache view components like user profile. Can you explain more detail on how you invalidate these view cache?

I understand that you wrote your own code to invalidate "data cache" when the data was changed. But for a view cache, there are lots of data, any of the data change should invalidate this view cache. How do you do that?

April 13, 2010 | Unregistered CommenterNeil H

My first feeling: too many PHP servers. I think that Symfony is too slow PHP framework for them in this case. I learned from my experience that Symfony eat a lot of CPU.

I really think that they should replace Symfony by a more scalable and lightweight PHP framework

April 13, 2010 | Unregistered Commenterpcdinh

@Max Lapshin

Thanks for the tips on Erlyvideo, we've looked into it too some months ago. We are not decided yet.

@Maxim R

We use a EC2 for video delivery, the other systems are hosted in our physical servers. The servers are running SLES10.

@Neil H

We "namespaces" the keys, so we can invalidate related set of keys at once. But it depends on which part of the site. So is hard to explain all this here. See here for a lot of common patterns: http://code.google.com/p/memcached/wiki/FAQ

@pcdinh

If this is you: http://twitter.com/pcdinh, then I can tell you that we don't use machines such as 16-core Dell/HP. We use old blade servers with 6G of Ram with 8 cores. Besides that we keep the load avg of those machines *very* low.

Then regarding symfony or any PHP framework while they are not the fastest solutions than plain PHP code or more lightweight frameworks, speed is not the only thing that you consider when choosing a framework. symfony has great support, great community and tons of documentation. This means we can hire people with ease, that already know the technology that we use. Then what happens if we use a super fast custom framework and then the "hacker" that wrote it leaves the company? Who will maintain his code? And then your suggestion about moving to another framework sounds nice in theory, but do you know how many months of development could take to port the site code to another framework? We also have to pay for our developers salaries which most of the time is more expensive than one of this blade servers. So as I said in an answer before, companies do business decisions, not just choose this or that framework because is fast.

So please, don't blame the number of servers on symfony, because while yes is heavier than plain PHP code, is not the reason why we use so many. If not then why do you use PHP?, C is much faster.

April 13, 2010 | Unregistered CommenterAlvaro

Alvaro, i'm in no way questioning your infrastructure since you know it better than anybody else here, especially some of those 'armchair system architects.' :)
however the statement

We also have to pay for our developers salaries which most of the time is more expensive than one of this blade servers.
can lead you into a corner. you can throw money at hardware only for so long until your investment start to produce diminishing returns and your infrastructure becomes too unwieldy and then it'll be that much harder to do any meaningful code changes.

also what is the point of keeping server load "*very* low"? does it matter what the load is(other than malfunction or severely overloaded) if the server returns data within acceptable amount of time?

it sounds like all of your servers are in the memcache pool. i'm curious wouldn't it be better for PHP servers to have a larger APC cache size rather than use it for memcache?

April 13, 2010 | Registered Commentermxx

@Maxim R.

Thanks for you insights. I agree with you. Is not that you go and throw money at hardware, there should be an equilibrium. We also try to improve our code when we can, i. e.: whenever we add functionality to site features, we try to improve the performance, refactor the code, etc., because we need to maintain it before it gets rotten. We are working on a lightweight solution for SQL queries, which according to our benchmarks will reduce quite a lot of load from the site since we can remove the ORM that we use, which is quite heavy. Our site is evolving and we are learning from our mistakes as everyone should do.

Regarding the load average statement, I said that because for some commenters it look like we have 28 completely overloaded machines. Besides that we have those machines in place because we are planning for future growth, by future we mean imminent if everything goes as planned.

About APC vs. Memcache. We have to ponder more that. Sometime we discuss the same as what you just said. At the same time some experienced PHP developers told me that APC doesn't work that good with huge amount of RAM. I have no experience related to that to give an opinion. Also APC cache is not shared, we have to ponder if that is a problem too. We do cache several computations into APC too.

April 13, 2010 | Unregistered CommenterAlvaro

Alvaro,

Thanks for a great article! I have one question about your nginx and memcached. You wrote that many requests doesn't even hit PHP, because Nginx gets the cached content from memcached - can you describe it a little bit more? Do you cache HTML pages?

Regards,

April 20, 2010 | Unregistered Commenterbmf

@Alvaro

Erlyvideo is developed very rapidly. Several months ago you could see previous generation of it, that couldn't do anything. So, if you are interested, better to communicate via email.

April 20, 2010 | Unregistered CommenterMax Lapshin

+1 thanks for great article!

Alvaro, I think you should try Erlyvideo - it's wrotten in Erlang, and it develops very fast. I think, Max Lapshin, erlyvideo's author (привет, Макс :) can provide support and implement all features needed.

Alvaro, did you try Facebook's HipHopPHP compiler?

May 1, 2010 | Unregistered CommenterPavel Sh

What I found most interesting is how they manage to sucessfully blend a little of the old with a little of the new

No, they didn't manage to successfully blend old and new.
poppen.de is famous for being slow. It's nearly never really fast, also every few (6-8) weeks extra slow for at least a few hours, often for 1-2 weeks.
The current slow-period is on since about 6 weeks, and still no end in sight. The performance of poppen.de is a pain in the a*s.. Far away from being successful...

May 3, 2010 | Unregistered Commenterpoppen.de user

We have 2 frontend Nginx servers
Alvaro, are these 2 servers in a master/master setup? Which is the solution to make them highly available/load balanced? Regards

October 24, 2010 | Unregistered CommenterLubumbax

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>