Aditya Agarwal, Director of Engineering at Facebook, gave an excellent Scale at Facebook talk that covers their architecture, but the talk is really more about how to scale an organization by preserving the best parts of its culture. The key take home of the talk is:
You can get the code right, you can get the products right, but you need to get the culture right first. If you don't get the culture right then your company won't scale.
This leads into the four meta secrets of scaling at Facebook:
1. Scaling takes Iteration. Solutions of often work in the beginning, but you'll have to modify them as you go on. What works in year one may not work later. PHP, for example, is simple to use at first, but is not a good choice when you have 10s of thousands of web servers.
Another example is photos. They currently serve 1.2 million photos a second. The first generation was build it the easy way. Don't worry about scaling that much. Focus on getting the functionality right. Uploader stored the file in NFS and the meta-data was stored in MySQL. It worked for the first 3 months and caused a lot of sleepless nights. He would still do it the same way today. Time to market was the biggest competitive advantage they had. Having the feature was more important than making sure it was a fully thought out scalable solution.
The second phase was optimization. Are there different access patterns that can be optimized for? It turns out smaller images are access more frequently so those became cached. They also started using a CDN. NFS was not designed to store 80 billion small files, so all meta-data wouldn't fit in memory, so lookups would take 2 or 3 disk IOs which was slow.
The third generation is an overlay system that creates a file that is a blob stored in the file system. Images are stored in the blob and you know the offset of the photo in the blob. One IO per photo.
2. Don't Over Design. Just use what you need to use as you scale your system out. Figure out where you need to iterate on a solution, optimize something, or completely build a part of the stack yourself. They spent a lot of time trying to optimize PHP, they ended up writing HipHop, a code transformer to convert PHP into C++. It generated a massive amount of memory and CPU savings. You don't have to do this on day one, but you may have to. Focus on the product first before you write an entire new language.
3. Choose the right tool for the job, but realize that your choice comes with overhead. If you really need to use Python then go ahead and do so, we'll try to help you succeed. Realize with that choice there is overhead, usually across deployment, monitoring, ops, and so on.
If you choose to use a services architecture you'll have to build most of the backend yourself and that often takes quite a bit of time. With the LAMP stack you get a lot for free. Once you move away for the LAMP stack how do things like service configuration and monitoring is up to you. As you go deeper into the services approach you have to reinvent the wheel.
4. Get the culture right. Move Fast - break things. Huge Impact - small teams. Be bold - innovate. Build an environment internally which promotes building the right thing first and then fixing as needed, not worrying about innovating, not worrying about breaking things, thinking big, thinking what is the next thing you need to build after the building the first thing. You can get the code right, you can get the products right, but you need to get the culture right first. If you don't get the culture right then your company won't scale.
There are no product owners at Facebook. Everyone owns the product. Give people ownership of what they work on. If you give ownership to one person then the chances are nobody else will contribute to pushing it to the next level. Ideas come from users and people internally. If you can't push responsibility down and you isolate the number of people who feel they are real owners, then the only people you'll be able to motivate are the people who think they are the real owners. So instead why not distribute that entire responsibility?
Isolate the part of the culture that you value and want to preserve. It doesn't happen automatically. Facebook organizes hackathons, the point of which is to show new engineers that if they come in at 8AM they can get a new feature up on the site in 12 hours. Move fast isn't just a platitude, a company has to come up with ways to make people feel it's a reality.