The Instagram Architecture Facebook Bought for a Cool Billion Dollars
It's been a well kept secret, but you may have heard Facebook will Buy Photo-Sharing Service Instagram for $1 Billion. Just what is Facebook buying? Here's a quick gloss I did a little over a year ago on a presentation Instagram gave on their architecture. In that article I called Instagram's architecture the "canonical description of an early stage startup in this era." Little did we know how true that would turn out to be. If you want to learn how they did it then don't take a picture, just keep on reading...
Instagram is a free photo sharing and social networking service for your iPhone that has been an instant success. Growing to 14 million users in just over a year (now 30 million users), they reached 150 million photos in August while amassing several terabytes of photos, and they did this with just 3 Instaneers, all on the Amazon stack.
The Instagram team has written up what can be considered the canonical description of an early stage startup in this era: What Powers Instagram: Hundreds of Instances, Dozens of Technologies.
Instagram uses a pastiche of different technologies and strategies. The team is small yet has experienced rapid growth riding the crest of a rising social and mobile wave, it uses a hybrid of SQL and NoSQL, it uses a ton of open source projects, they chose the cloud over colo, Amazon services are highly leveraged rather than building their own, reliability is through availability zones, async work scheduling links components together, the system is composed as much as possible of services exposing an API and external services they don't have to build, data is stored in-memory and in the cloud, most code is in a dynamic language, custom bits have been coded to link everything together, and they have gone fast and kept small. A very modern construction.
We'll just tl;dr the article here, it's very well written and to the point. Definitely worth reading. Here are the essentials:
- Lessons learned: 1) Keep it very simple 2) Don’t re-invent the wheel 3) Go with proven and solid technologies when you can.
- 3 Engineers. (They now reportedly have 13 employees, remember this was awhile back)
- Amazon shop. They use many of Amazon's services. With only 3 engineers so don’t have the time to look at self hosting.
- 100+ EC2 instances total for various purposes.
- Ubuntu Linux 11.04 (“Natty Narwhal”). Solid, other Ubuntu versions froze on them.
- Amazon’s Elastic Load Balancer routes requests and 3 nginx instances sit behind the ELB.
- SSL terminates at the ELB, which lessens the CPU load on nginx.
- Amazon’s Route53 for the DNS.
- 25+ Django application servers on High-CPU Extra-Large machines.
- Traffic is CPU-bound rather than memory-bound, so High-CPU Extra-Large machines are a good balance of memory and CPU.
- Gunicorn as their WSGI server. Apache harder to configure and more CPU intensive.
- Fabric is used to execute commands in parallel on all machines. A deploy takes only seconds.
- PostgreSQL (users, photo metadata, tags, etc) runs on 12 Quadruple Extra-Large memory instances.
- Twelve PostgreSQL replicas run in a different availability zone.
- PostgreSQL instances run in a master-replica setup using Streaming Replication. EBS is used for snapshotting, to take frequent backups.
- EBS is deployed in a software RAID configuration. Uses mdadm to get decent IO.
- All of their working set is stored memory. EBS doesn’t support enough disk seeks per second.
- Vmtouch (portable file system cache diagnostics) is used to manage what data is in memory, especially when failing over from one machine to another, where there is no active memory profile already.
- XFS as the file system. Used to get consistent snapshots by freezing and unfreezing the RAID arrays when snapshotting.
- Pgbouncer is used pool connections to PostgreSQL.
- Several terabytes of photos are stored on Amazon S3.
- Amazon CloudFront as the CDN.
- Redis powers their main feed, activity feed, sessions system, and other services.
- Redis runs on several Quadruple Extra-Large Memory instances. Occasionally shard across instances.
- Redis runs in a master-replica setup. Replicas constantly save to disk. EBS snapshots backup the DB dumps. Dumping on the DB on the master was too taxing.
- Apache Solr powers the geo-search API. Like the simple JSON interface.
- 6 memcached instances for caching. Connect using pylibmc & libmemcached. Amazon Elastic Cache service isn't any cheaper.
- Gearman is used to: asynchronously share photos to Twitter, Facebook, etc; notifying real-time subscribers of a new photo posted; feed fan-out.
- 200 Python workers consume tasks off the Gearman task queue.
- Pyapns (Apple Push Notification Service) handles over a billion push notifications. Rock solid.
- Munin to graph metrics across the system and alert on problems. Write many custom plugins using Python-Munin to graph, signups per minute, photos posted per second, etc.
- Pingdom for external monitoring of the service.
- PagerDuty for handling notifications and incidents.
- Sentry for Python error reporting.
And now you know the secret to getting bought for a billion dollars...getting your architecture written up on HighScalability!
Reader Comments (9)
Just noticed one of the related articles has the wrong link.
Article: "Storing hundreds of millions of simple key-value pairs in Redis"
Current URL: http://instagram-engineering.tumblr.com/post/12651721845/instagram-engineering-challenge-the-unshredder
Should be: http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value-pairs
I am wondering how they connect and maintain their database as this is their single keypoint of data. How do they store data across their databases and still having fast queries.
How much does it cost monthly to mantain this architecture on AWS? 100k$? 200k$?
I have a little doubt about the database chooice. Why PostgreSQL instead Cassandra?
Someone have any idea about?
Tks
Henrique
Henrique, If you look at the Scaling Instagram link under Related Articles, slide 59 says 'Why PG? PostGIS'. I'm not well versed on geo support in other database systems, but PostGIS is a very mature choice so it makes sense that they would have chosen that a couple of years ago (and indeed today). Their use of Django would naturally have led them towards Postgres as well. Then, of course, they combined this with Redis to take care of some of the relational db shortcomings.
I would like to know about the database choice as well.
Inio Joln, why not answer your question about their monthly AWS bill yourself with http://calculator.s3.amazonaws.com/calc5.html?
Plugging in 25 High-CPU XL instances (I'm assuming reserved @ 1yr), and 75 other Medium instances (they mention ~100 total), and then 5TB on/in/out of S3, you get about $12K/month. Obviously there is more going on, but EC2 costs generally dwarf others on AWS, so even if we double it for comfort, it probably isn't more than $25K/month. On the other hand if they are using on-demand on reserved instances more, it could be under $10K.
An updated view that talks about Sensu, Redis, RabbitMQ, etc. check out this PyConf 2013 talk video by Rick Branson.
Hello good, I would like to know if there is a more recent article regarding the current instagram stack, especially the feed