« Update on Scalable Causal Consistency For Wide-Area Storage With COPS | Main | Stuff The Internet Says On Scalability For December 5, 2011 »
Tuesday
Dec062011

Instagram Architecture: 14 Million users, Terabytes of Photos, 100s of Instances, Dozens of Technologies

Instagram is a free photo sharing and social networking service for your iPhone that has been an instant success. Growing to 14 million users in just over a year, they reached 150 million photos in August while amassing several terabytes of photos, and they did this with just 3 Instaneers, all on the Amazon stack.

The Instagram team has written up what can be considered the canonical description of an early stage startup in this era: What Powers Instagram: Hundreds of Instances, Dozens of Technologies.

Instagram uses a pastiche of different technologies and strategies. The team is small yet has experience rapid growth riding the crest of a rising social and mobile wave, it uses a hybrid of SQL and NoSQL, it uses a ton of open source projects, they chose the cloud over colo, Amazon services are highly leveraged rather than building their own, reliability is through availability zones, async work scheduling links components together, the system is composed as much as possible of services exposing an API and external services they don't have to build, data is stored in-memory and in the cloud, most code is in a dynamic language, custom bits have been coded to link everything together, and they have gone fast and kept small. A very modern construction.

We'll just tl;dr the article here, it's very well written and to the point. Definitely worth reading. Here are the essentials: 

  • Lessons learned: 1) Keep it very simple 2) Don’t re-invent the wheel 3) Go with proven and solid technologies when you can.
  • 3 Engineers.
  • Amazon shop. They use many of Amazon's services. With only 3 engineers so don’t have the time to look at self hosting.
  • 100+ EC2 instances total for various purposes.
  • Ubuntu Linux 11.04 (“Natty Narwhal”). Solid, other Ubuntu versions froze on them.
  • Amazon’s Elastic Load Balancer routes requests and 3 nginx instances sit behind the ELB.
  • SSL terminates at the ELB, which lessens the CPU load on nginx.
  • Amazon’s Route53 for the DNS.
  • 25+ Django application servers on High-CPU Extra-Large machines.
  • Traffic is CPU-bound rather than memory-bound, so High-CPU Extra-Large machines are a good balance of memory and CPU.
  • Gunicorn as their WSGI server. Apache harder to configure and more CPU intensive.
  • Fabric is used to execute commands in parallel on all machines. A deploy takes only seconds.
  • PostgreSQL (users, photo metadata, tags, etc) runs on 12 Quadruple Extra-Large memory instances.
  • Twelve PostgreSQL replicas run in a different availability zone.
  • PostgreSQL instances run in a master-replica setup using Streaming Replication. EBS is used for snapshotting, to take frequent backups. 
  • EBS is deployed in a software RAID configuration. Uses mdadm to get decent IO.
  • All of their working set is stored memory. EBS doesn’t support enough disk seeks per second.
  • Vmtouch (portable file system cache diagnostics) is used to manage what data is in memory, especially when failing over from one machine to another, where there is no active memory profile already.
  • XFS as the file system. Used to get consistent snapshots by freezing and unfreezing the RAID arrays when snapshotting.
  • Pgbouncer is used pool connections to PostgreSQL.
  • Several terabytes of photos are stored on Amazon S3.
  • Amazon CloudFront as the CDN.
  • Redis powers their main feed, activity feed, sessions system, and other services.
  • Redis runs on several Quadruple Extra-Large Memory instances. Occasionally shard across instances.
  • Redis runs in a master-replica setup. Replicas constantly save to disk. EBS snapshots backup the DB dumps. Dumping on the DB on the master was too taxing.
  • Apache Solr powers the geo-search API. Like the simple JSON interface.
  • 6 memcached instances for caching. Connect using pylibmc & libmemcached. Amazon Elastic Cache service isn't any cheaper.
  • Gearman is used to: asynchronously share photos to Twitter, Facebook, etc; notifying real-time subscribers of a new photo posted; feed fan-out.
  • 200 Python workers consume tasks off the Gearman task queue.
  • Pyapns (Apple Push Notification Service) handles over a billion push notifications. Rock solid.
  • Munin to graph metrics across the system and alert on problems. Write many custom plugins using Python-Munin to graph, signups per minute, photos posted per second, etc.
  • Pingdom for external monitoring of the service.
  • PagerDuty for handling notifications and incidents.
  • Sentry for Python error reporting.

Related Articles

Reader Comments (8)

Hey Todd,

Mike from Instagram here. Thanks for the write-up—High Scalability has been a fantastic resource for us as we've been growing the infrastructure, thanks for all the great info you compile!

December 6, 2011 | Unregistered CommenterMike Krieger

My pleasure Mike. And I really appreciate how open you guys are being. Great stuff. Thanks.

December 6, 2011 | Registered CommenterHighScalability Team

@Mike (instagram)
i've read that you use solr for geo search. can you explain your solution a little bit? do you use solr 3.1 with geofilt or have you developed something special?

December 7, 2011 | Unregistered CommenterDominik

How much does this cost? Just to have an idea.

May 11, 2012 | Unregistered CommenterStefano

Dear mike,

do you have some information concerning the basic hardware data of the Instagram server? We have to find out data like CPU, RAM, fixed-disk storage and processor for our "information management" lecture at university.

We would be very pleased if you could help us.

Thank's a lot!

J, J and L

December 18, 2012 | Unregistered CommenterJJL

CDN needs content to be publicly readable..right?
then how does Instagram handle images that should be shared only with few people

April 7, 2015 | Unregistered CommenterJoe

You can have a private bucket item on S3, still routed through CDN though it won't be visible until you generate a signed key in S3. Each user that's granted permission can receive a signed key. I don't know if that's how they do it specifically though.

October 20, 2016 | Unregistered CommenterJREAM

Link in related articles for "Storing hundreds of millions of simple key-value pairs in Redis" is incorrect. It redirects to some other medium post. Actual link is https://instagram-engineering.com/storing-hundreds-of-millions-of-simple-key-value-pairs-in-redis-1091ae80f74c

April 5, 2019 | Unregistered Commenterameya

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>