« Live Video Streaming At Facebook Scale | Main | Uptime Funk - Best Sysadmin Parody Video Ever! »
Monday
Jan112016

A Beginner's Guide to Scaling to 11 Million+ Users on Amazon's AWS

How do you scale a system from one user to more than 11 million users? Joel Williams, Amazon Web Services Solutions Architect, gives an excellent talk on just that subject: AWS re:Invent 2015 Scaling Up to Your First 10 Million Users.

If you are an advanced AWS user this talk is not for you, but it’s a great way to get started if you are new to AWS, new to the cloud, or if you haven’t kept up with with constant stream of new features Amazon keeps pumping out.

As you might expect since this is a talk by Amazon that Amazon services are always front and center as the solution to any problem. Their platform play is impressive and instructive. It's obvious by how the pieces all fit together Amazon has done a great job of mapping out what users need and then making sure they have a product in that space. 

Some of the interesting takeaways:

  • Start with SQL and only move to NoSQL when necessary.
  • A consistent theme is take components and separate them out. This allows those components to scale and fail independently. It applies to breaking up tiers and creating microservices.
  • Only invest in tasks that differentiate you as a business, don't reinvent the wheel.
  • Scalability and redundancy are not two separate concepts, you can often do both at the same time.
  • There's no mention of costs. That would be a good addition to the talk as that is one of the major criticisms of AWS solutions.

 

The Basics

  • AWS is in 12 regions around the world.

    • A Region is a physical location in the world where Amazon has multiple Availability Zones. There are regions in: North America; South America; Europe; Middle East; Africa; Asia Pacific.

    • An Availability Zone (AZ) is generally a single datacenter, though they can be constructed out of multiple datacenters.

    • Each AZ is separate enough that they have separate power and Internet connectivity.

    • The only connection between AZs is a low latency network. AZs can be 5 or 15 miles apart, for example. The network is fast enough that your application can act like all AZs are in the same datacenter.

    • Each Region has at least two Availability Zones. There are 32 AZs total.

    • Using AZs it’s possible to create a high availability architecture for your application.

    • At least 9 more Availability Zones and 4 more Regions are coming in 2016.

  • AWS has 53 edge locations around the world.

    • Edge locations are used by CloudFront, Amazon’s Content Distribution Network (CDN) and Route53, Amazon’s managed DNS server.

    • Edge locations enable users to access content with a very low latency no matter where they are in the world.

  • Building Block Services

    • AWS has created a number of services that use multiple AZs internally to be highly available and fault tolerant. Here is a list of what services are available where.

    • You can use these services in your application, for a fee, without having to worry about making them highly available yourself.

    • Some services that exist within an AZ: CloudFront, Route 53, S3, DynamoDB, Elastic Load Balancing, EFS, Lambda, SQS, SNS, SES, SWF.

    • A highly available architecture can be created using services even though they exist within a single AZ.

 

1 User

  • In this scenario you are the only user and you want to get a website running.

  • Your architecture will look something like:

    • Run on a single instance, maybe a type t2.micro. Instance types comprise varying combinations of CPU, memory, storage, and networking capacity and give you the flexibility to choose the appropriate mix of resources for your applications.

    • The one instance would run the entire web stack, for example: web app, database, management, etc.

    • Use Amazon Route 53 for the DNS.

    • Attach a single Elastic IP address to the instance.

    • Works great, for a while.

 

Vertical Scaling

  • You need a bigger box. Simplest approach to scaling is choose a larger instance type. Maybe a c4.8xlarge or m3.2xlarge, for example.

  • This approach is called vertical scaling.

  • Just stop your instance and choose a new instance type and you’re running with more power.

  • There is a wide mix of different hardware configurations to choose from. You can have a system with 244 gigs of RAM (2TB of RAM types are coming soon). Or one with 40 cores. There are High I/O instances, High CPU Instances, High storage instances.

  • Some Amazon services come with a Provisioned IOPS option to guarantee performance. The idea is you can perhaps use a smaller instance type for your service and make use of Amazon services like DynamoDB that can deliver scalable services so you don’t have to.

  • Vertical scaling has a big problem: there’s no failover, no redundancy. If the instance has a problem your website will die. All your eggs are in one basket.

  • Eventually a single instances can only get so big. You need to do something else.

 

Users > 10

  • Separate out a single host into multiple hosts

    • One host for the web site.

    • One host for the database. Run any database you want, but you are on the hook for the database administration.

    • Using separate hosts allows the web site and the database to be scaled independently of each other. Perhaps your database will need a bigger machine than your web site, for example.

  • Or instead of running your own database you could use a database service.

    • Are you a database admin? Do your really want to worry about backups? High availability? Patches? Operating systems?

    • A big advantage of using a service is you can have a multi Availability Zone database setup with a single click. You won’t have to worry about replication or any of that sort of thing. Your database will be highly available and reliable.

  • As you might imagine Amazon has several  fully managed database services to sell you:

    • Amazon RDS (Relational Database Service). There are many options: Microsoft SQL Server, Oracle, MySQL, PostgreSQL, MariaDB, Amazon Aurora.

    • Amazon DynamoDB. A NoSQL managed database.

    • Amazon Redshift. A petabyte scale data warehouse system.

  • More Amazon Aurora:

    • Automatic storage scaling up to 64TB. You no longer have to provision the storage for your data.

    • Up to 15 read read-replicas

    • Continuous (incremental) backups to S3.

    • 6-way replication across 3 AZs. This helps you handle failure.

    • MySQL compatible.

  • Start with a SQL database instead of a NoSQL database.

    • The suggestion is to start with a SQL database.

    • The technology is established.

    • There’s lots of existing code, communities, support groups, books, and tools.

    • You aren’t going to break a SQL database with your first 10 million users. Not even close. (unless your data is huge).

    • Clear patterns to scalability.

  • When might you need start with a NoSQL database?

    • If you need to store > 5 TB of data in year one or you have an incredibly data intensive workload.

    • Your application has super low-latency requirements.

    • You need really high throughput. You need to really tweak the IOs you are getting both on the reads and the writes.

    • You don’t have any relational data.

 

Users > 100

  • Use a separate host for the web tier.

  • Store the database on Amazon RDS. It takes care of everything.

  • That’s all you have to do.

 

Users > 1000

  • As architected your application has availability issues. If the host for your web service fails then your web site goes down.

  • So you need another web instance in another Availability Zone. That’s OK because the latency between the AZs is in the low single digit milliseconds, almost like they right next to each other.

  • You also need to a slave database to RDS that runs in another AZ. If there’s a problem with the master your application will automatically switch over to the slave. There are no application changes necessary on the failover because your application always uses the same endpoint.

  • An Elastic Load Balancer (ELB) is added to the configuration to load balance users between your two web host instances in the two AZs.

  • Elastic Load Balancer (ELB):

    • ELB is a highly available managed load balancer. The ELB exists in all AZs. It’s a single DNS endpoint for your application. Just put it in Route 53 and it will load balance across your web host instances.

    • The ELB has Health Checks that make sure traffic doesn’t flow to failed hosts.

    • It scales without your doing anything. If it sees additional traffic it scales behind the scenes both horizontally and vertically. You don’t have to manage it. As your applications scales so is the ELB.

 

Users > 10,000s - 100,000s

  • The previous configuration has 2 instances behind the ELB, in practice you can have 1000s of instances behind the ELB. This is horizontal scaling.

  • You’ll need to add more read replicas to the database, to RDS. This will take load off the write master.

  • Consider performance and efficiency by lightening the load off your web tier servers by moving some of the traffic elsewhere. Move static content in your web app to Amazon S3 and Amazon CloudFront. CloudFront is the Amazon’s CDN that stores your data in the 53 edge locations across the world.

  • Amazon S3 is an object base store.

    • It’s not like EBS, it’s not storage that’s attached to an EC2 instance, it’s an object store, not a block store.

    • It’s a great place to store static content, like javascript, css, images, videos. This sort of content does not need to sit on an EC2 instance.

    • Highly durable, 11 9’s of reliability.

    • Infinitely scalable, throw as much data as it as you want. Customers store multiple petabytes of data in S3.

    • Objects of up to 5TB in size are supported.

    • Encryption is supported. You can use Amazon’s encryption, your encryption, or an encryption service.

  • Amazon CloudFront  is cache for your content.

    • It caches content at the edge locations to provide your users the lowest latency access possible.

    • Without a CDN your users will experience higher latency access to your content. Your servers will also be under higher load as they are serving the content as well as handling the web requests.

    • One customer needed to serve content at 60 Gbps. The web tier didn’t even know that was going on, CloudFront handled it all.

  • You can also lighten the load by shifting session state off your web tier.

    • Store the session state in ElastiCache or DynamoDB.

    • This approach also sets your system up to support auto scaling in the future.

  • You can also lighten the load by caching data from your database into ElastiCache.

    • Your database doesn’t need to handle all the gets for data. A cache can handle a lot of that work and leaves the database to handle more important traffic.

  • Amazon DynamoDB - A managed NoSQL database

    • You provision the throughput you want. You dial up the read and write performance you want to pay for.

    • Supports fast, predictable performance.

    • Fully distributed and fault tolerant. It exists in multiple Availability Zones.

    • It’s a key-value store. JSON is supported.

    • Documents up to 400KB in size are supported.

  • Amazon Elasticache - a managed Memcached or Redis

    • Managing a memcached cluster isn’t making you more money so let Amazon do that for you. That’s the pitch.

    • The clusters are automatically scaled for you. It’s a self-healing infrastructure, if nodes fail new nodes are started automatically.

  • You can also lighten the load by shifting dynamic content to CloudFront.

    • A lot of people know CloudFront can handle static content, like files, but it can also handle some dynamic content. This topic is not discussed further in the talk, but here’s a link.

 

Auto Scaling

  • If you provision enough capacity to always handle your peak traffic load, Black Friday, for example, you are wasting money. It would be better to match compute power with demand. That’s what Auto Scaling let’s you do, the automatic resizing of compute clusters.

  • You can define the minimum and maximum size of your pools. As a user you decide what’s the smallest number of instances in your cluster and the largest number of instances.

  • CloudWatch is a management service that’s embedded into all applications.

    • CloudWatch events drive scaling.

    • Are you going to scale on CPU utilization? Are you going to scale on latency? On network traffic?

    • You can also push your own custom metrics into CloudWatch. If you want to scale on something application specific you can push that metric into CloudWatch and then tell Auto Scaling you want to scale on that metric.

 

Users > 500,000+

  • The addition from the previous configuration is auto scaling groups are added to the web tier. The auto scaling group includes the two AZs, but can expand to 3 AZs, up to as many as are in the same region. Instances can pop up in multiple AZs not just for scalability, but for availability.

  • The example has 3 web tier instances in each AZ, but it could be thousands of instances. You could say you want a minimum of 10 instances and a maximum of a 1000.

  • ElastiCache is used to offload popular reads from the database.

  • DynamoDB is used to offload Session data.

  • You need to add monitoring, metrics and logging.

    • Host level metrics. Look at a single CPU instance within an autoscaling group and figure out what’s going wrong.

    • Aggregate level metrics. Look at metrics on the Elastic Load Balancer to get feel for performance of the entire set of instances.

    • Log analysis. Look at what the application is telling you using CloudWatch logs. CloudTrail helps you analyze and manage logs.

    • External Site Performance. Know what your customers are seeing as end users. Use a service like New Relic or Pingdom.

  • You need to know what your customers are saying. Is their latency bad? Are they getting an error when they go to your web page?

  • Squeeze as much performance as you can from your configuration. Auto Scaling can help with that. You don’t want systems that are at 20% CPU utilization.

 

Automation

  • The infrastructure is getting big, it can scale to 1000s of instances. We have read replicas, we have horizontal scaling, but we need some automation to help manage it all, we don’t want to manage each individual instance.

  • There’s a hierarchy of automation tools.

    • Do it yourself: Amazon EC2, AWS CloudFormation.

    • Higher-level services: AWS Elastic Beanstalk, AWS OpsWorks

  • AWS Elastic Beanstalk: manages the infrastructure for your application automatically. It’s convenient but there’s not a lot of control.

  • AWS OpsWorks: an environment where you build your application in layers, you use Chef recipes to manage the layers of your application.

    • Also enables the ability to do Continuous Integration and deployment.

  • AWS CloudFormation: been around the longest.

    • Offers the most flexibility because it offers a templatized view of your stack. It can be used to build your entire stack or just components of the stack.

    • If you want to update your stack you update the Cloud Formation template it will update just that one piece of your application.

    • Lots of control, but less convenient.

  • AWS CodeDeploy: Deploys your code to a fleet of EC2 instances.

    • Can deploy to one or thousands of instances.

    • Code Deploy can point to an auto scaling configuration so code is deployed to a group of instances.

    • Can also be used in conjunction with Chef and Puppet.

 

Decouple Infrastructure

  • Use SOA/microservices.  Take components from your tiers and separate them out. Create separate services like when you separated the web tier from the database tier.

  • The individual services can then be scaled independently. This gives you a lot of flexibility for scaling and high availability.

  • SOA is a key component of the architectures built by Amazon.

  • Loose coupling sets you free.

    • You can scale and fail components independently.

    • If a worker node fails in pulling work from SQS does it matter? No, just start another one. Things are going to fail, let’s build an architecture that handles failure.

    • Design everything as a black box.

    • Decouple interactions.

    • Favor services with built-in redundancy and scalability rather than building your own.

 

Don’t Reinvent the Wheel

  • Only invest in tasks that differentiate you as a business.

  • Amazon has a lot of services that are inherently fault tolerant because they span multiple AZs. For example: queuing, email, transcoding, search, databases, monitoring, metrics, logging, compute. You don’t have to build these yourself.

  • SQS: queueing service.

    • The first Amazon service offered.

    • It spans multiple AZs so it’s fault tolerant.

    • It’s scalable, secure, and simple.

    • Queuing can help your infrastructure by helping you pass messages between different components of your infrastructure.

    • Take for example a Photo CMS. The systems that collects the photos and processes them should be two different systems. They should be able to scale independently. They should be loosely coupled. Ingest a photo, put it in queue, and workers can pull photos off the queue and do something with them.

  • AWS Lambda: lets you run code without provisioning or managing servers.

    • Great tool for allowing you to decouple your application.

    • In the Photo CMS example Lambda can respond to S3 events so when a S3 file is added the Lambda function to process is automatically triggered.

    • We’ve done away with EC2. It scales out for you and there’s no OS to manage.

 

Users > 1,000,000+

  • Reaching a million users and above requires bits of all the previous points:

    • Multi-AZ

    • Elastic Load Balancing between tiers. Not just on the web tier, but also on the application tier, data tier, and any other tier you have.

    • Auto Scaling

    • Service Oriented Architecture

    • Serve Content Smartly with S3 and CloudFront

    • Put caching in front of the DB

    • Move state off the web tier.

  • Use Amazon SES to send email.

  • Use CloudWatch for monitoring.

 

Users > 10,000,000+

  • As we get bigger we’ll hit issues in the data tier. You will potentially start to run into issues with your database around contention with the write master, which basically means you can only send so much write traffic to one server.

  • How do you solve it?

    • Federation

    • Sharding

    • Moving some functionality to other types of DBs (NoSQL, graph, etc)

  • Federation - splitting into multiple DBs based on function

    • For example, create a Forums Database, a User Database, a Products Database. You might have had these in a single database before, now spread them out.

    • The different databases can be scaled independently of each other.

    • The downsides: you can’t do cross database queries; it delays getting to the next strategy, which is sharding.

  • Sharding -  splitting one dataset across multiple hosts

    • More complex at the application layer, but there’s no practical limit on scalability.

    • For example, in a Users Database ⅓ of the users might be sent to one shard, and the last third to another shard, and another shard to another third.

  • Moving some functionality to other types of DBs

    • Start thinking about a NoSQL database.

    • If you have data that doesn’t require complex joins, like say a leaderboard, rapid ingest of clickstream/log data, temporary data, hot tables, metadata/lookup tables, then consider moving it to a NoSQL database.

    • This means they can be scaled independently of each other.

 

Users > 11 Million

  • Scaling is an iterative process. As you get bigger there's always more you can do.

  • Fine tune your application.

  • More SOA of features/functionality.

  • Go from Multi-AZ to multi-region.

  • Start to build custom solutions to solve your particular problem that nobody has ever done before. If you need to serve a billion customers you may need custom solutions.

  • Deep analysis of your entire stack.

 

In Review

  • Use a multi-AZ infrastructure for reliability.

  • Make use of self-scaling services like ELB, S3, SQS, SNS, DynamoDB, etc.

  • Build in redundancy at every level. Scalability and redundancy are not two separate concepts, you can often do both at the same time.

  • Start with a traditional relational SQL database.

  • Cache data both inside and outside your infrastructure.

  • Use automation tools in your infrastructure.

  • Make sure you have good metrics/monitoring/logging in place. Make sure you are finding out what your customers experience with your application.

  • Split tiers into individual services (SOA) so they can scale and fail independently of each other.

  • Use Auto Scaling once you’re ready for it.

  • Don’t reinvent the wheel, use a managed service instead of coding your own, unless it’s absolutely necessary.

  • Move to NoSQL if and when it makes sense.

 

Further Reading

Reader Comments (19)

Nice article. Do correct the link for start-ups. ( opening to http://aws.amazon.comstart-ups, should be http://aws.amazon.com/start-ups/)

January 12, 2016 | Unregistered Commenteranil

This is an excellent, approachable article even for scalability professionals. Well done

January 12, 2016 | Unregistered CommenterPush

This set up would cost an insane amount of $ per month. AWS is a great way to feed VC dollars into Amazons bank account indirectly.

People need to stop faking for the hype of overpriced VMs and get back to bare metal solutions if they're actually serious about TCO/ROI.

January 12, 2016 | Unregistered CommenterPatrick Barnes

This is very application dependent, but what range of transactions/requests person second can be expected for each level of users?

January 12, 2016 | Unregistered CommenterMike

@Patrick

Yeah, but no...

We've ran an absurd amount of cost analysis on precisely this topic. In virtually every area of our system operations, AWS presents us with significant savings. Some AWS services cost as little as 1/3 of what we can do it for internally.

Whenever you see the entire world moving towards a certain trend, especially when that "world" is made primarily of smart people and millionaires (who REALLY like to keep/make money), you should assume that its probably not just a "hype" or "fad".

Five years ago the numbers were different, but AWS has cut its prices in HALF two or more times since then. Unless your calculator is broken, I would find it hard to arrive at any other conclusion.

My experience has been, mainly, with AWS.. but Google's cloud is comparably priced and I assume that the price difference on the other IaS and PaS providers is negligible.

So, my advice: Look closer.. then jump on the wagon or start planning for retirement

January 13, 2016 | Unregistered CommenterLuke Chavers

The problem with AWS CloudFormation has been the unwieldiness of large JSON scripts - no consistent way of adding comments, a klunky modularization story, and no backwards referencing.

We've been getting around this by writing our templates in yaml, and then using YAML Stratus for converting to the required JSON. Yaml supports comments, backward references, and yaml stratus adds yaml extensions for including other yaml scripts with overriding, as well as compile-time parameterrization.

January 14, 2016 | Unregistered CommenterShannon

@Luke Chavers

Yeah, but you're wrong.

AWS is always more expensive than bare metal in a colo, anywhere from 10-100x. Even worse when deploying a complex setup like this which most apps never need.

Modern is pretty fast, a basic mid-size server can support millions of requests per day and doubling up for high availability is cheap. Even in the cloud, its better to just stick to basic VMs rather than all the managed services unless you really need it.

Judging by your comment, it looks like it's your calculator that's broken or you're just working for AWS.

January 15, 2016 | Unregistered CommenterSpin

I kinda disagree with the RDS point. DON"T USE RDS if you have even basic db-administraton skills. Amazons RDS uses EBS-drives for databases which is very, very slow. A much better way would be to use instance-store SSD drive for temporary stuff you don't need to persist (tempdb, transaction logs (if you know what you're doing and howto restore a db without a log) etc etc.

Other than this tiny comment of mine - awesome post, thanks

January 17, 2016 | Unregistered CommenterAlex

Hi,

I have used RDS for one of my applications which deal with processing of millions of real time social media data. With all optimizations, we could use mysql mysql in medium amazon instance to deal data of 1 or two weeks only. When the volume of data grow, RDS was the help on our relational database design.

The thing is it is pretty costly when compared with normal solutions, but it is worth spending when you have to deal with potentially large volume of data.

January 19, 2016 | Unregistered CommenterAnas

lol @ colo. Sure, colo your metal if you like downtime and expensive bandwidth bills!

January 22, 2016 | Unregistered CommenterLG

Hey @Alex: This [1] says Amazon RDS does support SSDs ?

[1] http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html

January 27, 2016 | Unregistered CommenterAli Gajani

Scaling on AWS is what really worries me and this is why love Cloudways as they have a slide to scale up or scale down the server size.

March 15, 2016 | Unregistered CommenterAzaz

Very nice article, but I would temper the fact that we should start with SQL DB and then switch what is possible to No-SQL. Transforming from SQL to No-SQL can be very painfull (model transformation, business code rewriting, data migration, ...) and so risky that you will never do it.

So I would advice the following : if you suspect that your application will support more than 100 000 users (and if your business is No-SQL compatible for sure), go directly with No-SQL database. You will be able to shard easily, developping for No-SQL is not more expensive and your application will scale naturally with no need of major architectural redefinition.

Jean-Marc

July 21, 2016 | Unregistered CommenterJM. COLLIN

The thing about colo of bare metal is that you need to consider *all* the costs. Most people who decide to go bare metal woefully underestimate the time and effort required, the least of which is the hardware (which will still, however, be far more expensive than you think, assuming you go with professional quality server hardware rather than junk from your local recycling yard). In particular, you will need an implementation team to do the initial deployment, then at *least* one person on staff who does nothing but the care and feeding of the servers and the virtualization infrastructures on top of them, and that's $150K+ per year just for that one person -- and that's assuming you can stop with one person. Then there's the cost of fiber interconnects between racks in different sections of the data center (data centers typically have power and networking zones -- where racks are on different UPS and routers -- but if you want fast 10 gigabit interconnection between your two outposts, you have to pay $$$ for it). And of course the cost of the hardware itself -- 10 gigabit switches to interconnect database servers and compute servers, the 10 gigabit cards needed, and good professional quality servers and enterprise grade SSD's are *not* cheap. I've worked the crossover point for our own business where it will be cheaper to do it inhouse rather than via Amazon, and while that crossover point is at far less than 10 million users, it's definitely at more than what we're spending now on AWS services.

Regarding NoSQL, it's good to consider it for large quantity largely-unstructured data such as, for example, log data (e.g. ElasticSearch), but the ACID guarantees of modern SQL databases are very important for many things. Attempting to sideways-hack those into NoSQL databases generally results in far more time spent than simply optimizing and scaling your SQL database to where it needs to be. Twitter handles 150 million active users with MySQL. Granted with Redis caches for your timelines and some custom replication code, but still: MySQL.

August 15, 2016 | Unregistered CommenterBadtux

This is a pretty awesome post! I think it might be more effective at some point to work on efficiency of the website than the scale of the servers. For instance, some people say they can handle millions of pageviews a day on tiny VPS servers by using microcaching and nginx.

But then again, I've never had a large server load so what do I know.

August 26, 2016 | Unregistered CommenterJ.T. Smith

very useful information and please keep updating us, .....

June 6, 2017 | Unregistered Commentervamshi Krishna

Thanks. I am a beginner in this field. I use Linode VPS to host since AWS is expensive. This will very helpful me to use with my context. However, if you have experience regarding that/other cheap services, please share with us, because of Linode has not some automated services AWS have.

December 26, 2018 | Unregistered CommenterNamal

There is something people are not addressing, Amazon may be more expensive, but they do listen to sellers and buyers whenever fraud is suspected/involved, something which ebay does not .

March 2, 2020 | Unregistered CommenterBachi

Super clear, well written. Thanks

April 10, 2020 | Unregistered Commentersriram

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>