High Scalability -

6 Comments |

Permalink |

Example

Monday

Nov092015

A 360 Degree View of the Entire Netflix Stack

Monday, November 9, 2015 at 8:56AM

This is a guest repost by Chris Ueland, creator of Scale Scale, with a creative high level view of the Netflix stack.

As we research and dig deeper into scaling, we keep running into Netflix. They are very public with their stories. This post is a round up that we put together with Bryan’s help. We collected info from all over the internet. If you’d like to reach out with more info, we’ll append this post. Otherwise, please enjoy!

–Chris / ScaleScale / MaxCDN

A look at what we think is interesting about how Netflix Scales

20 Comments |

Permalink |

Example

Monday

Nov022015

How Shopify Scales to Handle Flash Sales from Kanye West and the Superbowl

Monday, November 2, 2015 at 8:56AM

This is a guest repost by Christophe Limpalair, creator of Scale Your Code.

In this article, we take a look at methods used by Shopify to make their platform resilient. Not only is this interesting to read about, but it can also be practical and help you with your own applications.

Shopify's Scaling Challenges

Shopify, an ecommerce solution, handles about 300 million unique visitors a month, but as you'll see, these 300M people don't show up in an evenly distributed fashion.

One of their biggest challenge is what they call "flash sales". These flash sales are when tremendously popular stores sell something at a specific time.

For example, Kanye West might sell new shoes. Combined with Kim Kardashian, they have a following of 50 million people on Twitter alone.

They also have customers who advertise on the Superbowl. Because of this, they have no idea how much traffic to expect. It could be 200,000 people showing up at 3:00 for a special sale that ends within a few hours.

How does Shopify scale to these sudden increases in traffic? Even if they can't scale that well for a particular sale, how can they make sure it doesn't affect other stores? This is what we will discuss in the next sections, after briefly explaining Shopify's architecture for context.

Shopify's Architecture

1 Comment |

Permalink |

Example

Wednesday

Oct282015

Five Lessons from Ten Years of IT Failures

Wednesday, October 28, 2015 at 8:56AM

IEEE Spectrum has a wonderful article series on Lessons From a Decade of IT Failures. It’s not your typical series in that there are very cool interactive graphs and charts based on data collected from past project failures. They are really fun to play with and I can only imagine how much work it took to put them together.

The overall takeaway of the series is:

Even given the limitations of the data, the lessons we draw from them indicate that IT project failures and operational issues are occurring more regularly and with bigger consequences. This isn’t surprising as IT in all its various forms now permeates every aspect of global society. It is easy to forget that Facebook launched in 2004, YouTube in 2005, Apple’s iPhone in 2007, or that there has been three new versions of Microsoft Windows released since 2005. IT systems are definitely getting more complex and larger (in terms of data captured, stored and manipulated), which means not only are they increasing difficult and costly to develop, but they’re also harder to maintain.

Here are the specific lessons:

Segment: Rebuilding Our Infrastructure with Docker, ECS, and Terraform

Monday, October 19, 2015 at 8:56AM

This is a guest repost from Calvin French-Owen, CTO/Co-Founder of Segment.

In Segment’s early days, our infrastructure was pretty hacked together. We provisioned instances through the AWS UI, had a graveyard of unused AMIs, and configuration was implemented three different ways.

As the business started taking off, we grew the size of the eng team and the complexity of our architecture. But working with production was still limited to a handful of folks who knew the arcane gotchas. We’d been improving the process incrementally, but we needed to give our infrastructure a deeper overhaul to keep moving quickly.

So a few months ago, we sat down and asked ourselves: “What would an infrastructure setup look like if we designed it today?”

Over the course of 10 weeks, we completely re-worked our infrastructure. We retired nearly every single instance and old config, moved our services to run in Docker containers, and switched over to use fresh AWS accounts.

We spent a lot of time thinking about how we could make a production setup that’s auditable, simple, and easy to use–while still allowing for the flexibility to scale and grow.

Here’s our solution.

Separate AWS Accounts

2 Comments |

Permalink |

Example

Monday

Oct122015

Making the Case for Building Scalable Stateful Services in the Modern Era

Monday, October 12, 2015 at 8:56AM

For a long time now stateless services have been the royal road to scalability. Nearly every treatise on scalability declares statelessness as the best practices approved method for building scalable systems. A stateless architecture is easy to scale horizontally and only requires simple round-robin load balancing.

What’s not to love? Perhaps the increased latency from the roundtrips to the database. Or maybe the complexity of the caching layer required to hide database latency problems. Or even the troublesome consistency issues.

But what of stateful services? Isn’t preserving identity by shipping functions to data instead of shipping data to functions a better approach? It often is, but we don’t hear much about how to build stateful services. In fact, do a search and there’s very little in the way of a systematic approach to building stateful services. Wikipedia doesn’t even have an entry for stateful service.

Caitie McCaffrey, Tech Lead for Observability at Twitter, is fixing all that with a refreshing talk she gave at the Strange Loop conference on Building Scalable Stateful Services (slides).

Refreshing because I’ve never quite heard of building stateful services in the way Caitie talks about building them. You’ll recognize most of the ideas--Sticky Sessions, Data Shipping Paradigm, Function Shipping Paradigm, Data Locality, CAP, Cluster Membership, Gossip Protocols, Consistent Hashing, DHT---but she weaves them around the theme of building stateful services in a most compelling way.

The highlight of the talk for me is when Caitie ties the whole talk together around the discussion of her experiences developing Halo 4 using Microsoft’s Orleans on top of Azure. Orleans doesn’t get enough coverage. It’s based on an inherently stateful distributed virtual Actor model; a highly available Gossip Protocol is used for cluster membership; and a two tier system of Consistent Hashing plus a Distributed Hash Table is used for work distribution. With this approach Orleans can rebalance a cluster when a node fails, or capacity is added/contracted, or a node becomes hot. The result is Halo was able to run a stateful Orleans cluster in production at 90-95% CPU utilization across the cluster.

Orleans isn't the only example system covered. Facebook's Scuba and Uber's Ringpop are also analyzed using Caitie's stateful architecture framework. There's also a very interesting section on how Facebook cleverly implements fast database restarts for large in-memory databases by decoupling the memory lifetime from the process lifetime.

So let’s jump in and learn how to build stateful services...

Stateless Services are Wasteful

10 Comments |

Permalink |

Example

Wednesday

Oct072015

Zappos's Website Frozen for Two Years as it Integrates with Amazon

Wednesday, October 7, 2015 at 8:56AM

Here's an interesting nugget from a wonderfully written and deeply interesting article by Roger Hodge in the New Republic: A radical experiment at Zappos to end the office workplace as we know it:

Zappos's customer-facing web site has been basically frozen for the last few years while the company migrates its backend systems to Amazon's platforms, a multiyear project known as Supercloud.

It's a testament to Zappos that they still sell well with a frozen website while most of the rest of the world has adopted a model of continuous deployment and constant evolution across multiple platforms.

Amazon is requiring the move, otherwise a company like Zappos would probably be sensitive to the Conway's law implication of such a deep integration. Keep in mind Facebook is reportedly keeping WhatsApp and Instagram independent. This stop the world plan must mean something, unfortunately I don't have the strategic insight to understand why this might be. Any thoughts?

The article has more tantalizing details about what's going on with the move:

9 Comments |

Permalink |

Example

Monday

Sep282015

How Facebook Tells Your Friends You're Safe in a Disaster in Under Five Minutes

Monday, September 28, 2015 at 8:56AM

In a disaster there’s a raw and immediate need to know your loved ones are safe. I felt this way during 9/11. I know I’ll feel this way during the next wild fire in our area. And I vividly remember feeling this way during the 1989 Loma Prieta earthquake.

Most earthquakes pass beneath notice. Not this one and everyone knew it. After ceiling tiles stopped falling like snowflakes in the computer lab, we convinced ourselves the building would not collapse, and all thoughts turned to the safety of loved ones. As it must have for everyone else. Making an outgoing call was nearly impossible, all the phone lines were busy as calls poured into the Bay Area from all over the nation. Information was stuck. Many tense hours were spent in ignorance as the TV showed a constant stream of death and destruction.

It’s over a quarter of a century later, can we do any better?

Facebook can. Through a product called Safety Check, which connects friends and loved ones during a disaster. When a disaster hits Safety Check prompts people in the area to indicate if they are OK or not. Then Facebook closes the worry loop by telling their friends how they are doing.

Brian Sa, Engineer Manager at Facebook, created Safety Check out of his experience of the devastating earthquake in Fukushima Japan in 2011. He told his very moving story in a talk he gave at @Scale.

During the earthquake Brian put a banner on Facebook with helpful information sources, but he was moved to find a better way to help people in need. That impulse became Safety Check.

My first reaction to Safety Check was damn, why didn’t anyone think of this before? It’s such a powerful idea.

The answer became clear as I listened to a talk in the same video given by Peter Cottle, Software Engineer at Facebook, who also talked about building Safety Check.

It’s likely only Facebook could have created Safety Check. This observation dovetails nicely with Brian’s main lesson in his talk:

Solve real-world problem in a way that only YOU can. Instead of taking the conventional route, think about the unique role you and your company can play.

Only Facebook could create Safety Check, not because of resources as you might expect, but because Facebooks lets employees build crazy things like Safety Check and because only Facebook has 1.5 billion geographically distributed users, with a degree of separation between them of only 4.74 edges, and only Facebook has users who are fanatical about reading their news feeds. More about this later.

In fact, Peter talked about how resources were a problem in a sort of product development Catch-22 at Facebook. The team for Safety Check was small and didn’t have a lot of resources attached to it. They had to build the product and prove its success without resources before they could get the resources to build the product. The problem had to be efficiently solved at scale without the application of lots of money and lots of resources.

As is often the case constraints led to a clever solution. A small team couldn’t build a big pipeline and index, so they wrote some hacky PHP and effectively got the job done at scale.

So how did Facebook build Safety Check? Here’s my gloss on both Brian’s and Peter’s talks:

Uber Goes Unconventional: Using Driver Phones as a Backup Datacenter

Monday, September 21, 2015 at 8:56AM

In How Uber Scales Their Real-Time Market Platform one of the most intriguing hints was how Uber handles datacenter failovers using driver phones as an external distributed storage system for recovery.

Now we know a lot more about how that system works from Uber's Nikunj Aggarwal and Joshua Corbin, who gave a very interesting talk at the @Scale conference: How Uber Uses your Phone as a Backup Datacenter.

Rather than use a traditional backend replication scheme where databases sync state between datacenters to achieve a measure of k-safety, Uber did something different, what they do is store enough state on driver phones so that if a datacenter failover occurs trip information can not be lost on the failover.

Why choose this approach? The traditional approach would be much simpler. I think it is to make sure the customer always has a good customer experience and losing trip information for an active trip would make for a horrible customer experience.

By building their syncing strategy around the phone, even thought it's complicated and takes a lot work, Uber is able to preserve trip data and make for a seamless customer experience even on datacenter failures. And making the customer happy is what counts, especially in a market with near zero switching costs.

So the goal is not to lose trip information, even on a datacenter failover. Using a traditional database replication strategy it would not be possible to make this guarantee for reasons that have parallels to how network management systems have always had to work. Let me explain.

In a network devices are the authoritative source for state information like packet errors, alarms, packets sent and received, and so on. The network management system is authoritative for configuration data like alarm thresholds and customer information. The complication is devices and the network management system are not always in contact, so they get out of sync because they work independently of each other. Which means on bootup, failover, and communication reconnection all this information has to be merged in both directions using a complicated dance that ensures correctness and consistency.

Uber has the same problem, only the devices are smart phones and the authoritative state the phone contains is trip information. So on bootup, failover, and communication reconnection the trip information must be preserved because the phone is the authoritative source for trip information.

Even when connectivity is lost the phone has an accurate record all trip data. So you wouldn't want to sync trip data from the datacenter down to the phone because that would wipe out the correct data on the phone. The correct information must come from the phone.

Uber also takes another trick from network management systems. They periodically query phones to test the integrity of information in the datacenter.

Let's see how they do it...

Motivation for Using Phones as Storage for Datacenter Failure

4 Comments |

Permalink |

Example

Monday

Sep142015

How Uber Scales Their Real-time Market Platform

Monday, September 14, 2015 at 8:56AM

Reportedly Uber has grown an astonishing 38 times bigger in just four years. Now, for what I think is the first time, Matt Ranney, Chief Systems Architect at Uber, in a very interesting and detailed talk--Scaling Uber's Real-time Market Platform---tells us a lot about how Uber’s software works.

If you are interested in Surge pricing, that’s not covered in the talk. We do learn about Uber’s dispatch system, how they implement geospatial indexing, how they scale their system, how they implement high availability, and how they handle failure, including the surprising way they handle datacenter failures using driver phones as an external distributed storage system for recovery.

The overall impression of the talk is one of very rapid growth. Many of the architectural choices they’ve made are a consequence of growing so fast and trying to empower recently assembled teams to move as quickly as possible. A lot of technology has been used on the backend because their major goal has been for teams to get the engineering velocity as high as possible.

After a understandably chaotic (and very successful) start it seems Uber has learned a lot about their business and what they really need to succeed. Their early dispatch system was a typical just make it work type affair that assumed at a deep level it was moving only people. Now that Uber’s mission has grown to handle boxes and groceries as well as people, their dispatch system has been abstracted and put on very solid and smart architectural foundation.

Though Matt thinks their architecture might be a little crazy, the idea of using a consistent hash ring with a gossip protocol seems spot on for their use case.

It’s hard not to be captivated for Matt’s genuine enthusiasm for what he’s working on. When talking about DISCO, their dispatch system, he says in an excited tone that it’s like the traveling salesman problem from school. It’s a cool Computer Science thing. Even though the solution isn’t optimal, it’s the traveling salesman at an interesting scale, in real-time, in the real-world, built out of fault tolerant scalable components. How cool is that?

So let’s see how Uber works on the inside. Here’s my gloss on Matt’s talk:

Stats

2 Comments |

Permalink |