How Google Serves Data from Multiple Datacenters
Update: Streamy Explains CAP and HBase's Approach to CAP. We plan to employ inter-cluster replication, with each cluster located in a single DC. Remote replication will introduce some eventual consistency into the system, but each cluster will continue to be strongly consistent.
Ryan Barrett, Google App Engine datastore lead, gave this talk Transactions Across Datacenters (and Other Weekend Projects) at the Google I/O 2009 conference.
While the talk doesn't necessarily break new technical ground, Ryan does an excellent job explaining and evaluating the different options you have when architecting a system to work across multiple datacenters. This is called multihoming, operating from multiple datacenters simultaneously.
As multihoming is one of the most challenging tasks in all computing, Ryan's clear and thoughtful style comfortably leads you through the various options. On the trip you learn:
- lowish latency writes
- datacenter failure survival
- strong consistency guarantees.
There's still a lot more to learn. Here's my gloss on the talk:
Consistency - What happens happens after you read after a write?
Read/write data is one of the hardest kinds of data to run across datacenters. Users a expect a certain level of reliability and consistency.Transactions - Extended form of consistency across multiple operations.
Why Operate in Multiple Datacenters?
Why Not Operate in Multiple Datacenters?
Your Different Architecture Options
- Better, but not great.
- Data are usually replicated asynchronously so there's a window of vulnerability for loss.
- Data in your other datacenters may not be consistent on failure.
- Popular with financial institutions.
- You get geolocation to serve reads. Consistency depends on the technique. Writes are still limited to one datacenter.
- So some choose to do it with just two datacenters. NASDAQ has two datacenters close together (low latency) and perform a two-phase commit on every transaction, but they have very strict latency requirements.
- Using more than two datacenters is fundamentally harder. You pay for it with queuing delays, routing delays, speed of light. You have to talk between datacenters. Just fundamentally slower with a smaller pipe. You may pay for with capacity and throughput, but you'll definitely pay in latency.
How Do You Actually Do This?
What are the techniques and tradeoffs of different approaches? Here's the evaluation matrix:Backups | M/S | MM | 2PC | Paxos | |
Consistency | Weak | Eventual | Eventual | Strong | Strong |
Transactions | No | Full | Local | Full | Full |
Latency | Low | Low | Low | High | High |
Throughput | High | High | High | Low | Medium |
Data loss | Lots | Some | Some | None | None |
Failover | Down | Read-only | Read/Write | Read/Write | Read/Write |
- M/S = master/slave, MM - multi-master, 2PC - 2 Phase Commit
- What kind of consistency, transactions, latency throughput do we get for a particular approach? Will we lose data on failure? How much will we lose? When we failover for maintenance or we want to move things, say decommissioning a datacenter, how well do we do that, how well do the techniques support it?
- Replication is asynchronous so good for latency and throughput.
- Weak/eventual consistency unless you are very careful.
- You have multiple copies in the datacenters, so you'll lose a little data on failure, but not much. Failover can go read-only until the master has been moved to another datacenter.
- Datastore currently uses this mechanism. Truly multihoming adds latency because you have to add the extra hop between datacenters. App Engine is already slow on writes so this extra hit would be painful. M/S gives you most of the benefits of better forms while still offering lower latency writes.
- You figure out how to merge all the writes later when there's a conflict. It's like asynchronous replication, but you are serving writes from multiple locations.
- Best you can do is Eventual Consistency. Writes don't immediately go everywhere. This is a paradigm shift here. We've assumed with a strongly consistent system that backup and M/S that they don't change anything. They are just techniques to help us multihome. Here it literally changes how the system runs because the multiple writes must be merged.
- To do the merging you must find away to serialize, impose an ordering on all your writes. There is no global clock. Things happen in parallel. You can't ever know what happens first. So you make it up using timestamps, local timetamps + skew, local version numbers, distributed consensus protocol. This is the magic and there are a number of ways to do it.
- There's no way to do a global transaction. With multiple simultaneous writes you can't guarantee transactions. So you have to figure out what to do afterward.
- AppEngine wants strong consistency to make building applications easier, so they didn't consider this option.
- Failover is easy because each datacenter can handle writes.
- Semi-distributed because there's always a master coordinator for a given 2PC transaction. Because there are so few datacenters you tend to go through the same set of master coordinators.
- It's synchronous. All transactions are serialized through that master which kills your throughput and increases latency.
- Never serious considered this option because write throughput is very important to them. No single point of failure or serialization point would work for them. Latency is high because of the extra coordination. Writes can be in the 200msec area.
- This option does work though. You write to all datacenters or nothing. You get strong consistency and transactions.
- Need N+1 datacenters. If you take one down then you still have N to handle your load.
- Protocol: there's a propose step and then an agree step. You only need a majority of nodes to agree to say something is persisted for it to be considered persisted.
- Unlike 2PC it is fully distributed. There's no single master coordinator.
- Multiple transactions can be run in parallel. There's less serialization.
- Writes are high latency because of the 2 extra round coordination trips required in the protocol.
- Wanted to do this, but the they didn't want to pay the 150msec latency hit to writes, especially when competing against 5msec writes for RDBMSes.
- They tried using physcially close datacenters but the built-in multi-datacenter overhead (routers, etc) was too high. Even in the same datacenter was too slow.
- Paxos is still used a ton within Google. Especially for lock servers. For coordinating anything they do across datacenters. Especially when state is moved between datacenters. If your app is serving data in one datacenter and it should be moved to another that coordination is done through Paxos. It's used also in managing memcache and offline processing.
Miscellaneous
Discussion
A few things I wondered through the talk. Did they ever consider a distributed MVCC approach? That might be interesting and wasn't addressed as an option. Clearly at Google scale an in-memory data grid isn't yet appropriate.A preference for the strong consistency model was repeatedly specified as a major design goal because this makes the job of the programmer easier. A counter to this is that the programming model for Google App Engine is already very difficult. The limits and the lack of traditional relational database programming features put a lot of responsibility back on the programmer to write a scalable app. I wonder if giving up strong consistency would have been such a big deal in comparison?
I really appreciated the evaluation matrix and the discussion of why Google App Engine made the choices they did. Since writes are already slow on Google App Engine they didn't have a lot of headroom to absorb more layers of coordination. These are the kinds of things developers talk about in design meetings, but they usually don't make it outside the cubicle or conference room walls. I can just hear Ryan, with voiced raised, saying "Why can't we have it all!" But it never seems we can have everything. Many thanks to Ryan for sharing.
Reader Comments (7)
Extensive use of the word multihoming with a definition I've never heard before... Multihoming traditionally means to connect to more than one upstream network as is typically the case with ISPs and Hosting companies using BGP to connect to multiple IP traffic providers to improve performance and reliability.
multihoming is not limited to the scope of connecting to the internet via multiple NSP/ISP's
multihoming can be defined simply by saying an element has multiple connections to 2 disparate resources.
thus, a server can be multihomed to 2 network(s)/switches via 2 nics
now, i wouldn't necessarily have used the term multihoming in the context used above, i think "distributed" would be a better choice of words...
Great post and great video! I usually find short videos and general information, but 1 post like this and 1 hour of information is fantastic!
Important to mention that use of multiple data centers have basically 2 different purposes: throughput and backup. And I believe are inversely proportional.
In this article, the two goals were not distinguished.
For example, there are banks or financial institutions (such as Nasdaq) that build datacenters to 4 miles apart from each other to meet the Basel Accord (only for backup purposes), and all the data needs to be replicated synchronously. However google mainly focuses on throughput.
I believe it is important to focus on one main objective (backup | | throughput) to begin to build a distributed architecture in multiple datacenters.
Portuguese Version:
Importante citar que utilização de multiplos datacenters tem basicamente 2 finalidades diferentes: throughput e backup. E acredito que são inversamente proporcionais.
Nesse artigo, os dois objetivos não foram distinguidos.
Por exemplo, existem bancos ou instituicoes financeiras (como Nasdaq) que constrõem datacenters a 4 milhas de distancia um do outro para cumprir o acordo de Basileia (somente para fins de backup), e todos os dados precisam ser replicados de forma sincrona. Entretanto o google foca em throughput essencialmente.
Acredito que é importante focar em um objetivo principal (backup || throughput) para começar a construir uma arquitetura distribuida em multiplos datacenters.
Neither of the links given at the start of the blog work now.
Bit rot is a serious disease impacting every corner of the Internet. There is no cure.
It would be great to assess these trade-offs and performances of such in today's world, 13+ years later. Alternatively, any useful articles that paints the scenario of 2022 onwards?