« Paper: A Co-Relational Model of Data for Large Shared Data Banks | Main | Scaling Social Ecommerce Architecture Case study »
Wednesday
Apr062011

Netflix: Run Consistency Checkers All the time to Fixup Transactions

You might have consistency problems if you have: multiple datastores in multiple datacenters, without distributed transactions, and with the ability to alternately execute out of each datacenter;  syncing protocols that can fail or sync stale data; distributed clients that cache data and then write old back to the central store; a NoSQL database that doesn't have transactions between updates of multiple related key-value records; application level integrity checks; client driven optimistic locking.

Sounds a lot like many evolving, loosely coupled, autonomous, distributed systems these days. How do you solve these consistency problems? Siddharth "Sid" Anand of Netflix talks about how they solved theirs in his excellent presentation, NoSQL @ Netflix : Part 1, given to a packed crowd at a Cloud Computing Meetup

You might be inclined to say how silly it is to have these problems in the first place, but just hold on. See if you might share some of their problems, before getting all judgy:

  • Netflix is in the process of moving an existing Oracle database from their own datacenter into the Amazon cloud. As part of that process the webscale data, the data that is proportional to user traffic and thus needs to scale, has been put in NoSQL databases like SimpleDB and Cassandra, in the cloud. Complex relational data still resides in the Oracle database. So they have a bidirectional sync problem. Data must sync between the two systems that reside in different datacenters. So there are all the usual latency, failure and timing problems that can lead to inconsistencies. Eventual consistency is the rule here. Since they are dealing with movie data and not financial data, the world won't end.
  • Users using multiple devices also leads to consistency issues. This is a surprising problem and one we may see more of as people flow seamlessly between various devices and expect their state to flow with them. If you are watching a video on your Xbox, pause it, then start watching the video on your desktop, should the movie start from where you last paused it? Ideally yes, and that's what Netflix tries to do. But think for a moment all the problems that can occur. All these distributed systems are not in a transactional context. The movie save points are all stored centrally, but since each device operates independently, they can be inconsistent. Independent systems usually cache data which means they can write stale data back to the central database which then propagates to the other devices. Or changes can happen from multiple devices at the same time. Or changes can be lost and a device will have old data. Or a device can contact a different datacenter than another device and get a different value. For a user this appears like Netflix can't pause and resume correctly, which is truish, but it's more tricky than that. The world never stands still long enough to get the right answer.
  • One of the features NoSQL databases have dropped is integrity checking. So all constraints must be implemented at the application layer. Applications can mess up and that can leave your data inconsistent.
  • Anther feature of NoSQL databases is they generally don't have transactions on multiple records. So if you update one record that has a key pointing to another record that will be updated in a different transaction, those can get out of sync. 
  • Syncing protocols are subject to the same failures of any other programs. Machines can go down, packets can be dropped or reordered. When that happens your data might be out of sync.

How does Netlix handle these consistency issues?

  • Optimistic locking. One approach Netflix uses to address the consistency issues is optimistic locking. Every write has a timestamp. The newest time stamp wins, which may or may not always be correct, but given the types of data involved, it's usually good enough. NTP is used to keep all the nodes timed synced.
  • Consistency checkers. The heavy hitter strategy they use to bring their system back into a consistent state are consistency checkers that run continuously in the background. I've used this approach very effectively in situations where events that are used to synchronize state can be dropped. So what you do is build applications that are in charge of reaching out to the various parts of the system and making them agree. They make the world stand still long enough to come to an agreement on some periodic basis. Notice they are not trying to be accurate, the ability to be accurate has been lost, but what's important is that all the different systems agree. If, for example, a user moves a movie from the second position on their queue to the third position on one device and that change hasn't propagated correctly, what matters is that all the systems eventually agree on where the item is in the queue, it doesn't matter so much what the real position is as long as they agree and the user sees a consistent experience. Dangling references can be checked an repaired. Data transforms from failed upgrades can be fixed. Any problems from devices that use different protocol can be fixed. Any order that was approved that may not now have the inventory must be addressed. Any possible incosistency must be coded for, checked, and compensated for. Another cool feature that can be tucked into consistency checkers are aggregation operations, like calculating counts, leader boards, totals, that sort of thing. 

Loosely coupled, autonomous. distributed systems are complex beasts that are eventually consistent by nature. Netflix is on the vanguard of innovation here. They have extreme scale, they are transitioning into a cloud system, and they have multiple independent devices that must cooperate. It's great for them to have shared their experiences and how they are tacking their problems with us.

The Problem of Time in Autonomous Systems

One thing this article has brought up for me again is how we have punted on the problem of time. It's too hard to keep clocks in sync so we don't even bother. Vector clocks are the standard technique of deciding which version of data to keep, but in an open, distributed, autonomous system, not all nodes can or will want to participate in this vector clock paradigm.

We actually do have an independent measure that can be used to put an order on events. It's called time. What any device can do is put a very high precision timestamp on data. Maybe it's time to tackle the problem of time again?

Reader Comments (5)

Actually IEEE1588v2 solves the time synchronization to a good extent.

April 6, 2011 | Unregistered CommenterKrishna Sankar

Uh... the "vanguard" of innovation?

This book is from 2004:

http://www.amazon.com/Java-Transaction-Processing-Design-Implementation/dp/013035290X/ref=sr_1_1?ie=UTF8&qid=1302199288&sr=8-1

And was mostly talking about well established heuristics for dealing with transactions.

Why do we keep acting liking doing the obvious is somehow "innovation"?

I started building my own systems this way in '02, and recently worked for a firm that has been using this architecture in an ERP system since the early '90s.

April 7, 2011 | Unregistered CommenterSam

@Sam
I think the innovation with the Netflix case is the scale *and* their 100% embracement of the cloud. Nobody is saying Netflix invented consistency checkers at the application layer...
So Is the system you build in 2002 performing on the same scale as Netflix? Is the system you build in 2002 fully migrating from a private RDB based system to a *public* noSQL cloud based approach?
More important where are all the posts, presentations, white papers about the system you build in 2002? I guess the system you built in 2002 is probably top secret hence you can't reveal your identity either.

April 12, 2011 | Unregistered CommenterJaime Gago

ah, the harsh realities of living in the public cloud (where the MTBF is actually magnitudes higher!) and using relaxed consistency strategies.

there are many web properties out there solving much harder problems with full ACID at much larger scale. of course, they aren't buzzword compliant like netflix is. unless they have some very, very unique pricing from amazon their hosting bill has probably increase by magnitudes too:)

it's pretty sad that "innovation" is being touted as a result of their poor decision making to be with the other cool kids in the fog..

April 15, 2011 | Unregistered CommenterPeter

Peter, I completely agree with you. Netflix was given an amazing deal by new CDN provider Level3, and I promise they are not paying standard rates for Amazon cloud services. Noone in thier right mind could afford that at thier scale.

We do SaaS, in two datacenters, 600Mbps traffic, only 12 cabinets, and under 10TB of data - the cost for AM cloud using thier caculators was nearly double.... yet my boss hears the buzzwords and just "knows" our future is the cloud.

April 15, 2011 | Unregistered CommenterChris

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>