« Scale Indefinitely on S3 With These Secrets of the S3 Masters | Main | Stuff The Internet Says On Scalability For March 2, 2012 »
Tuesday
Mar062012

Ask For Forgiveness Programming - Or How We'll Program 1000 Cores

The argument for a massively multicore future is now familiar: while clock speeds have leveled off, device density is increasing, so the future is cheap chips with hundreds and thousands of cores. That’s the inexorable logic behind our multicore future.

The unsolved question that lurks deep in the dark part of a programmer’s mind is: how on earth are we to program these things? For problems that aren’t embarrassingly parallel, we really have no idea. IBM Research’s David Ungar has an idea. And it’s radical in the extreme...

Grace Hopper once advised “It's easier to ask for forgiveness than it is to get permission.” I wonder if she had any idea that her strategy for dealing with human bureaucracy would the same strategy David Ungar thinks will help us tame  the technological bureaucracy of 1000+ core systems?

You may recognize David as the co-creator of the Self programming language, inspiration for the HotSpot technology in the JVM and the prototype model used by Javascript. He’s also the innovator behind using cartoon animation techniques to build user interfaces. Now he’s applying that same creative zeal to solving the multicore problem.

During a talk on his research, Everything You Know (about Parallel Programming) Is Wrong! A Wild Screed about the Future, he called his approach “anti-lock or “race and repair” because the core idea is that the only way we’re going to be able to program the new multicore chips of the future is to sidestep Amdhal’s Law and program without serialization, without locks, embracing non-determinism. Without locks calculations will obviously be wrong, but correct answers can be approached over time using techniques like fresheners:

A thread that, instead of responding to user requests, repeatedly selects a cached value according to some strategy, and recomputes that value from its inputs, in case the value had been inconsistent. Experimentation with a prototype showed that on a 16-core system with a 50/50 split between workers and fresheners, fewer than 2% of the queries would return an answer that had been stale for at least eight mean query times. These results suggest that tolerance of inconsistency can be an effective strategy in circumventing Amdahl’s law.


During his talk David mentioned that he’s trying  to find a better name than “anti-lock or “race and repair” for this line of thinking. Throwing my hat into the name game, I want to call it Ask For Forgiveness Programming (AFFP), based on the idea that using locks is “asking for permission” programming, so not using locks along with fresheners is really “asking for forgiveness.” I think it works, but it’s just a thought.

No Shared Lock Goes Unpunished

Amdahl’s Law is used to understand why simply having more cores won’t save us for a large class of problems. The idea is that any program is made up of a serial fraction and a parallel fraction. More cores only helps you with the parallel portion. If an operation takes 10 seconds, for example, and one second of it is serial, then having infinitely many cores will only help you make the parallelizable part faster, the serial code will always take  one second. Amdahl says you can never go faster than that 10%. As long as your code has a serial portion it’s impossible to go faster.

Jakob Engblom recounts a similar line of thought in his blog:

They also had the opportunity to test their solution [for parallel Erlang] on a Tilera 64-core machines. This mercilessly exposed any scalability limitations in their system, and proved the conventional wisdom that going beyond 10+ cores is quite different from scaling from 1 to 8… The two key lessons they learned was that no shared lock goes unpunished, and data has to be distributed as well as code.

It seems that for all big system parallelization efforts turn into a hunt for locks and the splitting up of code and data into units that can run in parallel without having to synchronize. The “upper limit” of this process is clearly a system with no synchronization points at all.

Sherlock Holmes says that when you have eliminated the impossible, whatever remains, however improbable, must be the truth, so the truth is: removing serialization is the only way to use all these cores. Since synchronization is the core serial component to applications, we must get rid of synchronization.

A Transaction View Isn’t as Strong as it Once Was

Getting rid of locks/synchronization may not be the radical notion it once was. Developments over the last few years have conditioned us to deal with more ambiguity, at least for distributed systems.

Through many discussions about CAP, we’ve come to accept a non-transactional view of databases as a way to preserve availability in the face of partitions. Even if a read doesn’t return the last write, we know it will eventually and that some merge logic will make them consistent once again.

The idea of compensating transactions as a way around the performance problems due to distributed coordination has also become more familiar.

Another related idea comes from the realm of optimistic concurrency control that lets multiple transactions proceed in parallel without locking and then checks at commit time if a conflict requires a rollback. The application then gets to try again.

And for some time now memcache has supported a compare-and-set operation that allows multiple clients to avoid writing over each other by comparing time stamps.

As relaxed as these methods are, they all still require that old enemy: synchronization.

Your Answers are Already Wrong

The core difficulty with abandoning synchronization is coming to terms with the notion that results may not be correct at all times. It’s a certainty we love certainty, so even considering we could get wrong answers for any window of time is heresy.

David says we need to change our way of thinking. The idea is to get results that are “good enough, soon enough.” Get wrong answers quickly, but that are still right enough to be useful. Perfect is the enemy of the good and perfect answers simply take too long at scale.

David emphasises repeatedly that there’s a fundamental trade-off between correctness and performance. Without locks operations happen out of order, which gives the wrong answer. Race conditions happen. To get correct answers we effectively add in delays, making one thing wait for another, which kills performance, and we don’t want to kill performance, we want to use all those cores.

But consider an aspect of working on distributed systems that people don’t like to think about: your answers are already always wrong.

Unless you are querying a read-only corpus or use a global lock, in distributed systems any answer to any query is potentially wrong, always. This is the hardest idea to get into your brain when working in distributed systems with many cores that experience simultaneous updates. The world never stops. It’s always in flux. You can never assume for a single moment there’s a stable frame of reference. The answer from a query you just made could be wrong in the next instant. A query to see if a host is up, for example, can’t ever be assumed right. That host maybe up or down in the next instant and your code won’t know about it until it finds a conflict.

So are we really that far away from accepting that all answers to queries could be wrong?

Lessons from Nature

One strategy for dealing with many cores is to move towards biological models instead of mathematical models, where complicated behaviours emerge without global determinism. Bird flocks, for example, emerge from three simple rules: avoid crowding, steer towards average heading of neighbors,  steer towards average position of neighbors. No pi-calculus required, it works without synchronization or coordination. Each bird is essentially its own thread, it simply looks around and makes local decisions. This is a more cellular automaton view of the world.

Race and Repair - Mitigating Wrong Results

The idea is that errors created by data races won’t be prevented, they will be repaired. A calculation made without locks will be wrong under concurrent updates. So why not use some of our many cores to calculate the right answers in the background, in parallel, and update the values? This approach:

  • Uses no synchronization.
  • Tolerates some wrong answers.
  • Probabilistically fixes the answers over time.

Some obvious questions: how many background threads do you need? What order should values be recalculated? And how wrong will your answers be?

To figure this out an experiment was run and described in Inconsistency Robustness for Scalability in Interactive Concurrent‑Update In-Memory MOLAP Cubes, which test updates on a complicated spreadsheet.

With locking the results were correct, but scalability was limited. Without locking, results were usually wrong. Both results are as might be expected. And when they added freshener threads they found:

Combining the frequency with the duration data suggests that in a 16-core system with a 50/50 split between workers and fresheners, fewer than 2% of the queries would return an answer that had been stale for at least eight mean query times.

I found this result quite surprising. I would have expected the answers to be wrong more of the time. Could this actually work?

Breadcrumbs

The simple idea of Race and Repair is open to a lot of clever innovations. Breadcrumbs are one such innovation that attempts to be smarter about which values need recalculating. Meta-data is attached on a value indicating that a entity is recalculating the value or has changed a dependent value such that this value is now out of date. Any entity that might want to use this data can wait until a “valid” value is calculated and/or not to insert a calculated value if it is out of data. This narrows the window of time in which errors are introduced. It’s a mitigation.


There are endless variations of this. I can imagine remembering calculations that used a value and then publishing updated values so those calculations can be rerun. The result is a roiling  event driven sea that is constantly bubbling with updates that are trying to bring values towards correctness, but probably never quite getting there.

Probabilistic Data Structures

Another area David has researched are hashtables that can be inserted into without synchronization. This would allow entries to be added to a hashtable from any number of cores without slowing down the entire system with a lock. Naive insertion into a hashtable in parallel will mess up pointers, which can either result in the loss of values or the insertion of values, but it’s possible to work around these issues and create a lock free hashtable.

His presentation goes into a lot of detail on how this might work. He rejects light-weight locking schmes like CAS because these are still a choke point and there’s a penalty for atomic instructions under contention. They won’t scale.

He thinks there’s a big research opportunity in probabilistic data structures that work without synchronization and that work with mitigation.

Is this the Future?

This is just a light weight introduction. For more details please read all the papers and watch all the videos. But I think it’s important to talk about and think about how we might make use of all these cores for traditional programming tasks, though the result may be anything but traditional.

A bad experience on one project makes me somewhat skeptical that human nature will ever be comfortable in accepting wrong answers as the norm. My experience report was on an event driven system with a large number of nodes that could generate events so fast that events had to be dropped. Imagine a city undergoing an earthquake and a huge sensor net spewing out change events to tell the world what’s going on in real-time.

The original requirement was events could never be dropped, ever, which made a certain amount of sense when the number of sensors was small. As the number of sensors expands it’s simply not possible. My proposal was to drop events and have background processes query the actual sensors in the background so that the database would be synced over time. Very much like the proposals here.

It was a no go. A vehement no go. Titanic arguments ensued.  Management and everyone on down simply could not accept the idea that their models would be out of sync with the sensors (which of course they were anyway). My eventual solution took a year to implement and radically changed everything, but that was simpler than trying to convince people to deal with uncertainty.

So there might be some convincing to do.

Related Articles

Reader Comments (15)

It is hard enough to get people to believe that RAM can be persistent storage via K-safety, trying to convince them that correctness is a trade-off is going to encounter violent opposition from mediocre minds :)

Infobright has ROUGH QUERY's, which return very fast answer-ranges (the ranges are correct) and this pretty brilliant idea lacks traction because people are wary of using non exact values, they dont trust "correct enough".

Eventually consistent is other-wise-known-as Never Consistent (in a bad way).

David Ungar is correct framing correctness & parallel performance as a trade-off, and Todd is right, in that people rely on computers for EXACT answers and FEAR lack of total correctness.

IMHO, we are all taught from the beginning to program computers as if everything is correct and exact and it is a great simplification and in the end it is a lie: not even system clocks are really correct & exact (let alone distributed clocks :)

Another flaw in debates on this topic is that preachers of correctness are widely believed to be more correct :)

Getting the industry as a whole to start even considering this trade-off is only going to be accomplished by examples that require the strategy (google search is a good example) and then there will be decades spent trying to figure out how to do this right (we may be in the middle of these decades).

March 6, 2012 | Unregistered CommenterRussell Sullivan

Certainly, new programming paradigms (and languages) will be required. Personally, I think that rather than having inconsistent, "mushy" shared memory (that basically undermines everything programmers have come to expect when they think of memory), explicit (and well defined) message passing is probably the way to go.

March 6, 2012 | Unregistered CommenterRon

If I wanted fast, not-quite-accurate results, I'd use my brain :P

March 6, 2012 | Unregistered CommenterShish

You might want to look at LoseThos Operating System with it's master/slave multicore API. It does not support GPUs but supports multicore updates of the screen with no locks, so it causes imperfections on the screen.

Here's a multicore flight sim: Eagle Dive

March 6, 2012 | Unregistered CommenterTerry A. Davis

I'm interested in the sequential processing you mention. Can you give an example of a "serial" algorithm which is CPU bound but not easily parallelizable?

March 6, 2012 | Unregistered CommenterTaylor

There are parallels (sorry) between this level of multicore programming and the distributed computing world, at least as far as the idea of inconsistent data is concerned. Your naming idea of 'Ask For Forgiveness Programming' is also very similar to 'Apology based computing' or 'Apology oriented computing', which was coined, I believe, by Pat Helland from his distributed architecture work at Amazon (see Memories, Guesses, and Apologies). While he spoke about real life apologies, to real customers, the concepts have similar roots.

March 7, 2012 | Unregistered CommenterSimon Munro

Todd, the idea of CA and emergent global order is very compelling and seductive, but a fundamental error of the notion is the implicit 'gift' of physical reality and physical phenomena (such as sight): A flock of birds that need to send IP packets to 'see' and 'measure' qualities in individual birds or the fock is I/O bound.

The notion of relaxing determinism and reaching for probablistic heuristic to beat the O() gods is hardly a radical notion due to David Ungar (with all due respect). The very notion of a hash function is precisely about this insight. Temporal locality heuristic of your basic malloc is another example.

I am nearly certain at this point that the near future is a mixed bag of custom solutions and various niche 'golden hammers'.

alphazero's law: In any system composed of inteconnected discreet units, one can only have 2 of {Correctness, Timeliness, Efficiency}

March 7, 2012 | Unregistered Commenteralphazero

alphazero,

Software has senses via messages that can be quite rich. And if you gossip those messages over a local region and incrementally decreasing radius out, it would seem to emulate the flocking scenario quite well, without a lot of overhead.

And I don't quite see how your examples are equivalent as they don't really deal with interdependent calculations.

March 7, 2012 | Registered CommenterHighScalability Team

Todd, of course one can gossip; I merely point out here that unbounded linear scaling via messaging is basically the root of this whole matter. (Agree that in a rack/dc, broadcast based flocking could work.)

High level conceptual affinities:

A hash trades determinism of a structured index (e.g. binary tree) to beat the O(log(N)) time bounds of deterministic structure ops (or space bounds of a static structure e.g. array). It does this by sacrificing space efficiency. It is arguable if adding n keys to hash (such as Cockoo) are interdependent or not.

March 7, 2012 | Unregistered Commenteralphazero

As you say perhaps this is more natural for those dealing in distributed systems, certainly I've used related techniques both in concurrent and distributed cases.

I noted there was a strong bias towards viewing these approaches to yield right and wrong answers which is interesting because when considered from the CAP perspective, we're talking about information that could be seen as "stale" or out of date instead. Either way, it's a form of approximation.

So I guess I'm with Russell Sullivan on this, it's all about mindset and how many people can cope with it.

March 8, 2012 | Unregistered CommenterDan Creswell

"If I wanted fast, not-quite-accurate results, I'd use my brain :P
March 6, 2012 | Unregistered CommenterShish"

You have used your brain all your life. They work quite good, dont't they? :)

March 20, 2012 | Unregistered Commenterbrainsss

This naive "lets just throw more cores" approach seems like a really un-intelligent way of trying to tackle the root problem of scalability/power.

Of course it will help in certain areas (i.e. Embarrassingly parallel work load).

But at the root, when you have a set of event processes that requires other events to complete before they can be completed. This can never be computed in parallel.

Other kinds of optimisations need to be found. Smarter ways such as pre-computation, partial calculation etc..

Software I think is a mountain of mess, a huge layer of 1,000 abstraction. We need to perhaps take a step back?

April 13, 2012 | Unregistered CommenterThe IT Ninja

This is very interesting, but of course some applications are more tolerant of stale results than are others. One really must ask what is worse, a system which doesn't work, or a system which seems to work? David Ungar seems to be advocating the latter.

April 17, 2012 | Unregistered CommenterCarl Gundel

This might work on a very low-level, but at a higher level it is a terrible idea that has been tried before.

Repeating calculations for the same input assumes two things:
1. That the input is small enough to buffer.
2. That the operation is purely computational and has no non-repeatable side-effects such as modifying files on disk or communicating with external systems.

Secondly, there is no way you can guarantee that this system will produce answers that are "good enough" without being able to measure the accuracy. Producing answers that contain an error range less than 5% might be acceptable for some use-cases but not others. Producing answers without having a clue of the error range is just wishful thinking. No one in their right mind should accept this loosy-goosey system. You cannot build abstractions on top of a house of cards.

Just my 2 cents.
Gili

January 1, 2015 | Unregistered CommenterGili

I used to hate threads and approaching multiple cores..
No mater how "perfect" my design was.. There was always someone at the end of the line that found a way to enter in a deadlock..

My perfect solution is never to trust anyone with "my code" and use structures that will hide all the "threads and parallel" code.

Instead of using a "List" I use a custom "ThreadSafeMultiCoreList".. and so on..

I find this concept so much better.. Yes I lose 10-15% of my power.. but I gain a happy life for all my team and projects..

My conclusion is that Threads and Cores must be kept "locked" inside private rooms.. never to be shown to .. anyone.

Never Give a "pointer" and always give a reference in form of an unique key or unique Index.
Never delete or remove an item from your lists but always mark I as "unused" and register your list to be periodically checked .. to someone else..
Avoid passing objects.. the object .. and better pass a clone or a range or a key..

MultiCore is not an exact science and I didn't yet grasp.. even at 10%.. so please excuse my lack of.. certitudes :)

January 2, 2015 | Unregistered Commenterneglewis

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>