How do you program a computer with 10 terabytes of RAM?

Wednesday

Aug052015

How do you program a computer with 10 terabytes of RAM?

Wednesday, August 5, 2015 at 8:56AM

How do you program a computer with 10 terabytes of RAM in a single address space? When the great Adrian Cockcroft was interviewed for Enterprise Initiatives Episode blog, that’s one of the answers he gave to the question of “What’s the next big thing?”

Adrian says we are already taking big machines and running tiny little containers on them. He thinks another interesting workload is huge memory systems. Building computers with many terabytes of main memory will soon be affordable. We already know the JVM has problems garbage collecting on machines with 10s of gigabytes of RAM. What about machines with terabytes of RAM? We don’t really have the programming models worked out yet. It may be that garbage collected languages won't make the cut.

Sounds like a good idea for a post, right? Here’s the problem, I found surprisingly little on huge memory systems. If you have any ideas on good source please leave a comment. Here’s some of what I did find…

SGI’s 64TB Computer

You may chuckle at the idea of a computer with 10TB of RAM. Well, one already exists. It’s from SGI. The SGI UV (Ultra Violet) 3000 scales from 4 to 256 CPU sockets with up to 64TB of shared memory as a single system. It’s aimed squarely at the enterprise space, SGI is out of the cloud.

There’s a good thread on Hacker News that talks about some of the problems that occur when using so much RAM:

Cost. There’s no cost information given, but high-end RAM is not cheap. Thoughts are it might be in the $4M range, which is pure conjecture. StackExchange has an interesting thread, If RAM is cheap, why don't we load everything to RAM and run it from there?, that goes more into the cost aspects. Hard disks are still a lot cheaper than RAM.
Memory latency. Though from the programmer’s perspective the memory will look like a flat address space, under the good memory accesses will go through a network that’s not all that different from a networked cluster of machines. Martin Thompson talks about this problem with the counterintuitive idea that memory is not truly random access. Even if your memory looks flat you still have to worry about the penalty for accessing non-local RAM. Still, it’s faster than going to disk.
Corruption/Consistency. The UV runs a single system image, which means in practice it looks like a big desktop computer. You have huge potential problems with buggy code scribbling all over memory. You also have a problem with programs that crash. If you are doing all your computing in RAM you can’t rely on restarting a process and reading everything back from disk to fix all your problems. Crashes mean inconsistency unless you code specifically around it.
Problem to solve. You don’t want to use such a machine to run one instance of database. And you need to solve a problem that isn’t just as easily solved with a cluster. The star use case seems to be as an appliance for SAP HANA, an in-memory, column-oriented, relational database management system, which presumably benefits from not running its own cache coherence protocols over the network and has some way to partition work in a CPU-aware and NUMA-aware manner.
Power. Big-iron uses a lot of power.
Address space exhaustion. From userbinator: Now we have 64TB, which is 2^46, which means there's only 18 "unused" bits of address left - 256K. If you could connect only(!) 262,144 of these machines together and present the memory on them as one big unit, you would have exhausted the 64-bit address space. That is what I think is really incredible. What's next, 128-bit addresses? Or maybe we'll realise that segmented address spaces (e.g. something like 96-bit, split as 32:64) are naturally more suited to the locality of NUMA than flat ones?

What about NVRAM?

Intel and Micron have announced 3D Xpoint memory, which is 10x denser than conventional memory (DRAM), it is 1000x faster than NAND (Flash), and it has 1000x better endurance than NAND, which wears out.

Could something like this be the basis for our new huge memory systems? Dave Farley in The Next Big Thing? says not yet, 3D Xpoint memory is still 10 times slower than DRAM.

And even if it was as fast as DRAM, we still don’t know how to use it all yet. It would be a shame to just use it as a faster disk. Can’t we think of a way to really reconceptualize the entire computer platform to natively take advantage of large non-volatile memory stores?

That’s what HP is trying to do.

HP’s Memory-Driven Computing

Memristors were going to change everything. Not so much. Memristors are now off the roadmap.

What do you do when your key differentiating technology flops? You pivot and rebrand. That’s what HP is doing.

HP is now pitching the vision of Memory-Driven Computing. If you look at The Machine, which was the top to bottom rebuild of a computer system around memristors, the key ideas of the system don’t actually require memristors to work. The idea at its core was to rebuild a system around the idea of huge memory spaces. And that’s still possible. You can use RAM or any of the new memory technologies that are on the horizon. HP’s innovative photonic interconnect technology still makes it possible to have a very low latency network between memory and CPU. So The Machine is still possible, just differently possible.

Kirk Bresniker, HP Labs Chief Architect and HP Fellow, makes HP’s case:

The Machine will fuse memory and storage, flatten complex data hierarchies, bring processing closer to the data, embed security control points throughout the hardware and software stacks, and enable management and assurance of the system at scale. It may seem counter-intuitive, but by concentrating on massive pools of non-volatile memory, we expect to spur innovation in computation by allowing many different models of computation to work on the same massive data sets. Quantum, deep neural net, carbon-nanotube, non-linear analog – all of these models could be working in concert, connected to petabytes and exabytes of information derived from a world of intelligent devices.

What does the revamped Machine look like? Kimberly Keeton, a researcher at HP Labs, in Reimagining systems and application software for The Machine, gives a very nice overview of how they are rethinking their system in the context of huge memory systems. So far it’s the most serious source of thought leadership I’ve found on the topic. There’s more detail in the talk, but here are a few highlights.

HP is taking what they call a Shared Something approach in their computer architecture. Shared Everything is a big computer running a single operating system. A Shared Nothing is a cluster of individual computers connected via network, each node runs its own OS.

Shared Something is a middle ground. The Machine has a shared pool of non-volatile memory, but individual compute nodes run their own OS and communicate via shared memory. So shared memory might be back.

The idea is their photonic interconnect technology will make the shared memory have both low latency and consistent latency. HP is creating a specialized version of Linux that is reworked to optimally handle global non-volatile shared memory management. A lot of overhead can be removed once you obliterate the assumption of slow media access times. The video goes into more detail on all the changes they are making to optimize the OS for this new computing paradigm. HP predicts, for example, that a database will perform 100x better.

You may not recall, but at one time shared memory is how different processes on Unix machines talked to each other. When sockets and TCP/UDP arrived that all changed and shared memory was something high performance computing people did. Now shared memory may be back. Which makes sense, there’s no reason to shuttle data around when it’s accessible in a flat space. This insight has big implications up and down the entire stack.

Let’s say you want to move beyond the disk based paradigm and just want to keep everything in memory. How do you handle consistency? The potential consistency problems that occur when a program crashes in a global shared memory system are dealt with by a system HP calls Atlas.

The idea is a program differentiates between persistent and transient data. The persistent data lives in a persistent region, which is mapped into the processes address space. The programmer writes the same multithreaded code they would normally write. The runtime system will automatically annotate the code to do the logging and cache line flushes necessary to maintain consistency after a crash. I couldn’t tell if this was a type of software transactional memory.

HP has a pretty audacious vision. And they are putting a lot talent and money behind the effort. A lot of original thinking is going into making a high margin proprietary machine. Just like the good old days. Will this work? Or will commodity/open source computing win in the end?

Wrapping Up

Huge memory systems are on the way. How are we going to program them? If you have any ideas feel free to make a comment. Your insight will be appreciated.

HighScalability Team |

12 Comments |

Permalink |

Print Article

Email Article

Interesting

Reader Comments (12)

One byte at a time...

August 5, 2015 |

Joshua

A 32 GB ram can be stated the optimal level for a PC of home use. Surely, general PC users will bot be interested in NVRAM. I mean its not necessary to have a big giant which cant be utilized by most of the general users. Though you are a gamer or an architect, a ram or 1600 bus, 16 GB is enough for you.

August 5, 2015 |

John Adam

"What's next, 128-bit addresses?"

Quite likely. AS/400 (or whatever IBM calls it this year) has used 128-bit (virtual) pointers for more than 25 years.

August 5, 2015 |

Tim

That's interesting Tim. I didn't know that. Every pointer could be an ipv6 address :-)

August 6, 2015 |

HighScalability Team

The article's statement about memory latency is key. RAM is storage that happens to be easily accessed. Thanks to virtual memory it has always been possible to build a virtual machine where extensive memory space is disk-backed. Back in the 1990s people were awed when virtual machines had gigabytes of addressable memory, today it is terabytes, soon we could be talking petabytes or exabytes of addressable memory. That memory's performance will be wildly variable, ranging from a fraction of a nanosecond for that in the CPU cache to milliseconds or longer for disk-backed or network-backed. They may look like addresses, but not all memory access is equal.

Memory latency has been a concern for performance for many decades. CPU cache is faster than L2 or L3 cache, which are faster than main memory, which is faster than remote memory or disk drives. All of them are storage space, it is only their access speed that varies. Locality of data and access times are important if you care about performance.

Just because you have 10TB of memory does not mean your program will keep the CPU busy, those 10TB have slow access times. Faster than a spindle disk of course, but slow enough to cost millions or billions of CPU cycles waiting for data.

August 6, 2015 |

Bryan

Disclaimer: I have read the present post but none of the links

HP's vision is interesting, but there are two very important factors here : concurrency and (cohabitation of multiple) workloads.

Concurrency tells us that throwing more memory is not enough: it also goes with more cores, more caches... to avoid clogging the buses. And there is a performance penalty!

Workloads are more subtle: although any kind of task would benefit from the new architecture, more classical (= cheaper) ones might fit it well tool. For example a full scan of a dataset accommodates well from fast mechanical disks. Interestingly, HP is pushing the next generation of Hadoop datalakes: specialized nodes (some with more cores, other with more RAM, some with GPUs...) interconnected with high speed networks. The key here is not raw power but cost efficiency when facing a given pattern of workloads.

That said, huge flat memory would be a big improvement for reliability and simplicity. I would dream of a world without distributed systems programming :)

August 6, 2015 |

Thomas

I really enjoyed reading this post. Thanks for it.

I hope that "garbage collected languages will make the cut" and we are working on it. In partnership with the Java R&D at Oracle, we evaluated a large Java application on a SPARC M6-32 server with 16TB of memory. That will be the subject of a couple JavaOne sessions later this year (https://goo.gl/nYVe9q)

August 7, 2015 |

Antoine

Whoa, Antoine's comment is interesting.

On GC, a few things stand out. One, you usually get more CPU as you get more RAM--either you physically need the sockets to address the DIMMs or you just want lots of CPU to keep a balanced machine. Two, mostly-background collection seems to work--Go 1.5 recently produced some shiny stories (75x reduction in pause time in a big app), but there's also concurrent mark-sweep in Java, Azul C4, etc. Three, if you're using terabytes, often you're talking about (or can adapt things so you're talking about) huge blobs of bytes, which are less of work for a collector than tangles of pointers.

I'm not sure that that really means much about the larger issues around building for silly-huge heaps, or that you can even get very far with the question until you start focusing on one specific application or another for all that RAM. But it's one interesting thing anyhow.

August 8, 2015 |

Randall

Not sure about internals of latest processors. According to my understanding though the memory is GBs still reads as byte.
I believe first thing to change is this 8 bit byte reads. Why to read 8 bits at a time from memory where we all use unicode?

August 12, 2015 |

Joy

Have you looked at kdb+ from kx systems, it is an in-memory column oriented database technology that has been around for about 23 years in the Finance industry. There are already numerous applications using this technology that could make use of 10TB+ RAM machines.

September 24, 2015 |

AndyC

Not sure what Joy was saying about fetching bytes. IBM and all large systems have been fetching and processing half words(16 bits) words and double words for years.

November 29, 2016 |

John R

Not sure what John R was saying about fetching bytes. Intel x86 has been fetching at least 64 bits since the Pentium (~25 years), and CPU registers have been 64 bit in PCs for at least a decade, allowing processing of multiple bytes. But sometimes all you need is a byte, and fortunately CPU architectures can accommodate this.

October 10, 2017 |

PandyD

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

How do you program a computer with 10 terabytes of RAM?

SGI’s 64TB Computer

What about NVRAM?

HP’s Memory-Driven Computing

Wrapping Up

Related Articles

Reader Comments (12)

Post a New Comment