There's a deep similarity between how long running systems like our brains and computers accumulate errors and repair themselves.
Reboot it. Isn’t that the common treatment for most computer ailments? And you may have noticed now that your iPhone supports background processing it reboots a lot more often? Your DVR, phone, computer, router, car, and an untold number of long running computer systems all suffer from a nasty problem: over time they accumulate flaws and die or go crazy.
Now think about your brain. It’s a long running program running on very complex and error prone hardware. How does your brain keep itself sane over time? The answer may be found in something we spend a third of our lives doing. Sleep.
There’s new research out on how our brains are cleansed during sleep that has some interesting parallels to how we keep long running hardware-software systems up and running properly. This is a fun topic. Let’s explore it a little more.
One of the most frustrating system tests for a computer system is to just let it sit idle day after day and check that it doesn’t reboot, leak memory, or fail in some surprising never-thought-that-would-happen sort of way. Systems are never really idle on the inside. Stuff is always happening. Interrupts are being served, timers are firing, all kinds of metrics are being collected and sent, connections are being kept up, protocols are being serviced. Just like when you sleep, an apparently idle computer can be quite active.
Flaws in more complex hardware-software systems are of many kinds: memory leaks, garbage accumulation, out of memory, CPU starvation, data structures like lists get larger and larger over time which causes memory problems and increased latency as testing probably never tested these scenarios, improper freeing of objects, stack corruption, deadlock problems, pointer corruption, memory fragmentation, priority inversion, timer delays, timers and protocols inside a box and across nodes tends to synchronize which means you can blow your latency budgets, cascading failures as work stalls across a system and timeouts kick in, missing hardware interrupts, counter overflows, calendar problems like not dealing with day light savings correctly, upgrades that eat up memory or leave a system in a strange state it has never experience before, buggy upgrades that upgrade data structures improperly, improper or new configuration, hardware errors, errors from remote services, unknown and unhandled errors, applications that go into out of different states incorrectly, logs that aren’t truncated properly, the exercise of rarely used code paths in hard to test states that end in potentially disastrous ways.
Just the general accumulation of strange little inconsistency errors over time can make it look like your system has dementia. This is why you want to do a cold reboot. Power off, let all the hardware go dark, leave nothing in memory, reload the OS, reload applications, and just start over. Rebooting is like grace for computers.
How our biology works is necessarily different. Maiken Nedergaard, who led the study Sleep Drives Metabolite Clearance from the Adult Brain, thinks he may have discovered the long sought after secret to why sleep is crucial for all living organisms:
We sleep to clean our brains. Through a series of experiments on mice, the researchers showed that during sleep, cerebral spinal fluid is pumped around the brain, and flushes out waste products like a biological dishwasher. The process helps to remove the molecular detritus that brain cells churn out as part of their natural activity, along with toxic proteins that can lead to dementia when they build up in the brain.
Without access to spinal fluid or nicely chunked pieces of garbage like molecules, here are some common tactics for removing software detritus from a system:
Simplicity. Create a system so simple none of these issues apply. This isn’t a good option for a human brain or usually for software. Modern systems are complex. They just are.
Random Reboot. A venerable option. The system is down only for the period it takes to restart and it fixes a myriad of different problems. Some systems periodically do a reboot on purpose just to get a fresh start. It also can remove any active security breaches. In a cloud architecture this is a good design option as you have the resources available to take over for the rebooting node.
Primary-secondary Failover. In high availability designs it’s common to have an active primary and a passive secondary to take over just in case the primary fails. The secondary is usually in some sort of simplified state so that it doesn’t suffer from all the degrading influences experience by the primary. One strategy is to have the primary periodically failover to the secondary so that the system reinitializes, removing many of the flaws that have accumulated over time. Service is offline during this period and the period is proportional to the amount of code, state, and computation that must happen to transition from secondary to primary state. Which isn't to say the system is completely down for all functionality. There's almost always some low level protection and consistency code running regardless of what state the system is in. The period can be minimized by keeping the secondary closer to live state, but this means the secondary will probably suffer the same flaws as the primary, so is counter productive. As failover is a highly error prone process this strategy doesn’t work so well in practice.
Background consistency agents. Run software in the background that looks for flaws and then repairs them. This can lead to local inconsistencies of service, but over time the system cleans itself to the degree it can. When things get complex at scale errors are just a fact of life. Might as well embrace and use that fact in your designs to make them more robust.
Partial shutdown. Shutdown certain lower priority services when metrics show higher priority services are degraded. This hopefully frees up resources, but freeing up resources is always a buggy as hell process.
Whole Cluster failover. The cloud has made an entirely new form of sanity possible: a whole cluster failover. Often used for software upgrades, it will also work as a form of cleaning out software detritus. If you have a 100 node cluster, for example, spin up another hundreds nodes and then move processing over to the newly constructed cluster. Any faults accumulated on the old hardware will magically disappear.
Hardening. Put hardware on a replacement schedule to reduce its probability of failure. Upgrade software in a bug patch only mode to make the software more robust to problems found in the field. This is an expensive approach, but it hardens and bullet proofs a system over time. At least until requirements change and the whole thing is blown away. This is an area where biologics have an advantage in that their mandate never changes (in their ecological niche) while human systems are changing all the time, often at deep levels. So for a biologic techniques like cell repair and immune system responses are quite powerful, but they aren't enough for human systems.
When you think about it Mother Nature has had a tough design task that in a biological system could only be solved by something as potentially anti-survival as sleep. But the sleep period has been a great hook in which to insert other services like memory consolidation, something that would be difficult to do in an always active system.
In many ways we can now design systems that are more robust than Mother Nature, though certainly not in the same space for the same power budget. Not yet at least.