On rebooting: the unreasonable effectiveness of turning computers off and on again

From first principles

Turn a misbehaving computer off and on, or stop a misbehaving program and then start it again. Often, the problem goes away.

Most users don’t think hard about this, and accept it as just another inscrutable fact about computers.

However, as you learn more about how computers work, I suspect that you start feeling uncomfortable about never outgrowing this seemingly hacky and arbitrary fix. Professional engineers working for the most celebrated technology companies on Earth are sometimes reduced to blindly rebooting everything from their personal workstation to hundred-node distributed systems clusters. Is this the best that anyone can do?

Well, I offer the following argument that restarting from the initial state is a deeply principled technique for repairing a stateful system — whether that system is a program, or an entire computer, or a collection of computers.

Before a computing system starts running, it’s in a fixed initial state. At startup, it executes its initialization sequence, which transitions the system from the initial state to a useful working state:

(init_0)
   |
   v
(init_1)
   |
   v
 [...]
   |
   v
(init_N)
   |
   v
 (w_0)

This initialization sequence has been executed many times during the development, testing, and operation of the system. It is therefore likely to be reliable: that is, the transitions from the initial state to the working state occur with very high cumulative reliability. And this is not accidental: it stems from fundamental characteristics of the engineering process that built the initialization sequence.

As the system runs correctly, it transitions from its initial working state to other well-behaved states:

(init_0)
   |
   v
.------------------.
| (w_0) <--> [...] |
|   ^          ^   |
|   |          |   | (working states)
|   v          v   |
| [...] <--> (w_n) |
'------------------`

However, when the system reaches a defect, it leaves the set of working states, and enters a broken state:

(init_0)
   |
   v
.------------------.
| (w_0) <--> [...] |
|   ^          ^   |
|   |          |   | (working states)
|   v          v   |
| [...] <--> (w_n) |
'--------------+---'
               |
               v
            (BROKEN)

By definition, this broken state is unexpected; otherwise, it would just end in another working state.

At this point, any attempt to bring your system back directly from the broken state into a working state is improvisational. We are no longer like the classically trained violist from Juilliard performing a Mozart sonata after rehearsing it a thousand times; we are now playing jazz. And in the engineering of reliable systems, we do not want our systems to improvise.

So, what should we do to fix the system?

Turn it off, and turn it on again. Anything else is less principled.

This is the basic insight behind the philosophy of crash-only software, a.k.a. recovery-oriented computing.

Complications

Granularity

If you were paying attention, you may have noticed some sleight of hand in the above reasoning. I glossed over the distinction between two different ways of resetting a system: rebooting a computer, and restarting a program.

As often happens with a crack in a simple story, if you pry at it, you will realize that a great chasm of complication opens up.

Restarting a program, as you well know from experience, is sometimes not enough to fix its misbehavior. There can be errant state elsewhere in the computer. Sometimes bad state can survive even a system reboot: if the program executable is corrupted on disk, no amount of rebooting will save you. If your hardware is corrupted deeply enough, even wiping the disk and reinstalling your operating system won’t work.

And yet, of course, we do not throw out our computers and buy new ones every time a program does something wrong. So the story of system repair is one of “turning it off and on again” at various layers of abstraction. At each layer, we hope that we can purge the corruption by discarding some compartmentalized state, and replacing it with a known start state, from which we can enter a highly reliable reinitialization sequence that ends in a working state.

(There seem to be certain analogies here between computing systems and biological ones. Your body is composed of trillions of compartmentalized cells, most of which are programmed to die after a while, partly because this prevents their DNA from accumulating enough mutations to start misbehaving in serious ways. Our body even sends its own agents to destroy misbehaving cells that have neglected to destroy themselves; sometimes you just gotta kill dash nine.)

Local crashes and global equilibria

So, resetting a single component’s state is insufficient to prevent the system as a whole from going wrong. We can go further: sometimes resetting a component can exacerbate the problem.

Consider, for example, the following scenario:

  • A process P performs certain queries against a shared backend when it starts up, but not in routine operation.
  • P contains a latent defect that, under certain conditions, is encountered with high probability in a short interval after startup.
  • P contains assertions which catch the defect and crash.

What will happen when we encounter the conditions that trigger the defect? P will crash-loop, and every time it crashes, it will fire off its startup queries. Since the shared backend receives these queries relatively infrequently in ordinary operation, it may not be prepared for this load, and it may fall over. This is especially likely if the startup queries are expensive and there are many replicas of P.

Oops! Your beautiful crash-only error handling strategy has nudged your system into a new equilibrium where the backend is continually receiving too much load. A local defect has been amplified into a global system outage. Even if you remove the crashing defect, the flood of retrying startup queries may persist as a metastable failure mode of your system.

As with most software problems, there are ways to deal with the particular scenario outlined here (for example: stochastically delay restart timing after a crash, or add circuit breakers for the query load, or cache the startup query results so that they can be reused across restarts, or…). But the particular example is less important than the general insight that restarting a localized part of the system cannot be a silver bullet for reliability problems.

Crashiness is a healthy part of a balanced diet in reliable system engineering. But you should still think about what happens when you crash.

Forensic analysis vs. repair

The discussion above focuses on how to bring a broken system back into a working state. But, hopefully, you plan to continue building and operating your system for the foreseeable future, not just today.

In an ideal world, you will have designed your system for observability, and it will already have produced enough durable evidence to figure out what happened and fix the defect later. Here in the real world, the picture is often less complete. Depending on the urgency of the fix, you should consider pausing to gather forensic evidence before executing the reboot.

(If you’re a computer science researcher looking for a good ambitious problem, consider figuring out how to instrument multi-process and multi-computer distributed systems to support post hoc reconstruction of state at arbitrary points in time, at overheads low enough to be used in production systems. Yes, I know about rr. It’s amazing! But I think it’s not quite at the state where most companies would be comfortable running literally all their production processes under it, and multi-tier systems are outside its current scope.)

The parable of Mike and the login shell

Once, a student named Mike wondered whether it was better for programs to be written so that

  1. each function would be strict about its preconditions, checking its inputs and crashing immediately with an assertion failure if a precondition was violated; or
  2. each function would be permissive about its preconditions, checking its inputs where necessary, but repairing erroneous inputs and proceeding as best it could.

So, he wrote two Unix shells: one in the strict style, and one in the permissive style.

The shell written in the strict style would crash, at first. Mike was fearless enough to use his work-in-progress as his login shell; crashing was incredibly inconvenient, as it would log him out of the machine completely. Nevertheless, he persisted; he found and fixed defects at a rapid rate, and soon enough the shell became a usable and useful tool.

The shell written in the permissive style also had defects. But he was never able to find and fix enough of them to make it usable. Eventually he gave up on this shell.

He concluded that it was better for most programs to be written in a strict and crashing style. Even when crashing was incredibly inconvenient, it made errors so much easier to diagnose and fix that you could build better software if you did it.

Mike went on to become one of the eminent programmers of his generation, earning fame and fortune.

Meta

Acknowledgments

Thanks to my teammates on Airtable’s Performance & Architecture team for being a sounding board for these ideas, and encouraging me to write them up. (Consider joining us!)

Reactions

See this essay discussed elsewhere: