"Let It Crash" Programming

Posted by on in Blogs
This past weekend I read Joe Armstrong's paper on the history of Erlang. Now, HOPL papers in general are like candy for me, and this one did not disappoint. There's more in this paper that I can cover in one post, so today I'm going to concentrate on one particular feature of Erlang highlighted by Armstrong.

Although Erlang is designed to encourage/facilitate a massively parallel programming style, its error handling may be even more noteworthy. Like everything else in Erlang, its error handling is designed to be distributed, and for good reason:
Error handling in Erlang is very different from error handling in conventional programming languages. The key observation here is to note that the error-handling mechanisms were designed for building fault-tolerant systems, and not merely for protecting from program exceptions. You cannot build a fault-tolerant system if you only have one computer. The minimal configuration for a fault tolerant system has two computers. These must be configured so that both observe each other. If one of the computers crashes, then the other computer must take over whatever the first computer was doing.

This means that the model for error handling is based on the idea of two computers that observe each other.

Erlang is famous for its features which help programmers to produce stable systems in the real world. Its shared nothing architecture and ability to hot-swap code are well-known. But these features are available in other systems. The "links" feature, on the other hand, seems to be unique. When you create a process in Erlang, you can link it to another process; this link essentially means, "If that process crashes, I'd like to crash, also; and if I crash, that process should die, too." Here is Armstrong's description:
Links in Erlang are provided to control error propagation paths for errors between processes. An Erlang process will die if it evaluates illegal code, so, for example, if a process tries to divide by zero it will die. The basic model of error handling is to assume that some other process in the system will observe the death of the process and take appropriate corrective actions. But which process in the system should do this? If there are several thousand processes in the system then how do we know which process to inform when an error occurs? The answer is the linked process. If some process A evaluates the primitive link(B) then it becomes linked to A . If A dies then B is informed. If B dies then A is informed.

Using links, we can create sets of processes that are linked together. If these are normal processes, they will die immediately if they are linked to a process that dies with an error. The idea here is to create sets of processes such that if any process in the set dies, then they will all die. This mechanism provides the invariant that either all the processes in the set are alive or none of them are. This is very useful for programming error-recovery strategies in complex situations. As far as I know, no other programming language has anything remotely like this.

In addition to simply killing the linked process, the link can also function as a kind of signal to a system process that a group of processes have died, so that appropriate action can be taken, such as restarting a process group.

Like Armstrong, I cannot think of another system that works quite this way. The closest analogy I can think of is a distributed transaction. But distributed transactions have quite a bit more overhead, because they're all about providing serializable access to shared data, which Erlang just doesn't allow.

Armstrong says that the idea of links was inspired by the 'C wire' in early telephone exchanges:
The C wire went back to the exchange and through all the electromechanical relays involved in setting up a call. If anything went wrong, or if either partner terminated the call, then the C wire was grounded. Grounding the C wire caused a knock-on effect in the exchange that freed all resources connected to the C line.

Armstrong says that the links feature encourages a worker/supervisor style of programming which is "not possible in a single threaded language."


  • Guest
    Stephen Hoffman Tuesday, 20 May 2008

    The OpenVMS Distributed Lock Manager has offered analogous capabilities for many years, though the application itself is expected to fully participate in the cooperative processing.

    If the application or the host within the cluster crashes, then DLM frees up the (arbitrary) locks (formerly) held by the failed application (or failed host), and cooperating applications are then granted the locks and can detect the failure and can take failure-appropriate action.

    This application processing can range from a full restart to specific recovery processing, to transactional-style sequencing. It's all fully programmable.

    Further, cluster locks can easily be used akin to the C wire that was described; where acquiring a lock can cause all processes monitoring the lock to take an application-specific action.

    A block of arbitrary and application-specific information can also be passed along with the lock itself. This could be a next invoice number, or a count of crashes, or information related to the failure or the recovery and restart; this is an entirely arbitrary block of data.

  • Guest
    miguel rodriguez Tuesday, 20 May 2008

    for years, I tried to make systems without failures, and I failure, the very day I accepted the failure and I worked on how to handled it, it suddenly all became stable and robust, because at least from the client perspective it doesn't matter what is behind the wall, I know this mental model it is not appreciate as It should be, but everything will be appreciate in time, I have no doubt about it.

  • Guest

    [...] Erlang, but just reading the chapter and following the examples made it clear how the “let it crash” philosophy together with language support simplifies the process of writing robust [...]

  • Please login first in order for you to submit comments
  • Page :
  • 1