The 4 Keys to 100% Uptime with Node.js

On Tuesday, I presented at Nova Node on achieving 100% uptime with Node.js. [Update: video of the talk is now available.] This topic is near and dear to our hearts: giving our users the best experience means keeping Fluencia and SpanishDict running all the time, every day. This takes work, and over the past few months we've had opportunities to dive deep into error handling/recovery and zero-downtime deployment techniques.

We've found that the four keys to (nearly) 100% uptime with Node.js are:

  1. Sensibly handle uncaught exceptions.
  2. Use domains to catch and contain errors.
  3. Manage processes with cluster.
  4. Gracefully terminate connections.

1. Sensibly handle uncaught exceptions.

Uncaught exceptions happen when an Error is thrown outsisde of a try/catch block, or an error event is emitted and nothing is listening for it. Node's default action following an uncaught exception is to exit (crash) the process. If the process is a server, this leads to downtime, so we want to avoid uncaught exceptions as much as possible, and sensibly handle them when they do happen.

Avoid them can be accomplished by writing solid code and testing it thoroughly, and by using domains.

2. Use domains to catch and contain errors.

Domains can catch asynchronous errors—for example, error events—which try/catch cannot do.

var d = require('domain').create();

d.on('error', function (err) {
  console.log("domain caught", err);
});

var f = d.bind(function() {
  console.log(domain.active === d); // <-- data-preserve-html-node="true" true
  throw new Error("uh-oh");
});

setTimeout(f, 100);

The above example would crash the Node process if function f (which throws an error) were not called in the context of a domain. Instead, the domain's error handler is called, giving you a chance to take some action: retry, abort, ignore, or even rethrow. What to do is dependent on context, but the point is that you can choose, rather than having the process crash.

When code is running in the context of a domain—that is, a domain is "active"—then domain.active contains a reference to it. New EventEmitters will automatically bind to the active domain, saving you the hassle of manually listening to their error events (though there may still be good reason to do so—but if you miss one, it will automatically go to the domain error handler).

domain.add will add an existing EventEmitter to a domain. Domains have more methods than add and bind; the full list of is available in the docs.

3. Manage processes with cluster.

Domains alone are not enough to achieve continuous uptime. Sometimes an individual process will need to go down whether due to a rethrown error or to deployment. Using node's cluster module, we can spawn a master process that manages one or more worker processes, distributing work (for example, incoming connections from clients) among them. The master process need never stop running, meaning the socket always accepts connections, and when any particular worker stops working, the master can replace it.

We can send a signal to the master process when we want to initiate a reloading of workers. For example, to deploy without downtime, we might give our master a symlink that points to worker code, then later update that symlink to point to a new version of the code, and send a signal to the master process that causes it to shut down existing workers and fork new workers. The new workers will come up running the new code, and the master process keeps running the entire time!

Coordination between master and workers is important. Workers must communicate their state, and master must know what action to take based on that. We've found recluster to be a useful wrapper around Node's native cluster module that provides worker state management.

4. Gracefully terminate connections.

When workers/servers need to shutdown, they must do so gracefully, as they may be serving any number of in-flight requests, and also have open TCP connections to many more clients. Any call to process.exit should occur inside the callback to the server.close method, which only fires once a server has closed all its connections.

To close existing connections, we can add middleware that checks whether an app is in a shutting down state, and if so, causes connections to be closed ASAP:

var afterErrorHook = function(err) { // <-- data-preserve-html-node="true" called after an unrecoverable error
  app.set("isShuttingDown", true); // <-- data-preserve-html-node="true" set state
  server.close(function() {
    process.exit(1);  // <-- data-preserve-html-node="true" all clear to exit
  });
}

var shutdownMiddle = function(req, res, next) {
  if(app.get("isShuttingDown") {  // <-- data-preserve-html-node="true" check state
    req.connection.setTimeout(1);  // <-- data-preserve-html-node="true" kill keep-alive
  }
  next();
}

To ensure all servers shutdown eventually, we can also call process.exit inside of a timer.

Back to handling uncaught exceptions.

The Node docs say not to keep running after an uncaught exception, for good reason:

An unhandled exception means your application — and by extension node.js itself — is in an undefined state. Blindly resuming means anything could happen. You have been warned.

http://nodejs.org/api/process.html#processeventuncaughtexception

Now we have the pieces in place to sensibly handle uncaught exceptions:

  • Log error.
  • Server stops accepting new connections.
  • Worker tells cluster master it's done.
  • Master forks a replacement worker.
  • Worker exits gracefully when all connections are closed, or after timeout.

If the exception was triggered by a request, it is not possible to respond to that request because the uncaught exception lacks the context to determine which request it was. But we should be able to handle any other in-flight requests, minimizing effective downtime.

Remember that while parts of Node are very stable, and it is in production use with greate success by many, many companies (including us!), other parts such as domains and the cluster module are still classified as unstable or experimental, and may change in the future.

This was a quick summary of what I covered in the talk. The slides below contain additional example code and various tips and other considerations. Hopefully they will help you to achieve (nearly) 100% uptime with your Node application!