Node.js: Run the Heartbeat in a Child Process

A Node.js process is, as I'm sure you're aware, essentially a single thread of execution for your Javascript code. It keeps things running along smartly by making use of the downtime spent waiting for I/O operations to complete in order to execute more Javascript code. That it is a single thread means that you don't have wrestle with fun things like concurrency and locking, which are thorny enough topics to have merited the invention of numerous architectures that really only exist so as to allow developers to avoid encountering concurrency and locking in day to day coding. The flip side of this coin is that a busy Node.js process isn't all that good at doing things to a schedule when the schedule intervals are on the order of 100ms or less - unless the only thing it is doing is running that schedule.

So to pick a concrete example, my Thywill framework is basically Express, Socket.IO, and some code for defining applications, managing setup, and clustering processes. One of the cluster options makes use of Redis pub/sub functions to implement communication channels between cluster processes. One needed line item for this code to accomplish is for cluster processes to know when one of their number falls over. There are a few ways to go about doing this, but I decided to implement a pub/sub heartbeat and timeout check - if process A has heard a heartbeat from process B recently enough then it assumes process B is running, otherwise it assumes process B is down.

Issuing a heartbeat every 100ms runs something like this:

// Each process has its own ID.
var processId = "alpha";
var pubClient = require("redis").createClient();
setInterval(function () {
  pubClient.publish("heartbeat", processId);
}), 100);

Every process can listen for heartbeats from all the other processes:

// Keep track of timestamps.
var heartbeatTimestamps = {};
// Once subscribed, you can't use a Redis client for anything other than
// listening for published messages. So it has to be its own instance.
var subClient = require("redis").createClient();
subClient.subscribe("heartbeat");
subClient.on("message", function (channel, processId) {
  heartbeatTimestamps[processId] = Date.now();
});

Then check every so often to see if the last heartbeat received from a specific process is too old:

// Anything older than 300ms and we assume that means the process
// in question has died.
var timeout = 300;
setInterval(function () {
  var tooOld = Date.now() - timeout;
  for (var processId in heartbeatTimestamps) {
    if (heartbeatTimestamps[processId] < tooOld) {
      // Do something here - log, emit an event, send a notice, etc.
    }
  });
}), 100);

The trouble is that this doesn't really work so well in practice. The important thing to remember here is that setTimeout() and setInterval() are not guaranteed to execute on time - and they don't, given any sort of reasonable non-I/O processing load, let alone adding in computationally intensive tasks. With 100ms intervals and a 300ms timeout, there will be false positive timeouts even while running a simple, lightly loaded four member test cluster wherein the cluster members and Redis server are all on the same physical machine. Too great a length of time elapses between heartbeats, or between issuing a heartbeat and it arriving, or between sequential checks on heartbeat timestamps. That results from some combination of (a) other processing taking place in the Node.js process, (b) network and Redis delay (actually minimal as it turns out), and (c) sharing a Redis pub/sub channel and client with other messages.

To get rid of these issues, the heartbeat functionality is first given its own channel and Redis client instances - that eliminates the possibility of having to wait for other, non-heartbeat messages to clear through in order in the subscribed client, for example. Then we run the heartbeat in a forked child process to eliminate as much of the timing lag as possible. Launching the process runs as follows:

var heartbeatProcessArguments = {
  // Arguments to pass to the child process. e.g. Redis client options,
  // timeouts, intervals, etc.
};

// This is a lazy way to pass structured arguments to a child process
// without having to care about what they contain: convert to JSON,
// and then base 64 encode that string.
heartbeatProcessArguments = JSON.stringify(heartbeatProcessArguments);
heartbeatProcessArguments = new Buffer(heartbeatProcessArguments, "utf8")
heartbeatProcessArguments = heartbeatProcessArguments.toString("base64");

var heartbeatProcess = require("child_process").fork(
  path.join(__dirname, "heartbeat.js"),
  [heartbeatProcessArguments],
  {
    // Pass over all of the environment.
    env: process.ENV,
    // Share stdout/stderr, so we can hear the inevitable errors.
    silent: false
  }
);

// Listen for messages from the heartbeat child process.
heartbeatProcess.on("message", function (message) {
  // Process the message - these will be notices that a cluster process
  // is down or has come back up.
});

// Make sure the child process dies along with this parent process; certainly
// necessary for Vows tests, possible not so much under other circumstances.
process.on("exit", function () {
  heartbeatProcess.kill("SIGKILL");
});

// If the child process dies, then we have a problem.
heartbeatProcess.on("exit", function (code, signal) {
  // This is critical, so either restart the heartbeat or kill this process by
  // throwing an error. The latter is probably a better idea.
});

Then in the child process, we can read the provided base 64 encoded configuration as follows:

var config = new Buffer(process.argv[2], "base64").toString("utf8");
config = JSON.parse(config);

Passing messages back to the parent makes use of the process.send() function:

// Anything older than 300ms and we assume that means the process
// in question has died.
var timeout = 300;
setInterval(function () {
  var tooOld = Date.now() - timeout;
  for (var processId in heartbeatTimestamps) {
    if (heartbeatTimestamps[processId] < tooOld) {
      // Tell the parent process that a cluster process fell over.
      process.send({
        type: "processDown",
        processId: processId
      });
    }
  });
}), 100);

And that is that in a nutshell. Now all you have to worry about is overall server load and network latency when selecting appropriate values for the heartbeat interval and timeout. You can see an actual implementation with the frills and the epicycles in the Thywill code for RedisCluster and RedisClusterHeartbeat.