Skip to content

Implementing Failover for Random Cronjobs with Zookeeper

Part of running a popular service on the internet is the never-ending search for SPOFs. Single points of failure are pretty easy to squish in the more well-understood parts of the stack; load balancers come in pairs and distribute queries to multiple web servers with health checks and so on. Inevitably, though, there are a few bits that creep in and are "unique".

One form of these special snowflakes is a job that you want to run periodically but would rather have only one copy of these jobs running at a time - every infrastructure I've seen winds up having a "batch host" of some sort. For example, every day we do some analysis on the previous day's logs and create some statistics about our service to keep ourselves honest. We want these statistics processed every day (even if a machine crashes) but it's silly to crunch all the data on two separate machines just to make sure one can crash without impacting the process.

It's easy to make sure two copies of a process don't run simultaneously on one machine; it's harder to coordinate between two (or more) machines to make sure that at least one of them takes care of the task. There are all sorts of fancy job processing systems out there, but most of them are overkill for something as small as a one-line cronjob. The most common solution to this problem is just to monitor the one server that does a bunch of these things more closely so that you can go fix a problem that comes up. Combine that with the ability to manually re-run past statistics, and you have a solution that works most of the time and doesn't cause too much extra work when it doesn't.

Still, though, especially in a virtualized environment, you want to be able to shut down any server at any time and have the whole service continue to run. Zookeeper provides a service you can use in your application to do distributed configuration, discovery, and locking, but it's primarily aimed at integration in a larger application. Dealing with the Zookeeper API is not the kind of thing you want to get into for a short cronjob or shell script. However, given that we want to use Zookeeper for other aspects of our infrastructure, it would be nice to be able to use it for simple things as well.

To make our own processes more resilient to failure, we wrote get_zk_lock, a small script that is intended to wrap cronjobs. The utility attempts to get a lock from Zookeeper and returns 0 if it succeeds, 1 if it fails. Most models for a zookeeper lock keep the connection to zookeeper open for the duration of the lock; get_zk_lock is more like leader election from that point of view. If many servers (or processes) all attempt to get the lock simultaneously, only one will succeed. Because the lock request is not a persistent processes, the definition of 'simultaneously' must extend beyond the runtime of the request itself; the default for get_zk_lock is ±30 seconds. (This value also means that the lower limit for the frequency of your job is 30s plus your clock drift.) Because the script exits with code 0 if it succeeded in getting the lock, it becomes very easy to wrap your cronjobs by putting get_zk_lock foo && before your protected cronjob.

ben@localhost:~$ get_zk_lock foo && echo "I got the lock"
I got the lock
ben@localhost:~$ dsh -g testcluster -c -M "/usr/local/bin/get_zk_lock bar && echo 'I got the lock'"
test16: I got the lock
ben@localhost:~$ dsh -g testcluster -c -M "/usr/local/bin/get_zk_lock bar && echo 'I got the lock'"
test9: I got the lock

Our first use case for this script was to eliminate the need to identify a single node that would perform Zookeeper backups. All of the Zookeeper servers in our cluster run an hourly job that cleans up old snapshots and transaction log files; one of them succeeds in getting the lock to tar up those files and store them on S3 as backups. With this in place, every Zookeeper node really is disposable; if one dies a different one will pick up the backups next time. Using this in the middle of a larger script is pretty simple:

# do stuff that happens everywhere
if /usr/local/bin/get_zk_lock zookeeper-backup
  # do stuff that happens only once

As with any distributed environment, you do have to keep your clocks synchronized for this approach to work.  The frequency of the job minus the adjustable 'locktime' threshold for get_zk_lock is the allowed variability of your clocks; with a default of 30 and a cronjob running every minute it means that across your cluster, your clocks must be no more than 30 seconds apart (though in reality, 30s is a HUGE variance). Changing the threshold allows you to whether it's more important that the job always runs (and sometimes twice) or if it's better to run only once or not at all. With a locktime of 50, if clocks drift by more than 10s, it will appear as though the lock is always taken and your job will never run. (Conversely, with a locktime of 10 and a drift of >10s, two hosts will both get a lock, running your job twice.)

Returning to my original example of jobs to do daily statistics, the cronjob entry to trigger the stats analysis changes from

0 15 * * * * bundle exec ruby lib/daily_stats.rb


0 15 * * * * /usr/local/bin/get_zk_lock daily_stats && bundle exec ruby lib/daily_stats.rb

With this change, it can be run on multiple servers, with the understanding that it will actually run on only one of the set on which its configured. It inevitably winds up that there are many scripts like this - miscellaneous jobs that want to be run somewhere but don't care so much where; dedicating two servers to running these jobs, each one protected by get_zk_lock, means that each job will be sure to run (even in the face of machine failure) but no job will run twice.

The code for get_zk_lock is available in our public github repo,