I would like to run a large cluster of nodes in the cloud (AWS, Heroku, or maybe self-manged VMS), whose clocks must be synchronized with a predefined tolerance in mind. I'm looking for a tolerance of maybe 200 ms. That means if I have 250 nodes, the largest clock difference between any of the 250 nodes should never exceed 200 ms. I don't really care about the actual date / time with respect to the world. The solution has to be fault tolerant, and should not need to rely on the accuracy of the clock of any one system -- in fact, it's likely that none of the clocks will be terribly accurate.
The requirement is strong enough where if for any reason the clock synchronization is determined to be unreliable for any particular node, that I'd prefer to drop a node from the cluster due to clock desynchronization -- so on any suspected failure, I'd like to be able to perform some type of controlled shutdown of that node.
I'd love to use something like NTP, but according to the NTP known issues twiki:
NTP was not designed to run inside of a virtual machine. It requires a high resolution system clock, with response times to clock interrupts that are serviced with a high level of accuracy. No known virtual machine is capable of meeting these requirements.
And although the same twiki then goes to describe various ways of addressing the situation (such as running ntp on the host OS), I don't believe I'll have the ability to modify the environment enough using AWS or on horoku to comply with the workarounds.
Even if I was not running in VM's, a trusted operations manager who has years of experience running ntp tells me that ntp can and will drop synchronization (or plain get the time wrong) due to bad local clock drift every once in a while. It doesn't happen often, but it does happen, and as you increase machines, you increase your chances of this happening. AFAIK, detecting how far off you are requires stopping ntpd, running a query mode command, and starting it back up again, and it can take a long time to get an answer back.
To sum up -- I need a clock synchronization whose primary goal is as follows:
From the description, it seems like the Berkeley Algorithm might be the right choice here, but is it already implemented?
Nice to haves:
Since the FAQ for NTP specifically states why NTP time sync doesn't work 'right' under virtual machines, it's probably an insurmountable problem.
Most machines have a RTC (real-time clock) in them, on PCs its how you store the time so that you have a 'rough' guess as to what the time is if ntp is unavailable, once the system is loaded there's a 'tick' clock that is higher resolution - thats what NTP sets.
That tick clock is subject to the drift of the virtual machine since ticks may or may not happen at the correct intervals - any time mechanism you attempt to use is going to be subject to that drift.
It's probably suboptimal design to try to enforce ntp synchronization on virtual machines, if machine A and B have a delta of 200ms, and machine B and C have a delta of 200ms, C could 400ms away from A. You can't control that.
You're better off using a centralized messaging system like zeromq to keep everybody in sync with the job queue, it's going to be more overhead, but relying on system tick time is a dodgy affair at best. There are many clustering solutions that account for cluster participation using all sorts of reliable mechanisms to ensure that everyone is in sync, take a look at corosync or spread - they've solved this already for things like two-phase-commits.
Incidentally, ntp 'giving up' when drift is too high can be circumvented by instructing it to 'slam' the time to the new value rather than 'slew'. By default ntp will incrementally update the system time to account for its drift from 'real time'. I forget how to configure this in ntpd, but if you use ntpdate the flag is -B
-B Force the time to always be slewed using the adjtime(2) system call, even if the measured
offset is greater than +-128 ms. The default is to step the time using settimeofday(2) if the offset
is greater than +-128 ms. Note that, if the offset is much greater than +-128 ms in this case, it
can take a long time (hours) to slew the clock to the correct value. During this time, the host
should not be used to synchronize clients.