Why is Google's TrueTime API hard to duplicate?

Jon Bringhurst picture Jon Bringhurst · Aug 22, 2013 · Viewed 11k times · Source

I'm not sure why the press in general says that Google's TrueTime API is hard to replicate (Wired, Slashdot, etc).

I can understand how it would be a tough thing to get the low error intervals that Google is achieving, but I don't see how the API itself would be very difficult.

For example, I whipped up a hacked together version. Here's the interval.

    typedef struct TT_interval {
            struct timeval earliest;
            struct timeval latest;
    } TT_interval;

Here's the now function.

    int TT_now(TT_interval* interval)
    {
        struct ntptimeval tv;
        struct timeval delta;

        struct timeval* earliest_p = &(interval->earliest);
        struct timeval* latest_p = &(interval->latest);
        struct timeval* now_p = &(tv.time);
        struct timeval* delta_p = δ

        timerclear(&delta);
        timerclear(&interval->earliest);
        timerclear(&interval->latest);

        if(ntp_gettime(&tv) == 0) {
            tv.maxerror = tv.maxerror > 0 ? tv.maxerror : -(tv.maxerror);

            delta.tv_sec = delta.tv_sec + (tv.maxerror / 1000);
            delta.tv_usec = delta.tv_usec + ((tv.maxerror % 1000) * 1000);

            if(delta.tv_usec > 1000000) {
                delta.tv_usec -= 1000000;
                delta.tv_sec++;
            }

            timeradd(now_p, delta_p, latest_p);
            timersub(now_p, delta_p, earliest_p);
        } else {
            printf("error on ntp_gettime. %s\n", strerror(errno));
            return ERROR;
        }

        return SUCCESS;
    }

Finally, here's the before and after functions (which are wrappers around the now function and could use a bit of DRY refactoring).

    int TT_before(TT_interval* interval, bool* success)
    {
        struct timeval* latest_p;
        struct timeval* earliest_p;
        TT_interval now;

        if(TT_now(&now) != SUCCESS) {
            return ERROR;
        }

        latest_p = &(interval->latest);
        earliest_p = &(now.earliest);

        if(timercmp(latest_p, earliest_p, <) != 0) {
            *success = true;
            return SUCCESS;
        } else {
            *success = false;
            return SUCCESS;
        }

        return ERROR;
    }

   int TT_after(TT_interval* interval, bool* success)
    {
        struct timeval* latest_p;
        struct timeval* earliest_p;
        TT_interval now;

        if(TT_now(&now) != SUCCESS) {
            return ERROR;
        }

        earliest_p = &(interval->latest);
        latest_p = &(now.earliest);

        if(timercmp(latest_p, earliest_p, <) != 0) {
            *success = true;
            return SUCCESS;
        } else {
            *success = false;
            return SUCCESS;
        }

        return ERROR;
    }

I seem to be getting interval errors of around 5,000us to 350,000us (using a public NTPd). This is a far cry from Google's numbers, but you need to start somewhere.

Other than lackluster performance, is there a major flaw in this design that would prevent something like Spanner from being built on top?

Answer

Eric Burnett picture Eric Burnett · Aug 28, 2013

The challenge in implementing a TrueTime API lies in the guarantees you must provide. Namely, the absolute time must never be outside the TrueTime interval on any server in the system. If this can happen, then absolute ordering of events is lost, as are most of the guarantees of Spanner.

The Spanner paper achieves this by a combination of means (section 3):

  1. Multiple time servers, with disparate sources (GPS, atomic clocks), including time servers from other datacenters.
  2. Marzullo’s algorithm to detect liars and multiplex the various trusted time sources into an update of the local machine clock.
  3. An assumed clock drift of 200us/s at spanservers, applied between clock synchronizations.
  4. Kicking machines from the system that exhibit measured local clock drift > threshold (threshold << 200us/s by necessity).

Now, you can achieve this with simpler means - NTP and an assumed error interval of 10 minutes would trivially do. But as noted in the question, there are performance implications to this. Read-write transactions (4.2.1) must wait on commit, with an expected wait time of 2*errorAverage - 20 minutes in this example. Similarly, read-only transactions (4.2.2) at time "now" - rather than a time in the past - must wait for safetime to advance far enough; at least 10 minutes in this example. So to have a high performance system, you need to minimize the error intervals as far as possible, without losing your guarantees, which is where the complexity arises.

I'm not sure how ntp_adjtime is being called in your system - it's possible it's already being set using multiple untrusted and uncorrelated time sources, in which case you're most of the way there already. If you can also ensure that the maxerror value is guaranteed to be advancing faster than the possible clock drift of your system, you should be good to go. Most of the performance of Spanner, without your own personal atomic clock :).