faster equivalent of gettimeofday

Humble Debugger picture Humble Debugger · Jun 27, 2011 · Viewed 37.4k times · Source

In trying to build a very latency sensitive application, that needs to send 100s of messages a seconds, each message having the time field, we wanted to consider optimizing gettimeofday. Out first thought was rdtsc based optimization. Any thoughts ? Any other pointers ? Required accurancy of the time value returned is in milliseconds, but it isn't a big deal if the value is occasionally out of sync with the receiver for 1-2 milliseconds. Trying to do better than the 62 nanoseconds gettimeofday takes

Answer

David Terei picture David Terei · Oct 27, 2012

POSIX Clocks

I wrote a benchmark for POSIX clock sources:

  • time (s) => 3 cycles
  • ftime (ms) => 54 cycles
  • gettimeofday (us) => 42 cycles
  • clock_gettime (ns) => 9 cycles (CLOCK_MONOTONIC_COARSE)
  • clock_gettime (ns) => 9 cycles (CLOCK_REALTIME_COARSE)
  • clock_gettime (ns) => 42 cycles (CLOCK_MONOTONIC)
  • clock_gettime (ns) => 42 cycles (CLOCK_REALTIME)
  • clock_gettime (ns) => 173 cycles (CLOCK_MONOTONIC_RAW)
  • clock_gettime (ns) => 179 cycles (CLOCK_BOOTTIME)
  • clock_gettime (ns) => 349 cycles (CLOCK_THREAD_CPUTIME_ID)
  • clock_gettime (ns) => 370 cycles (CLOCK_PROCESS_CPUTIME_ID)
  • rdtsc (cycles) => 24 cycles

These numbers are from an Intel Core i7-4771 CPU @ 3.50GHz on Linux 4.0. These measurements were taken using the TSC register and running each clock method thousands of times and taking the minimum cost value.

You'll want to test on the machines you intend to run on though as how these are implemented varies from hardware and kernel version. The code can be found here. It relies on the TSC register for cycle counting, which is in the same repo (tsc.h).

TSC

Access the TSC (processor time-stamp counter) is the most accurate and cheapest way to time things. Generally, this is what the kernel is using itself. It's also quite straight-forward on modern Intel chips as the TSC is synchronized across cores and unaffected by frequency scaling. So it provides a simple, global time source. You can see an example of using it here with a walkthrough of the assembly code here.

The main issue with this (other than portability) is that there doesn't seem to be a good way to go from cycles to nanoseconds. The Intel docs as far as I can find state that the TSC runs at a fixed frequency, but that this frequency may differ from the processors stated frequency. Intel doesn't appear to provide a reliable way to figure out the TSC frequency. The Linux kernel appears to solve this by testing how many TSC cycles occur between two hardware timers (see here).

Memcached

Memcached bothers to do the cache method. It may simply be to make sure the performance is more predictable across platforms, or scale better with multiple cores. It may also no be a worthwhile optimization.