Searching for a fast/efficient histogram algorithm (with pre-specified bins)

ggkmath picture ggkmath · Dec 23, 2010 · Viewed 47.6k times · Source

I don't do much coding outside of Matlab, but I have a need to export my Matlab code to another language, most likely C. My Matlab code includes a histogram function, histc(), that places my input data (which is double-precision, not integer) into a specified array of bins, to form a histogram.

I'm sure I can piece together a couple nested loops to generate a histogram function, but I need this function to be fast and memory-light, as it will be accessed repeatedly and often.

To avoid re-inventing the wheel, anyone know if C language has any existing histogram function(s) available for use, or whether people needing such a thing generally create it themselves?

Anyone know an efficient algorithm for creating a histogram? Pseudo-code is fine.

Thanks in advance.

Answer

Tom picture Tom · Dec 23, 2010

The "ideal" histogram algorithm will depend upon the range you expect to capture. Generally any histogram algorithm will look like this:

const int NSAMPLES = whatever;
double samples[NSAMPLES] = { 1.0, 3.93, 1e30, ... }; // your data set
const int NBUCKETS = 10; // or whatever
int counts[NBUCKETS] = { 0 };
for (int i = 0; i != NSAMPLES; ++i) {
    counts[TRANSFER(samples[i])]++;
}

where TRANSFER() is some function that maps your inputs to a bin (0th or Nth bin mapping to "out of range" of applicable).

The exact implementation of TRANSFER() depends a lot on the expected distribution of your sample and where you are interested in detail. Some common approaches I have seen:

  • uniform distribution in range [a,b] (requires linear transform)
  • logarithmic distribution of unsigned integer values (best when combined with some bit twiddling hacks to quickly determine the nearest power-of-two or similar).

If you don't know the distribution up-front, then you really can't have an efficient mechanism to bin them effectively: you'll either have to guess (biased or uninformative results) or store everything and sort it at the end, binning into equal-sized buckets (poor performance).