Python: how to make an histogram with equally *sized* bins

astabada picture astabada · Oct 12, 2012 · Viewed 11.8k times · Source

I have a set of data, and want to make an histogram of it. I need the bins to have the same size, by which I mean that they must contain the same number of objects, rather than the more common (numpy.histogram) problem of having equally spaced bins. This will naturally come at the expenses of the bins widths, which can - and in general will - be different.

I will specify the number of desired bins and the data set, obtaining the bins edges in return.

Example:
data = numpy.array([1., 1.2, 1.3, 2.0, 2.1, 2.12])
bins_edges = somefunc(data, nbins=3)
print(bins_edges)
>> [1.,1.3,2.1,2.12]

So the bins all contain 2 points, but their widths (0.3, 0.8, 0.02) are different.

There are two limitations: - if a group of data is identical, the bin containing them could be bigger. - if there are N data and M bins are requested, there will be N/M bins plus one if N%M is not 0.

This piece of code is some cruft I've written, which worked nicely for small data sets. What if I have 10**9+ points and want to speed up the process?

  1 import numpy as np
  2 
  3 def def_equbin(in_distr, binsize=None, bin_num=None):
  4 
  5     try:
  6 
  7         distr_size = len(in_distr)
  8 
  9         bin_size = distr_size / bin_num
 10         odd_bin_size = distr_size % bin_num
 11 
 12         args = in_distr.argsort()
 13 
 14         hist = np.zeros((bin_num, bin_size))
 15 
 16         for i in range(bin_num):
 17             hist[i, :] = in_distr[args[i * bin_size: (i + 1) * bin_size]]
 18 
 19         if odd_bin_size == 0:
 20             odd_bin = None
 21             bins_limits = np.arange(bin_num) * bin_size
 22             bins_limits = args[bins_limits]
 23             bins_limits = np.concatenate((in_distr[bins_limits],
 24                                           [in_distr[args[-1]]]))
 25         else:
 26             odd_bin = in_distr[args[bin_num * bin_size:]]
 27             bins_limits = np.arange(bin_num + 1) * bin_size
 28             bins_limits = args[bins_limits]
 29             bins_limits = in_distr[bins_limits]
 30             bins_limits = np.concatenate((bins_limits, [in_distr[args[-1]]]))
 31 
 32         return (hist, odd_bin, bins_limits)

Answer

aganders3 picture aganders3 · Oct 12, 2012

Using your example case (bins of 2 points, 6 total data points):

from scipy import stats
bin_edges = stats.mstats.mquantiles(data, [0, 2./6, 4./6, 1])
>> array([1. , 1.24666667, 2.05333333, 2.12])