plt.hist() vs np.histogram() - unexpected results

Question 1

plt.hist() vs np.histogram() - unexpected results

python numpy matplotlib histogram binning

KOB · Oct 10, 2017 · Viewed 7.5k times · Source

Answer

Answer

You're assuming that plt.hist can differentiate between an array containing counts as values and an array containing values to count.

However that's not what happens, when you pass the counts to plt.hist it will count them and place them in the provided bins. That can lead to empty histograms but also to weird histograms.

So while plt.hist and numpy.histogram both work the same you cannot just pass the data obtained from numpy.histogram to plt.hist because that would count the counts of the values (not what you expect):

import numpy as np
import matplotlib.pyplot as plt

%matplotlib notebook

f, ax = plt.subplots(1)
arr = np.random.normal(10, 3, size=1000)
cnts, bins = np.histogram(arr, bins='auto')
ax.hist(cnts, bins=bins)

However you can use a bar plot to vizualize histograms obtained by numpy.histogram:

f, (ax1, ax2) = plt.subplots(2)
cnts, bins = np.histogram(arr, bins='auto')
ax1.bar(bins[:-1] + np.diff(bins) / 2, cnts, np.diff(bins))
ax2.hist(arr, bins='auto')

Question 2

The following lines

a1, b1, _ = plt.hist(df['y'], bins='auto')
a2, b2 = np.histogram(df['y'], bins='auto')

print(a1 == a2)
print(b1 == b2)

equate to all values of a1 being equal to those of a2 and the same for b1 and b2

I then create a plot using pyplot alone (using bins=auto should use the same np.histogram() function):

plt.hist(df['y'], bins='auto')
plt.show()

I then try to achieve the same histogram, but by calling np.histogram() myself, and passing the results into plt.hist(), but I get a blank histogram:

a2, b2 = np.histogram(df['y'], bins='auto')
plt.hist(a2, bins=b2)
plt.show()

From how I understand that plt.hist(df['y'], bins='auto') works, these two plots I am creating should be exactly the same - why isn't my method of using Numpy working?

EDIT

Following on from @MSeifert's answer below, I believe that for

counts, bins = np.histogram(df['y'], bins='auto')

bins is a list of the starting value for each bin, and counts is the corresponding number of values in each of these bins. As shown from my histogram above, this should produce a nearly perfect normal distribution, however, if call print(counts, bins) the result of counts shows that the very first and last bins have quite a substantial count of ~11,000. Why isn't this reflected in the histogram - why is there not two large spikes at either tail?

EDIT 2

It was just a resolution issue and my plot was seemingly too small for the spikes at either end to render correctly. Zooming in allowed them to display.

plt.hist() vs np.histogram() - unexpected results

Answer

Related questions