The following lines
a1, b1, _ = plt.hist(df['y'], bins='auto')
a2, b2 = np.histogram(df['y'], bins='auto')
print(a1 == a2)
print(b1 == b2)
equate to all values of a1
being equal to those of a2
and the same for b1
and b2
I then create a plot using pyplot
alone (using bins=auto
should use the same np.histogram()
function):
plt.hist(df['y'], bins='auto')
plt.show()
I then try to achieve the same histogram, but by calling np.histogram()
myself, and passing the results into plt.hist()
, but I get a blank histogram:
a2, b2 = np.histogram(df['y'], bins='auto')
plt.hist(a2, bins=b2)
plt.show()
From how I understand that plt.hist(df['y'], bins='auto')
works, these two plots I am creating should be exactly the same - why isn't my method of using Numpy
working?
EDIT
Following on from @MSeifert's answer below, I believe that for
counts, bins = np.histogram(df['y'], bins='auto')
bins
is a list of the starting value for each bin, and counts
is the corresponding number of values in each of these bins. As shown from my histogram above, this should produce a nearly perfect normal distribution, however, if call print(counts, bins)
the result of counts
shows that the very first and last bins have quite a substantial count of ~11,000. Why isn't this reflected in the histogram - why is there not two large spikes at either tail?
EDIT 2
It was just a resolution issue and my plot was seemingly too small for the spikes at either end to render correctly. Zooming in allowed them to display.
You're assuming that plt.hist
can differentiate between an array containing counts as values and an array containing values to count.
However that's not what happens, when you pass the counts to plt.hist
it will count them and place them in the provided bins. That can lead to empty histograms but also to weird histograms.
So while plt.hist
and numpy.histogram
both work the same you cannot just pass the data obtained from numpy.histogram
to plt.hist
because that would count the counts of the values (not what you expect):
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
f, ax = plt.subplots(1)
arr = np.random.normal(10, 3, size=1000)
cnts, bins = np.histogram(arr, bins='auto')
ax.hist(cnts, bins=bins)
However you can use a bar
plot to vizualize histograms obtained by numpy.histogram
:
f, (ax1, ax2) = plt.subplots(2)
cnts, bins = np.histogram(arr, bins='auto')
ax1.bar(bins[:-1] + np.diff(bins) / 2, cnts, np.diff(bins))
ax2.hist(arr, bins='auto')