I have some geometrically distributed data. When I want to take a look at it, I use
sns.distplot(data, kde=False, norm_hist=True, bins=100)
which results is a picture:
However, bins heights don't add up to 1, which means y axis doesn't show probability, it's something different. If instead we use
weights = np.ones_like(np.array(data))/float(len(np.array(data)))
plt.hist(data, weights=weights, bins = 100)
the y axis shall show probability, as bins heights sum up to 1:
It can be seen more clearly here: suppose we have a list
l = [1, 3, 2, 1, 3]
We have two 1s, two 3s and one 2, so their respective probabilities are 2/5, 2/5 and 1/5. When we use seaborn histplot with 3 bins:
sns.distplot(l, kde=False, norm_hist=True, bins=3)
we get:
As you can see, the 1st and the 3rd bin sum up to 0.6+0.6=1.2 which is already greater than 1, so y axis is not a probability. When we use
weights = np.ones_like(np.array(l))/float(len(np.array(l)))
plt.hist(l, weights=weights, bins = 3)
we get:
and the y axis is probability, as 0.4+0.4+0.2=1 as expected.
The amount of bins in these 2 cases are is the same for both methods used in each case: 100 bins for geometrically distributed data, 3 bins for small array l with 3 possible values. So bins amount is not the issue.
My question is: in seaborn distplot called with norm_hist=True, what is the meaning of y axis?
From the documentation:
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
So you need to take into account your bin width as well, i.e. compute the area under the curve and not just the sum of the bin heights.