I noticed a difference in how pandas.DataFrame.describe() and numpy.percentile() handle NaN values. e.g.
import numpy as np
import pandas as pd
a = pd.DataFrame(np.random.rand(100000),columns=['A'])
>>> a.describe()
A
count 100000.000000
mean 0.499713
std 0.288722
min 0.000009
25% 0.249372
50% 0.498889
75% 0.749249
max 0.999991
>>> np.percentile(a,[25,50,75])
[0.24937217017643742, 0.49888913303316823, 0.74924862428575034] # Same as a.describe()
# Add in NaN values
a.ix[1:99999:3] = pd.np.NaN
>>> a.describe()
A
count 66667.000000
mean 0.499698
std 0.288825
min 0.000031
25% 0.249285
50% 0.500110
75% 0.750201
max 0.999991
>>> np.percentile(a,[25,50,75])
[0.37341740173545901, 0.75020053461424419, nan] # Not the same as a.describe()
# Remove NaN's
b = a[pd.notnull(a.A)]
>>> np.percentile(b,[25,50,75])
[0.2492848255776256, 0.50010992119477615, 0.75020053461424419] # Now in agreement with describe()
Pandas neglects NaN values in percentile calculations, while numpy does not. Is there any compelling reason to include NaN's in percentile calculations? It seesm Pandas handles this correctly, so I wonder why numpy would not make a similar implementation.
Begin Edit
per Jeff's comment, this becomes an issue when resampling data. If I have a time series that contains NaN values and want to resample to percentiles (per this post)
upper = df.resample('1A',how=lambda x: np.percentile(x,q=75))
will include NaN values in calculation (as numpy does). To avoid this, you must instead put
upper = tmp.resample('1A',how=lambda x: np.percentile(x[pd.notnull(x.sample_value)],q=75))
Perhaps a numpy request is in order. Personally, I do not see any reason to include NaNs in percentile calculations. pd.describe() and np.percentile should, in my opinion, return the exact same values (I think this is the expected behavior), but the fact that they do not can be easily missed (this is not mentioned in the documentation for np.percentile), which can skew the stats. That is my concern.
End Edit
For your edited use case, I think I'd stay in pandas
and use Series.quantile
instead of np.percentile
:
>>> df = pd.DataFrame(np.random.rand(100000),columns=['A'],
... index=pd.date_range("Jan 1 2013", freq="H", periods=100000))
>>> df.iloc[1:99999:3] = np.nan
>>>
>>> upper_np = df.resample('1A',how=lambda x: np.percentile(x,q=75))
>>> upper_np.describe()
A
count 0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
[8 rows x 1 columns]
>>> upper_pd = df.resample('1A',how=lambda x: x.quantile(0.75))
>>> upper_pd.describe()
A
count 12.000000
mean 0.745648
std 0.004889
min 0.735160
25% 0.744723
50% 0.747492
75% 0.748965
max 0.750341
[8 rows x 1 columns]