specifying "skip NA" when calculating mean of the column in a data frame created by Pandas

lokheart picture lokheart · Jul 30, 2014 · Viewed 42.2k times · Source

I am learning Pandas package by replicating the outing from some of the R vignettes. Now I am using the dplyr package from R as an example:

http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

R script

planes <- group_by(hflights_df, TailNum)
delay <- summarise(planes,
  count = n(),
  dist = mean(Distance, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

Python script

planes = hflights.groupby('TailNum')
planes['Distance'].agg({'count' : 'count',
                        'dist' : 'mean'})

How can I state explicitly in python that NA needs to be skipped?

Answer

FooBar picture FooBar · Jul 30, 2014

That's a trick question, since you don't do that. Pandas will automatically exclude NaN numbers from aggregation functions. Consider my df:

    b   c   d  e
a               
2   2   6   1  3
2   4   8 NaN  7
2   4   4   6  3
3   5 NaN   2  6
4 NaN NaN   4  1
5   6   2   1  8
7   3   2   4  7
9   6   1 NaN  1
9 NaN NaN   9  3
9   3   4   6  1

The internal count() function will ignore NaN values, and so will mean(). The only point where we get NaN, is when the only value is NaN. Then, we take the mean value of an empty set, which turns out to be NaN:

In[335]: df.groupby('a').mean()
Out[333]: 
          b    c    d         e
a                              
2  3.333333  6.0  3.5  4.333333
3  5.000000  NaN  2.0  6.000000
4       NaN  NaN  4.0  1.000000
5  6.000000  2.0  1.0  8.000000
7  3.000000  2.0  4.0  7.000000
9  4.500000  2.5  7.5  1.666667

Aggregate functions work in the same way:

In[340]: df.groupby('a')['b'].agg({'foo': np.mean})
Out[338]: 
        foo
a          
2  3.333333
3  5.000000
4       NaN
5  6.000000
7  3.000000
9  4.500000

Addendum: Notice how the standard dataframe.mean API will allow you to control inclusion of NaN values, where the default is exclusion.