Replace NaN or missing values with rolling mean or other interpolation

Alexis Eggermont picture Alexis Eggermont · Aug 11, 2014 · Viewed 19.9k times · Source

I have a pandas dataframe with monthly data that I want to compute a 12 months moving average for. Data for for every month of January is missing, however (NaN), so I am using

pd.rolling_mean(data["variable"]), 12, center=True)

but it just gives me all NaN values.

Is there a simple way that I can ignore the NaN values? I understand that in practice this would become a 11-month moving average.

The dataframe has other variables which have January data, so I don't want to just throw out the January columns and do an 11 month moving average.

Answer

JohnE picture JohnE · Aug 11, 2014

There are several ways to approach this, and the best way will depend on whether the January data is systematically different from other months. Most real-world data is likely to be somewhat seasonal, so let's use the average high temperature (Fahrenheit) of a random city in the northern hemisphere as an example.

df=pd.DataFrame({ 'month' : [10,11,12,1,2,3],
                  'temp'  : [65,50,45,np.nan,40,43] }).set_index('month')

You could use a rolling mean as you suggest, but the issue is that you will get an average temperature over the entire year, which ignores the fact that January is the coldest month. To correct for this, you could reduce the window to 3, which results in the January temp being the average of the December and February temps. (I am also using min_periods=1 as suggested in @user394430's answer.)

df['rollmean12'] = df['temp'].rolling(12,center=True,min_periods=1).mean()
df['rollmean3']  = df['temp'].rolling( 3,center=True,min_periods=1).mean()

Those are improvements but still have the problem of overwriting existing values with rolling means. To avoid this you could combine with the update() method (see documentation here).

df['update'] = df['rollmean3']
df['update'].update( df['temp'] )  # note: this is an inplace operation

There are even simpler approaches that leave the existing values alone while filling the missing January temps with either the previous month, next month, or the mean of the previous and next month.

df['ffill']   = df['temp'].ffill()         # previous month 
df['bfill']   = df['temp'].bfill()         # next month
df['interp']  = df['temp'].interpolate()   # mean of prev/next

In this case, interpolate() defaults to simple linear interpretation, but you have several other intepolation options also. See documentation on pandas interpolate for more info. Or this statck overflow question: Interpolation on DataFrame in pandas

Here is the sample data with all the results:

       temp  rollmean12  rollmean3  update  ffill  bfill  interp
month                                                           
10     65.0        48.6  57.500000    65.0   65.0   65.0    65.0
11     50.0        48.6  53.333333    50.0   50.0   50.0    50.0
12     45.0        48.6  47.500000    45.0   45.0   45.0    45.0
1       NaN        48.6  42.500000    42.5   45.0   40.0    42.5
2      40.0        48.6  41.500000    40.0   40.0   40.0    40.0
3      43.0        48.6  41.500000    43.0   43.0   43.0    43.0

In particular, note that "update" and "interp" give the same results in all months. While it doesn't matter which one you use here, in other cases one way or the other might be better.