How to drop rows of Pandas DataFrame whose value in a certain column is NaN

bigbug picture bigbug · Nov 16, 2012 · Viewed 1.1M times · Source

I have this DataFrame and want only the records whose EPS column is not NaN:

>>> df
                 STK_ID  EPS  cash
STK_ID RPT_Date                   
601166 20111231  601166  NaN   NaN
600036 20111231  600036  NaN    12
600016 20111231  600016  4.3   NaN
601009 20111231  601009  NaN   NaN
601939 20111231  601939  2.5   NaN
000001 20111231  000001  NaN   NaN

...i.e. something like df.drop(....) to get this resulting dataframe:

                  STK_ID  EPS  cash
STK_ID RPT_Date                   
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

How do I do that?

Answer

Aman picture Aman · Nov 17, 2012

This question is already resolved, but...

...also consider the solution suggested by Wouter in his original comment. The ability to handle missing data, including dropna(), is built into pandas explicitly. Aside from potentially improved performance over doing it manually, these functions also come with a variety of options which may be useful.

In [24]: df = pd.DataFrame(np.random.randn(10,3))

In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan;

In [26]: df
Out[26]:
          0         1         2
0       NaN       NaN       NaN
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [27]: df.dropna()     #drop all rows that have any NaN values
Out[27]:
          0         1         2
1  2.677677 -1.466923 -0.750366
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295

In [28]: df.dropna(how='all')     #drop only if ALL columns are NaN
Out[28]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [29]: df.dropna(thresh=2)   #Drop row if it does not have at least two values that are **not** NaN
Out[29]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

In [30]: df.dropna(subset=[1])   #Drop only if NaN in specific column (as asked in the question)
Out[30]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

There are also other options (See docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html), including dropping columns instead of rows.

Pretty handy!