Pandas drop rows vs filter

ojon picture ojon · Jun 5, 2018 · Viewed 7.9k times · Source

I have a pandas dataframe and want to get rid of rows in which the column 'A' is negative. I know 2 ways to do this:

df = df[df['A'] >= 0]

or

selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)

What is the recommended solution? Why?

Answer

VaM picture VaM · Jun 5, 2018

The recommended solution is the most eficient, which in this case, is the first one.

df = df[df['A'] >= 0]

On the second solution

selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)

you are repeating the slicing process. But lets break it to pieces to understand why.

When you write

df['A'] >= 0

you are creating a mask, a Boolean Series with an entry for each index of df, whose value is either True or False according to a condition (on this case, if such the value of column 'A' at a given index is greater than or equal to 0).

When you write

df[df['A'] >= 0]

you accessing the rows for which your mask (df['A'] >= 0) is True. This is a slicing method supported by Pandas that lets you select certain rows by passing a Boolean Series and will return a new DataFrame with only the entries for which the Series was True.

Finally, when you write this

selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)

you are repeating the proccess because

df[df['A'] < 0]

is already slicing your DataFrame (in this case for the rows you want to drop). You are then getting those indices, going back to the original DataFrame and explicitly dropping them. No need for this, you already sliced the DataFrame in the first step.