I have a pandas dataframe and want to get rid of rows in which the column 'A' is negative. I know 2 ways to do this:
df = df[df['A'] >= 0]
or
selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)
What is the recommended solution? Why?
The recommended solution is the most eficient, which in this case, is the first one.
df = df[df['A'] >= 0]
On the second solution
selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)
you are repeating the slicing process. But lets break it to pieces to understand why.
When you write
df['A'] >= 0
you are creating a mask, a Boolean Series with an entry for each index of df, whose value is either True or False according to a condition (on this case, if such the value of column 'A' at a given index is greater than or equal to 0).
When you write
df[df['A'] >= 0]
you accessing the rows for which your mask (df['A'] >= 0) is True. This is a slicing method supported by Pandas that lets you select certain rows by passing a Boolean Series and will return a new DataFrame with only the entries for which the Series was True.
Finally, when you write this
selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)
you are repeating the proccess because
df[df['A'] < 0]
is already slicing your DataFrame (in this case for the rows you want to drop). You are then getting those indices, going back to the original DataFrame and explicitly dropping them. No need for this, you already sliced the DataFrame in the first step.