I have a pandas data frame with few columns.
Now I know that certain rows are outliers based on a certain column value.
For instance
column 'Vol' has all values around
12xx
and one value is4000
(outlier).
Now I would like to exclude those rows that have Vol
column like this.
So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean.
What is an elegant way to achieve this?
If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot.
df = pd.DataFrame(np.random.randn(100, 3))
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
description: