Search for and remove outliers from a dataframe grouped by a variable

Kole Stewart picture Kole Stewart · Feb 24, 2015 · Viewed 12.2k times · Source

I have a data frame that has 5 variables and 800 rows:

head(df)
       V1 variable    value element OtolithNum
1 24.9835       V7 130230.0      Mg         25
2 24.9835       V8 145844.0      Mg         25
3 24.9835       V9 126126.0      Mg         25
4 24.9835      V10 103152.0      Mg         25
5 24.9835      V11 129571.9      Mg         25
6 24.9835      V12 114214.0      Mg         25

I need to perform the following:

  1. identify all values (from the "value" variable) that are > 2 Standard Deviations from the median, grouped by the element variable.
  2. remove the outliers from the dataframe (or create a new dataframe with the outliers excluded.

I have been using dplyr package and have used the following code to group by the "element" variable, and provide the mean values:

df1=df %>%
  group_by(element) %>%
  summarise_each(funs(mean), value)

Can you please help me manipulate or add to the code above in order to remove outliers (defined above, as >2 sd from the median) grouped by the "element" variable, before I extract the means.

I have tried the following code from another posting (thats why the data names don't match with my personal data above), without luck:

#standardize each column (we use it in the outdet function)
   scale(dat)
#create function that looks for values > +/- 2 sd from mean
   outdet <- function(x) abs(scale(x)) >= 2
#index with the function to remove those values
   dat[!apply(sapply(dat, outdet), 1, any), ]

Answer

Zelazny7 picture Zelazny7 · Feb 24, 2015

Here's a method using base R:

element <- sample(letters[1:5], 1e4, replace=T)
value <- rnorm(1e4)
df <- data.frame(element, value)

means.without.ols <- tapply(value, element, function(x) {
  mean(x[!(abs(x - median(x)) > 2*sd(x))])
})

And using dplyr

df1 = df %>%
  group_by(element) %>%
  filter(!(abs(value - median(value)) > 2*sd(value))) %>%
  summarise_each(funs(mean), value)

Comparison of results:

> means.without.ols
           a            b            c            d            e 
-0.008059215 -0.035448381 -0.013836321 -0.013537466  0.021170663 

> df1
Source: local data frame [5 x 2]

  element        value
1       a -0.008059215
2       b -0.035448381
3       c -0.013836321
4       d -0.013537466
5       e  0.021170663