Removing outliers in R

ThePerson picture ThePerson · May 13, 2012 · Viewed 11.4k times · Source

I have looked at a set of data and decided it would be good to remove outliers, with an outlier having the definition of being 2SD away from the mean.

If I have a set of data, say 500 rows with 15 different attributes, how can I remove all the rows which have 1 or more attribute which is 2 standard deviations away from the mean?

Is there an easy way to do this using R? Thanks,

Answer

Tyler Rinker picture Tyler Rinker · May 13, 2012

There's probably lots of ways and probably add on packages to deal with this. I'd suggest you try this first:

library(sos); findFn("outlier")

Here's a way you could do what your asking for using the scale function that can standardize vectors.

#create a data set with outliers
set.seed(10)
dat <- data.frame(sapply(seq_len(5), function(i) 
    sample(c(1:50, 100:101), 200, replace=TRUE)))

#standardize each column (we use it in the outdet function)
scale(dat)

#create function that looks for values > +/- 2 sd from mean
outdet <- function(x) abs(scale(x)) >= 2
#index with the function to remove those values
dat[!apply(sapply(dat, outdet), 1, any), ]

So in answering your question yes there is an easy way in that the code to do this could be boiled down to 1 line of code:

dat[!apply(sapply(dat, function(x) abs(scale(x)) >= 2), 1, any), ]

And I'm guessing there's a package that may do this and more. The sos package is terrific (IMHO) for finding functions to do what you want.