I have looked at a set of data and decided it would be good to remove outliers, with an outlier having the definition of being 2SD away from the mean.
If I have a set of data, say 500 rows with 15 different attributes, how can I remove all the rows which have 1 or more attribute which is 2 standard deviations away from the mean?
Is there an easy way to do this using R? Thanks,
There's probably lots of ways and probably add on packages to deal with this. I'd suggest you try this first:
library(sos); findFn("outlier")
Here's a way you could do what your asking for using the scale
function that can standardize vectors.
#create a data set with outliers
set.seed(10)
dat <- data.frame(sapply(seq_len(5), function(i)
sample(c(1:50, 100:101), 200, replace=TRUE)))
#standardize each column (we use it in the outdet function)
scale(dat)
#create function that looks for values > +/- 2 sd from mean
outdet <- function(x) abs(scale(x)) >= 2
#index with the function to remove those values
dat[!apply(sapply(dat, outdet), 1, any), ]
So in answering your question yes there is an easy way in that the code to do this could be boiled down to 1 line of code:
dat[!apply(sapply(dat, function(x) abs(scale(x)) >= 2), 1, any), ]
And I'm guessing there's a package that may do this and more. The sos
package is terrific (IMHO) for finding functions to do what you want.