I have data with discrete x-values, such as
x = c(3,8,13,8,13,3,3,8,13,8,3,8,8,13,8,13,8,3,3,8,13,8,13,3,3)
y = c(4,5,4,6,7,20,1,4,6,2,6,8,2,6,7,3,2,5,7,3,2,5,7,3,2);
How can I generate a new dataset of x and y values where I eliminate pairs of values where the y-value is 2 standard deviations above the mean for that bin. For example, in the x=3 bin, 20 is more than 2 SDs above the mean, so that data point should be removed.
for me you want something like :
by(dat,dat$x, function(z) z$y[z$y < 2*sd(z$y)])
dat$x: 3
[1] 4 1 6 5 7 3 2
---------------------------------------------------------------------------------------------------------------
dat$x: 8
[1] 4 2 2 2 3
---------------------------------------------------------------------------------------------------------------
dat$x: 13
[1] 3 2
EDIT after comment :
by(dat,dat$x,
function(z) z$y[abs(z$y-mean(z$y))< 2*sd(z$y)])
EDIT
I slightly change the by
function to get x and y, then I call rbind
using do.call
do.call(rbind,by(dat,dat$x,function(z) {
idx <- abs(z$y-mean(z$y))< 2*sd(z$y)
z[idx,]
}))
or using plyr
in single call
ddply(dat,.(x),function(z) {
idx <- abs(z$y-mean(z$y))< 2*sd(z$y)
z[idx,]})