a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)
r<-sapply(split(a.3,a.2),function(x) which.max(x$b.2))
a.3[r,]
returns the list index, not the index for the entire data.frame
Im trying to return the largest value of b.2
for each subgroup of a.2
. How can I do this efficiently?
The ddply
and ave
approaches are both fairly resource-intensive, I think. ave
fails by running out of memory for my current problem (67,608 rows, with four columns defining the unique keys). tapply
is a handy choice, but what I generally need to do is select all the whole rows with the something-est some-value for each unique key (usually defined by more than one column). The best solution I've found is to do a sort and then use negation of duplicated
to select only the first row for each unique key. For the simple example here:
a <- sample(1:10,100,replace=T)
b <- sample(1:100,100,replace=T)
f <- data.frame(a, b)
sorted <- f[order(f$a, -f$b),]
highs <- sorted[!duplicated(sorted$a),]
I think the performance gains over ave
or ddply
, at least, are substantial. It is slightly more complicated for multi-column keys, but order
will handle a whole bunch of things to sort on and duplicated
works on data frames, so it's possible to continue using this approach.