select rows with largest value of variable within a group in r

Misha picture Misha · May 12, 2010 · Viewed 8.5k times · Source
a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)

r<-sapply(split(a.3,a.2),function(x) which.max(x$b.2))

a.3[r,]

returns the list index, not the index for the entire data.frame

Im trying to return the largest value of b.2 for each subgroup of a.2. How can I do this efficiently?

Answer

Aaron Schumacher picture Aaron Schumacher · Sep 7, 2012

The ddply and ave approaches are both fairly resource-intensive, I think. ave fails by running out of memory for my current problem (67,608 rows, with four columns defining the unique keys). tapply is a handy choice, but what I generally need to do is select all the whole rows with the something-est some-value for each unique key (usually defined by more than one column). The best solution I've found is to do a sort and then use negation of duplicated to select only the first row for each unique key. For the simple example here:

a <- sample(1:10,100,replace=T)
b <- sample(1:100,100,replace=T)
f <- data.frame(a, b)

sorted <- f[order(f$a, -f$b),]
highs <- sorted[!duplicated(sorted$a),]

I think the performance gains over ave or ddply, at least, are substantial. It is slightly more complicated for multi-column keys, but order will handle a whole bunch of things to sort on and duplicated works on data frames, so it's possible to continue using this approach.