I have a dataframe with >100 columns, and I would to find the unique rows, by comparing only two of the columns. I'm hoping this is an easy one, but I can't get it working with unique
or duplicated
myself.
In the below, I would like to unique only using id and id2:
data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
id id2 somevalue
1 1 x
1 1 y
3 4 z
I would like to obtain either:
id id2 somevalue
1 1 x
3 4 z
or:
id id2 somevalue
1 1 y
3 4 z
(I have no preference which of the unique rows is kept)
Ok, if it doesn't matter which value in the non-duplicated column you select, this should be pretty easy:
dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
> dat[!duplicated(dat[,c('id','id2')]),]
id id2 somevalue
1 1 1 x
3 3 4 z
Inside the duplicated
call, I'm simply passing only those columns from dat
that I don't want duplicates of. This code will automatically always select the first of any ambiguous values. (In this case, x.)