Edit 2019: This question was asked prior to changes in data.table
in November 2016, see the accepted answer below for both the current and previous methods.
I have a data.table
table with about 2.5 million rows. There are two columns. I want to remove any rows that are duplicated in both columns. Previously for a data.frame I would have done this:
df -> unique(df[,c('V1', 'V2')])
but this doesn't work with data.table. I have tried unique(df[,c(V1,V2), with=FALSE])
but it seems to still only operate on the key of the data.table and not the whole row.
Any suggestions?
Cheers, Davy
Example
>dt
V1 V2
[1,] A B
[2,] A C
[3,] A D
[4,] A B
[5,] B A
[6,] C D
[7,] C D
[8,] E F
[9,] G G
[10,] A B
in the above data.table where V2
is the table key, only rows 4,7, and 10 would be removed.
> dput(dt)
structure(list(V1 = c("B", "A", "A", "A", "A", "A", "C", "C",
"E", "G"), V2 = c("A", "B", "B", "B", "C", "D", "D", "D", "F",
"G")), .Names = c("V1", "V2"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7fb4c4804578>, sorted = "V2")
For v1.9.8+ (released November 2016)
From ?unique.data.table
By default all columns are being used (which is consistent with ?unique.data.frame
)
unique(dt)
V1 V2
1: A B
2: A C
3: A D
4: B A
5: C D
6: E F
7: G G
Or using the by
argument in order to get unique combinations of specific columns (like previously keys were used for)
unique(dt, by = "V2")
V1 V2
1: A B
2: A C
3: A D
4: B A
5: E F
6: G G
Prior v1.9.8
From ?unique.data.table
, it is clear that calling unique
on a data table only works on the key. This means you have to reset the key to all columns before calling unique
.
library(data.table)
dt <- data.table(
V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)
Calling unique
with one column as key:
setkey(dt, "V2")
unique(dt)
V1 V2
[1,] B A
[2,] A B
[3,] A C
[4,] A D
[5,] E F
[6,] G G