data.table
objects now have a := operator. What makes this operator different from all other assignment operators? Also, what are its uses, how much faster is it, and when should it be avoided?
Here is an example showing 10 minutes reduced to 1 second (from NEWS on homepage). It's like subassigning to a data.frame
but doesn't copy the entire table each time.
m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)
system.time(for (i in 1:1000) DF[i,1] <- i)
user system elapsed
287.062 302.627 591.984
system.time(for (i in 1:1000) DT[i,V1:=i])
user system elapsed
1.148 0.000 1.158 ( 511 times faster )
Putting the :=
in j
like that allows more idioms :
DT["a",done:=TRUE] # binary search for group 'a' and set a flag
DT[,newcol:=42] # add a new column by reference (no copy of existing data)
DT[,col:=NULL] # remove a column by reference
and :
DT[,newcol:=sum(v),by=group] # like a fast transform() by group
I can't think of any reasons to avoid :=
! Other than, inside a for
loop. Since :=
appears inside DT[...]
, it comes with the small overhead of the [.data.table
method; e.g., S3 dispatch and checking for the presence and type of arguments such as i
, by
, nomatch
etc. So for inside for
loops, there is a low overhead, direct version of :=
called set
. See ?set
for more details and examples. The disadvantages of set
include that i
must be row numbers (no binary search) and you can't combine it with by
. By making those restrictions set
can reduce the overhead dramatically.
system.time(for (i in 1:1000) set(DT,i,"V1",i))
user system elapsed
0.016 0.000 0.018