I am going through documentation of data.table
and also noticed from some of the conversations over here on SO that rbindlist
is supposed to be better than rbind
.
I would like to know why is rbindlist
better than rbind
and in which scenarios rbindlist
really excels over rbind
?
Is there any advantage in terms of memory utilization?
rbindlist
is an optimized version of do.call(rbind, list(...))
, which is known for being slow when using rbind.data.frame
Some questions that show where rbindlist
shines are
Fast vectorized merge of list of data.frames by row
These have benchmarks that show how fast it can be.
rbind.data.frame
does lots of checking, and will match by name. (i.e. rbind.data.frame will account for the fact that columns may be in different orders, and match up by name), rbindlist
doesn't do this kind of checking, and will join by position
eg
do.call(rbind, list(data.frame(a = 1:2, b = 2:3), data.frame(b = 1:2, a = 2:3)))
## a b
## 1 1 2
## 2 2 3
## 3 2 1
## 4 3 2
rbindlist(list(data.frame(a = 1:5, b = 2:6), data.frame(b = 1:5, a = 2:6)))
## a b
## 1: 1 2
## 2: 2 3
## 3: 1 2
## 4: 2 3
It used to struggle to deal with factors
, due to a bug that has since been fixed:
rbindlist two data.tables where one has factor and other has character type for a column (Bug #2650)
It has problems with duplicate column names
see Warning message: in rbindlist(allargs) : NAs introduced by coercion: possible bug in data.table? (Bug #2384)
rbindlist
can handle lists
data.frames
and data.tables
, and will return a data.table without rownames
you can get in a muddle of rownames using do.call(rbind, list(...))
see
How to avoid renaming of rows when using rbind inside do.call?
In terms of memory rbindlist
is implemented in C
, so is memory efficient, it uses setattr
to set attributes by reference
rbind.data.frame
is implemented in R
, it does lots of assigning, and uses attr<-
(and class<-
and rownames<-
all of which will (internally) create copies of the created data.frame.