I have a data.table
DT
and I want to run model.matrix
on it. Each row has a string ID, which is stored in the ID
column of DT
. When I run model.matrix
on DT
, my formula excludes the ID
column. The problem is, model.matrix
drops some rows because of NAs. If I set the rownames of DT
to the ID
column, before calling model.matrix
, then the final model matrix has rownames, and I'm all set. Otherwise, I can't figure out what rows I end up with. I'm setting the rownames with rownames(DT) = DT$ID
. However, when I try to add a new column to DT
, I get a complaint about
"Invalid .internal.selfref detected . . . At an earlier point, this data.table has been copied by R."
So I'm wondering
data.table
There are a couple of issues here.
Firstly, it is a feature of a data.table
that they do not have a rownames
, instead they have key
s which are far more powerful. See this great vignette.
But, it isn't the end of the world. model.matrix
returns sensible rownames when you pass it a data.table
For example
A <- data.table(ID = 1:5, x = c(NA, 1:4), y = c(4:2,NA,3))
mm <- model.matrix( ~ x + y, A)
rownames(mm)
## [1] "2" "3" "5"
So rows 2,3 and 5 are those included in the model.matrix.
Now, you can add this sequence as a column to A
. This will be useful if you then set the key to something else (thereby losing the original order)
A[, rowid := seq_len(nrow(A)]
You might consider making it character (like the rownames of mm
)) but it won't really matter (as you can just as easily convert rownames(mm)
to numeric when you need to reference.
As to the warning that data.table
gives, if you read the next sentence
Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: setkey(), setnames() and setattr()
rownames
are an attribute rownames<-
(internally at somepoint using the equivalent to attr<-
) will (possibly copy) in the same way.
The line from `row.names<-.data.frame`
is
attr(x, "row.names") <- value
That being said, data.tables
don't have rownames, so there is no point setting them.