Rownames for data.table in R for model.matrix

DavidR picture DavidR · Dec 20, 2012 · Viewed 10k times · Source

I have a data.table DT and I want to run model.matrix on it. Each row has a string ID, which is stored in the ID column of DT. When I run model.matrix on DT, my formula excludes the ID column. The problem is, model.matrix drops some rows because of NAs. If I set the rownames of DT to the ID column, before calling model.matrix, then the final model matrix has rownames, and I'm all set. Otherwise, I can't figure out what rows I end up with. I'm setting the rownames with rownames(DT) = DT$ID. However, when I try to add a new column to DT, I get a complaint about

"Invalid .internal.selfref detected . . . At an earlier point, this data.table has been copied by R."

So I'm wondering

  1. Is there a better way to set rownames for a data.table
  2. Is there a better approach to solving this problem.

Answer

mnel picture mnel · Dec 20, 2012

There are a couple of issues here.

Firstly, it is a feature of a data.table that they do not have a rownames, instead they have keys which are far more powerful. See this great vignette.

But, it isn't the end of the world. model.matrix returns sensible rownames when you pass it a data.table

For example

A <- data.table(ID = 1:5, x = c(NA, 1:4), y = c(4:2,NA,3))

mm <- model.matrix( ~ x + y, A)

rownames(mm)

## [1] "2" "3" "5"

So rows 2,3 and 5 are those included in the model.matrix.

Now, you can add this sequence as a column to A. This will be useful if you then set the key to something else (thereby losing the original order)

A[, rowid := seq_len(nrow(A)]

You might consider making it character (like the rownames of mm)) but it won't really matter (as you can just as easily convert rownames(mm) to numeric when you need to reference.

As to the warning that data.table gives, if you read the next sentence

Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: setkey(), setnames() and setattr()

rownames are an attribute rownames<- (internally at somepoint using the equivalent to attr<-) will (possibly copy) in the same way.

The line from `row.names<-.data.frame` is

attr(x, "row.names") <- value

That being said, data.tables don't have rownames, so there is no point setting them.