Big matrix to run glmnet()

Flavio Barros picture Flavio Barros · Jun 10, 2013 · Viewed 10.2k times · Source

I am having a problem to run glmnet lasso with a wide data set. My data has N=50, but p > 49000, all factors. So to run glmnet i have to create a model.matrix, BUT i just run out of memory when i call model.matrix(formula, data), where formula = Class ~ .

As a worked example i will generate a dataset:

data <- matrix(rep(0,50*49000), nrow=50)
for(i in 1:50) {
x = rep(letters[2:8], 7000)
y = sample(x=1:49000, size=49000)
data[i,] <- x[y]
}

data <- as.data.frame(data)
x = c(rep('A', 20), rep('B', 15), rep('C', 15))
y = sample(x=1:50, size=50)
class = x[y]
data <- cbind(data, class)

After that i tried to create a model.matrix to enter on glmnet.

  formula <- as.formula(class ~ .)
  X = model.matrix(formula, data)
  model <- cv.glmnet(X, class, standardize=FALSE, family='multinomial', alpha=1, nfolds=10)

In the last step (X = model.matrix ...) i run out of memory. What can i do?

Answer

Flavio Barros picture Flavio Barros · Jun 17, 2013

I asked Professor Trevor Hastie and received the following advice:

"Hello Flavio

model.matrix is killing you. You will have 49K factors, and model matrix is trying to represent them as contrasts which will be 6 column matrices, so 49*6 approx 300K columns. Why not make binary dummy variables (7 per factor), and simply construct this directly without using model.matrix. You could save 1/7th the space by storing this via sparseMatrix (glmnet accepts sparse matrix formats)"

I did exactly that and worked perfectly fine. I think that can be usefull to others.

An article, with code, that came form this problem: http://www.rmining.net/2014/02/25/genetic-data-large-matrices-glmnet/

In order to avoid broken links i will post part of the post here:

The problem with the formula approach is that, in general, genomic data has more columns than observations. The data that I worked in that case had 40,000 columns and only 73 observations. In order to create a small set of test data, run the following code:

for(i in 1:50) {
    x = rep(letters[2:8], 7000)
    y = sample(x=1:49000, size=49000)
    data[i,] <- x[y]
}

data <- as.data.frame(data)
x <- c(rep('A', 20), rep('B', 15), rep('C', 15))
y <- sample(x=1:50, size=50)
class = x[y]
data <- cbind(data, class)

So, with this data set we will try to fit a model with glmnet ():

formula <- as.formula(class ~ .)
X <- model.matrix(formula, data)
model <- cv.glmnet(X, class, standardize=FALSE, family='multinomial', alpha=1, nfolds=10)

And if you do not have a computer with more RAM than mine, you will probably leak memory and give a crash in R. The solution? My first idea was to try sparse.model.matrix() that creates a sparse matrix model using the same formula. Unfortunately did not work, because even with sparse matrix, the final model is still too big! Interestingly, this dataset occupies only 24MB from RAM, but when you use the model.matrix the result is an array with more than 1Gb.

The solution I found was to build the matrix on hand. To do this we encode the array with dummy variables, column by column, and store the result in a sparse matrix. Then we will use this matrix as input to the model and see if it will not leak memory:

## Creates a matrix using the first column
X <- sparse.model.matrix(~data[,1]-1)

## Check if the column have more then one level
for (i in 2:ncol(data)) {

## In the case of more then one level apply dummy coding 
if (nlevels(data[,i])>1) {
    coluna <- sparse.model.matrix(~data[,i]-1)
    X <- cBind(X, coluna)
}
## Transform fator to numeric
else {
   coluna <- as.numeric(as.factor(data[,i]))
   X <- cBind(X, coluna)
}

NOTE: Pay attention to how we are using a sparse matrix the Matrix package is required. Also note that the columns are connected using cBind () instead of cbind ().

The matrix thus generated was much lower: less than 70 Mb when I tested. Fortunately glmnet () supports a sparse matrix and you can run the model:

mod.lasso <- cv.glmnet(X, class, standardize=FALSE, family='multinomial', alpha=1, nfolds=10)

So you can create models with this type of data without blowing the memory and without use R packages for large datasets like bigmemory and ff.