All possible Regression in R: Saving coefficients in a matrix

SamPassmore picture SamPassmore · Sep 11, 2013 · Viewed 7.3k times · Source

I am running code for all possible models of a phylogenetic generalised linear model. The issue I am having is extracting and saving the beta coefficients for each model.

I want to save the coefficients into a matrix, where the columns correspond to a specific variable and the rows correspond to a formula.The issue arises because the variables are different for every model. So one cannot simply row bind the coefficients to the matrix.

The example below shows where I am up to in the problem:

y = rnorm(10)
inpdv = matrix(c(rnorm(10), runif(10), rpois(10, 1)), ncol = 3)
colnames(inpdv) = c("A", "B", "C")
data = cbind(y, inpdv)

model.mat = expand.grid(c(TRUE,FALSE), c(TRUE,FALSE), c(TRUE,FALSE))
names(model.mat) = colnames(inpdv)

formula = apply(model.mat, 1, function(x)
                       paste(colnames(model.mat)[x], collapse=" + "))
formula = paste("y", formula, sep = " ~ ")
formula[8] = paste(formula[8], 1, sep = "")

beta = matrix(NA, nrow = length(formula), ncol = 3)

for(i in 1:length(formula)){
   fit = lm(formula(formula), data)
   ## Here I want to extract the beta coeffecients here into a n * k matrix
   ## However, I cannot find a way to assign the value to the right cell in the matrix

}

So I imagine each coefficient will need to be placed into the respective cell, but I cannot think of a quick and efficient way of doing so.

The true analysis will take place around 30, 000 times, so any tips on efficiency would also be appreciated.

Edit: So as an example, the output for a model of y ~ a + c will need to be in the form of

a NA b 

Where the letters represent the coefficient for that model. The next model may be y ~ b + c which would then be added in the bottom. So the result would now look like

a  NA b
NA b c

Answer

Simon O'Hanlon picture Simon O'Hanlon · Sep 11, 2013

How about using names and %in% to subset the right columns. Extract the coefficient values using coef. Like this:

beta = matrix(NA, nrow = length(formula), ncol = 3)
colnames(beta) <- colnames(inpdv)

for(i in 1:length(formula)){
   fit = lm(formula(formula[i]), data)
    coefs <- coef(fit)
    beta[ i , colnames(beta) %in% names( coefs ) ] <- coefs[ names( coefs ) %in% colnames( beta ) ]
}
#              A          B         C
#[1,] -0.4229837 -0.0519900 0.3787666
#[2,]         NA  0.7015679 0.0555350
#[3,] -0.4165834         NA 0.3692974
#[4,]         NA         NA 0.1346726
#[5,] -0.2035173  0.7049951        NA
#[6,]         NA  0.7978726        NA
#[7,] -0.2229959         NA        NA
#[8,]         NA         NA        NA

I think it's perfectly acceptable to use a for loop here. For starters using something like lapply sometimes keep increasing memory usage as you run more and more of the simulations. R will sometimes not mark objects from earlier models as trash until the lapply loop finishes so so can sometimes get a memory allocation error. Using the for loop I find that R will reuse memory allocated to the previous iteration of the loop if necessary so if you can run the loop once, you can run it lots of times.

The other reason not to use a for loop is speed, but I would assume that the time to iterate is negligible compared to the time to fit the model so I would use it.