Selecting the statistically significant variables in an R glm model

r glm
Pritish Kakodkar picture Pritish Kakodkar · Apr 22, 2013 · Viewed 21.7k times · Source

I have an outcome variable, say Y and a list of 100 dimensions that could affect Y (say X1...X100).

After running my glm and viewing a summary of my model, I see those variables that are statistically significant. I would like to be able to select those variables and run another model and compare performance. Is there a way I can parse the model summary and select only the ones that are significant?

Answer

Maxim.K picture Maxim.K · Apr 22, 2013

Although @kith paved the way, there is more that can be done. Actually, the whole process can be automated. First, let's create some data:

x1 <- rnorm(10)
x2 <- rnorm(10)
x3 <- rnorm(10)
y <- rnorm(10)
x4 <- y + 5 # this will make a nice significant variable to test our code
(mydata <- as.data.frame(cbind(x1,x2,x3,x4,y)))

Our model is then:

model <- glm(formula=y~x1+x2+x3+x4,data=mydata)

And the Boolean vector of the coefficients can indeed be extracted by:

toselect.x <- summary(model)$coeff[-1,4] < 0.05 # credit to kith

But this is not all! In addition, we can do this:

# select sig. variables
relevant.x <- names(toselect.x)[toselect.x == TRUE] 
# formula with only sig variables
sig.formula <- as.formula(paste("y ~",relevant.x))  

EDIT: as subsequent posters have pointed out, the latter line should be sig.formula <- as.formula(paste("y ~",paste(relevant.x, collapse= "+"))) to include all variables.

And run the regression with only significant variables as OP originally wanted:

sig.model <- glm(formula=sig.formula,data=mydata)

In this case the estimate will be equal to 1 as we have defined x4 as y+5, implying the perfect relationship.