Cross validation for glm() models

Error404 picture Error404 · Jan 27, 2014 · Viewed 26.6k times · Source

I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm() function in the boot package, although I've read a lot of help files. When I provide the following formula:

library(boot)
cv.glm(data, glmfit, K=10)

Does the "data" argument here refer to the whole dataset or only to the test set?

The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10-folds on the same test set? They are all going to give exactly the same result (I assume!).

Unfortunately ?cv.glm explains it in a foggy way:

data: A matrix or data frame containing the data. The rows should be cases and the columns correspond to variables, one of which is the response

My other question would be about the $delta[1] result. Is this the average prediction error over the 10 trials? What if I want to get the error for each fold?

Here's what my script looks like:

##data partitioning
sub <- sample(nrow(data), floor(nrow(x) * 0.9))
training <- data[sub, ]
testing <- data[-sub, ]

##model building
model <- glm(formula = groupcol ~ var1 + var2 + var3,
        family = "binomial", data = training)

##cross-validation
cv.glm(testing, model, K=10)

Answer

Jake Drew picture Jake Drew · Jul 4, 2014

I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package:

#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)

#Perform 10 fold cross validation
for(i in 1:10){
    #Segement your data by fold using the which() function 
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- yourData[testIndexes, ]
    trainData <- yourData[-testIndexes, ]
    #Use test and train data partitions however you desire...
}