Deciding threshold for glm logistic regression model in R

user2175594 picture user2175594 · Apr 23, 2014 · Viewed 29.2k times · Source

I have some data with predictors and a binary target. Eg:

df <- data.frame(a=sort(sample(1:100,30)), b= sort(sample(1:100,30)), 
                 target=c(rep(0,11),rep(1,4),rep(0,4),rep(1,11)))

I trained a logistic regresion model using glm()

model1 <- glm(formula= target ~ a + b, data=df, family=binomial)

Now I'm trying to predict the output (for the example, the same data should suffice)

predict(model1, newdata=df, type="response")

This generates a vector of probability numbers. But I want to predict the actual class. I could use round() on the probablity numbers, but this assumes that anything below 0.5 is class '0', and anything above is class '1'. Is this a correct assumption? Even when the population of each class may not be equal (or close to equal)? Or is there a way to estimate this threshold?

Answer

Error404 picture Error404 · Apr 23, 2014

The best threshold (or cutoff) point to be used in glm models is the point which maximises the specificity and the sensitivity. This threshold point might not give the highest prediction in your model, but it wouldn't be biased towards positives or negatives. The ROCR package contain functions that can help you do this. check the performance() function in this package. It is going to get you what you're looking for. Here's a picture of what you are expecting to get:

enter image description here

After finding the cutoff point, I normally write a function myself to find the number of datapoints that has their prediction value above the cutoff, and match it with the group they belong to.