I have some data with predictors and a binary target. Eg:
df <- data.frame(a=sort(sample(1:100,30)), b= sort(sample(1:100,30)),
target=c(rep(0,11),rep(1,4),rep(0,4),rep(1,11)))
I trained a logistic regresion model using glm()
model1 <- glm(formula= target ~ a + b, data=df, family=binomial)
Now I'm trying to predict the output (for the example, the same data should suffice)
predict(model1, newdata=df, type="response")
This generates a vector of probability numbers. But I want to predict the actual class. I could use round() on the probablity numbers, but this assumes that anything below 0.5 is class '0', and anything above is class '1'. Is this a correct assumption? Even when the population of each class may not be equal (or close to equal)? Or is there a way to estimate this threshold?
The best threshold (or cutoff) point to be used in glm models is the point which maximises the specificity and the sensitivity. This threshold point might not give the highest prediction in your model, but it wouldn't be biased towards positives or negatives. The ROCR
package contain functions that can help you do this. check the performance()
function in this package. It is going to get you what you're looking for. Here's a picture of what you are expecting to get:
After finding the cutoff point, I normally write a function myself to find the number of datapoints that has their prediction value above the cutoff, and match it with the group they belong to.