I'm trying to boost a classification tree using the gbm
package in R and I'm a little bit confused about the kind of predictions I obtain from the predict
function.
Here is my code:
#Load packages, set random seed
library(gbm)
set.seed(1)
#Generate random data
N<-1000
x<-rnorm(N)
y<-0.6^2*x+sqrt(1-0.6^2)*rnorm(N)
z<-rep(0,N)
for(i in 1:N){
if(x[i]-y[i]+0.2*rnorm(1)>1.0){
z[i]=1
}
}
#Create data frame
myData<-data.frame(x,y,z)
#Split data set into train and test
train<-sample(N,800,replace=FALSE)
test<-(-train)
#Boosting
boost.myData<-gbm(z~.,data=myData[train,],distribution="bernoulli",n.trees=5000,interaction.depth=4)
pred.boost<-predict(boost.myData,newdata=myData[test,],n.trees=5000,type="response")
pred.boost
pred.boost
is a vector with elements from the interval (0,1)
.
I would have expected the predicted values to be either 0
or 1
, as my response variable z
also consists of dichotomous values - either 0
or 1
- and I'm using distribution="bernoulli"
.
How should I proceed with my prediction to obtain a real classification of my test data set? Should I simply round the pred.boost
values or is there anything I'm doing wrong with the predict
function?
Your observed behavior is correct. From documentation:
If type="response" then gbm converts back to the same scale as the outcome. Currently the only effect this will have is returning probabilities for bernoulli.
So you should be getting probabilities when using type="response"
which is correct. Plus distribution="bernoulli"
merely tells that labels follows bernoulli (0/1) pattern. You can omit that and still model will run fine.
To proceed do predict_class <- pred.boost > 0.5
(cutoff = 0.5) or else plot ROC curve to decide on cutoff yourself.