Boosting classification tree in R

Armin picture Armin · Mar 3, 2017 · Viewed 9.3k times · Source

I'm trying to boost a classification tree using the gbm package in R and I'm a little bit confused about the kind of predictions I obtain from the predict function.

Here is my code:

  #Load packages, set random seed
  library(gbm)
  set.seed(1)

  #Generate random data
  N<-1000
  x<-rnorm(N)
  y<-0.6^2*x+sqrt(1-0.6^2)*rnorm(N)
  z<-rep(0,N)
  for(i in 1:N){
    if(x[i]-y[i]+0.2*rnorm(1)>1.0){
      z[i]=1
    }
  }

  #Create data frame
  myData<-data.frame(x,y,z)

  #Split data set into train and test
  train<-sample(N,800,replace=FALSE)
  test<-(-train)

  #Boosting
  boost.myData<-gbm(z~.,data=myData[train,],distribution="bernoulli",n.trees=5000,interaction.depth=4)
  pred.boost<-predict(boost.myData,newdata=myData[test,],n.trees=5000,type="response")
  pred.boost

pred.boost is a vector with elements from the interval (0,1).

I would have expected the predicted values to be either 0 or 1, as my response variable z also consists of dichotomous values - either 0 or 1 - and I'm using distribution="bernoulli".

How should I proceed with my prediction to obtain a real classification of my test data set? Should I simply round the pred.boost values or is there anything I'm doing wrong with the predict function?

Answer

abhiieor picture abhiieor · Mar 3, 2017

Your observed behavior is correct. From documentation:

If type="response" then gbm converts back to the same scale as the outcome. Currently the only effect this will have is returning probabilities for bernoulli.

So you should be getting probabilities when using type="response" which is correct. Plus distribution="bernoulli" merely tells that labels follows bernoulli (0/1) pattern. You can omit that and still model will run fine.

To proceed do predict_class <- pred.boost > 0.5 (cutoff = 0.5) or else plot ROC curve to decide on cutoff yourself.