Selecting CP value for decision tree pruning using rpart

Ivan picture Ivan · Jun 9, 2016 · Viewed 11.3k times · Source

I understand that the common practice to select CP value is by choosing the lowest level with the minimum xerror value. However, in my following case, using cp <- fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"] will give me 0.17647059 which will result in no split or just root after pruning with this value.

> myFormula <- Kyphosis~Age+Number+Start
> set.seed(1)
> fit <- rpart(myFormula,data=data,method="class",control=rpart.control(minsplit=20,xval=10,cp=0.01))
> fit$cptable
          CP nsplit rel error   xerror      xstd
1 0.17647059      0 1.0000000 1.000000 0.2155872
2 0.01960784      1 0.8235294 1.000000 0.2155872
3 0.01000000      4 0.7647059 1.058824 0.2200975

Is there any other alternative/ good practice to select the CP value?

Answer

Alan Chalk picture Alan Chalk · Jul 19, 2016

Generally, a cptable like the one you have, is a warning that the tree is probably no use at all and probably not able to generalise well on to future data. So the answer is not to find another way to choose cp but rather to create a useful tree if you can, or to admit defeat and say that based on the examples and features that we have, we cannot create a model that is predictive of kyphosis.

In your case, all is not - necessarily - lost. The data is very small and the cross validation which gives rise to the xerror column is very volatile. If you seed your seed to 2 or to 3 you will see very different answers in that column (some even worse).

So one thing which is interesting on this data, is to increase the number of cross-validation folds to the number of observations (so that you get LOOCV). If you do this:

myFormula <- Kyphosis ~ Age + Number + Start
rpart_1 <- rpart(myFormula, data = kyphosis,
                 method = "class", 
                 control = rpart.control(minsplit = 20, xval = 81, cp = 0.01))
rpart_1$cptable

you will find a CP table that you will like better! (Note that setting a seed is not necessary any more since the folds are the same each time).