Tree sizes given by CP table in rpart

alopex picture alopex · Jan 9, 2015 · Viewed 11.2k times · Source

In the R package rpart, what determines the size of trees presented within the CP table for a decision tree? In the below example, the CP table defaults to presenting only trees with 1, 2, and 5 nodes (as nsplit = 0, 1 and 4 respectively).

library(rpart)   
fit <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis)
> printcp(fit) 

Classification tree:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, 
method = "class")

Variables actually used in tree construction:
[1] Age   Start

Root node error: 17/81 = 0.20988

n= 81 

        CP nsplit rel error  xerror    xstd
1 0.176471      0   1.00000 1.00000 0.21559
2 0.019608      1   0.82353 0.94118 0.21078
3 0.010000      4   0.76471 0.94118 0.21078

Is there an inherent rule rpart() used to determine what size of trees to present? And is it possible to force printcp() to return cross-validation statistics for all possible sizes of tree, i.e. for the above example, also include rows for trees with 3 and 4 nodes (nsplit = 2, 3)?

Answer

Kevin picture Kevin · Mar 9, 2015

The rpart() function is controlled using the rpart.control() function. It has parameters such as minsplit which tells the function to only split when there are more observations then the value specified and cp which tells the function to only split if the overall lack of fit is decreased by a factor of cp. If you look at summary(fit) on your above example it shows the statistics for all values of nsplit. To get these values to print when using printcp(fit) you need to choose appropriate values of cp and minsplit when calling the original rpart function.