Understanding of minbucket function in CART model using R

GBOT picture GBOT · Apr 14, 2015 · Viewed 10.5k times · Source

Assume the training data is "fruit", which I am going to use it for predict using CART model in R

> fruit=data.frame(
                   color=c("red",   "red",  "red",  "yellow", "red","yellow",
                           "orange","green","pink", "red",‌    ​"red"),
                   isApple=c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE,
                             FALSE,FALSE,FALSE,FALSE,TRUE))

> mod = rpart(isApple ~ color, data=fruit, method="class", minbucket=1)

> prp(mod)

Could anyone explain what is exactly the role of minbucket in plotting CART tree for this example if we are going to use minbucket = 2, 3, 4, 5?

See i have 2 variables color & isApple. Color variable has green, yellow, pink, orange and Red. is Apple variable has value TRUE or FALSE. In the last example, RED has three TRUE and 2 FALSE mapped with it. Red value appear five times. if i give minbucket = 1,2,3 then it is splitting. If I give minbucket = 4 or 5 then no split occurs though red appears five times.

Answer

Tim Biegeleisen picture Tim Biegeleisen · Apr 14, 2015

From the documentation for the rpart package:

minbucket

the minimum number of observations in any terminal node. If onlyone of minbucket or minsplit is specified, the code either sets minsplit tominbucket*3 or minbucket to minsplit/3, as appropriate.

Setting minbucket to 1 is meaningless, since each leaf node will (by definition) have at least one observation on it. If you set it to a higher value, say 3, then it would mean that every leaf node would have at least 3 observations in that bucket.

The smaller the value of minbucket, the more precise your CART model will be. By setting minbucket to too small a value, such as 1, you may run the risk of overfitting your model.