Assume the training data is "fruit", which I am going to use it for predict using CART model in R
> fruit=data.frame(
color=c("red", "red", "red", "yellow", "red","yellow",
"orange","green","pink", "red", "red"),
isApple=c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE,
FALSE,FALSE,FALSE,FALSE,TRUE))
> mod = rpart(isApple ~ color, data=fruit, method="class", minbucket=1)
> prp(mod)
Could anyone explain what is exactly the role of minbucket
in plotting CART tree for this example if we are going to use minbucket
= 2, 3, 4, 5?
See i have 2 variables color & isApple. Color variable has green, yellow, pink, orange and Red. is Apple variable has value TRUE or FALSE. In the last example, RED has three TRUE and 2 FALSE mapped with it. Red value appear five times. if i give minbucket = 1,2,3 then it is splitting. If I give minbucket = 4 or 5 then no split occurs though red appears five times.
From the documentation for the rpart
package:
minbucket
the minimum number of observations in any terminal node. If onlyone of minbucket or minsplit is specified, the code either sets minsplit tominbucket*3 or minbucket to minsplit/3, as appropriate.
Setting minbucket
to 1 is meaningless, since each leaf node will (by definition) have at least one observation on it. If you set it to a higher value, say 3, then it would mean that every leaf node would have at least 3 observations in that bucket.
The smaller the value of minbucket
, the more precise your CART model will be. By setting minbucket
to too small a value, such as 1, you may run the risk of overfitting your model.