Why does the C4.5 algorithm use pruning in order to reduce the decision tree and how does pruning affect the predicion accuracy?

ksm001 picture ksm001 · Jun 2, 2012 · Viewed 11.4k times · Source

I have searched on google about this issue and I can't find something that explains this algorithm in a simple yet detailed way.

For instance, I know the id3 algorithm doesn't use pruning at all, so if you have a continuous characteristic, the prediction success rates will be very low.

So the C4.5 in order to support continuous characteristics it uses pruning, but is this the only reason?

Also I can't really understand in the WEKA application, how exactly the confidence factor affects the efficiency of the predictions. The smaller the confidence factor the more pruning the algorithm will do, however what is the correlation between pruning and the prediction's accuracy? The more you prune, the better the predictions or the worse?

Thanks

Answer

Lars Kotthoff picture Lars Kotthoff · Jun 3, 2012

Pruning is a way of reducing the size of the decision tree. This will reduce the accuracy on the training data, but (in general) increase the accuracy on unseen data. It is used to mitigate overfitting, where you would achieve perfect accuracy on training data, but the model (i.e. the decision tree) you learn is so specific that it doesn't apply to anything but that training data.

In general, if you increase pruning, the accuracy on the training set will be lower. WEKA does however offer various things to estimate the accuracy better, namely training/test split or cross-validation. If you use cross-validation for example, you'll discover a "sweet spot" of the pruning confidence factor somewhere where it prunes enough to make the learned decision tree sufficiently accurate on test data, but doesn't sacrifice too much accuracy on the training data. Where this sweet spot lies however will depend on your actual problem and the only way to determine it reliably is to try.