Calculating entropy in decision tree (Machine learning)

code muncher picture code muncher · Jan 16, 2013 · Viewed 7.6k times · Source

I do know formula for calculating entropy:

H(Y) = - ∑ (p(yj) * log2(p(yj)))

In words, select an attribute and for each value check target attribute value ... so p(yj) is the fraction of patterns at Node N are in category yj - one for true in target value and one one for false.

But I have a dataset in which target attribute is price, hence range. How to calculate entropy for this kinda dataset?

(Referred: http://decisiontrees.net/decision-trees-tutorial/tutorial-5-exercise-2/)

Answer

Vic Smith picture Vic Smith · Jan 16, 2013

You first need to discretise the data set in some way, like sorting it numerically into a number of buckets. Many methods for discretisation exist, some supervised (ie taking account the value of your target function) and some not. This paper outlines various techniques used in fairly general terms. For more specifics there are plenty of discretisation algorithms in machine learning libraries like Weka.

The entropy of continuous distributions is called differential entropy, and can also be estimated by assuming your data is distributed in some way (normally distributed for example), then estimating underlaying distribution in the normal way, and using this to calculate an entropy value.