Scikit Learn - Random Forest: How continuous feature is handled?

Question 1

Scikit Learn - Random Forest: How continuous feature is handled?

scikit-learn random-forest discretization

Sachinmm · Sep 19, 2015 · Viewed 8k times · Source

Answer

Answer

As far as I understand, you are asking how the threshold is chosen for continuous features. The binning occurs at values, where your class is changed. For example, consider the following 1D dataset with x as feature and y as class variable

x = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [ 1, 1, 0, 0, 0, 0, 0, 1, 1, 1]

The two possible candidate cuts will be considered: (i) between 2 and 3 (will practically look like as x<2.5) and (ii) between 7 and 8 (as x<7.5). Among these two candidates the second one will be chosen since it provides a better separation. Them the algorithm goes to the next step.

Therefore it is not advisable to discretize the data yourself. Think about this with the data above. If, for example, you discretize the data in 5 bins [1, 2 | 3, 4 | 5, 6 | 7, 8 | 9, 10], you miss the best split (since 7 and 8 will be in one bin).

Question 2

Random Forest accepts numerical data. Usually features with text data is converted to numerical categories and continuous numerical data is fed as it is without discretization. How the RF treat the continuous data for creating nodes? Will it bin the continuous numerical data internally? or treat each data as discrete level.

for example: I want to feed a data set(ofcourse after categorizing the text features) to RF. How the continuous data is handled by the RF? Is it advisable to discretize the continuous data(longitudes and latitudes, in this case) before feeding? Or doing so information is lost?

Scikit Learn - Random Forest: How continuous feature is handled?

Answer

Related questions