Does R randomForest's rfcv method actually say which features it selected, or not?

Question 1

Does R randomForest's rfcv method actually say which features it selected, or not?

r machine-learning classification random-forest feature-selection

tresbot · Aug 10, 2012 · Viewed 7.4k times · Source

Answer

Answer

I think the purpose of rfcv is to establish how your accuracy is related to the number of variables you use. This might not seem useful when you have 10 variables, but when you have thousands of variables it is quite handy to understand how much those variables "add" to the predictive power.

As you probably found out, this code

rf<-randomForest(type ~ .,data=fgl)
importance(rf)

gives you the relative importance of each of the variables.

Question 2

I would like to use rfcv to cull the unimportant variables from a data set before creating a final random forest with more trees (please correct and inform me if that's not the way to use this function). For example,

>     data(fgl, package="MASS")
>     tst <- rfcv(trainx = fgl[,-10], trainy = fgl[,10], scale = "log", step=0.7)
>     tst$error.cv
        9         6         4         3         2         1 
0.2289720 0.2149533 0.2523364 0.2570093 0.3411215 0.5093458

In this case, if I understand the result correctly, it seems that we can remove three variables without negative side effects. However,

>     attributes(tst)
$names
[1] "n.var"     "error.cv"  "predicted"

None of these slots tells me what those first three variables that can be harmlessly removed from the dataset actually were.

Does R randomForest's rfcv method actually say which features it selected, or not?

Answer

Related questions