Does R randomForest's rfcv method actually say which features it selected, or not?

tresbot picture tresbot · Aug 10, 2012 · Viewed 7.4k times · Source

I would like to use rfcv to cull the unimportant variables from a data set before creating a final random forest with more trees (please correct and inform me if that's not the way to use this function). For example,

>     data(fgl, package="MASS")
>     tst <- rfcv(trainx = fgl[,-10], trainy = fgl[,10], scale = "log", step=0.7)
>     tst$error.cv
        9         6         4         3         2         1 
0.2289720 0.2149533 0.2523364 0.2570093 0.3411215 0.5093458

In this case, if I understand the result correctly, it seems that we can remove three variables without negative side effects. However,

>     attributes(tst)
$names
[1] "n.var"     "error.cv"  "predicted"

None of these slots tells me what those first three variables that can be harmlessly removed from the dataset actually were.

Answer

nograpes picture nograpes · Aug 11, 2012

I think the purpose of rfcv is to establish how your accuracy is related to the number of variables you use. This might not seem useful when you have 10 variables, but when you have thousands of variables it is quite handy to understand how much those variables "add" to the predictive power.

As you probably found out, this code

rf<-randomForest(type ~ .,data=fgl)
importance(rf)

gives you the relative importance of each of the variables.