How to know if a regression model generated by random forests is good? ( MSE and %Var(y))

John  picture John · May 14, 2013 · Viewed 9.4k times · Source

I tried to use random forests for regression. The original data is a data frame of 218 rows and 9 columns. The first 8 columns are categorical values ( can be either A, B, C, or D), and the last column V9 has numerical values that can go from 10.2 to 999.87.

When I used random forests on a training set, which represents 2/3 of the original data and which is randomly selected, I got the following results.

>r=randomForest(V9~.,data=trainingData,mytree=4,ntree=1000,importance=TRUE,do.trace=100)
       |      Out-of-bag   |
  Tree |      MSE  %Var(y) |
   100 | 6.927e+04    98.98 |
   200 | 6.874e+04    98.22 |
   300 | 6.822e+04    97.48 |
   400 | 6.812e+04    97.34 |
   500 | 6.839e+04    97.73 |
   600 | 6.852e+04    97.92 |
   700 | 6.826e+04    97.54 |
   800 | 6.815e+04    97.39 |
   900 | 6.803e+04    97.21 |
  1000 | 6.796e+04    97.11 |

I do not know if the high variance percentage means that the model is good or not. Also, since MSE is high, I suspect that the regression model is not really good. Any idea about how to read the results above? Do they mean that the model is not good?

Answer

Gorgens picture Gorgens · May 14, 2013

Like @Joran told, %Var is the amount of total variance of Y explained by your random forest model. After the adjust, apply the model to your validation data (1/3 remain):

RFestimated = predict(r, data=ValidationData)

It is interesting also to check the residual:

qqnorm((RFestimated - ValidationData$V9)/sd(RFestimated-ValidationData$V9))

qqline((RFestimated-ValidationData$V9)/sd(RFestimated-ValidationData$V9))

the estimated versus observed values:

plot(ValidationData$V9, RFestimated)

and the RMSE:

RMSE <- (sum((RFestimated-ValidationData$V9)^2)/length(Validation$v9))^(1/2)

I hope this help!