I tried to use random forests for regression. The original data is a data frame of 218 rows and 9 columns. The first 8 columns are categorical values ( can be either A, B, C, or D), and the last column V9 has numerical values that can go from 10.2 to 999.87.
When I used random forests on a training set, which represents 2/3 of the original data and which is randomly selected, I got the following results.
>r=randomForest(V9~.,data=trainingData,mytree=4,ntree=1000,importance=TRUE,do.trace=100)
| Out-of-bag |
Tree | MSE %Var(y) |
100 | 6.927e+04 98.98 |
200 | 6.874e+04 98.22 |
300 | 6.822e+04 97.48 |
400 | 6.812e+04 97.34 |
500 | 6.839e+04 97.73 |
600 | 6.852e+04 97.92 |
700 | 6.826e+04 97.54 |
800 | 6.815e+04 97.39 |
900 | 6.803e+04 97.21 |
1000 | 6.796e+04 97.11 |
I do not know if the high variance percentage means that the model is good or not. Also, since MSE is high, I suspect that the regression model is not really good. Any idea about how to read the results above? Do they mean that the model is not good?
Like @Joran told, %Var is the amount of total variance of Y explained by your random forest model. After the adjust, apply the model to your validation data (1/3 remain):
RFestimated = predict(r, data=ValidationData)
It is interesting also to check the residual:
qqnorm((RFestimated - ValidationData$V9)/sd(RFestimated-ValidationData$V9))
qqline((RFestimated-ValidationData$V9)/sd(RFestimated-ValidationData$V9))
the estimated versus observed values:
plot(ValidationData$V9, RFestimated)
and the RMSE:
RMSE <- (sum((RFestimated-ValidationData$V9)^2)/length(Validation$v9))^(1/2)
I hope this help!