I used RandomForest for a regression problem. I used importance(rf,type=1)
to get the %IncMSE for the variables and one of them has a negative %IncMSE. Does this mean that this variable is bad for the model? I searched the Internet to get some answers but I didn't find a clear one.
I also found something strange in the model's summary ( attached below), It seems that only one tree was used although I defined ntrees
as 800.
model:
rf<-randomForest(var1~va2+var3+..+var35,data=d7depo,ntree=800,keep.forest=FALSE, importance=TRUE)
summary(rf)
Length Class Mode
call 6 -none- call
type 1 -none- character
predicted 26917 -none- numeric
mse 800 -none- numeric
rsq 800 -none- numeric
oob.times 26917 -none- numeric
importance 70 -none- numeric
importanceSD 35 -none- numeric
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 0 -none- NULL
coefs 0 -none- NULL
y 26917 -none- numeric
test 0 -none- NULL
inbag 0 -none- NULL
terms 3 terms call
Question 1 - why does ntree
show 1?:
summary(rf)
shows you the length of the objects that are included in your rf
variable. That means that rf$ntree
is of length 1. If you type on your console rf$tree
you will see that it shows 800.
Question 2 - does a negative %IncMSE show a "bad" variable?
IncMSE:
The way this is calculated is by computing the MSE of the whole model initially. Let's call this MSEmod
. After this for each one of the variables (columns in your data set) the values are randomly shuffled (permuted) so that a "bad" variable is being created and a new MSE is being calculated. I.e. imagine for that for one column you had rows 1,2,3,4,5. After the permutation these will end up being 4,3,1,2,5 at random. After the permutation (all of the other columns remain exactly the same since we want to examine col1's
importance), the new MSE of the model is being calculated, let's call it MSEcol1
(in a similar manner you will have MSEcol2
, MSEcol3
but let's keep it simple and only deal with MSEcol1
here). We would expect that since the second MSE was created using a variable completely random, MSEcol1
would be higher than MSEmod
(the higher the MSE the worse). Therefore, when we take the difference of the two MSEcol1
- MSEmod
we usually expect a positive number. In your case a negative number shows that the random variable worked better, which shows that it probably the variable is not predictive enough i.e. not important.
Keep in mind that this description I gave you is the high level, in reality the two MSE values are scaled and the percentage difference is being calculated. But the high level story is this.
In algorithm form:
Hope it is clear now!