random forests: Does it make any difference if the test-set is also labeled?

user2468261 picture user2468261 · Jul 5, 2013 · Viewed 8.4k times · Source

All the examples I can find of making predictions using random forests already have the actual answers (i.e. the test-set has labels). What do you do when you don't have that column?

For example, this tutorial uses the iris data: http://mkseo.pe.kr/stats/?p=220

If we were doing this for real, the test dataset would have columns [1,4] and not column 5. If you try to run this without column 5 it kicks up an error that the dataframes are not the same size, which, obviously they're not.

How do you make predictions when you don't already have a column of answers?

Edit Clarification using excerpt from above link:

Prepare training and test set.

 test = iris[ c(1:10, 51:60, 101:110), ]
 train = iris[ c(11:50, 61:100, 111:150), ]

The test data frame has a complete species column. I'm trying to predict the species based on the forest I grow from the training set. So the position I am in is after running:

 test <- test[-5] 

I'm now in the position I'd be in if I'd gone out and collected a bunch of plant measurements and wanted to know the species based on the tree model I've grown from my training data. So, how can I predict the Species column I've just deleted based on the remaining data in the test dataframe and the forest grown using the training dataframe?

Answer

flodel picture flodel · Jul 5, 2013

Although the tutorial you quote has the Species column in the test set, it is not needed by the predict function as you guessed:

library(randomForest)
test  <- iris[ c(1:10, 51:60, 101:110), -5]  # removed the Species column here.
train <- iris[ c(11:50, 61:100, 111:150), ]
r <- randomForest(Species ~., data=train, importance=TRUE, do.trace=100)
predict(r, test)