All the examples I can find of making predictions using random forests already have the actual answers (i.e. the test-set has labels). What do you do when you don't have that column?
For example, this tutorial uses the iris data: http://mkseo.pe.kr/stats/?p=220
If we were doing this for real, the test dataset would have columns [1,4] and not column 5. If you try to run this without column 5 it kicks up an error that the dataframes are not the same size, which, obviously they're not.
How do you make predictions when you don't already have a column of answers?
Edit Clarification using excerpt from above link:
Prepare training and test set.
test = iris[ c(1:10, 51:60, 101:110), ]
train = iris[ c(11:50, 61:100, 111:150), ]
The test data frame has a complete species column. I'm trying to predict the species based on the forest I grow from the training set. So the position I am in is after running:
test <- test[-5]
I'm now in the position I'd be in if I'd gone out and collected a bunch of plant measurements and wanted to know the species based on the tree model I've grown from my training data. So, how can I predict the Species column I've just deleted based on the remaining data in the test dataframe and the forest grown using the training dataframe?
Although the tutorial you quote has the Species
column in the test
set, it is not needed by the predict
function as you guessed:
library(randomForest)
test <- iris[ c(1:10, 51:60, 101:110), -5] # removed the Species column here.
train <- iris[ c(11:50, 61:100, 111:150), ]
r <- randomForest(Species ~., data=train, importance=TRUE, do.trace=100)
predict(r, test)