I'm working on a prediction problem and I'm building a decision tree in R, I have several categorical variables and I'd like to one-hot encode them consistently in my training and testing set. I managed to do it on my training data with :
temps <- X_train
tt <- subset(temps, select = -output)
oh <- data.frame(model.matrix(~ . -1, tt), CLASS = temps$output)
But I can't find a way to apply the same encoding on my testing set, how can I do that?
I recommend using the dummyVars function in the caret package:
customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0))
customers
id gender mood outcome
1 10 male happy 1
2 20 female sad 1
3 30 female happy 0
4 40 male sad 0
5 50 female happy 0
# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
trsf
id gender.female gender.male mood.happy mood.sad outcome
1 10 0 1 1 0 1
2 20 1 0 0 1 1
3 30 1 0 1 0 0
4 40 0 1 0 1 0
5 50 1 0 1 0 0
example source
You apply the same procedure to both the training and validation sets.