How to one hot encode several categorical variables in R

xeco picture xeco · Feb 6, 2018 · Viewed 59.8k times · Source

I'm working on a prediction problem and I'm building a decision tree in R, I have several categorical variables and I'd like to one-hot encode them consistently in my training and testing set. I managed to do it on my training data with :

temps <- X_train
tt <- subset(temps, select = -output)
oh <- data.frame(model.matrix(~ . -1, tt), CLASS = temps$output)

But I can't find a way to apply the same encoding on my testing set, how can I do that?

Answer

Esteban PS picture Esteban PS · Feb 6, 2018

I recommend using the dummyVars function in the caret package:

customers <- data.frame(
  id=c(10, 20, 30, 40, 50),
  gender=c('male', 'female', 'female', 'male', 'female'),
  mood=c('happy', 'sad', 'happy', 'sad','happy'),
  outcome=c(1, 1, 0, 0, 0))
customers
id gender  mood outcome
1 10   male happy       1
2 20 female   sad       1
3 30 female happy       0
4 40   male   sad       0
5 50 female happy       0


# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
trsf
id gender.female gender.male mood.happy mood.sad outcome
1 10             0           1          1        0       1
2 20             1           0          0        1       1
3 30             1           0          1        0       0
4 40             0           1          0        1       0
5 50             1           0          1        0       0

example source

You apply the same procedure to both the training and validation sets.