I am building a prediction model in python with two separate training and testing sets. The training data contains numerical type categorical variable, e.g., zip code,[91521,23151,12355, ...], and also string categorical variables, e.g., city ['Chicago', 'New York', 'Los Angeles', ...].
To train the data, I first use the 'pd.get_dummies' to get dummy variable of these variable, and then fit the model with the transformed training data.
I do the same transformation on my test data and predict the result using the trained model. However, I got the error 'ValueError: Number of features of the model must match the input. Model n_features is 1487 and input n_features is 1345 '. The reason is because there are fewer dummy variables in the test data because it has fewer 'city' and 'zipcode'.
How can I solve this problem? For example, 'OneHotEncoder' will only encode all numerical type categorical variable. 'DictVectorizer()' will only encode all string type categorical variable. I search on line and see a few similar questions but none of them really addresses my question.
Handling categorical features using scikit-learn
https://www.quora.com/What-is-the-best-way-to-do-a-binary-one-hot-one-of-K-coding-in-Python
You can also just get the missing columns and add them to the test dataset:
# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]
This code also ensure that column resulting from category in the test dataset but not present in the training dataset will be removed