Keep same dummy variable in training and testing data

Question 1

Keep same dummy variable in training and testing data

python dataframe scikit-learn prediction dummy-variable

nimning · Dec 26, 2016 · Viewed 24.3k times · Source

Answer

Answer

You can also just get the missing columns and add them to the test dataset:

# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]

This code also ensure that column resulting from category in the test dataset but not present in the training dataset will be removed

Question 2

I am building a prediction model in python with two separate training and testing sets. The training data contains numerical type categorical variable, e.g., zip code,[91521,23151,12355, ...], and also string categorical variables, e.g., city ['Chicago', 'New York', 'Los Angeles', ...].

To train the data, I first use the 'pd.get_dummies' to get dummy variable of these variable, and then fit the model with the transformed training data.

I do the same transformation on my test data and predict the result using the trained model. However, I got the error 'ValueError: Number of features of the model must match the input. Model n_features is 1487 and input n_features is 1345 '. The reason is because there are fewer dummy variables in the test data because it has fewer 'city' and 'zipcode'.

How can I solve this problem? For example, 'OneHotEncoder' will only encode all numerical type categorical variable. 'DictVectorizer()' will only encode all string type categorical variable. I search on line and see a few similar questions but none of them really addresses my question.

Handling categorical features using scikit-learn

https://www.quora.com/If-the-training-dataset-has-more-variables-than-the-test-dataset-what-does-one-do

https://www.quora.com/What-is-the-best-way-to-do-a-binary-one-hot-one-of-K-coding-in-Python

Keep same dummy variable in training and testing data

Answer

Related questions