Pipeline OrdinalEncoder ValueError Found unknown categories

Pablo Honey picture Pablo Honey · Feb 22, 2019 · Viewed 8.2k times · Source

Please take it easy on me. I’m switching careers into data science and don’t have a CS or programming background—so I could be doing something profoundly stupid. I've researched for a few hours without success.

Objective: get Pipeline to run with OrdinalEncoder.

Problem: code does not run w/the OrdinalEncoder call. It does run w/o OrdinalEncoder. As best as I can tell I can pass two arguments, i.e. categories and dtype. Neither help.

I’m passing the public diabetes data set to the model. Is this the issue? IOW, is the passing of high cardinality features to OrdinalEncoder causing a problem between train/test data after model is built, i.e. the test split has a value that the train set does not?

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([
    ('imputer', SimpleImputer()),
    ('ordinal_encoder', OrdinalEncoder()),
    ('classifier', RandomForestClassifier(criterion='gini', n_estimators=100))])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Construct model
model = pipe.fit(X_train, y_train)

# Show results
print("Hold-out AUC score: %.3f" %roc_auc_score(model.predict_proba(X_test),y_test))

Here’s the error I’m getting:

ValueError: Found unknown categories [17.0] in column 0 during transform

What am I doing wrong?

Setup:

The scikit-learn version is 0.20.2.
3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43) 
[Clang 6.0 (clang-600.0.57)]
sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)

Answer

kevh picture kevh · Dec 19, 2019

I'm late to the game but I landed on this page so I thought I would reply anyway.

You said it in your comment: "diabetes dataset has too many values in many of the features for a given test/train split to both mirror all the values"

This error happens with encoders when the testing set contains data not seen during the training.