Using categorical data as features in sklean LogisticRegression

Optimesh picture Optimesh · Nov 28, 2015 · Viewed 16.6k times · Source

I'm trying to understand how to use categorical data as features in sklearn.linear_model's LogisticRegression.

I understand of course I need to encode it.

  1. What I don't understand is how to pass the encoded feature to the Logistic regression so it's processed as a categorical feature, and not interpreting the int value it got when encoding as a standard quantifiable feature.

  2. (Less important) Can somebody explain the difference between using preprocessing.LabelEncoder(), DictVectorizer.vocabulary or just encoding the categorical data yourself with a simple dict? Alex A.'s comment here touches on the subject but not very deeply.

Especially with the first one!

Answer

Matthew Gunn picture Matthew Gunn · Nov 29, 2015

You can create indicator variables for different categories. For example:

animal_names = {'mouse';'cat';'dog'}

Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')

Then we have:

                [0                         [0
Indicator_cat =  1        Indicator_dog =   0
                 0]                         1]

And you can concatenate these onto your original data matrix:

X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]

Remember though to leave one category without an indicator if a constant term is included in the data matrix! Otherwise, your data matrix won't be full column rank (or in econometric terms, you have multicollinearity).

[1  1  0  0         Notice how constant term, an indicator for mouse,
 1  0  1  0         an indicator for ca,t and an indicator for dog
 1  0  0  1]        leads to a less than full column rank matrix:
                    the first column is the sum of the last three.