One hot encoding of string categorical features

Question 1

One hot encoding of string categorical features

python encoding scikit-learn one-hot-encoding

hlin117 · Jan 30, 2016 · Viewed 18.5k times · Source

Answer

Answer

If you are on sklearn>0.20.dev0

In [11]: from sklearn.preprocessing import OneHotEncoder
    ...: cat = OneHotEncoder()
    ...: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T
    ...: cat.fit_transform(X).toarray()
    ...: 
Out[11]: array([[1., 0., 0., 1., 0.],
           [0., 1., 0., 0., 1.],
           [1., 0., 0., 1., 0.],
           [0., 0., 1., 0., 1.]])

If you are on sklearn==0.20.dev0

In [30]: cat = CategoricalEncoder()

In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T

In [32]: cat.fit_transform(X).toarray()
Out[32]:
array([[ 1.,  0., 0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  1.],
       [ 1.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.,  1.]])

Another way to do it is to use category_encoders.

Here is an example:

% pip install category_encoders
import category_encoders as ce
le =  ce.OneHotEncoder(return_df=False, impute_missing=False, handle_unknown="ignore")
X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])
le.fit_transform(X)
array([[1, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 0, 1]])

Question 2

I'm trying to perform a one hot encoding of a trivial dataset.

data = [['a', 'dog', 'red']
        ['b', 'cat', 'green']]

What's the best way to preprocess this data using Scikit-Learn?

On first instinct, you'd look towards Scikit-Learn's OneHotEncoder. But the one hot encoder doesn't support strings as features; it only discretizes integers.

So then you would use a LabelEncoder, which would encode the strings into integers. But then you have to apply the label encoder into each of the columns and store each one of these label encoders (as well as the columns they were applied on). And this feels extremely clunky.

So, what's the best way to do it in Scikit-Learn?

Please don't suggest pandas.get_dummies. That's what I generally use nowadays for one hot encodings. However, its limited in the fact that you can't encode your training / test set separately.

One hot encoding of string categorical features

Answer

Related questions