I'm trying to perform a one hot encoding of a trivial dataset.
data = [['a', 'dog', 'red']
['b', 'cat', 'green']]
What's the best way to preprocess this data using Scikit-Learn?
On first instinct, you'd look towards Scikit-Learn's OneHotEncoder. But the one hot encoder doesn't support strings as features; it only discretizes integers.
So then you would use a LabelEncoder, which would encode the strings into integers. But then you have to apply the label encoder into each of the columns and store each one of these label encoders (as well as the columns they were applied on). And this feels extremely clunky.
So, what's the best way to do it in Scikit-Learn?
Please don't suggest pandas.get_dummies. That's what I generally use nowadays for one hot encodings. However, its limited in the fact that you can't encode your training / test set separately.
If you are on sklearn>0.20.dev0
In [11]: from sklearn.preprocessing import OneHotEncoder
...: cat = OneHotEncoder()
...: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T
...: cat.fit_transform(X).toarray()
...:
Out[11]: array([[1., 0., 0., 1., 0.],
[0., 1., 0., 0., 1.],
[1., 0., 0., 1., 0.],
[0., 0., 1., 0., 1.]])
If you are on sklearn==0.20.dev0
In [30]: cat = CategoricalEncoder()
In [31]: X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T
In [32]: cat.fit_transform(X).toarray()
Out[32]:
array([[ 1., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 1.],
[ 1., 0., 0., 1., 0.],
[ 0., 0., 1., 0., 1.]])
Another way to do it is to use category_encoders.
Here is an example:
% pip install category_encoders
import category_encoders as ce
le = ce.OneHotEncoder(return_df=False, impute_missing=False, handle_unknown="ignore")
X = np.array([['a', 'dog', 'red'], ['b', 'cat', 'green']])
le.fit_transform(X)
array([[1, 0, 1, 0, 1, 0],
[0, 1, 0, 1, 0, 1]])