Pandas sklearn one-hot encoding dataframe or numpy?

Georg Heiler picture Georg Heiler · Oct 7, 2016 · Viewed 7k times · Source

How can I transform a pandas data frame to sklearn one-hot-encoded (dataframe / numpy array) where some columns do not require encoding?

mydf = pd.DataFrame({'Target':[0,1,0,0,1, 1,1],
                   'GroupFoo':[1,1,2,2,3,1,2],
                    'GroupBar':[2,1,1,0,3,1,2],
                    'GroupBar2':[2,1,1,0,3,1,2],
                    'SomeOtherShouldBeUnaffected':[2,1,1,0,3,1,2]})
columnsToEncode = ['GroupFoo', 'GroupBar']

Is an already label encoded data frame and I would like to only encode the columns marked by columnsToEncode?

My problem is that I am unsure if a pd.Dataframe or the numpy array representation are better and how to re-merge the encoded part with the other one.

My attempts so far:

myEncoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
myEncoder.fit(X_train)
df = pd.concat([
         df[~columnsToEncode], # select all other / numeric
        # select category to one-hot encode
         pd.Dataframe(encoder.transform(X_train[columnsToEncode]))#.toarray() # not sure what this is for
        ], axis=1).reindex_axis(X_train.columns, axis=1)

Notice: I am aware of Pandas: Get Dummies / http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html but that does not play well in a train / test split where I require such an encoding per fold.

Answer

Georg Heiler picture Georg Heiler · Oct 8, 2016

This library provides several categorical encoders which make sklearn / numpy play nicely with pandas https://github.com/wdm0006/categorical_encoding

However, they do not yet support "handle unknown category"

for now I will use

myEncoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
myEncoder.fit(df[columnsToEncode])

pd.concat([df.drop(columnsToEncode, 1),
          pd.DataFrame(myEncoder.transform(df[columnsToEncode]))], axis=1).reindex()

As this supports unknown datasets. For now, I will stick with half-pandas half-numpy because of the nice pandas labels. for the numeric columns.