One-hot-encoding multiple columns in sklearn and naming columns

Gideon Blinick picture Gideon Blinick · Mar 18, 2019 · Viewed 14.1k times · Source

I have the following code to one-hot-encode 2 columns I have.

# encode city labels using one-hot encoding scheme
city_ohe = OneHotEncoder(categories='auto')
city_feature_arr = city_ohe.fit_transform(df[['city']]).toarray()
city_feature_labels = city_ohe.categories_
city_features = pd.DataFrame(city_feature_arr, columns=city_feature_labels)

phone_ohe = OneHotEncoder(categories='auto')
phone_feature_arr = phone_ohe.fit_transform(df[['phone']]).toarray()
phone_feature_labels = phone_ohe.categories_
phone_features = pd.DataFrame(phone_feature_arr, columns=phone_feature_labels)

What I'm wondering is how I do this in 4 lines while getting properly named columns in the output. That is, I can create a properly one-hot-encoded array by include both columns names in fit_transform but when I try and name the resulting dataframe's columns, it tells me that there is a mismatch between the shape of the indices:

ValueError: Shape of passed values is (6, 50000), indices imply (3, 50000)

For background, both phone and city have 3 values.

    city    phone
0   CityA   iPhone
1   CityB Android
2   CityB iPhone
3   CityA   iPhone
4   CityC   Android

Answer

MaximeKan picture MaximeKan · Mar 19, 2019

You you are almost there... Like you said you can add all the columns you want to encode in fit_transform directly.

ohe = OneHotEncoder(categories='auto')
feature_arr = ohe.fit_transform(df[['phone','city']]).toarray()
feature_labels = ohe.categories_

And then you just need to do the following:

feature_labels = np.array(feature_labels).ravel()

Which enables you to name your columns like you wanted:

features = pd.DataFrame(feature_arr, columns=feature_labels)