I have a Pandas Dataframe with 2 categorical variables, and ID variable and a target variable (for classification). I managed to convert the categorical values with OneHotEncoder
. This results in a sparse matrix.
ohe = OneHotEncoder()
# First I remapped the string values in the categorical variables to integers as OneHotEncoder needs integers as input
... remapping code ...
ohe.fit(df[['col_a', 'col_b']])
ohe.transform(df[['col_a', 'col_b']])
But I have no clue how I can use this sparse matrix in a DecisionTreeClassifier? Especially when I want to add some other non-categorical variables in my dataframe later on. Thanks!
EDIT In reply to the comment of miraculixx: I also tried the DataFrameMapper in sklearn-pandas
mapper = DataFrameMapper([
('id_col', None),
('target_col', None),
(['col_a'], OneHotEncoder()),
(['col_b'], OneHotEncoder())
])
t = mapper.fit_transform(df)
But then I get this error:
TypeError: no supported conversion for types : (dtype('O'), dtype('int64'), dtype('float64'), dtype('float64')).
I see you are already using Pandas, so why not using its get_dummies
function?
import pandas as pd
df = pd.DataFrame([['rick','young'],['phil','old'],['john','teenager']],columns=['name','age-group'])
result
name age-group
0 rick young
1 phil old
2 john teenager
now you encode with get_dummies
pd.get_dummies(df)
result
name_john name_phil name_rick age-group_old age-group_teenager \
0 0 0 1 0 0
1 0 1 0 1 0
2 1 0 0 0 1
age-group_young
0 1
1 0
2 0
And you can actually use the new Pandas DataFrame in your Sklearn's DecisionTreeClassifier.