Hi have a pandas dataframe df
containing categorical variables.
df=pandas.DataFrame(data=[['male','blue'],['female','brown'],
['male','black']],columns=['gender','eyes'])
df
Out[16]:
gender eyes
0 male blue
1 female brown
2 male black
using the function get_dummies I get the following dataframe
df_dummies = pandas.get_dummies(df)
df_dummies
Out[18]:
gender_female gender_male eyes_black eyes_blue eyes_brown
0 0 1 0 1 0
1 1 0 0 0 1
2 0 1 1 0 0
Owever the columns gender_female
and gender_male
contain the same information because the original column could assume a binary value. Is there a (smart) way to keep only one of the 2 columns?
UPDATED
The use of
df_dummies = pandas.get_dummies(df,drop_first=True)
Would give me
df_dummies
Out[21]:
gender_male eyes_blue eyes_brown
0 1 1 0
1 0 0 1
2 1 0 0
but I would like to remove the columns for which originally I had only 2 possibilities
The desired result should be
df_dummies
Out[18]:
gender_male eyes_black eyes_blue eyes_brown
0 1 0 1 0
1 0 0 0 1
2 1 1 0 0
Yes, you can use the argument dropfirst
:
drop_first=True
From the documentation:
pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
To have all dummy columns for eyes
, and one for gender
, use this:
df = pd.get_dummies(df, prefix=['eyes'], columns=['eyes'])
df = pd.get_dummies(df,drop_first=True)
Output:
eyes_black eyes_blue eyes_brown gender_male
0 0 1 0 1
1 0 0 1 0
2 1 0 0 1
More general:
gender eyes heigh
0 male blue tall
1 female brown short
2 male black average
for i in df.columns:
if len(df.groupby([i]).size()) > 2:
df = pd.get_dummies(df, prefix=[i], columns=[i])
df = pd.get_dummies(df, drop_first=True)
Output:
eyes_black eyes_blue eyes_brown heigh_average heigh_short heigh_tall \
0 0 1 0 0 0 1
1 0 0 1 0 1 0
2 1 0 0 1 0 0
gender_male
0 1
1 0
2 1