how to get pandas get_dummies to emit N-1 variables to avoid collinearity?

ihadanny picture ihadanny · Jul 19, 2015 · Viewed 8.9k times · Source

pandas.get_dummies emits a dummy variable per categorical value. Is there some automated, easy way to ask it to create only N-1 dummy variables? (just get rid of one "baseline" variable arbitrarily)?

Needed to avoid co-linearity in our dataset.

Answer

T.C. Proctor picture T.C. Proctor · May 26, 2016

Pandas version 0.18.0 implemented exactly what you're looking for: the drop_first option. Here's an example:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: u'0.18.1'

In [3]: s = pd.Series(list('abcbacb'))

In [4]: pd.get_dummies(s, drop_first=True)
Out[4]: 
     b    c
0  0.0  0.0
1  1.0  0.0
2  0.0  1.0
3  1.0  0.0
4  0.0  0.0
5  0.0  1.0
6  1.0  0.0