Creating dummy variable using pandas or statsmodel for interaction of two columns

Mehdi picture Mehdi · Jul 12, 2017 · Viewed 7.2k times · Source

I have a data frame like this:

Index ID  Industry  years_spend       asset
6646  892         4            4  144.977037
2347  315        10            8  137.749138
7342  985         1            5  104.310217
137    18         5            5  156.593396
2840  381        11            2  229.538828
6579  883        11            1  171.380125
1776  235         4            7  217.734377
2691  361         1            2  148.865341
815   110        15            4  233.309491
2932  393        17            5  187.281724

I want to create dummy variables for Industry X years_spend which creates len(df.Industry.value_counts()) * len(df.years_spend.value_counts()) varaible, for example d_11_4 = 1 for all rows that has industry==1 and years spend=4 otherwise d_11_4 = 0. Then I can use these vars for some regression works.

I know I can make groups like what I want using df.groupby(['Industry','years_spend']) and I know I can create such variable for one column using patsy syntax in statsmodels:

import statsmodels.formula.api as smf

mod = smf.ols("income ~   C(Industry)", data=df).fit()

but If I want to do with 2 columns I get an error that: IndexError: tuple index out of range

How can I do that with pandas or using some function inside statsmodels?

Answer

Nathaniel J. Smith picture Nathaniel J. Smith · Jul 14, 2017

Using patsy syntax it's just:

import statsmodels.formula.api as smf

mod = smf.ols("income ~ C(Industry):C(years_spend)", data=df).fit()

The : character means "interaction"; you can also generalize this to interactions of more than two items (C(a):C(b):C(c)), interactions between numerical and categorical values, etc. You might find the patsy docs useful.