Difference in Differences in Python + Pandas

pceccon picture pceccon · May 12, 2016 · Viewed 9.9k times · Source

I'm trying to perform a Difference in Differences (with panel data and fixed effects) analysis using Python and Pandas. I have no background in Economics and I'm just trying to filter the data and run the method that I was told to. However, as far as I could learn, I understood that the basic diff-in-diffs model looks like this:

enter image description here

I.e., I am dealing with a multivariable model.

Here it follows a simple example in R:

https://thetarzan.wordpress.com/2011/06/20/differences-in-differences-estimation-in-r-and-stata/

As it can be seen, the regression takes as input one dependent variable and tree sets of observations.

My input data looks like this:

    Name    Permits_13  Score_13    Permits_14  Score_14    Permits_15  Score_15
0   P.S. 015 ROBERTO CLEMENTE   12.0    284 22  279 32  283
1   P.S. 019 ASHER LEVY 18.0    296 51  301 55  308
2   P.S. 020 ANNA SILVER    9.0 294 9   290 10  293
3   P.S. 034 FRANKLIN D. ROOSEVELT  3.0 294 4   292 1   296
4   P.S. 064 ROBERT SIMON   3.0 287 15  288 17  291
5   P.S. 110 FLORENCE NIGHTINGALE   0.0 313 3   306 4   308
6   P.S. 134 HENRIETTA SZOLD    4.0 290 12  292 17  288
7   P.S. 137 JOHN L. BERNSTEIN  4.0 276 12  273 17  274
8   P.S. 140 NATHAN STRAUS  13.0    282 37  284 59  284
9   P.S. 142 AMALIA CASTRO  7.0 290 15  285 25  284
10  P.S. 184M SHUANG WEN    5.0 327 12  327 9   327

Through some research I found that this is the way to use fixed effects and panel data with Pandas:

Fixed effect in Pandas or Statsmodels

I performed some transformations to get a Multi-index data:

rng = pandas.date_range(start=pandas.datetime(2013, 1, 1), periods=3, freq='A')
index = pandas.MultiIndex.from_product([rng, df['Name']], names=['date', 'id'])
d1 = numpy.array(df.ix[:, ['Permits_13', 'Score_13']])
d2 = numpy.array(df.ix[:, ['Permits_14', 'Score_14']])
d3 = numpy.array(df.ix[:, ['Permits_15', 'Score_15']])
data = numpy.concatenate((d1, d2, d3), axis=0)
s = pandas.DataFrame(data, index=index)  
s = s.astype('float')

However, I didn't get how to pass all this variables to the model, such as can be done in R:

reg1 = lm(work ~ post93 + anykids + p93kids.interaction, data = etc)

Here, 13, 14, 15 represents data for 2013, 2014, 2015, which I believe should be used to create a panel. I called the model like this:

reg  = PanelOLS(y=s['y'],x=s[['x']],time_effects=True)

And this is the result:

enter image description here

I was told (by an economist) that this doesn't seem to be running with fixed effects.

--EDIT--

What I want to verify is the effects of the number of permits on the score, given the time. The number of the permits is the treatment, it's an intensive treatment.

A sample of the code can be found here: https://www.dropbox.com/sh/ped312ur604357r/AACQGloHDAy8I2C6HITFzjqza?dl=0.

Answer

etna picture etna · May 18, 2016

It seems that what you need are not difference in differences (DD) regressions. DD regressions are relevant when you can distinguish a control group and a treatment group. A standard simplified example would be the evaluation of a medicine. You split a population of sick people in two groups. Half of them are given nothing: they are the control group. The other half are given a medicine: they are the treatment group. Essentially, the DD regression will capture the fact that the real effect of the medicine is not directly measurable in terms of how many people who were given the medicine got healthy. Intuitively, you want to know if these people did better than the ones who were not given any medicine. This result could be refined by adding yet another category: a placebo one i.e. people who are given something which looks like a medicine but actually isn't... but again this would be a well defined group. Last but not least, for a DD regression to be really appropriate, you need to make sure groups are not heterogeneous in a way that could bias results. A bad situation for your medicine test would be if the treatment group includes only people who are young and super fit (hence more likely to heal in general), while the control group is a bunch of old alcoholics...

In your case, if I'm not mistaken, everybody gets "treated" to some extent... so you are closer to a standard regression framework where the impact of X on Y (e.g. IQ on wage) is to be measured. I understand that you want to measure the impact of the number of permits on the score (or is it the other way? -_-), and you have classical endogeneity to deal with i.e. if Peter is more skilled than Paul, he'll typically obtain more permits AND a higher score. So what you actually want to use is the fact that with the same level of skill over time, Peter (respectively Paul) will be "given" different levels of permits over years... and there you'll really measure the influence of permits on score...

I might not be guessing well, but I want to insist on the fact that there are many ways to obtain biased, hence meaningless results, if you don't put enough efforts to understand/explain what's going on in the data. Regarding technical details, your estimation only have year fixed effects (likely not estimated but taken into account through demeaning, hence not returned in the output), so what you want to do is to add entity_effects = True. If you want to go further... I'm afraid panel data regressions are not well covered in any Python package so far, (including statsmodels which if the reference for econometrics) so if you're not willing to invest... I would rather suggest using R or Stata. Meanwhile, if a Fixed Effect regression is all you need, you can also get it with statsmodels (which also allows to cluster standard errors if needed...):

import statsmodels.formula.api as smf
df = s.reset_index(drop = False)
reg = smf.ols('y ~ x + C(date) + C(id)',
              data = df).fit()
print(reg.summary())
# clustering standard errors at individual level
reg_cl = smf.ols(formula='y ~ x + C(date) + C(id)',
                 data=df).fit(cov_type='cluster',
                              cov_kwds={'groups': df['id']})
print(reg_cl.summary())
# output only coeff and standard error of x
print(u'{:.3f} ({:.3f})'.format(reg.params.ix['x'], reg.bse.ix['x']))
print(u'{:.3f} ({:.3f})'.format(reg_cl.params.ix['x'], reg_cl.bse.ix['x']))

Regarding econometrics, you'll likely get more/better answers on Cross Validated than here.