How to interpret the summary table for Python OLS Statsmodel?

J.A.Cado picture J.A.Cado · Jul 16, 2018 · Viewed 10.3k times · Source

I have a continuous dependent variable y and a independent categorical variable x named control_grid. x contains two variables: c and g

using python package statsmodel I am trying to see if independent variable has significant effect on y variable, as such:

model = smf.ols('y ~ c(x)', data=df)
results = model.fit()
table = sm.stats.anova_lm(results, typ=2)

Printing the table gives this as ouput:

     OLS Regression Results                            
==============================================================================
Dep. Variable:          sedimentation   R-squared:                       0.167
Model:                            OLS   Adj. R-squared:                  0.165
Method:                 Least Squares   F-statistic:                     86.84
Date:                Fri, 13 Jul 2018   Prob (F-statistic):           5.99e-19
Time:                        16:15:51   Log-Likelihood:                -2019.2
No. Observations:                 436   AIC:                             4042.
Df Residuals:                     434   BIC:                             4050.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            -6.0243      1.734     -3.474      0.001      -9.433      -2.616
control_grid[T.g]    22.2504      2.388      9.319      0.000      17.558      26.943
==============================================================================
Omnibus:                       30.623   Durbin-Watson:                   1.064
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               45.853
Skew:                          -0.510   Prob(JB):                     1.10e-10
Kurtosis:                       4.218   Cond. No.                         2.69
==============================================================================

In the table where the coefficients are shown, I don't understand the depiction of my dependent variable.

it says:

control_grid[T.g]

What is the "T"? And is it only looking at one of the two variables? Only at the effect of "g" and not at "c"?

If you go here you see that in the summary the catogorical data Region is also shown for all the four variables "N","S","E" and "W".

P.S. my data looks as such:

index         sedimentation control_grid
0             5.0            c
1            10.0            g
2             0.0            c
3           -10.0            c
4             0.0            g
5           -20.0            g
6            30.0            g
7            40.0            g
8           -10.0            c
9            45.0            g
10           45.0            g
11           10.0            c
12           10.0            g
13           10.0            c
14            6.0            g
15           10.0            c
16           29.0            c
17            3.0            g
18           23.0            c
19           34.0            g

Answer

Irbin B. picture Irbin B. · Aug 29, 2018

I am not an expert, but I'll try to explain it. First, you should know ANOVA is a Regression analysis, so you are building a model Y ~ X, but in Anova X is a categorical variable. In your case Y = sedimentation, and X = control_grid (this is categorical), so the model is "sedimentation ~ control_grid".

Ols perform a regression analysis, so it calculates the parameters for a linear model: Y = Bo + B1X, but, given your X is categorical, your X is dummy coded which means X only can be 0 or 1, what is coherent with categorical data. Be aware in Anova, the number of parameters estimated is equal to the number of categories - 1, you in your data you have only 2 categories (g and c), therefore only one parameter is showed in your ols report. "T.g" means this parameter corresponds to the "g" category. Then your model is Y = Bo + T.g*X

Now, the parameter for T.c is considered as Bo, so actually, your model is:

Y = T.cX + T.gX where X is O or 1 depending if it is "c" or "g".

So, you are asking:

1) What is the "T"? T (T.g) is only indicating you the parameters estimated and showed correspond to the category "g".

2) And is it only looking at one of the two variables? No, the analysis estimated the parameters for the two categories (c and g), but the intercept Bo represents the coefficient for the other level of the category, in your data "c".

3) Only at the effect of "g" and not at "c"? No, in fact, the analyses look at the effect of both "g" and "c". If you look at the values of the coefficient T.g and Intercept (T.c) you can realize if they are significative or not (p values), and you can say if they have an effect on "sedimentation".

Cheers,