I have a continuous dependent variable y and a independent categorical variable x named control_grid. x contains two variables: c and g
using python package statsmodel I am trying to see if independent variable has significant effect on y variable, as such:
model = smf.ols('y ~ c(x)', data=df)
results = model.fit()
table = sm.stats.anova_lm(results, typ=2)
Printing the table gives this as ouput:
OLS Regression Results
==============================================================================
Dep. Variable: sedimentation R-squared: 0.167
Model: OLS Adj. R-squared: 0.165
Method: Least Squares F-statistic: 86.84
Date: Fri, 13 Jul 2018 Prob (F-statistic): 5.99e-19
Time: 16:15:51 Log-Likelihood: -2019.2
No. Observations: 436 AIC: 4042.
Df Residuals: 434 BIC: 4050.
Df Model: 1
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept -6.0243 1.734 -3.474 0.001 -9.433 -2.616
control_grid[T.g] 22.2504 2.388 9.319 0.000 17.558 26.943
==============================================================================
Omnibus: 30.623 Durbin-Watson: 1.064
Prob(Omnibus): 0.000 Jarque-Bera (JB): 45.853
Skew: -0.510 Prob(JB): 1.10e-10
Kurtosis: 4.218 Cond. No. 2.69
==============================================================================
In the table where the coefficients are shown, I don't understand the depiction of my dependent variable.
it says:
control_grid[T.g]
What is the "T"? And is it only looking at one of the two variables? Only at the effect of "g" and not at "c"?
If you go here you see that in the summary the catogorical data Region is also shown for all the four variables "N","S","E" and "W".
P.S. my data looks as such:
index sedimentation control_grid
0 5.0 c
1 10.0 g
2 0.0 c
3 -10.0 c
4 0.0 g
5 -20.0 g
6 30.0 g
7 40.0 g
8 -10.0 c
9 45.0 g
10 45.0 g
11 10.0 c
12 10.0 g
13 10.0 c
14 6.0 g
15 10.0 c
16 29.0 c
17 3.0 g
18 23.0 c
19 34.0 g
I am not an expert, but I'll try to explain it. First, you should know ANOVA is a Regression analysis, so you are building a model Y ~ X, but in Anova X is a categorical variable. In your case Y = sedimentation, and X = control_grid (this is categorical), so the model is "sedimentation ~ control_grid".
Ols perform a regression analysis, so it calculates the parameters for a linear model: Y = Bo + B1X, but, given your X is categorical, your X is dummy coded which means X only can be 0 or 1, what is coherent with categorical data. Be aware in Anova, the number of parameters estimated is equal to the number of categories - 1, you in your data you have only 2 categories (g and c), therefore only one parameter is showed in your ols report. "T.g" means this parameter corresponds to the "g" category. Then your model is Y = Bo + T.g*X
Now, the parameter for T.c is considered as Bo, so actually, your model is:
Y = T.cX + T.gX where X is O or 1 depending if it is "c" or "g".
So, you are asking:
1) What is the "T"? T (T.g) is only indicating you the parameters estimated and showed correspond to the category "g".
2) And is it only looking at one of the two variables? No, the analysis estimated the parameters for the two categories (c and g), but the intercept Bo represents the coefficient for the other level of the category, in your data "c".
3) Only at the effect of "g" and not at "c"? No, in fact, the analyses look at the effect of both "g" and "c". If you look at the values of the coefficient T.g and Intercept (T.c) you can realize if they are significative or not (p values), and you can say if they have an effect on "sedimentation".
Cheers,