I'm doing logistic regression using pandas 0.11.0
(data handling) and statsmodels 0.4.3
to do the actual regression, on Mac OSX Lion.
I'm going to be running ~2,900 different logistic regression models and need the results output to csv file and formatted in a particular way.
Currently, I'm only aware of doing print result.summary()
which prints the results (as follows) to the shell:
Logit Regression Results
==============================================================================
Dep. Variable: death_death No. Observations: 9752
Model: Logit Df Residuals: 9747
Method: MLE Df Model: 4
Date: Wed, 22 May 2013 Pseudo R-squ.: -0.02672
Time: 22:15:05 Log-Likelihood: -5806.9
converged: True LL-Null: -5655.8
LLR p-value: 1.000
===============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
-------------------------------------------------------------------------------
age_age5064 -0.1999 0.055 -3.619 0.000 -0.308 -0.092
age_age6574 -0.2553 0.053 -4.847 0.000 -0.359 -0.152
sex_female -0.2515 0.044 -5.765 0.000 -0.337 -0.166
stage_early -0.1838 0.041 -4.528 0.000 -0.263 -0.104
access -0.0102 0.001 -16.381 0.000 -0.011 -0.009
===============================================================================
I will also need the odds ratio, which is computed by print np.exp(result.params)
, and is printed in the shell as such:
age_age5064 0.818842
age_age6574 0.774648
sex_female 0.777667
stage_early 0.832098
access 0.989859
dtype: float64
What I need is for these each to be written to a csv file in form of a very lon row like (am not sure, at this point, whether I will need things like Log-Likelihood
, but have included it for the sake of thoroughness):
`Log-Likelihood, age_age5064_coef, age_age5064_std_err, age_age5064_z, age_age5064_p>|z|,...age_age6574_coef, age_age6574_std_err, ......access_coef, access_std_err, ....age_age5064_odds_ratio, age_age6574_odds_ratio, ...sex_female_odds_ratio,.....access_odds_ratio`
I think you get the picture - a very long row, with all of these actual values, and a header with all the column designations in a similar format.
I am familiar with the csv module
in Python, and am becoming more familiar with pandas
. Not sure whether this info could be formatted and stored in a pandas dataframe
and then written, using to_csv
to a file once all ~2,900 logistic regression models have completed; that would certainly be fine. Also, writing them as each model is completed is also fine (using csv module
).
UPDATE:
So, I was looking more at statsmodels site, specifically trying to figure out how the results of a model are stored within classes. It looks like there is a class called 'Results', which will need to be used. I think using inheritance from this class to create another class, where some of the methods/operators get changed might be the way to go, in order to get the formatting I require. I have very little experience in the ways of doing this, and will need to spend quite a bit of time figuring this out (which is fine). If anybody can help/has more experience that would be awesome!
Here is the site where the classes are laid out: statsmodels results class
There is no premade table of parameters and their result statistics currently available.
Essentially you need to stack all the results yourself, whether in a list, numpy array or pandas DataFrame depends on what's more convenient for you.
for example, if I want one numpy array that has the results for a model, llf and results in the summary parameter table, then I could use
res_all = []
for res in results:
low, upp = res.confint().T # unpack columns
res_all.append(numpy.concatenate(([res.llf], res.params, res.tvalues, res.pvalues,
low, upp)))
But it might be better to align with pandas, depending on what structure you have across models.
You could write a helper function that takes all the results from the results instance and concatenates them in a row.
(I'm not sure what's the most convenient for writing to csv by rows)
edit:
Here is an example storing the regression results in a dataframe
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/multilinear.py#L21
the loop is on line 159.
summary() and similar code outside of statsmodels, for example http://johnbeieler.org/py_apsrtable/ for combining several results, is oriented towards printing and not to store variables.