I am trying to fit a Poisson distribution to my data using statsmodels but I am confused by the results that I am getting and how to use the library.
My real data will be a series of numbers that I think that I should be able to describe as having a poisson distribution plus some outliers so eventually I would like to do a robust fit to the data.
However for testing purposes, I just create a dataset using scipy.stats.poisson
samp = scipy.stats.poisson.rvs(4,size=200)
So to fit this using statsmodels I think that I just need to have a constant 'endog'
res = sm.Poisson(samp,np.ones_like(samp)).fit()
print res.summary()
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 200
Model: Poisson Df Residuals: 199
Method: MLE Df Model: 0
Date: Fri, 27 Jun 2014 Pseudo R-squ.: 0.000
Time: 14:28:29 Log-Likelihood: -404.37
converged: True LL-Null: -404.37
LLR p-value: nan
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 1.3938 0.035 39.569 0.000 1.325 1.463
==============================================================================
Ok, that doesn't look right, But if I do
res.predict()
I get an array of 4.03 (which was the mean for this test sample). So basically, firstly I very confused how to interpret this result from statsmodel and secondly I should probably being doing something completely different if I'm interested in robust parameter estimation of a distribution rather than fitting trends but how should I go about doing that?
Edit I should really have given more detail in order to answer the second part of my question.
I have an event that occurs a random time after a starting time. When I plot a histogram of the delay times for many events, I see that the distribution looks like a scaled Poisson distribution plus several outlier points which are normally caused by issues in my underlying system. So I simply wanted to find the expected time delay for the dataset, excluding the outliers. If not for the outliers, I could simply find the mean time. I suppose that I could exclude them manually but I thought that I could find something more exacting.
Edit On further reflection, I will be considering other distributions instead of sticking with a Poissonion and the details of my issue are probably a distraction from the original question but I've left them here anyway.
The Poisson model, as most other models in generalized linear model families or for other discrete data, assumes that we have a transformation that bounds the prediction in the appropriate range.
Poisson works for nonnegative numbers and the transformation is exp
, so the model that is estimated assumes that the expected value of an observation, conditional on the explanatory variables is
E(y | x) = exp(X dot params)
To get the lambda parameter of the poisson distribution, we need to use exp, i.e.
>>> np.exp(1.3938)
4.0301355071650118
predict
does this by default, but you can request just the linear part (X dot params)
with a keyword argument.
BTW: statsmodels' controversial terminology endog is y exog is x (has x in it) (http://statsmodels.sourceforge.net/devel/endog_exog.html )
Outlier Robust Estimation
The answer to the last part of the question is that there is currently no outlier robust estimation in Python for Poisson or other count models, as far as I know.
For overdispersed data, where the variance is larger than the mean, we can use NegativeBinomial Regression. For outliers in Poisson we would have to use R/Rpy or do manual trimming of outliers. Outlier identification could be based on one of the standardized residuals.
It will not be available in statsmodels for some time, unless someone is contributing this.