I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector X
that contains the values of the continuous variable, and another equally-sized column vector Y
that contains the known classification of each value of X
(e.g. 0 or 1). I'm using the following code:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a p = 1.000
, coefficients (b
) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.
I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval
to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');
), and created an array for the fitting (X_fit = linspace(0,1)
). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-')
, the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.
My questions are as follows:
1) Why did my use of glmfit
give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval
should be able to input the stats
output from glmfit
, but my use of glmfit
is not giving correct results.
Any comments and input would be very useful, thanks!
I found that mnrval
seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1);
where Y+1
simply makes my binary classifier into a nominal one.
I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats);
to get various pihat
probability values, where loopVal = linspace(0,1)
or some appropriate input range and `ii = 1:length(loopVal)'.
The stats
parameter has a great correlation coefficient (0.9973), but the p values for b_fit
are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit
work over glmfit
in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit
were both p<<0.001
, and the coefficient estimates were quite different as well.
Finally, how does one interpret the dev
output from the mnrfit
function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev
values from other models?
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x
such that all values of x < xDiv
belong to one class (say y = 0
) and all values of x > xDiv
belong to the other class (y = 1
).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X
such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0
or y = 1
when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t
where y(t) == 0
set y(t) = 1
). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.