I was thinking about posting my question in Cross-Validated, but decided to come here. I am using the multinom() function from the nnet package to estimate the odds of becoming employed, unemployed, or out of labor force conditioned on age and education. I need some help with the interpretation.
I have the following dataset of one dependent categorical variable employment status(EmpSt) and two independent categorical variables: age (Age) and education level (Education).
>head(df)
EmpSt Age Education
1 Employed 61+ Less than a high school diploma
2 Employed 50-60 High school graduates, no college
3 Not in labor force 50-60 Less than a high school diploma
4 Employed 30-39 Bachelor's degree or higher
5 Employed 20-29 Some college or associate degree
6 Employed 20-29 Some college or associate degree
Here is the summary with the levels:
>summary(df)
EmpSt Age Education
Not in universe : 0 16-19: 6530 Less than a high school diploma :14686
Employed :61478 20-29:16031 High school graduates, no college:30716
Unemployed : 3940 30-39:16520 Some college or associate degree :28525
Not in labor force:38508 40-49:17403 Bachelor's degree or higher :29999
50-60:20779
61+ :26663
I want to determine what is the estimation equation(model) for the call
df$EmpSt<-relevel(df$EmpSt,ref="Employed")
multinom(EmpSt ~ Age + Education,data=df)
so I can write it down in my research paper. In my understanding the Employed is the base level and the logit model for this call is:
where i and n are the categories of the variables age and education respectively (sorry for confusing notation). Please, correct me if my understanding of the logistic model produced by multinom() is incorrect. I am not going to include the summary of the test because it is a lot of output, so below I just include the the output for call >test
:
> test
Call:
multinom(formula = EmpSt ~ Age + Education, data = ml)
Coefficients:
(Intercept) Age20-29 Age30-39 Age40-49 Age50-60 Age61+
Unemployed -1.334734 -0.3395987 -0.7104361 -0.8848517 -0.9358338 -0.9319822
Not in labor force 1.180028 -1.2531405 -1.6711616 -1.6579095 -1.2579600 0.8197373
EducationHigh school graduates, no college EducationSome college or associate degree
Unemployed -0.4255369 -0.781474
Not in labor force -0.8125016 -1.004423
EducationBachelor's degree or higher
Unemployed -1.351119
Not in labor force -1.580418
Residual Deviance: 137662.6
AIC: 137698.6
Given that my understanding of the logit model produced by the multinom() is correct the coefficients are the logged odds where the base level is Employed. To get the actual odds I antilog by the call exp(coef(test))
which gives me the actual odds:
> exp(coef(test))
(Intercept) Age20-29 Age30-39 Age40-49 Age50-60 Age61+
Unemployed 0.2632281 0.7120560 0.4914298 0.4127754 0.3922587 0.3937724
Not in labor force 3.2544655 0.2856064 0.1880285 0.1905369 0.2842333 2.2699035
EducationHigh school graduates, no college EducationSome college or associate degree
Unemployed 0.6534189 0.4577308
Not in labor force 0.4437466 0.3662560
EducationBachelor's degree or higher
Unemployed 0.2589504
Not in labor force 0.2058891
which brings me to my next question.
I wonder if there is a way to get the actual probabilities of being unemployed vs employed based on the combination of age and education,e.g what is the probability of being unemployed if I am 22 and have a high school diploma. Sorry for the lengthy question. Thanks for your help. Let me know if additional clarification is needed.
About your first question, I'm also having some doubts about multinom
with categorical variables (here is my question: Multinom with Matrix of Counts as Response).
From what a user replied in that question and the output of >test
you posted, I guess that the math you wrote is partially right: indeed, a multinomial model should work only if the predictor variables are continuous or dichotomous (i.e., with values only 0 or 1), and it seems that when multinom
gets categorical variables as predictors, like in your example, R
automatically converts them to dummy varibales (only 0 or 1).
With reference to your example, considering only the Age
predictor, we should have ln(\frac{Pr(unemployed)}{Pr(employed}) = \beta_0 + \beta_1*Age20-29 + \beta_2*Age30-39 + ...
and an analogous formula for Pr(not in labor force)
, but with different \beta
coefficients.
About your second question: yes, there is a way. Use predict(test, newdata, "probs")
, where newdata
is an array with Age20-29
and High school graduates, no college
as entries (given your example).