Multinomial regression using multinom function in R

Koba picture Koba · Mar 10, 2014 · Viewed 10.8k times · Source

I was thinking about posting my question in Cross-Validated, but decided to come here. I am using the multinom() function from the nnet package to estimate the odds of becoming employed, unemployed, or out of labor force conditioned on age and education. I need some help with the interpretation.

I have the following dataset of one dependent categorical variable employment status(EmpSt) and two independent categorical variables: age (Age) and education level (Education).

>head(df)
               EmpSt   Age                         Education
1           Employed   61+   Less than a high school diploma
2           Employed 50-60 High school graduates, no college
3 Not in labor force 50-60   Less than a high school diploma
4           Employed 30-39       Bachelor's degree or higher
5           Employed 20-29  Some college or associate degree
6           Employed 20-29  Some college or associate degree

Here is the summary with the levels:

>summary(df)
                EmpSt          Age                                    Education    
 Not in universe   :    0   16-19: 6530   Less than a high school diploma  :14686  
 Employed          :61478   20-29:16031   High school graduates, no college:30716  
 Unemployed        : 3940   30-39:16520   Some college or associate degree :28525  
 Not in labor force:38508   40-49:17403   Bachelor's degree or higher      :29999  
                            50-60:20779                                            
                            61+  :26663                                    
  • First,what is the estimation equation(model)

I want to determine what is the estimation equation(model) for the call

df$EmpSt<-relevel(df$EmpSt,ref="Employed") multinom(EmpSt ~ Age + Education,data=df)

so I can write it down in my research paper. In my understanding the Employed is the base level and the logit model for this call is:

enter image description here enter image description here

where i and n are the categories of the variables age and education respectively (sorry for confusing notation). Please, correct me if my understanding of the logistic model produced by multinom() is incorrect. I am not going to include the summary of the test because it is a lot of output, so below I just include the the output for call >test:

> test
Call:
multinom(formula = EmpSt ~ Age + Education, data = ml)

Coefficients:
                   (Intercept)   Age20-29   Age30-39   Age40-49   Age50-60     Age61+
Unemployed           -1.334734 -0.3395987 -0.7104361 -0.8848517 -0.9358338 -0.9319822
Not in labor force    1.180028 -1.2531405 -1.6711616 -1.6579095 -1.2579600  0.8197373
                   EducationHigh school graduates, no college EducationSome college or associate degree
Unemployed                                         -0.4255369                                 -0.781474
Not in labor force                                 -0.8125016                                 -1.004423
                   EducationBachelor's degree or higher
Unemployed                                    -1.351119
Not in labor force                            -1.580418

Residual Deviance: 137662.6 
AIC: 137698.6 

Given that my understanding of the logit model produced by the multinom() is correct the coefficients are the logged odds where the base level is Employed. To get the actual odds I antilog by the call exp(coef(test)) which gives me the actual odds:

> exp(coef(test))
                   (Intercept)  Age20-29  Age30-39  Age40-49  Age50-60    Age61+
Unemployed           0.2632281 0.7120560 0.4914298 0.4127754 0.3922587 0.3937724
Not in labor force   3.2544655 0.2856064 0.1880285 0.1905369 0.2842333 2.2699035
                   EducationHigh school graduates, no college EducationSome college or associate degree
Unemployed                                          0.6534189                                 0.4577308
Not in labor force                                  0.4437466                                 0.3662560
                   EducationBachelor's degree or higher
Unemployed                                    0.2589504
Not in labor force                            0.2058891

which brings me to my next question.

  • Second, the probabilities

I wonder if there is a way to get the actual probabilities of being unemployed vs employed based on the combination of age and education,e.g what is the probability of being unemployed if I am 22 and have a high school diploma. Sorry for the lengthy question. Thanks for your help. Let me know if additional clarification is needed.

Answer

Pippo picture Pippo · Mar 10, 2014

About your first question, I'm also having some doubts about multinom with categorical variables (here is my question: Multinom with Matrix of Counts as Response).

From what a user replied in that question and the output of >test you posted, I guess that the math you wrote is partially right: indeed, a multinomial model should work only if the predictor variables are continuous or dichotomous (i.e., with values only 0 or 1), and it seems that when multinom gets categorical variables as predictors, like in your example, R automatically converts them to dummy varibales (only 0 or 1).

With reference to your example, considering only the Age predictor, we should have ln(\frac{Pr(unemployed)}{Pr(employed}) = \beta_0 + \beta_1*Age20-29 + \beta_2*Age30-39 + ... and an analogous formula for Pr(not in labor force), but with different \beta coefficients.

About your second question: yes, there is a way. Use predict(test, newdata, "probs"), where newdata is an array with Age20-29 and High school graduates, no college as entries (given your example).