An Application On Multinomial Logistic Regression Model
An Application On Multinomial Logistic Regression Model
An Application On Multinomial Logistic Regression Model
net/publication/274394043
CITATIONS READS
40 16,890
1 author:
Abdalla M. El-Habil
Al-Azhar University - Gaza
32 PUBLICATIONS 70 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Abdalla M. El-Habil on 28 October 2016.
Abdalla M. EL-HABIL
Head of the Department of Applied Statistics
Faculty of Economics andAdministrative Sciences
Al-Azhar University, Gaza - Palestine
abdalla20022002@yahoo.com
Abstract
This study aims to identify an application of Multinomial Logistic Regression model which is one
of the important methods for categorical data analysis. This model deals with one nominal/ordinal
response variable that has more than two categories, whether nominal or ordinal variable. This
model has been applied in data analysis in many areas, for example health, social, behavioral,
and educational.To identify the model by practical way, we used real data on physical violence
against children, from a survey of Youth 2003 which was conducted by Palestinian Central
Bureau of Statistics (PCBS). Segment of the population of children in the age group (10-14 years)
for residents in Gaza governorate, size of 66,935 had been selected, and the response variable
consisted of four categories. Eighteen of explanatory variables were used for building the primary
multinomial logistic regression model. Model had been tested through a set of statistical tests to
ensure its appropriateness for the data. Also the model had been tested by selecting randomly of
two observations of the data used to predict the position of each observation in any classified
group it can be, by knowing the values of the explanatory variables used. We concluded by using
the multinomial logistic regression model that we can able to define accurately the relationship
between the group of explanatory variables and the response variable, identify the effect of each
of the variables, and we can predict the classification of any individual case.
I. Introduction
In recent years, specialized statistical methods for analyzed categorical data
have increased, particularly for application in biomedical and social science.
Regression analysis is one of these statistical tools that utilize the relationship
between two or more variables. The regression models can be divided into two
groups, the first related to linear relationship models, and the second related to
non-linear relationship models. The linear models, considered up to this point,
are satisfactory for most regression applications. Nonlinear model used when the
linear model is not suitable anyhow. Many of statisticians believe that the logistic
regression model is one of the important models can be applied to analyze a
categorical data; this model is a special case of generalized linear models (GLM).
The multinomial logistic regression (MLR) model used in generally effective
where the response variable is composed of more than two levels or categories.
The basic concept was generalized from binary logistic regression. Continuous
variables are not used as response variable in logistic regression, and only one
response variable can be used. The MLR model can be used to predict a
response variable on the basis of continuous and/or categorical explanatory
variables to determine the percent of variance in the response variable explained
by the explanatory variables, to rank the relative importance of independents, to
The idea of this study focusing on MLR model, that we believe it is important and
useful for analyzing categorical data. Therefore, the problem is:
By using real data, how can we apply a new statistical method (multinomial
logistic regression model) for analyzing categorical data?
The odds = exp( x ) , and the logarithm of the odds is called logit, so
(x )
Logit[ π(x)] log log exp( x ) = x
1 (x )
The logit has linear approximation relationship, and logit = logarithm of the odds.
The parameter β is determined by the rate of increase or decrease of the S-
shaped curve of π (x). The sign of β indicates whether curve ascends (β > 0) or
descends (β < 0), and the rate of change increases as |β| increases.
The parameter i refers to the effect of x i on the log odds that Y =1, controlling
other x j , for instance, exp( i ) is the multiplicative effect on the odds of a one-
unit increase in x i , at fixed levels of other x j .
j
x
log
i
x x ... x ,
k
x
i
0i 1 j 1i 2 j 2i pj pi
Our resource of data was the file "Youthfile.sav", conducting by PCBS of the
Youth Survey, 2003, and the user guide, survey questionnaire, and methodology
book. According to this survey, total number of persons in Palestinian National
Authority on the age group (10-24 Years) in the year 2003 was (1,189,282),
51.0% male, 49.0% female, and 62.3% in West Bank, and 37.7% in Gaza Strip.
Response variable
Through the review of a questionnaire of the survey, there were two questions
drew our attention directly related to the issue of physical violence, first question
was "have you been subjected to physical violence (beating, burning, biting,
pushing, etc) during the last month? ", with two levels of measures (yes/no),
second was " who did practice physical violence against you?", with 10 levels of
measures, or 10 kinds of people exercised of physical violence against the youth:
father, mother, sibling, wife, other relatives, teacher, employer, peer (schoolmate,
neighbor, etc), Israeli forces, and others.
In fact, the aim ofthis analysis does not focus basically on the phenomenon of
violence, which it is very important, and its need special study, but the aim is to
apply our statistical model, MLR model, on categorical data. For this, we tried to
choose available related data on physical violence from Youth Survey 2003
according to our criteria already mentioned in the sample size and frame. By
merging the two variables we got the response variable with 11 levels of
measures, as the target population was the children 10-14 years, living in Gaza,
wife and Israeli forces were excluded as there is no frequency to these levels.
We tried to focus on the practice of physical violence on children by the family,
father, mother, sibling, and very close environment of the child, schoolmates, the
neighbors, and others. We note that a small number of these levels, in the same
time due to skewness in the response variable we combined these levels: other
relatives, teacher, employer, and other, to be in one category, Takagi et. al.
(2007).
We note that, the skewness before merging was (2.908) and, and after merging
(1.887), and standard error of the skewness was 0.009 for both. The response
variable became as the following "have you been subjected to physical violence
during the last month, and who did practice physical violence against you?", had
four categories, 0-had not been, 1-father/ mother, 2-sibling, 3-peer & other, we
called the response variable as "response physical violence by". The frequencies
of response variable according to these categories are shown in Table 3.1.
A set of questions talk about "In your opinion, are the following behavior exists
among youth in the locality where lives?":
Hh04-a"Alcohol consumption" with two categories (1-no/ little, 2-
yes widely).
Hh04-b" Smoking" with two categories (1-no/ little, 2-yes widely).
Hh04-c "Reckless driving" with two categories (1-no/ little, 2-yes
widely).
Hh04-d "Drug abuse" with two categories (1-no/ little, 2-yes
widely).
Hh04-e "Verbal violence" (e.g., harassment, swearing), with two
categories (1-no/ little, 2-yes widely).
Hh04-f "Begging" with two categories (1-no/ little, 2-yes widely).
Hh04-g "Assault on properties",(Stealing, pillage, plundering ), with
two categories (1-no/ little, 2-yes widely).
Hh04-h "Physical violence", (Beating, rape, etc), with two
categories (1-no/ little, 2-yes widely).
Notes: In this set of variables we considered the answer don't know as missing
system, as this answer does not give an opinion
Another set of questions talked about the child himself:
Hh01-a "How do you evaluate your physical health status" with two
categories (1-good, 2-moderate/poor).
Hh01-b"How do you evaluate your mental health status"with two
categories (1-good, 2-moderate/poor).
Hr04 "Sex" with two categories (1-male, 2-female).
Hh02 "Do you want your current weight to ", we merged the categories to
three only (1-remain as it is, 2-to decrease, 3-to increase).
Hr08 "Enrolled in education status", with two categories (1-currently
enrolled, 2-not enrolled now).
S01 "Free time you have" with three categories (1-little, 2-enough, 3- too
much).
Another set of questions talked about the family circumstances:
Loctype "locality type" with two categories (1-urban, 2- camps).
Ir04 "Total number of household members" (numeric variable).
Hr07 "Refugee status" with two categories (1-refugee, 2-not refugee).
Ho5 "Current status of parents" the variable was with 7 categories (1-living
together, 2-divorced, 3-father is dead, 4-mother is dead, 5-both are dead,
6-one of them works abroad, 7-others),some of these without frequencies,
so we merged the categories to three only as: 1-living together, 2-one of
them dead, 3-divorced & others.
We used the same variable's name and codes used by PCBS survey. Full
detailed of the explanatory variables was summarized in table 3.2. This table
prepared by using frequency command of (Statistical Package for Social
Sciences) SPSS.
Table 4.1 is a portion of large table contains all variables, response variable and
explanatory variables. We are focusing on the response variable, as we see in
Table 4.3; the number of the valid observations used in our model is 33,423
distributed among the four categories. The marginal percentage column lists the
proportion of valid observations found in each of the response variable' groups,
75.7% of the valid case (had not been) subjected to physical violence, 8.6% had
been subjected by (father/mother), 9.7% by (sibling), and 6.0% by (peer and
other).
Subpopulation
Subpopulation indicates the number of subpopulations contained in the data. A
subpopulation of the data consists of one combination of the explanatory
variables specified for the model. The SPSS footnote for table 4.1 provides how
many of these combinations of the explanatory variables consist of records that
all have the same value inthe response variable. In our model there are 126
combinations that appear in the data and 125 of these combinations are
composed of records with the same response variable categories
Missing
Missing indicates the number of cases in the dataset where data are missing of
the response variable or any of explanatory variables. In primary model we found
the missing almost 50%. Brannon et al (2007) suggests that we can calculate
scales with missing items if at least two thirds of the items were completed and
others were dropped. Anyhow, this model still under checking, but we refer to
some explanatory variables like alcohol consumption, smoking, etc, a category of
"I don't know" is not a valid decision in this situation. It was considered as
missing system as this procedure will not affect the final result, Moorman and
Carr (2008).
Table 4.2: The parameter estimates with more than 2 units of standard
error
Response physical violence by B Std .Error Wald df sig Exp(B)
1-father/mother
Ho5=1 14.510 160.461 0.008 1 .928 2001804.550
H05=2 -.652 190.610 0.0 1 .997 .521
Hh04-b=1 -14.034 46.734 0.090 1 .764 8.04E-007
2-sibling
H05=1 14.050 152.557 0.008 1 .927 1264463.302
H05=2 13.386 152.557 0.008 1 .930 650756.508
3-peer & other
H05=1 12.302 103.476 0.014 1 .905 220089.142
H05=2 -5.735 135.662 0.002 1 .966 .003
Hh01-a=1 35.119 46.309 0.575 1 .488 1786163….
Hh02=2 -32.769 34.640 0.895 1 .344 5.87E-015
Hr08=1 15.417 319.464 0.002 1 .962 4960497.108
From Table 4.2, there are five explanatory variables causing a numerical
problem, these variables: H05 "current status of parents", hh04-b "smoking
among youth", hh01-a "evaluation of physical health status", hh02 "current
weight", and hr08 "enrolling in education". We note these parameter estimates
gave unreasonable results by one unit change in the explanatory variable.
Information of the four models showed that the Model (2) is the best to be
appropriate to the data comparing with the other models. It has the highest
classification overall percentage, includes 10 independent variables. Also worked
to increase the valid cases and reduce the missing cases. R-square factor which
is usually influenced by a number of variables, had given the values comparable
to the other models. For these reasons we selected model (2) with the following
explanatory variables (10 variables): Hh01-b "Evaluation of mental health status",
Hh04-a "Alcohol consumption", Hh04-d "Drug abuse", Hh04-g "Assault on
properties", Hh04-h "Physical violence", Hr04 "Sex", Hh04-f "Begging", Ir04
"Total number of household members", S01 "Free time you have", Hh04-e
"Verbal violence".
Pseudo R-square
There are three pseudo R-square values can be calculated by SPSS for logistic
regression table 4.5. Pseudo R –square does not have an equivalent to R2 in
OLS regression (the coefficient of determination). R2 summarizes the proportion
of variance in the response variable associated with explanatory variables, but
pseudo R-square does not means what R2 means in OLS regression but we can
use it as indicator for different areas of application. The model with the largest
pseudo R-square statistic is best according to the measures; however,
Absence of Multicollinearity
Multicollinearity can be occurred in logistic regression, as the correlation
increases among the independent variables, the standard errors of the logit
parameters will become inflated. Multicollinearity does not change the estimates
of the parameters, only their reliability, Garson (2009). We check first the
standard error; variables with more than 2 units were ignored. We checked the
asymptotic correlation matrix which is a matrix of parameter estimate correlation.
In this matrix we found that the majority of correlation coefficients were less than
0.10, another 4 were between (0.20 and 0.27), only one coefficient was 0.54, this
means we do not have serious problem with multicollinearity among the
explanatory variables that used in the model. Correlation between total number
of household members (ir04) and evaluation of mental health status (hh01-b) is
0.221. Correlation between alcohol consumption among youth in locality (hh04-a)
and drug abuse among youth in locality (hh04-d) is 0.208, also between drug
abuse and assault on properties (hh04-g) is 0.54, between assault on properties
and physical violence (hh04-h) is 0.276, and between the physical violence and
begging (hh04-f) is 0.265.
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) judge
a model by how close its fitted values tend to be to the true expected values, as
summarized by a certain expected distance between the two, the optimal model
is the one that tends to have its fitted values closest to the true outcome
probabilities. In our model AIC and BIC and -2log likelihood are very close.
Table (4.8): Likelihood ratio tests of the selected model
In Table 4.8, we checked the same point with all explanatory variables used to
build model separately. The result was referred that the existence of a
relationship between each of the explanatory variables and the response variable
was supported.
probability of the child had not been faced physical violence (baseline category)
by 0 and the estimate by ˆ0 . The physical violence by father/ mother by 1 and
the estimate by ˆ1 . The physical violence by sibling by 2 and the estimate by
ˆ2 , and the physical violence by peer and other by 3 , and the estimate by ˆ3 ,
3
the response probability satisfying
j 0
j
1 , our baseline category is (had not
ˆ ˆ ˆ
First, we can calculate log 1 , log 2 , and log 3 , as the response
ˆ ˆ ˆ
0 0 0
variable has four categories (J=4), which means that there are 3 equations as
following:
ˆ ˆ ˆ
let y1 = log 1 , and y2 = log 2 , and y3 = log 3 , so
ˆ ˆ ˆ
0 0 0
Second we calculate ˆ1 , ˆ2 , ˆ3 , ˆ0 , as following, where exp or e = 2.71828 is the
base of the system of natural logarithms:
exp(y1 )
ˆ1 (4.4)
1 exp(y1 ) exp(y2 ) exp(y 3 )
exp(y2 )
ˆ2 (4.5)
1 exp(y1 ) exp(y2 ) exp(y 3 )
exp(y 3 )
ˆ3 (4.6)
1 exp(y1 ) exp(y2 ) exp(y 3 )
1
ˆ0 (4.7)
1 exp(y1 ) exp(y2 ) exp(y 3 )
Where the (1) term in each denominator and in the numerator of ˆ0 represents
exp(ˆ ˆ x ) , for ˆ ˆ 0 , Agresti (2007).
0 0 0 0
response Physical
a
violence by B Std. Error Wald df Sig. Exp(B)
1 Father/Mother Intercept 1.381 .133 108.121 1 .000
ir04 -.116 .010 123.252 1 .000 .891
[hh01_b=1] -.162 .049 10.789 1 .001 .850
[hh01_b=2] b
0 . . 0 . .
[hh04_a=1] -.244 .064 14.439 1 .000 .784
[hh04_a=2] b
0 . . 0 . .
[hh04_d=1] -1.895 .068 780.216 1 .000 .150
[hh04_d=2] b
0 . . 0 . .
[hh04_g=1] .465 .077 36.714 1 .000 1.592
[hh04_g=2] b
0 . . 0 . .
[hh04_h=1] -.445 .054 67.278 1 .000 .641
[hh04_h=2] b
0 . . 0 . .
[hr04=1] -1.163 .046 645.567 1 .000 .312
b
[hr04=2] 0 . . 0 . .
[hh04_f=1] -.599 .053 129.126 1 .000 .549
[hh04_f=2] b
0 . . 0 . .
[s01=1] .677 .066 106.005 1 .000 1.967
[s01=2] 1.291 .060 455.276 1 .000 3.635
[s01=3] b
0 . . 0 . .
[hh04_e=1] -1.501 .050 888.151 1 .000 .223
[hh04_e=2] b
0 . . 0 . .
2 Sibling Intercept -1.954 .174 126.245 1 .000
ir04 -.044 .010 17.415 1 .000 .957
[hh01_b=1] 1.069 .060 314.681 1 .000 2.913
[hh01_b=2] b
0 . . 0 . .
[hh04_a=1] .940 .090 110.122 1 .000 2.561
[hh04_a=2] b
0 . . 0 . .
[hh04_d=1] 1.908 .096 396.509 1 .000 6.737
[hh04_d=2] b
0 . . 0 . .
[hh04_g=1] -2.326 .061 1468.420 1 .000 .098
[hh04_g=2] b
0 . . 0 . .
[hh04_h=1] -1.620 .046 1215.457 1 .000 .198
[hh04_h=2] b
0 . . 0 . .
[hr04=1] -1.964 .050 1556.703 1 .000 .140
[hr04=2] b
0 . . 0 . .
[hh04_f=1] .843 .057 218.119 1 .000 2.323
b
[hh04_f=2] 0 . . 0 . .
[s01=1] -.020 .049 .169 1 .681 .980
[s01=2] -.539 .052 108.999 1 .000 .583
[s01=3] b
0 . . 0 . .
[hh04_e=1] -.069 .045 2.300 1 .129 .933
[hh04_e=2] b
0 . . 0 . .
3 Peer& other Intercept .145 .146 .980 1 .322
ir04 .082 .011 59.833 1 .000 1.085
[hh01_b=1] -1.753 .058 899.007 1 .000 .173
[hh01_b=2] b
0 . . 0 . .
[hh04_a=1] -2.417 .080 923.704 1 .000 .089
[hh04_a=2] b
0 . . 0 . .
[hh04_d=1] -1.192 .078 232.146 1 .000 .304
[hh04_d=2] b
0 . . 0 . .
[hh04_g=1] -.862 .077 126.518 1 .000 .422
b
[hh04_g=2] 0 . . 0 . .
[hh04_h=1] 1.983 .100 395.992 1 .000 7.262
[hh04_h=2] b
0 . . 0 . .
[hr04=1] 1.195 .064 347.519 1 .000 3.305
[hr04=2] b
0 . . 0 . .
[hh04_f=1] -.536 .064 70.418 1 .000 .585
[hh04_f=2] b
0 . . 0 . .
[s01=1] -.886 .063 199.821 1 .000 .412
[s01=2] -1.853 .068 733.513 1 .000 .157
[s01=3] b
0 . . 0 . .
[hh04_e=1] .839 .061 192.266 1 .000 2.315
b
[hh04_e=2] 0 . . 0 . .
a. The reference category is: 0 Had not been.
b. This parameter is set to zero because it is redundant.
By using equations (4.4, 4.5, 4.6, and 4.7), we can calculate the estimated
probability to occur in each category as the following:
exp(0.395)
ˆ1 = 0.1980
1 exp(0.395) exp(0.542) exp(4.806)
exp(0.542)
ˆ2 = 0.5055
1 exp(0.395) exp(0.542) exp(4.806)
exp(4.806)
ˆ3 = 0.0024
1 exp(0.395) exp(0.542) exp(4.806)
1
ˆ4 = 0.2940
1 exp(0.395) exp(0.542) exp(4.806)
These probabilities appeared that the case number 12049 has probability of
0.198 to occur in category that the child had facing physical violence by his
father/mother, and probability of 0.5055 by sibling, and 0.0024 by peer and other
and finally had not been facing physical violence with probability of 0.2940. So
the conclusion here is that the child was facing physical violence by sibling has
the largest probability comparing with other groups or categories.
ˆ3
y3
= log ˆ = + 0.082(4) - 1.753(0) - 2.417(1) - 1.192(0) - 0.862(0) + 1.983(0) +
0
1.195(0) - 0.536(0) - 0.886(1) - 1.853(0) +0.839(0) = -2.975
By using equations (4.4, 4.5, 4.6, and 4.7), we can calculate the estimated
probability to occur in each category as the following:
exp(1.35)
ˆ1 = 0.7584
1 exp(1.35) exp(1.729) exp(2.975)
exp(1.729)
ˆ2 = 0.0349
1 exp(1.35) exp(1.729) exp(2.975)
exp(2.975)
ˆ3 = 0.0100
1 exp(1.35) exp(1.729) exp(2.975)
1
ˆ4 = 0.1966
1 exp(1.35) exp(1.729) exp(2.975)
These probabilities appeared that the case number 13102 has probability of
0.7584 to occur in category that the child had facing physical violence by his
father/mother, and probability of 0.0349 by sibling, and 0.0100 by peer and other
and finally had not been facing physical violence with probability of 0.1966. So
the conclusion here is that the child was facing physical violence by
father/mother has the largest probability comparing with other groups or
categories.
V. Conclusion
We have reviewed the results of the model and carried out some tests to make
sure that the model is fit of the data according to statistical terms. Also we have
reviewed the estimates of parameters and interpreted these estimates focusing
on odds ratio scale. Likelihood ratio tests showed all explanatory variables were
significance but the effects and contribution of each variable were not the same,
so it were sorted according to their effects on the model. "Sex" variable was the
most significant, followed by "Spread of physical violence among young people",
"Spread of drug abuse", "Free time the children had", "Assault on properties
among the youth", "Mental status", "Spread of verbal violence", "Spread of
alcohol among youth", "Spread of begging", and finally "Total number of
household members". The model ability of prediction had been checked by
choosing two cases of the data randomly and applying the model to predict in
any of the response variable's group can be classified of these cases. The model
has been successful in one classification.
The crucial conclusion can be presented by several important points:
1. The usage of the MLR model gives us the opportunity to deal with a
response categorical variable with more than two levels and variety of
explanatory variables.
2. MLR indicates the effect of each of explanatory variables as well as its
additive effect by used in the analysis simultaneously which we are aiming
of the study of this model.
3. MLR enables building a statistical model showing those complex and
interrelated relationships, particularly as we are dealing with a qualitative
response variable has more than two categories. These equations could
measure accurately the effect of each of explanatory variables and
excludedthose variables which did not have statistical significant.
4. MLR model, also has proved its ability to predict, and has reached the
precision with which exhibited 80.7% in our model.
5. The model will help researcher who will try to study the subject of physical
violence by gave him an idea about variables importance and effects, of
course it can be made comparisons between the effects that are
calculated from models if used the similar variables.
6. The logistic regression model is a suitable model to many types of data
when the response variable with more than two categories. MLR has no
any restrictions about the explanatory variables; this model is most
common in the categorical data analysis. MLR can be used in many areas
of social, educational, health, behavioral and even scientific experiments.
References
1. Agresti, A. (2007). An Introduction to Categorical Data Analysis. John
Wiley & Sons, Inc.
2. Agresti, A. (2002). Categorical Data Analysis. John Wiley & Sons, Inc.
3. Brannon, D., Barry, T., Kemper, P., Schreiner, A., and Vasey, J.(2007).
Job Perception and Intent to Leave Among Direct Care Workers: Evidence
from the Better Jobs Better Care Demonstrations. The Gerontologist
47:820-829, The Geronotological Society of America.
4. Chatterjee, S., and Hadi, A. (2006). Regression Analysis by Example.
John Wiley & Sons.
5. Garson, D. (2009). Logistic Regression with SPSS. North Carolina State
University, Public administration Program.
6. Moorman, S.M., and Carr, D. (2008).Spouses Effectiveness as End-of-Life
Health Care Surrogates: Accuracy, Uncertainty, and Errors of
Overtreatment or Undertreatment. The Gerontologist 48:811, The
Gerontological Society of America.
7. PCBS (2003). User guide, Youth Survey 2003. Palestinian Central Bureau
of Statistics.
8. Schafer J.L. (2006). Multinomial logistic regression models. STAT 544-
Lecture 19.
9. Schwab J., (2007). Multinomial Logistic Regression Basic
Relationships.www.utexas.edu/courses/schwab/sw388r7/SolvingProblems
/Analyzi.
10. Takagi, E., Silverstein, M., and Crimmins, E. (2007). Intergenerational
Coresidence of Older Adults in Japan: Conditions for Cultural Plasticity.
The Journals of Gerontology Series B: Psychological Sciences and Social
Sciences 62:S330-S339, The Gerontological Society of America.