Binary Logistic Regression Concept
Binary Logistic Regression Concept
Binary Logistic Regression Concept
Submitted by
Siddhanta Subedi
Submitted to
1
INTRODUCTION
Binary logistic regression is a type of regression analysis that is used to estimate the
relationship between a dichotomous dependent variable and dichotomous-, interval-, and
ratio-level independent variables. Many different variables of interest are dichotomous – e.g.,
whether or not someone voted in the last election, whether or not someone is a smoker,
whether or not one has a child, whether or not one is unemployed, etc. These types of
variables are often referred to as discrete or qualitative. Many discrete or qualitative variables
can be thought of as events. Dichotomous or dummy variables are usually coded 1, indicating
“success” or “yes,” and 0, indicating “failure” or “no.” The mean of a dichotomous variable
coded 1 and 0 is equal to the proportion of cases coded as 1, which can also be interpreted as
a probability.
For example :-Deciding on whether or not to offer a loan to a bank customer: Outcome = yes
or no, Evaluating the risk of cancer: Outcome = high or low, Predicting a team’s win in a
football match: Outcome = yes or no.
Logical regression analyzes the relationship between one or more independent variables and
classifies data into discrete classes. It is extensively used in predictive modeling, where the
model estimates the mathematical probability of whether an instance belongs to a specific
category or not.
2
The logit, g(x) is linear in its parameters, may be continuous, and may range from - to +,
depending on the range of x.
[ 1-
Since the observation are assumed to be independent, the likelihood function is expressed as:
l() =
L()=ln l()=
TEST OF SIGNIFICATION
Under null hypothesis H0, is equal to zero, the G statistics follows Chi-square distribution
with 1 degree of freedom, where G statistics is given by:
G= -2ln
W=
exp
3
And exp
P(Z)=
Where Z=
CHD
Frequency Percentage
Valid No 57 57.0
Yes 43 43.0
Total 100 100.0
From above table we found out that out of 100, there are 43 participants having CHD and 57
without disease.
For independent variable the minimum age of participants is 20 years and maximum age of
participant is 69 years age. The median age of participants was found to be 44 years with
mean age of 44.38 years.
4
Scatter plot between age and CHD
1.2
0.8
CHD
0.6
0.4
0.2
0
0 10 20 30 40 50 60 70 80
Age
5
Here, we are interested to see the association between age of the participants and the presence
or absence of the CHD in this study of population. So, we plot the scatter plot with our
outcome variable CHD verses age as the independent variable. The scatter plot of data is
obtained in figure above.
In above scatterplot all points fall on two parallel lines representing the absence of CHD
(y=0) and the presence of CHD (y=1). This scatter plot clearly shows that the dichotomous
nature of the outcome variable and we can see that the scatter plot does not provide the clear
explanation the relationship between CHD and Age.
Thus, the outcome variable i.e., status of CHD is a dichotomous or binary and there is no
linear relationship between the predictor variable (Age) and response variable (outcome). So,
we choose to use logistic regression model. The conditional mean of the dichotomous
variable and the range of logistic function both lies between 0 and 1
Since as we have already seen that the scatterplot didn’t explain much the relationship
between CHD and Age (predictor variable). So, we create the intervals for the independent
age variable and the frequency and the proportion of having CHD is calculated for each age
group.
0.900
0.800
Proportion of having CHD
0.700
0.600
0.500
0.400
0.300
0.200
0.100
0.000
20-29 30-34 35-39 40-44 45-49 50-54 55-59 60-69
Age group
6
From the graph, we can inference that there is a reasonable assessment of the functional
relationship between proportion of CHD and AGE. With a dichotomous outcome variable,
the conditional mean must be greater than or equal to zero and less than or equal to one,
which can be seen in figure above. The curve is said to be S-shaped and resembles a plot of
the cumulative distribution of continuous random variable. The model we use is based on the
logistic distribution. Hence the logistic model mathematically given by,
From table we have estimated value of Also, the Wald statistic for
age (W is 21.254 which is significant at 0.000(<0.05) at 5% level of significance. The fitted
values are given by the equation,
The log-likelihood given in the table is computed using estimates and it is 53.677.
7
Also, the log likelihood for the model containing only constant term is;
[43 ln (43) + 57 ln (57) − 100 ln (100)] = -68.331 (no = 57 & n1 = 43, n= 100)
Chi-square p-value
.890 .999
Here, the Hosmer-Lameshow statistic is 0.890 with p-value of 0.999 > 0.05, at 5% level
significance and we fail to reject null hypothesis. This means that the model fits well in the
given data.
Odd Ratio:
OR =
From calculation table we have odd ratio = 1.117, which means there is a 1.117 times higher
odd of having coronary heart disease (CHD) when comparing two groups with a one-unit
difference in the independent variable (in this case, age).
8
Fitting Logistic Regression model with AGE groups:
The logistic regression is performed between status of CHD and different age groups of
patients defined taking the age group 20-29 as reference category, following results are
obtained.
Since Nagelkerke R2 is found to be 0.335, this means that a 33.5% variation in the status of
CHD in patients is explained by the age of patients.
Also, Hosmer and Lameshow test is performed to observe the how well the model fits the
observed data. And it is found to be chi square statistic value 0.000 with a p-value
1.00(>0.05) which suggests that it fails to reject null hypothesis concluding that the model
fits the data well.
Fitting logistic regression between status of CHD and Age groups with reference age group
20-29
From the SPSS result shown in above table we can fit the logistic model as:
ln = -2.197 + 0.325 * age (30-34) + 1.099* age (35-39) + 1.504 * age (40-44) + 2.043 *
age (45-49) + 2.078 * age (50-54) + 3.376 *age (55-59) + 3.584 * age (60-69)
9
& the table again shows that the estimated regression coefficients of age groups, 50-54 years, 55-59
years and 60-69 years are significant at 5% level of significance with p-values 0.035, 0.005, and 0.007
respectively. Again, we have standard error and value of Wald test given in the table.
Furthermore, the odd ratio (Exp (is obtained in the column (Exp (and all odd ratios exceeds 1
suggesting that the value of odds ratio times there is chance of developing CHD in the corresponding
age groups as compared to the reference age group (20-29) years.
For example, the odd ratio for age group 30-34 years is 1.385 which suggests that there is 1.385 times
chance of developing CHD in age group 30-34 years as compared to reference group 20-29 years of
Age and So on. Also, it is seen that the highest chance of developing CHD is in age group 60-69 years
with odds of 36 times as compare to reference age group.
Now, predicted model with age being independent variable and CHD as response variable is
given by;
CONCLUSION
From the above calculations and results we fitted the logistic model for given data between
the status of having CHD as dependent variable and AGE as independent variable and found
that AGE is significant variable in predicting CHD.
10