Module02 LogisticRegression
Module02 LogisticRegression
Module02 LogisticRegression
Source: An Introduction to Generalized Linear Models, A.J. Dobson, A.G. Barnett (Ch 5)
Inference
• Specify a model 𝑀0 as null hypothesis 𝐻0 and a more general model
(with more terms) 𝑀1 as 𝐻1
• Fit 𝑀0 and calculate corresponding goodness of fit statistics (e.g.
Likelihood function) 𝐺0 , do the same for 𝑀1 and get 𝐺1
• Get the improvement 𝐺1 − 𝐺0 or 𝐺1 /𝐺0 and compare with
corresponding sampling distributions
• Use null hypothesis 𝐺1 = 𝐺0 and if it is not rejected then 𝑀0 is
preferred model, otherwise 𝑀1 is the preferred model
Source: An Introduction to Generalized Linear Models, A.J. Dobson, A.G. Barnett (Ch 5)
Log-Likelihood Ratio Statistic
• A model with maximum number of possible parameters is called saturated
model.
• If there are 𝑁𝑻observations 𝑌1 , . . , 𝑌𝑁 all with potentially different values for linear
component 𝒙𝒊 𝜷 then a saturated model can be specified by 𝑁 parameters. This is
also called maximal or full model.
• Let 𝜷𝒎 be the parameter vector for full model and let 𝒃𝒎 be the maximum
likelihood estimator for 𝜷𝒎
• Let 𝒃 be the parameter vector of the model of interest. The likelihood ratio is
𝑳 𝒃 ,𝒚
given by 𝜆 = 𝒎
𝑳 𝒃,𝒚
• 𝑙𝑜𝑔𝜆 = 𝑙𝑜𝑔 𝐿 𝒃𝒎 , 𝒚 − 𝑙𝑜𝑔 𝐿 𝒃, 𝒚 = 𝑙 𝒃𝒎 , 𝒚 − 𝑙(𝒃, 𝒚)
• 2𝑙𝑜𝑔𝜆 is chi-squared distributed, so it is the commonly used statistic, called
Deviance
Deviance
• The likelihood ratio 𝑙𝑜𝑔𝜆 can be used for model comparison and
hypothesis testing as 2𝑙𝑜𝑔𝜆 is chi-square distributed
• The Deviance or log-likelihood ratio statistic is given by
𝑫 = 𝟐 𝑙 𝒃𝒎 , 𝒚 − 𝑙 𝒃, 𝒚 ~𝜒 2 (𝑚 − 𝑝, 𝜈)
Where 𝑚 is the number of parameters in the full model and 𝑝 is the
number of parameters in the model of interest.
• The constant 𝜈 is the non-centrality parameter which is almost 0 if
the model of interest fits the data almost as well as the full model.
• Which means compare Deviance with 𝜒 2 (𝑚 − 𝑝) to test hypothesis.
• Deviance forms the basis of hypothesis test for most GLMs.
Deviance of a Normal Model
• 𝐸 𝑌𝑖 = 𝜇𝑖 = 𝒙𝑻𝒊 𝜷 where 𝑌𝑖 ~𝑁 𝜇𝑖 , 𝜎 2 ; 𝑖 = 1,2, . . 𝑁
1 𝑁 2 1 2
• Log likelihood 𝑙 𝜷, 𝒚 = − σ 𝑖=1 𝑖𝑦 − 𝜇 𝑖 − 𝑁𝑙𝑜𝑔 2𝜋𝜎
2𝜎 2 2
• For a saturated model, all 𝜇𝑖 s are different, so 𝜷 has N elements
𝜕𝑙
• = 0 will result in 𝜇ෝ𝑖 = 𝑦𝑖
𝜕𝜇𝑖
1
• So saturated model log likelihood 𝑙 𝒃𝒎 , 𝑦 = − 𝑁𝑙𝑜𝑔 2𝜋𝜎 2
2
−𝟏
= 𝑿𝑻 𝑿
• For any other model with 𝑝 parameters, 𝑝 < 𝑁 𝜷 𝑿𝑻 𝒚
are the estimated values
• 𝒚ෝ𝒊 = 𝒙𝑻𝒊 𝜷
መ 𝑦 =− 1
σ𝑁 𝑻 2 1
• 𝑙 𝛽, 𝑖=1 𝑦𝑖 − 𝒙𝒊 𝜷 − 𝑁𝑙𝑜𝑔 2𝜋𝜎 2
2𝜎 2 2
Deviance of a Normal Model
1 2
•
Deviance 𝐷𝑝 = 2 𝑙 𝒃𝒎 , 𝑦 − 𝑙 𝜷, 𝑦 = 2 σ𝑁 𝑖=1 𝑖𝑦 − 𝒙𝑻
𝒊𝜷
𝜎
• In case where there is only one parameter 𝜇, 𝐸 𝑌𝑖 = 𝜇
• 𝑿 is a vector of N ones; 𝜇Ƹ = 𝑦ത = σ𝑁 𝑖=1 𝑦𝑖
1
• Deviance of null model 𝐷0 = 2 σ𝑁 𝑖=1 𝑦𝑖 − 𝑦ത 2
𝜎
• This statistic is similar to sample variance
𝑁
1 𝜎 2𝐷
𝑆2 = 𝑦𝑖 − 𝑦ത 2 =
𝑁−1 𝑁−1
𝑖=1
• 𝐷𝑝 ~𝜒 2 𝑁 − 𝑝 deviance of model with 𝑝 parameters
• 𝐷0 ~𝜒 2 (𝑁 − 1) deviance of null model
Nested or Hierarchical Models
• Consider a model 𝑀0 with smaller number 𝑞 of parameters and a more general
model 𝑀1 with 𝑝 parameters, where 𝑞 < 𝑝 < 𝑁 where 𝑁 is the number of
observations.
𝛽1 𝛽1
𝛽2 𝛽2
• 𝐻0 ∶ 𝜷 = 𝜷𝟎 = . ; 𝐻1 ∶ 𝜷 = 𝜷𝟏 = .
. .
𝛽𝑞 𝛽𝑝
• 𝐻0 against 𝐻1 can be tested using the difference in deviance statistics
Δ𝐷 = 𝐷0 − 𝐷1 = 𝟐 𝑙 𝒃𝟏 , 𝒚 − 𝑙 𝒃𝟎 , 𝒚 ~𝜒 2 (𝑝 − 𝑞)
• If Δ𝐷 is consistent with 𝜒 2 (𝑝 − 𝑞), generally you choose the smaller model, but
judgement should be used on deciding which factors are important
Nested or Hierarchical Models
• For normal linear models, the deviance expression has 𝜎 2 in it, which
is usually not known.
• Hence an 𝐹 statistic is used where
𝐷0 −𝐷1
𝑝−𝑞
𝐹= 𝐷1 ~ 𝐹 𝑝 − 𝑞, 𝑁 − 𝑝
𝑁−𝑝
Please review the likelihood ratio notes for linear regression model, the
expression is the same
AIC and BIC (Chapter 7.5, Dobson)
• AIC (Akaike Information Criterion) and Schwartz or BIC (Bayesian
Information Criterion) are log-likelihood ratio statistics with adjustment for
the number of parameters
• 𝐴𝐼𝐶 = −2𝑙 𝜋, ො 𝑦 + 2𝑝 where 𝑝 is the number of parameters
• 𝐵𝐼𝐶 = −2𝑙 𝜋, ො 𝑦 + 2𝑝 × ln(𝑁) where 𝑁 is the number of observations
• BIC imposes greater penalty on number of parameters.
• A small value of these statistics, and a large p-value indicates the model fits
the data well
• Not recommended for nested models, usually these are used with models
that are not nested.
• The above formulae are used in R, other software may use other
expressions similar to these.
Example: Stock Market Data
Matrix Plot:
Example: Stock Market Data
Correlation Analysis
corrplot::corrplot(cor(Smarket[,-9]))
Fitting the Logistic Regression Model
Prediction on Training Data