SAS - Logistic Regression
SAS - Logistic Regression
SAS - Logistic Regression
羅偉成(Nicholas)
Question?
請問年齡越大是否會增加罹患冠心病的危險性?
AGE Total Non-CHD CHD CHD % Scatter plot of CHD by AGE
20-29 10 9 1 0.10
30-34 15 13 2 0.13
35-39 12 9 3 0.25
40-44 15 10 5 0.33
45-49 13 7 6 0.46
50-54 8 3 5 0.63
55-59 17 4 13 0.76
60-69 10 2 8 0.80
Total 100 57 43 0.43
2
Linear Regression Analysis
Where
– β0 is the intercept term (the mean value of CHD when Age=0),
– βAge is the regression slope (the change in mean CHD with a 1 year change in Age)
– ε is an error term (usually assumed to have a normal distribution)
The null hypothesis is that there is no linear relationship between CHD and
Age (i.e. βAge = 0).
3
Linear Regression Analysis
5
Question?
請問年齡越大是否會增加罹患冠心病的危險性?
AGE Total Non-CHD CHD CHD %
20-29 10 9 1 0.10
30-34 15 13 2 0.13
35-39 12 9 3 0.25
40-44 15 10 5 0.33
45-49 13 7 6 0.46
50-54 8 3 5 0.63
55-59 17 4 13 0.76
This ‘S’ or sigmoid shape in the conditional means is
60-69 10 2 8 0.80 common when plotting percent response by a
Total 100 57 43 0.43 predictive factor.
6
Logistic function
• 條件平均值(conditional mean)
– 以 Ε Y χ 表示
• 用線性廻歸來表示,其方程式為
– Ε Y χ = 𝛽0 + 𝛽1 𝜒
• Ε Y χ 的值介於0~1之間, Ε Y χ 接近0時表示Y成功的機會很小,接近1時則表示成功的機會很大
7
Logistic regression
• The relationship between the log odds of the predicted probabilities and the
predictor is:
Ε Yχ
= ℯ (𝛽0+𝛽1𝜒1)
1−Ε Y χ
Ε Yχ
𝑙𝑛( ) = 𝛽0 + 𝛽1𝜒1
1−Ε Y χ
𝑃
log 𝑖𝑡 𝑃 = log = 𝛽0 + 𝛽1 𝜒1 + 𝛽2 𝜒2 +. . . . . +𝛽𝑝 𝜒𝑝
1−𝑃
Odds 可定義為事件可能發生的次數與事件不可能發生的次數之比值
9
Odds ratio (OR)
OR在case-control study的定
義為病例組暴露的勝算與對照
組暴露的勝算之比值
OR在cohort study的定義為
暴露組個體得病的勝算與非暴
露組個體得病的勝算之比值
10
Logistic regression
𝑃
• 迴歸係數(β)在logistic regression的詮釋 log 𝑖𝑡 𝑃 = log = 𝛽0 + 𝛽1 𝜒1 + 𝛽2 𝜒2 +. . . . . +𝛽𝑝 𝜒𝑝
1−𝑃
– 在logistic regression,以年齡為例
𝑃1 𝑌 = 1ȁ35𝑦𝑒𝑎𝑟𝑠
• 年齡35歲會得冠心病的Odds為 1 − 𝑃1 𝑌 = 1ȁ35𝑦𝑒𝑎𝑟𝑠
= exp 𝛽0 + 𝛽1 × 35
𝑃2 𝑌 = 1ȁ36𝑦𝑒𝑎𝑟𝑠
• 年齡36歲會得冠心病的Odds為 1 − 𝑃2 𝑌 = 1ȁ36𝑦𝑒𝑎𝑟𝑠
= exp 𝛽0 + 𝛽1 × 36
• 因此exp(β1)即為36歲相對於35歲會得病的OR
11
Logistic regression
• The logistic regression model is fit using the principal of maximum likelihood.
• This requires finding values of the unknown parameters (β0 and β1) that
maximize the likelihood of the data that was observed.
• SAS determines values of β0 and β1 that make predicted values solve the two
equations. These are called maximum likelihood estimates (MLEs), and are often
denoted by putting ‘hats’ above the symbols.
12
PROC LOGISTIC
13
PROC LOGISTIC
14
PROC LOGISTIC
15
Dummy variable
16
Create dummy variables
17
Create dummy variables
Reference group
18
PROC LOGOSTOC
19
Multiple logistic regression
• 包含多個連續自變數及多個類別自變數,以探討這些自變數對類別
依變數Y的影響
𝑦 = 𝛽0 + 𝛽1 𝜒1 + 𝛽2 𝜒2 +. . . +𝛽𝑝 𝜒𝑝
• β的估計是利用maximum likelihood estimate (MLE) method 估計
測得
20
Model Selection
✓ Forward selection
✓ Backward selection
✓ Stepwise selection
21
Confounding Effect
22
干擾偏差 Confounding bias
Unknown Smoking
23
What is confounder ?
Formal Definition
– Confounding occurs when a relationship between an exposure (risk factor) and
an outcome (disease) is misrepresented because each of these variables is also
related to a third variable, known as the confounder.
A confounder must:
1. Be associated with the determinant
2. Be independently associated with the outcome
3. Not be an intermediary in the determinant-outcome pathway
Confounding example 1
• Stark & Mantel showed a striking trend in the prevalence of Down’s Syndrome and
birth order
Prevalence of Down's Syndrome
1.7/1000
1.8
1.6
1.4
1.2 0.6/1000
1
0.8
0.6
0.4
0.2
0
1st births 5th births
Smoking
Pancreatic
Cups of coffee / day Control total
Cancer
≥2 122 1978 2100
<2 26 1074 1100
148 3052 3200
122×1074
OR = = 2.55
26×1978
Confounding example 2
2000/2100
𝑃𝑅 = = 10.48
100/1100
Confounding example 2
Smokers Non-Smokers
120×94 2×980
OR = = 1.0 OR = = 1.0
6×1880 20×98
Methods to control for confounding
• Design stage
– Randomization
– Restriction
– Matching
• Analysis stage
– Stratification
– Multivariate analysis
Crude Strata
Evaluation of stratified analyses
50+ years
Stratified Analyses
Crude
Exposure Case Control
Yes 500 600 OR=1.9
No 1500 3400
•The stratum specific ORs (1.0) differ from the crude OR (by ~ 10%)
•This difference indicates that there is confounding by age
Stratified Analyses
Crude
Exposure Case Control
Yes 200 50 OR=4.75
No 800 950
36
Interaction effect
Interaction
– When the incidence rate of disease in the presence of two or more risk factors
differs from the incidence rate expected to result from their individual effects
– Occurs when a measure of association is not constant across levels of another
variable
– Example:
• 吸菸和冠狀動脈粥樣硬化心臟病的相關性強度在女生或男生中皆相同,
意指該研究中性別與吸菸無交互作用存在
• 反之,相關性強度在女生或男生中不同,意指該研究中性別與吸菸有
交互作用存在
Interaction effect
• Is there an association ?
• If so, is it due to confounding ?
• Is there an association equally strong in strata formed on the basis of a third
variable ?
NO YES
Interaction Present Interaction Not Present
Interaction term
𝑦 = 𝛽0 + 𝛽1 𝜒1 + 𝛽2 𝜒2 + 𝛽3 𝜒1× 𝜒2
交互作用項指的是X1X2相乘項
解釋:係數𝛽3 為控制X1及X2對Y的主效果之後,每單位交互作
用改變Y平均值的變化量(看不懂@@)
39
Two straight lines with
different slopes
40
Interaction term in SAS
Interaction=gender*age1;
PROC LOGISTIC DESCENDING;
MODEL ht=gender age1
interaction;
run;
41
修飾作用 Effect modifier
Sex
42
A variable can be ….
請將作業以WORD存檔,檔名為學號+姓名,上傳至I‘m@TMU作業區 45
Thanks for your attention
46