SAS - Logistic Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Logistic regression

羅偉成(Nicholas)
Question?
請問年齡越大是否會增加罹患冠心病的危險性?
AGE Total Non-CHD CHD CHD % Scatter plot of CHD by AGE

20-29 10 9 1 0.10
30-34 15 13 2 0.13
35-39 12 9 3 0.25
40-44 15 10 5 0.33
45-49 13 7 6 0.46
50-54 8 3 5 0.63
55-59 17 4 13 0.76
60-69 10 2 8 0.80
Total 100 57 43 0.43
2
Linear Regression Analysis

• One analysis that we might want to do with these data is a linear


regression.
• The usual linear regression model for these data would be that

Where
– β0 is the intercept term (the mean value of CHD when Age=0),
– βAge is the regression slope (the change in mean CHD with a 1 year change in Age)
– ε is an error term (usually assumed to have a normal distribution)

The null hypothesis is that there is no linear relationship between CHD and
Age (i.e. βAge = 0).
3
Linear Regression Analysis

Scatter plot of CHD by AGE Linear regression predicted CHD by AGE

However, when we plot the predicted values


we find a problem. 4
What is the problem?

✓ If we use linear regression when the outcome is


binary, the predicted values may take on
impossible values for a percentage or probability
– Ordinary least squares (OLS) method may give
a bad-looking fitted line
– They can go below 0 , or above 1

5
Question?
請問年齡越大是否會增加罹患冠心病的危險性?
AGE Total Non-CHD CHD CHD %
20-29 10 9 1 0.10
30-34 15 13 2 0.13
35-39 12 9 3 0.25
40-44 15 10 5 0.33
45-49 13 7 6 0.46
50-54 8 3 5 0.63
55-59 17 4 13 0.76
This ‘S’ or sigmoid shape in the conditional means is
60-69 10 2 8 0.80 common when plotting percent response by a
Total 100 57 43 0.43 predictive factor.
6
Logistic function
• 條件平均值(conditional mean)
– 以 Ε Y χ 表示

• 用線性廻歸來表示,其方程式為
– Ε Y χ = 𝛽0 + 𝛽1 𝜒

• Sigmoid function (Logistic function)


ℯ 𝑓(𝑥) 圖形呈現S分佈的累積機率分佈(cumulative
– 𝜋(𝜒) =
1+ℯ 𝑓(𝑥) distribution)圖形稱為羅吉斯分佈
• Y事件成功的機率 Y事件失敗的機率
ℯ (𝛽0+𝛽1𝜒1) 1
– Ε Yχ = 1 − E Yχ =
1+ℯ (𝛽0+𝛽1𝜒1) 1+ℯ (𝛽0+𝛽1𝜒1)

• Ε Y χ 的值介於0~1之間, Ε Y χ 接近0時表示Y成功的機會很小,接近1時則表示成功的機會很大
7
Logistic regression
• The relationship between the log odds of the predicted probabilities and the
predictor is:
Ε Yχ
= ℯ (𝛽0+𝛽1𝜒1)
1−Ε Y χ

Ε Yχ
𝑙𝑛( ) = 𝛽0 + 𝛽1𝜒1
1−Ε Y χ

• The log odds of a probability is called the logit of the probability.


• In logistic regression, the logit has a linear relationship with x. The predicted
values, Ε(Y│χ), have a sigmoid shape as in the figure above.
8
Logistic regression

• 基本上應變項(Y) 為二分法試驗 (Bernoulli trial)


– Y=1:發生事件;Y=0:不發生事件
– Odds
𝑃
Odds = (Range [odds scale]:0 ... ∞)
1−𝑃
𝑃 1
ln(odds) = ln = logit(𝑃) = 𝜂 (range: −∞, + ∞) 𝑃 =
1−P 1 + exp[−𝜂]

𝑃
log 𝑖𝑡 𝑃 = log = 𝛽0 + 𝛽1 𝜒1 + 𝛽2 𝜒2 +. . . . . +𝛽𝑝 𝜒𝑝
1−𝑃

Odds 可定義為事件可能發生的次數與事件不可能發生的次數之比值
9
Odds ratio (OR)
OR在case-control study的定
義為病例組暴露的勝算與對照
組暴露的勝算之比值

OR在cohort study的定義為
暴露組個體得病的勝算與非暴
露組個體得病的勝算之比值

10
Logistic regression
𝑃
• 迴歸係數(β)在logistic regression的詮釋 log 𝑖𝑡 𝑃 = log = 𝛽0 + 𝛽1 𝜒1 + 𝛽2 𝜒2 +. . . . . +𝛽𝑝 𝜒𝑝
1−𝑃

– 在logistic regression,以年齡為例
𝑃1 𝑌 = 1ȁ35𝑦𝑒𝑎𝑟𝑠
• 年齡35歲會得冠心病的Odds為 1 − 𝑃1 𝑌 = 1ȁ35𝑦𝑒𝑎𝑟𝑠
= exp 𝛽0 + 𝛽1 × 35

𝑃2 𝑌 = 1ȁ36𝑦𝑒𝑎𝑟𝑠
• 年齡36歲會得冠心病的Odds為 1 − 𝑃2 𝑌 = 1ȁ36𝑦𝑒𝑎𝑟𝑠
= exp 𝛽0 + 𝛽1 × 36

• 36歲相對於35歲會得病的log odds ratio (LOR)為


𝑃2
ൗ1 − 𝑃
2 𝑃2 𝑃
𝐿𝑂𝑅 = log = log ൗ1 − 𝑃 − log 1ൗ1 − 𝑃 = 𝛽0 + 𝛽1 × 36 − 𝛽0 + 𝛽1 × 35
𝑃1 2 1
ൗ1 − 𝑃
1
= 𝛽1

• 因此exp(β1)即為36歲相對於35歲會得病的OR
11
Logistic regression
• The logistic regression model is fit using the principal of maximum likelihood.
• This requires finding values of the unknown parameters (β0 and β1) that
maximize the likelihood of the data that was observed.
• SAS determines values of β0 and β1 that make predicted values solve the two
equations. These are called maximum likelihood estimates (MLEs), and are often
denoted by putting ‘hats’ above the symbols.

12
PROC LOGISTIC

13
PROC LOGISTIC

14
PROC LOGISTIC

15
Dummy variable

• Dummy variable (虛擬變項)


– 在迴歸分析中,解釋變項若為類別變項,需先轉換為虛擬變項
才能進行適當的分析與解釋
– 解釋變項若有k個類別,則需產生k-1個虛擬變項
– 在資料處理中會使用一系列的IF…THEN…ELSE IF敘述來產生虛
擬變項

16
Create dummy variables

• Suppose we have 4 groups, then we need to create 3 dummy variables

g1=0; g2=0; g3=0; g1, g2, g3 為3個dummy variables


以group=1 當作reference group
IF group=2 THEN g1=1;
ELSE IF group=3 THEN g2=1;
ELSE IF group=4 THEN g3=1;

17
Create dummy variables

Reference group

18
PROC LOGOSTOC

Group AGE Dummy OR(95%CI) P-value


1 20-34 Reference 1.0 (ref.)
2 35-44 g1 3.09(0.72-13.32) 0.1307
3 45-54 g2 8.07(1.84-35.41) 0.0057
4 55+ g3 25.67(5.67-116.12) <0.0001

19
Multiple logistic regression
• 包含多個連續自變數及多個類別自變數,以探討這些自變數對類別
依變數Y的影響

𝑦 = 𝛽0 + 𝛽1 𝜒1 + 𝛽2 𝜒2 +. . . +𝛽𝑝 𝜒𝑝
• β的估計是利用maximum likelihood estimate (MLE) method 估計
測得

20
Model Selection

✓ Forward selection

✓ Backward selection

✓ Stepwise selection

21
Confounding Effect

22
干擾偏差 Confounding bias

Unknown Smoking

Betel nut X Lung cancer

23
What is confounder ?

 Formal Definition
– Confounding occurs when a relationship between an exposure (risk factor) and
an outcome (disease) is misrepresented because each of these variables is also
related to a third variable, known as the confounder.
 A confounder must:
1. Be associated with the determinant
2. Be independently associated with the outcome
3. Not be an intermediary in the determinant-outcome pathway
Confounding example 1
• Stark & Mantel showed a striking trend in the prevalence of Down’s Syndrome and
birth order
Prevalence of Down's Syndrome
1.7/1000
1.8
1.6
1.4
1.2 0.6/1000
1
0.8
0.6
0.4
0.2
0
1st births 5th births

Stark CR & Mantel N. J Natl Cancer Inst 1966


Confounding example 1
• Birth order and maternal age are related; women giving birth to their 5th baby tend
to be older than women giving birth to their 1st baby
Prevalence of Down's Syndrome
8.5/1000
9
8
7
6
5
4
0.2/1000
3
2
1
0
youngest oldest
maternal age

• Maternal age was a confounder

Stark CR & Mantel N. J Natl Cancer Inst 1966


From Rothman K. Epidemiology, An Introduction 2002
Confounding example 2

Smoking

Coffee consumption Pancreatic cancer

• Smoking is a known risk factor for pancreatic cancer


• Smoking is associated with coffee consumption, but is not a
result of coffee drinking
Confounding example 2

Pancreatic
Cups of coffee / day Control total
Cancer
≥2 122 1978 2100
<2 26 1074 1100
148 3052 3200

122×1074
OR = = 2.55
26×1978
Confounding example 2

• Smoking is associated with coffee consumption


Cigarette smoking
Yes No
≥2 cups of coffee/day 2000 100
Total 2100 1100

2000/2100
𝑃𝑅 = = 10.48
100/1100
Confounding example 2

Smokers Non-Smokers

Cups of Pancreatic Cups of Pancreatic


Control Control
coffee / day cancer coffee / day cancer
≥2 120 1880 ≥2 2 98
<2 6 94 <2 20 980
Total 126 1974 Total 22 1078

120×94 2×980
OR = = 1.0 OR = = 1.0
6×1880 20×98
Methods to control for confounding

• Design stage
– Randomization
– Restriction
– Matching

• Analysis stage
– Stratification
– Multivariate analysis

Crude Strata
Evaluation of stratified analyses

• Compare the adjusted estimate with the crude estimate. If:

• Crude Result = Stratified Result No confounding

• Crude Result ≠ Stratified Result Confounding


Stratified
Crude
<50 years

50+ years
Stratified Analyses
Crude
Exposure Case Control
Yes 500 600 OR=1.9
No 1500 3400

Age< 50 Age >=50

Exposure Case Control Exposure Case Control


Yes 50 300 Yes 450 300
No 450 2700 No 1050 700
Stratum Specific OR=1.0 Stratum Specific OR=1.0

•The stratum specific ORs (1.0) differ from the crude OR (by ~ 10%)
•This difference indicates that there is confounding by age
Stratified Analyses
Crude
Exposure Case Control
Yes 200 50 OR=4.75
No 800 950

Age< 50 Age >=50

Exposure Case Control Exposure Case Control


Yes 194 21 Yes 6 29
No 706 79 No 94 871
Stratum Specific OR=1.03 Stratum Specific OR=1.91

• Stratum-specific ORs differ from crude OR and from each other →


confounding and effect measure modification by age
Interaction

36
Interaction effect

 Interaction
– When the incidence rate of disease in the presence of two or more risk factors
differs from the incidence rate expected to result from their individual effects
– Occurs when a measure of association is not constant across levels of another
variable
– Example:
• 吸菸和冠狀動脈粥樣硬化心臟病的相關性強度在女生或男生中皆相同,
意指該研究中性別與吸菸無交互作用存在
• 反之,相關性強度在女生或男生中不同,意指該研究中性別與吸菸有
交互作用存在
Interaction effect
• Is there an association ?
• If so, is it due to confounding ?
• Is there an association equally strong in strata formed on the basis of a third
variable ?

NO YES
Interaction Present Interaction Not Present
Interaction term

𝑦 = 𝛽0 + 𝛽1 𝜒1 + 𝛽2 𝜒2 + 𝛽3 𝜒1× 𝜒2

交互作用項指的是X1X2相乘項
解釋:係數𝛽3 為控制X1及X2對Y的主效果之後,每單位交互作
用改變Y平均值的變化量(看不懂@@)

39
Two straight lines with
different slopes

40
Interaction term in SAS

DATA a; SET work.test;


AGE和GENDER
IF sex='F' THEN gender=0; 對於HT並無顯
ELSE IF sex='M' THEN 著交互作用
gender=1;
IF age<50 THEN age1=0;
ELSE IF age>=50 THEN age1=1;

Interaction=gender*age1;
PROC LOGISTIC DESCENDING;
MODEL ht=gender age1
interaction;
run;

41
修飾作用 Effect modifier

Sex

Smoking Lung cancer

42
A variable can be ….

1. Both a confounder and modifier


2. A confounder but not modifier
3. A modifier but not confounder
4. Neither a confounder nor modifier
With Without
Total modifier Confounding
variable X variable X
A 4.0 2.0 1 Y Y
B 4.0 4.0 1 N Y
C 4.0 2.0 2.8 Y N
D 4.0 4.0 4.0 N N From CJ Chen, 2000
Practice and Homework
PRACTICE TIME
• Import Epi_test.xls file into SAS, then
– To estimate the odds ratio of hypertension (HT) by cigarette
smoking (SMK) status, and interpret the outcome.
– To estimate the odds ratio of hypertension (HT) by different age
group (AGE), and interpret the outcome.
• Age group: <45, 45-54, 55-64, 65+

請將作業以WORD存檔,檔名為學號+姓名,上傳至I‘m@TMU作業區 45
Thanks for your attention

46

You might also like