GEA1000 Chapter 2 Review: David - Chew@nus - Edu.sg
GEA1000 Chapter 2 Review: David - Chew@nus - Edu.sg
Chapter 2 Review
david.chew@nus.edu.sg
1
W HERE WE WERE . . .
• P ROBLEM
Research Question?
• P LAN
• D ATA /A NALYSIS
Exploratory Data Analysis (EDA)
Summary statistics
2
W HERE WE ARE NOW . . .
In this tutorial. . .
A NALYSIS OF C ATEGORICAL D ATA
1. Bar graphs
5. Simpson’s Paradox
3
1 B AR GRAPHS
4
However, beware of making your plot
“too busy"!
6
Here is another set of plots I come across
quite often recently, but am not too
happy about.
Do you have ideas to make them better?
7
2 R ATES ( MARGINAL , CONDITIONAL AND JOINT )
Consider the following contingency table (2-by-2 table) that shows the smoking status and outcome
(heart disease (HD) or not) for 390 people after 20 years.
• Marginal rate
rate(HD) = 210
390 = 53.8%
HD No HD Total
Smoker 100 100 200 • Conditional rate
Non-smoker 110 80 190 rate(HD | Smoker) = 100
210 = 50.0%
Total 210 180 390
• Joint rate
rate(HD and Smoker) = 100390 = 25.6%
8
3 A SSOCIATION BETWEEN CATEGORICAL VARIABLES
100
HD No HD Total • rate(HD | Smoker) = 200 = 50.0%
Smoker 100 100 200 110
Non-smoker 110 80 190 • rate(HD | non-smoker) = 190 = 57.9%
Total 210 180 390 We say that HD and smoking are negatively associated,
since
9
4 S YMMETRY RULE ON RATES
10
B ASIC R ULE ON R ATES
11
Now let us look at the same data, but "sliced" by • For Female
60
gender. rate(HD|Smoker) = 80 = 75.0%
• For Male
Female Male Total rate(HD|Smoker) = 40
= 33.3%
120
HD No HD HD No HD
Smoker 60 20 40 80 200 • For All
100
Non-smoker 100 50 10 30 190 rate(HD|Smoker) = 200 = 50%
Total 160 70 50 110 390 • Note that as stated by Rule 1,
12
5 S IMPSON ’ S PARADOX
Female Male All
HD Total Rate HD Total Rate HD Total Rate
Smoker 60 80 75.0% 40 120 33.3% 100 200 50.0%
Non-smoker 100 150 66.7% 10 40 25.0% 110 190 57.9%
• Note that the relationship between the percentages in the subgroups are reversed when sub-
groups are combined — (75.0% > 66.7% and 33.3% > 25.0% but 50.0% < 57.9%).
An instance of the Yule-Simpson’s paradox!
• This is due to the confounder “gender" which is associated with both HD and smoking.
• So this answers our earlier question: "Does this mean that it is better to smoke?"
13
• It does not mean that Simpson’s Paradox will occur whenever there is a confounder in the
relationship between 2 variables.
• Slicing the data into male and female subgroups and studying the association between smoking
status and outcome controlled for the confounder gender.
• Randomised assignment should be used whenever possible in experiments, to lower the possi-
bility of having to deal with confounders.
• In observational studies, collecting data on possible variables that may be confounders is a good
idea. But there may too many of them!
Hence only association and not causation may be determined.
14