0% found this document useful (0 votes)
51 views14 pages

GEA1000 Chapter 2 Review: David - Chew@nus - Edu.sg

This document provides a summary of key concepts from a chapter on analyzing categorical data, including: 1) Bar graphs are commonly used to visualize categorical data but should avoid being too cluttered. 2) Rates like marginal, conditional, and joint rates can be calculated from contingency tables to analyze relationships between categorical variables. 3) Variables can be positively or negatively associated by comparing conditional rates. 4) Simpson's Paradox can cause reversal of relationships when data is combined or stratified without controlling for confounding variables.

Uploaded by

Huang Zhanyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views14 pages

GEA1000 Chapter 2 Review: David - Chew@nus - Edu.sg

This document provides a summary of key concepts from a chapter on analyzing categorical data, including: 1) Bar graphs are commonly used to visualize categorical data but should avoid being too cluttered. 2) Rates like marginal, conditional, and joint rates can be calculated from contingency tables to analyze relationships between categorical variables. 3) Variables can be positively or negatively associated by comparing conditional rates. 4) Simpson's Paradox can cause reversal of relationships when data is combined or stratified without controlling for confounding variables.

Uploaded by

Huang Zhanyi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

GEA1000

Chapter 2 Review

david.chew@nus.edu.sg

1
W HERE WE WERE . . .
• P ROBLEM
Research Question?

• P LAN

– What and how to measure?


Variable types
– Study Design?
Experiments vs Observational studies
– Collecting?
Census or sampling?

• D ATA /A NALYSIS
Exploratory Data Analysis (EDA)
Summary statistics

2
W HERE WE ARE NOW . . .

In this tutorial. . .
A NALYSIS OF C ATEGORICAL D ATA

1. Bar graphs

2. Rates (marginal, conditional and joint)

3. Association between categorical variables

4. Symmetry, basic rule on rates

5. Simpson’s Paradox

3
1 B AR GRAPHS

The most common approach


to visualizing amounts (i.e.,
numerical values shown for
some set of categories) is us-
ing bars, either vertically or
horizontally arranged.

4
However, beware of making your plot
“too busy"!

Some people contend that


. . . you can always write every
single figure that your chart rep-
resents on top of your bars, lines,
or pie segments; but then, what
is the point of designing the chart
in the first place? A good graphic
should let you visualize trends
and patterns without having to
read all the numbers.

How Charts Lie by Alberto Cairo

Straits Times on 17 Jun 2021.


5
Here is my attempt to make it read better. This is known as a slope chart, first introduced by Edward Tufte.

6
Here is another set of plots I come across
quite often recently, but am not too
happy about.
Do you have ideas to make them better?

An app written using R to make plots:


https://david-chew.shinyapps.io/esquisse/

7
2 R ATES ( MARGINAL , CONDITIONAL AND JOINT )

Consider the following contingency table (2-by-2 table) that shows the smoking status and outcome
(heart disease (HD) or not) for 390 people after 20 years.
• Marginal rate
rate(HD) = 210
390 = 53.8%
HD No HD Total
Smoker 100 100 200 • Conditional rate
Non-smoker 110 80 190 rate(HD | Smoker) = 100
210 = 50.0%
Total 210 180 390
• Joint rate
rate(HD and Smoker) = 100390 = 25.6%

8
3 A SSOCIATION BETWEEN CATEGORICAL VARIABLES

Continuing . . . We compare the conditional rates

100
HD No HD Total • rate(HD | Smoker) = 200 = 50.0%
Smoker 100 100 200 110
Non-smoker 110 80 190 • rate(HD | non-smoker) = 190 = 57.9%
Total 210 180 390 We say that HD and smoking are negatively associated,
since

rate(HD | Smoker) < rate(HD | Non-smoker)

Does this mean that it is better to smoke?

9
4 S YMMETRY RULE ON RATES

It can be shown that

(I) rate(A|B) > rate(A|NB) ⇐⇒ rate(B|A) > rate(B|NA)

(II) rate(A|B) < rate(A|NB) ⇐⇒ rate(B|A) < rate(B|NA)

(III) rate(A|B) = rate(A|NB) ⇐⇒ rate(B|A) = rate(B|NA)

• When we have (I), A and B are said to be positively associated;

• When we have (II), A and B are said to be negatively associated;

• When we have (III), A and B are said to be not associated.

10
B ASIC R ULE ON R ATES

In a population, let A and B be characteristics.


Denote the overall rate of A by rate(A), similarly for rate(B).

1. rate(A) always lies between rate(A|B) and rate(A|NB). Important!

2. The closer rate(B) is to 100%, the closer rate(A) is to rate(A|B). Important!

3. If rate(A|B) = rate(A|NB), then rate(A) = rate(A|B) = rate(A|NB).


rate(A|B) + rate(A|NB)
4. If rate(B) = rate(NB) = 50%, then rate(A) = .
2

11
Now let us look at the same data, but "sliced" by • For Female
60
gender. rate(HD|Smoker) = 80 = 75.0%

• For Male
Female Male Total rate(HD|Smoker) = 40
= 33.3%
120
HD No HD HD No HD
Smoker 60 20 40 80 200 • For All
100
Non-smoker 100 50 10 30 190 rate(HD|Smoker) = 200 = 50%
Total 160 70 50 110 390 • Note that as stated by Rule 1,

33.3 ≤ 50% ≤ 75%.

• Further, note that the overall HD rate among


smokers is closer to HD rate among male
smokers since there are more male smokers
then female smokers (Rule 2).

12
5 S IMPSON ’ S PARADOX
Female Male All
HD Total Rate HD Total Rate HD Total Rate
Smoker 60 80 75.0% 40 120 33.3% 100 200 50.0%
Non-smoker 100 150 66.7% 10 40 25.0% 110 190 57.9%

• Note that the relationship between the percentages in the subgroups are reversed when sub-
groups are combined — (75.0% > 66.7% and 33.3% > 25.0% but 50.0% < 57.9%).
An instance of the Yule-Simpson’s paradox!

• This is due to the confounder “gender" which is associated with both HD and smoking.

• How to resolve? We control by slicing on gender.

• So this answers our earlier question: "Does this mean that it is better to smoke?"

It is in fact not good to smoke.

13
• It does not mean that Simpson’s Paradox will occur whenever there is a confounder in the
relationship between 2 variables.

• Slicing the data into male and female subgroups and studying the association between smoking
status and outcome controlled for the confounder gender.

• Randomised assignment should be used whenever possible in experiments, to lower the possi-
bility of having to deal with confounders.

• In observational studies, collecting data on possible variables that may be confounders is a good
idea. But there may too many of them!
Hence only association and not causation may be determined.

14

You might also like