Chapter 7 Class Notes Comparison of Two Independent Samples: Introduction To Biostatistics
Chapter 7 Class Notes Comparison of Two Independent Samples: Introduction To Biostatistics
Chapter 7 Class Notes Comparison of Two Independent Samples: Introduction To Biostatistics
Chapter 7 Class Notes
Comparison of Two Independent Samples
In this chapter, we’ll compare means from two independently
sampled groups using HTs (hypothesis tests). As noted in Chapter 6,
there are two paradigms/settings:
Group I (e.g., males) Group II (e.g., females)
Randomize some subjects to treatment A and some to treatment B:
1 | P a g e
Introduction to Biostatistics
In §7.1 (pp. 218‐222), the authors motivate hypothesis testing with a
randomization test applied to Example 7.1.1 concerning flexibility of 7
women and two “methods”, aerobics class (A) and dance (D). The
flexibilities and methods are as follows:
The important topic considered in this chapter is that of Hypothesis
Tests, introduced and illustrated in Section 7.2 (p.223) – and we use
this method for the rest of the text. Hypothesis testing uses a ‘proof
by contradiction’ argument and places the burden of proof on the
researcher seeking to show that the status quo is no longer the case.
For the Hypnosis example – where 1 is the average respiratory rate
of all hypnotized male volunteers and 2 is the average respiratory
rate of all non‐hypnotized male volunteers – we may wonder if the
populations mean are the same: 1 = 2? If a priori (ahead of time)
we have no idea of which would be larger than the other, we may
wonder if the means differ: 1 ≠ 2? This is called a directionless
alternative since no direction is specified; we’ll take up directed
alternatives later in Section 7.5.
Here, the Null Hypothesis (H0) and Alternative Hypothesis (HA) are:
H0: 1 = 2 or equivalently: 1 ‐ 2 = 0
HA: 1 ≠ 2 or equivalently: 1 ‐ 2 ≠ 0
In Statistical Test of Hypotheses (denoted HT), the procedure for
assessing the compatibility of the data with H0 (“the Null”) uses a test
statistic (denoted TS); HT’s therefore answer the question of whether
the observed difference (between the sample means here) is real or
just due to sampling error. The TS here measures by how many SE’s
and differ, and thus is
3 | P a g e
Introduction to Biostatistics
The reason for explicitly writing the “– 0” in the numerator of the
above TS reflects the fact that the Null proposes that the difference of
the population means 1 ‐ 2 is equal to zero.
For the Hypnosis example from Chapter 6, our data summary is:
Y = respiratory (ventilation) rate
Population I: all male volunteers to who could be hypnotized
Population II: all male volunteers to who the non‐hypnotized
(control) treatment could be applied
Experimental Control
Sample size n1 = 8 n2 = 8
Sample mean = 6.169 = 5.291
Sample SD s1 = 0.621 s2 = 0.652
. .
We also found = 0.31834, so here,
.
. . This means that and differ by 2.76 SEs. But is that large?
0.4
0.3
Density
0.2
0.1
-2.76 2.76
A1 A2
0.0
-2 0 2
standardized t
4 | P a g e
Introduction to Biostatistics
Is our data consistent with H0? Subject to the needed assumptions
and requirements (and not assuming equal variances), if H0 is true,
then the sampling distribution of the TS ts is approximately a t‐
distribution with df (degrees of freedom) given by Equation 6.7.1; df =
13 here. We wonder if ts is ‘far out’ in the tail. The yardstick is the P‐
value (denoted ‘p’ and not to be confused with the Binomial ‘p’). For
the Hypnosis example, ts = 2.76 appears to be far out in the tail.
For this directionless alternative case, the P‐value of the test is the
area under the Student’s t curve in the two tails to the left of ‐ts and
the right of ts; in the above graph it is the sum of the areas A1 and A2.
The p‐value is a measure of compatibility between the data and H0.
The P‐VALUE is the probability of observing a test
statistic as extreme as or more extreme than the one
actually observed (where ‘extreme’ is defined in HA).
For the Hypnosis example, since Pr{t > 2.76} = 0.0081, the p‐value is
p = 2(0.008) = 0.0162 (obtained from the computer).
Making Decisions using HT (see p.227):
1. Fix (the significance level) a priori; usually = 5%
2. If (p‐value) p < , then H0 is rejected at the significance level
3. If p > , fail to reject H0 – i.e., retain H0
For the Hypnosis example, suppose we had picked = 5% before
viewing the data. Here, the p‐value (0.0162) is less than 0.05 (= ), so
5 | P a g e
Introduction to Biostatistics
we reject H0: 1 = 2 at the 5% significance level. Conclusion: At the
5% level, there appears to be a significant difference between the
average respiratory rate of hypnotized and non‐hypnotized males.
Note: the above p‐value was obtained using the computer; had we
used Table 4, the best we could say is that 0.01 < p < 0.02, but we still
reach the same conclusion as above since p < 0.05.
D. Bacteria Example (p.233, ex. 7.2.16). Y = bacteria colony count
1 = average bacterial colony count of all Petri dishes to which
sterile water (control) has been or can be added
2 = average bacterial colony count of all Petri dishes to which
soap has been or can be added
Control Soap
Sample size n1 = 8 n2 = 7
Sample mean = 41.8 = 32.4
Sample SD s1 = 15.6 s2 = 22.8
At the 5% significance level, let’s test if the means differ.
H0: 1 = 2 (i.e., 1 ‐ 2 = 0)
HA: 1 ≠ 2 (i.e., 1 ‐ 2 ≠ 0)
Since it turns out here that = 10.23, the test statistic is
. .
. . Also, the RHS of Equation 6.7.1 is 10.42 so
.
we use df = 10. Using Table 4 (t‐table), let’s approximate the p‐value:
from the table, we can bracket our TS (ts = 0.91) between 0.879 and
1.372, so the RH area is between 0.10 and 0.20, and so the p‐value is
6 | P a g e
Introduction to Biostatistics
between 0.20 and 0.40 (and close to the 0.40). Regardless of the
exact p‐value, we do know that p > .
We do not “accept H0” – rather, we “fail to reject H0” or “retain H0”.
We can never “prove the null hypothesis is true”. Again quoting Carl
Sagan, “Absence of evidence is not evidence of absence”; ‘absence of
evidence’ here refers to the lack of proof of a difference in the
population means (based on the nearness of our sample means).
Conclusion: The data do not provide sufficient evidence at the 5%
level of significance to conclude that ordinary soap and sterile water
differ in terms of the average number of bacteria.
Section 7.3: There’s an important connection between CI’s and HT’s.
For the Hypnosis example, the 95% CI for (1 – 2) was (0.190 , 1.566).
Also, since p < = 5%, we reject H0: 1 ‐ 2 = 0 and accept HA: 1 ‐ 2 ≠
0. For the Bacteria example, the 95% CI for (1 – 2) is (‐13.5 , 32.2),
and since p > , we retain (fail to reject) H0: 1 ‐ 2 = 0.
Thus, whenever the levels match and the HT is non‐directional, then
(as the text proves on p.234‐5):
Retain or Fail to reject H0: 1 ‐ 2 = 0 at the level IFF the
(1‐)100% CI for (1 ‐ 2) contains zero;
Reject H0: 1 ‐ 2 = 0 at the level IFF the (1‐)100% CI for
(1 ‐ 2) does not zero
With regard to interpreting (p.236), note that = Pr{reject H0 given
that H0 is true}. It is best illustrated using a simulation study or a
‘meta‐experiment’ as on p.237. Let 1 be that average IQ of all
7 | P a g e
Introduction to Biostatistics
Females, 2 be that average IQ of all Males, and suppose we know
that 1 = 2 (i.e., that H0 is true); say 1 = 2 = . Then, repeatedly: (1)
take samples of size n1 from the Female N(, 1) population and find
, (2) take samples of size n2 from the Male N(, 2) population and
find , (3) find the corresponding test statistics and decisions. Even
though we know that H0 is true (1 = 2), *100% (e.g., 5%) of the
time we’ll make the wrong decision and conclude 1 ≠ 2.
In general, if we reject H0, then either
H0 is in fact false (i.e., we made a good decision), or
H0 is true and we were among the unlucky 5% who rejected H0
anyway
The probability of this latter event () is controlled for in that it is set
before the study has started and data obtained.
Making Mistakes and Good Decisions in HT:
There are two types of errors we can make – called Type I and Type II
errors, and displayed in the following table:
H0 is True H0 is False
Our Decision Do not reject H0 Correct ☺ Type II error
Reject H0 Type I error Correct ☺
= Pr{Type I error} – controlled at the outset
= Pr{Type II error} – usually not controlled (but can be reduced
by increasing the sample size)
POWER = 1 ‐ = Pr{Reject H0 given that H0 is False} – “the ability
to see a difference when there really is a difference”.
8 | P a g e
Introduction to Biostatistics
As discussed in Chapter 1, the Smoking and Birthweight example on
pp.243‐7 illustrates an observational study. Here:
Response variable: baby’s birthweight
Explanatory variable: whether mother smoked during pregnancy
Extraneous variable(s): age, income, education, diet …
Observational unit (OU): a mother‐child pair
As we encounter observational studies, we need to look for and
consider possible sources of bias and then to imagine how these
would impact the study and findings. We also need to ask ourselves
whether the sample was truly random or simply a ‘sample of
convenience’; an example of the latter is Ex. 7.4.4 on p.245.
In the Smoking study, it could well be the case that diet is confounded
with whether or not a person smokes – so too might alcohol
consumption (AC) be confounded with smoking status.
???
S low birthweight S ‐‐‐‐ low birthweight
AC
Recent smoking studies – and all epidemiological studies in general –
measure many variables (e.g., amount of coffee consumed, age at
onset of menstruation, age, weight, height, blood type, religion,
education, income…), and the correct statistical analysis controls for
these additional variables. These studies do find a link between
smoking and baby birthweight.
But with all observational studies, remember
“Association is not causation.”
9 | P a g e
Introduction to Biostatistics
Why then perform observation studies? First, may be the only way to
study something like the effects of smoking (consider the alternative).
Second, observational studies can be pooled or can give suggestions
for subsequent experiments and research.
Spurious association (pp. 247‐8): babies exposed to ultrasound in the
womb were significantly lighter than babies not exposed to
ultrasound. Does ultrasound cause reduced birthweight? Of course
not – the lurking variable in this study was whether or not a mother
was experiencing problems during pregnancy.
In Section 7.5, we address Directional Alternatives or so‐called One‐
Tailed t‐tests. Previously, we have only considered non‐directional
alternatives, HA: 1 ≠ 2, which lead to the so‐called 2‐tailed test.
Here, we consider either of these cases:
H0: 1 = 2 H0: 1 = 2
HA: 1 < 2 OR HA: 1 > 2
Both are called one‐tailed or one‐sided t‐tests for two independent
samples. (Note in passing: regardless of the case, the equal sign
always appears in the null hypothesis.) It is important to realize that
it is legitimate to use a directional alternative only if HA is formulated
before seeing the data. As we shall see, the difference is in
calculating the associated p‐value.
Here is the One‐Tailed Alternative Procedure:
Step 1. Check the directionality of the data: see if the data deviate in
the direction specified by HA. If not, then the p‐value exceeds
50% and we stop since p > . If it does, proceed to step 2.
10 | P a g e
Introduction to Biostatistics
Step 2. The p‐value of the data is the ONE TAILED area beyond ts.
Step 3. Again, reject the null and accept HA if p < , and retain the null
(fail to reject) if p > .
Thus, if HA: 1 < 2, then we only look to the left of ts; if HA: 1 > 2,
then we only look to the right of ts. Also, since all our CI’s are two‐
sided, the CI‐HT connection is meaningless for one‐tailed tests.
Do we reject the Null in each of the following four cases?
p. 258, ex. 7.5.3: H0: 1 = 2 and HA: 1 > 2
(a) ts = 3.75, df = 19, = 0.01, decision? _________
(d) ts = 1.8, df = 7, = 0.05, decision? _________
p. 258, ex. 7.5.4: H0: 1 = 2 and HA: 1 < 2
(c) ts = 0.4, df = 16, = 0.10, decision? _________
(d) ts = ‐2.8, df = 27, = 0.01, decision? _________
p. 259, ex. 7.5.9: Wounded Tomato Plants: Does wounding a tomato
plant improve the plant’s defense against subsequent insect attack?
The researcher’s a priori guess was “yes”, and she wanted to perform
the relevant HT at the 5% level. Here, Y = weight (in mg) of insect
(tobacco hornworm) larvae after 7 days of attack.
1 = average weight of all larvae on wounded plants
2 = average weight of all larvae on non‐wounded plants
Here, we wish to test using = 0.05:
H0: 1 = 2
HA: 1 < 2
11 | P a g e
Introduction to Biostatistics
Wounded Control
Sample size n1 = 16 n2 = 18
Sample mean = 28.66 = 37.96
Sample SD s1 = 9.02 s2 = 11.14
The RHS of Equation 6.7.1 equals 31.8 so df = 31. Making no
assumptions about equality of the variances we have
. . . .
. , and the test statistic is
.
. . From Table 4 (t‐table), 0.005 < p < 0.01; since p < , we reject
the null. (Computer confirms that p = 0.0057 here.)
Conclusion: assuming that the respective larvae weight populations
are Normally distributed, there is significant evidence to conclude
that wounding a tomato plant does diminish average larvae growth.
0.4
0.3
Density
0.2
0.1
-2.69
Area
0.0
-2 0 2
standardized t
12 | P a g e
Introduction to Biostatistics
Section 7.6 addresses the differences between significance and
importance. We tend to use phrases like: “the effect of the drug was
highly significant,” or “the wheat yields did not differ significantly
between the two fertilizers,” or “no significant toxicity was found”
(meaning the null hypothesis of no difference of the means was not
rejected). But it is the researcher (consumer) who decides on
importance and relevance – not the statistical test. For example, a
study involving 10000’s of volunteers in each of the two treatment
arms may find the drug A cancer rate of 7.2% and the drug B cancer
rate of 7.4% and the difference may be statistically significant, but
consumers (the public, and perhaps MD’s) need to decide for
themselves if the difference is important enough to switch from the
generic drug (A) to the expensive one (B).
For two treatment groups, if we can assume 1 = 2 = , then we
| |
define the Effect Size to be , and we estimate this by the
| |
sample estimate ; s = max(s1,s2). It is a signal‐to‐noise ratio.
13 | P a g e
Introduction to Biostatistics
Planning for Adequate Power is the topic of §7.7. Recall that power is
the probability of rejecting H0 when H0 is indeed false (so HA is true).
If we have H0: 1 = 2 vs. HA: 1 ≠ 2, then power is the ability to see
the difference between the means when there really is one. Thus,
power depends upon , , n (total sample size) and (1 ‐ 2) or the
effect size. Taken one at a time (i.e., all other things equal) we have:
(a) Dependence on : If for example instead of = 5%, we choose
= 1%, we’d reject H0 less often, so the power would drop;
(b) Dependence on : A larger means less precision and so a
drop in power – to increase power, control outside factors (so‐
called “noise”) as much as possible;
(c) Dependence on n: As n increases, precision increases since SE’s
(/n) drop, so power increases;
(d) Dependence on (1 ‐ 2): Lots of shift in means increase in
power – investigator needs to ask him or her‐self about the
magnitude of the difference s/he is looking for; and this is
captured in the effect size.
p.270, ex.7.7.2 – Male and Female Heights. From a previous study,
we have = 69.1, = 64.1, s = 2.5, so our estimate of the effect
size is (69.1 ‐ 64.1) / 2.5 = 2.0. In a future study, we wish to test
H0: 1 = 2 versus HA: 1 ≠ 2 at the = 5% level. How many Males
and Females do we need to sample so the power is 99%?
Using Table 5 on pp. 619‐620, we see that we need nM = 11 and nF = 11
(so 22 total subjects). For additional examples work through ex. 7.7.3
on pp.270‐1 and pp.271‐2 ex. 7.7.2 (homework exercise).
14 | P a g e
Introduction to Biostatistics
§7.8 (Summary) reminds us of our assumptions/requirements:
Design – random samples, independence, representative;
Distribution – Normality or large n; no outliers.
If another design is used use another statistical analysis (e.g.,
paired t‐test in Chapter 8).
Violations could result in
(1) is really higher than the specified 5%, and/or
(2) chosen t‐test is not very powerful – i.e., another more
powerful test exists and should be used.
Alternatives to the above t‐test on the raw data include:
(a) performing the t‐test on transformed data (see the text
example on pp. 274‐6), or
(b) performing a nonparametric test (see below and p.282).
§7.9 recaps hypothesis‐testing strategies. Often – but not always –
the null hypothesis is the “status quo” and the alternative hypothesis
is “what the researcher wants to show.” Recall, the “=” sign always
occurs in H0 in this course (although not so in bioequivalence studies).
Read carefully the discussion on p.279: the probability that H0 is true
cannot be calculated – this is not what the p‐value is equal to.
The Wilcoxon‐Mann‐Whitney (WMW) Test – is a two‐independent‐
sample nonparametric test discussed in §7.10 (pp.282‐289). Here,
“nonparametric” means no assumption of a distribution such as the
Normal distribution. This is an unfortunate choice of words since the
WMW test is equivalent to a test of equality of the two population
medians (which are indeed parameters).
15 | P a g e
Introduction to Biostatistics
Sample 1 measurements come from population 1 – denoted Y1
Sample 2 measurements come from population 2 – denoted Y2
H0: the populations of Y1 and Y2 are the same
HA: depends on the exercise – sometimes looking for a
difference, sometimes for a Right or Left shift
Soil Respiration and Microbial Activity Example (p.282 ex.7.10.1).
Here, Y is amount of CO2 given off from core samples (in mol/g‐
soil/hour) at two locations in a forest: (1) under an opening in the
forest canopy (“gap” locations), or (2) at a nearby area under heavy
tree growth (“growth” locations). For this example, we have:
H0: Gap and Growth areas do not differ w/r/to soil respiration.
HA: Soil respiration rates tend to be DIFFERENT in the growth
area and the gap area.
The data are repeated in the following table in black:
GROWTH GAP
# of GAP Y1: GROWTH Y2: GAP data # of GROWTH
smaller data smaller
5 17 6 0
6 20 13 0
6.5 22 14 0
8 64 15 0
8 170 16 0
8 190 18 1
8 315 22 2.5
29 3
16 | P a g e
Introduction to Biostatistics
Now add up the red numbers: K1 = 49.5 and K2 = 6.5 (check that the
sum of these, 56 here, is equal to the product of the above sample
sizes n1 * n1 = 7 * 8 = 56 for this exercise). Then the Wilcoxon‐Mann‐
Whitney test statistic is
Us = max{K1 , K2}
Here, Us = 49.5. Since 48 < Us < 50, from Table 6 on p.621, we have
0.0093 < p‐value < 0.021
The following Box‐Plot and MTB output confirm these results.
Boxplot of Soil Respiration vs Location Type
gap
Location Type
growth
17 | P a g e
Introduction to Biostatistics
18 | P a g e