Statistical Inference
Statistical Inference
Statistical Inference
Statistical Inference
Serik Sagitov
STATISTICAL INFERENCE
PROBABILITY THEORY
predicts data patterns for
a given parametric model
1 Parametric models 5
1.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 One-way and two-way layout models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Sample mean, sample variance, and t-distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Gamma, exponential, and chi-squared distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Bernoulli, binomial, and multinomial distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Poisson, geometric, and hypergeometric distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Random sampling 14
2.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Approximate confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Simple random sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Dichotomous data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Stratified random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Exact confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Parameter estimation 24
3.1 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Maximum likelihood estimation for the gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Hypothesis testing 33
4.1 Statistical significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Large-sample test for the proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 P-value of the test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Likelihood-ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Chi-squared test of goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Bayesian inference 44
5.1 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Credibility interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Bayesian hypotheses testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6 Summarising data 53
6.1 Empirical probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Quantiles and QQ-plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4 Skewness and kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.5 Inference on the population median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.6 Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1
7 Comparing two samples 64
7.1 Two independent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2 Two independent samples: comparing population proportions . . . . . . . . . . . . . . . . . . . . . . . 67
7.3 Paired samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.4 Paired samples: comparing population proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.5 List of statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8 Analysis of variance 77
8.1 One-way layout ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2 Simultaneous confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.3 Kruskal-Wallis test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.4 Two-way layout ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.5 Additive model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
10 Multiple regression 99
10.1 Simple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.2 Confidence intervals and hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.3 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.4 Coefficients of multiple determination and model utility tests . . . . . . . . . . . . . . . . . . . . . . . 105
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
2
Introduction
Statistical analysis consists of three parts: collection of data, summarising data, and making inferences. The main
focus of this text is on the key tools of statistical inference: parameter estimation and hypothesis testing based upon
properly collected, relatively small data sets. Special attention, therefore, is paid to the basic principles of experimental
design: randomisation, blocking, and replication. The reader will get a deeper understanding of some traditional topics
in mathematical statistics such as methods based on likelihood, aspects of experimental design, non-parametric testing,
analysis of variance, introduction to Bayesian inference, chi-squared tests, and multiple regression.
It is expected that the reader is familiar with the basics of probability theory. The diagram on the front page
illustrates the relationship between probability theory and statistical inference. An appreciation of these two different
perspectives is provided by the following statements.
probability theory. Previous studies showed that the drug was 80% effective. We can anticipate that
for a study on 100 patients, in average 80 will be cured and at least 65 will be cured with probability 0.9999.
statistical inference. It was observed that 78 out of 100 patients were cured. We are 95% confi-
dent that for other similar studies, the drug will be effective on between 69.9% and 86.1% of patients.
Acknowledgements
This text has grown from the lecture notes for the undergraduate course "Statistical Inference", given at the Chalmers
University of Technology and the University of Gothenburg. The author is grateful to constructive feedback from the
students who have taken this course. Special thanks go to professor Aila Särkkä for many insightful comments and
corrections. The course material was originally based on the second edition of the book "Mathematical statistics and
data analysis" by John Rice. A great number of examples and exercises included in this compendium are borrowed
from Rice’s textbook.
3
Sufficient statistics for population parameters.
Exact confidence intervals for the mean and variance. Chi-squared distribution, t-score and t-distribution.
Statistical hypotheses, simple and composite hypotheses, null hypothesis and alternative hypothesis.
Test statistic, rejection region. Two types of error.
Significance level, test power, effect size.
P-value of the test, one-sided and two-sided p-values.
Large-sample tests for the mean and for the proportion. One-sample t-test. The binomial test.
Nested hypotheses, generalised likelihood ratio test.
Chi-squared test of goodness of fit, its approximate nature. Multinomial distribution.
Bayes formulas for probabilities and densities. Prior and posterior distributions.
Conjugate priors. Normal-normal model.
Beta and Dirichlet distributions. Beta-binomial model and Dirichlet-multinomial model.
Bayesian estimation, zero-one loss function and squared error loss. Posterior risk.
Maximum aposteriori estimate (MAP) and posterior mean estimate (PME). Credibility interval.
Posterior odds. Bayesian hypotheses testing.
Categorical data.
Dichotomous data. Cross classification.
Chi-squared tests of homogeneity and independence.
Prospective and retrospective studies. Matched-pairs design, McNemar’s test. Odds ratio.
4
Chapter 1
Parametric models
Notationally, we distinguish between random variables X, Y, Z and their realisations x, y, z by consistently using either
capital or small letters. For a random variable X, we usually denote its mean value and variance by
µ = E(X), σ 2 = Var(X).
depending on whether the distribution of X is continuous, with the probability density function f (x), or discrete, with
the probability mass function pi = P(X = xi ). The difference (X − µ) is called the deviation from the mean, and the
variance of X is defined by
Var(X) = E((X − µ)2 ).
The square root of the variance, σ, is called the standard deviation of X.
By X ∼ F(µ, σ) we will mean that X has a distribution with mean µ and standard deviation σ. The
symbol F(µ, σ) will often denote the so-called population distribution.
If (Z1 , Z2 ) are standardised (X1 , X2 ), then the correlation coefficient for (X1 , X2 ) is given by ρ = Cov(Z1 , Z2 ).
The correlation coefficient is used as a measure of linear dependence between a pair of random variables.
In general, the correlation coefficient belongs to the interval −1 ≤ ρ ≤ 1. In particular, if X1 and X2 are independent
random variables, then ρ = 0, and if X2 = a + bX1 , then ρ = ±1 depending on the sign of the non-zero slope b.
5
The corresponding z-score Z = X−µ σ has the so-called standard normal distribution N(0, 1). The cumulative distribu-
tion function of N(0, 1) is assigned a special notation
Φ(x) = P(Z ≤ x), −∞ < x < ∞.
The values of Φ(x) for x ≥ 0, may be found using the table in Section 11.1. Importantly, if X ∼ N(µ, σ), then
P(X ≤ x) = P( X−µ
σ ≤ x−µ
σ ) = P(Z ≤ x−µ
σ ) = Φ( x−µ
σ ).
The next figure presents four probability density functions N(0, 0.5), N(0, 2), N(−2, 1), and N(2, 0.5). All normal
distribution curves have the same shape and differ only in their location parameter µ and their scale parameter σ.
The key role of the normal model in the statistical inference is due to the central limit theorem. Let (X1 , . . . , Xn )
be independent and identically distributed random variables with mean µ and variance σ 2 . The arithmetical average
X1 + . . . + Xn
X̄ =
n
represents the sample mean. According to the central limit theorem, for a large sample size n,
X̄ ≈ N(µ, √σn ),
σ2
meaning that the random variable X̄ is asymptotically normally distributed with mean µ and variance n .
6
More generally, suppose that we are given k ≥ 2 normally distributed random variables
X1 ∼ N(µ1 , σ1 ), . . . , Xk ∼ N(µk , σk ).
Define the response variable Y as Xi with a random index i taking one of the values 1, . . . , k with probabilities
w1 , . . . , wk , so that
w1 + . . . + wk = 1.
This yields the following expressions for the mean µ = E(Y ) and variance σ 2 = Var(Y )
µ = w1 µ1 + . . . + wk µk ,
k
X k
X
σ2 = wj (µj − µ)2 + wj σj2 .
j=1 j=1
The above expression for σ 2 is due to the law of total variance, see Wikipedia, which recognises two sources of variation
Pk
variation between the strata j=1 wj (µj − µ)2 ,
Pk
variation within the strata j=1 wj σi2 .
µi = µ + αi , αi = µi − µ,
and
I
X J
X
δij = 0, j = 1, . . . , J, δij = 0, i = 1, . . . , I.
i=1 j=1
In general, at different combinations of levels (i, j) of the two factors may interact either negatively, δij < 0, or
positively, δij > 0.
Additive model
In the special case, with δij = 0 for all (i, j), the model claims that there is no interaction and the main factors
contribute additively:
µij = µ + αi + βj .
7
1.3 Sample mean, sample variance, and t-distributions
Suppose we are given a vector (X1 , . . . , Xn ) of independent random variables having the same distribution F(µ, σ).
A realisation of this vector (x1 , . . . , xn ) will be called (with a slight abuse of the established terminology) a random
sample drawn from the population distribution F(µ, σ) with population mean µ and population standard deviation
σ. For the given sample (x1 , . . . , xn ) define the sample mean, sample variance, and sample standard deviation by
n r
x1 + . . . + xn 2 1 X 2 (x1 − x̄)2 + . . . + (xn − x̄)2
x̄ = , s = (xi − x̄) , s = .
n n − 1 i=1 n−1
An alternative formula for the sample variance
n x21 + . . . + x2n
s2 = (x2 − x̄2 ), x2 = ,
n−1 n
is often more convenient to use for pen and paper calculations. The sample mean x̄ and sample variance s2 are
realisations of the random variables
n
X1 + . . . + Xn 1 X
X̄ = , S2 = (Xi − X̄)2 ,
n n − 1 i=1
which have the following means and variances
σ2 σ4
E(X̄) = µ, Var(X̄) = n , E(S 2 ) = σ 2 , Var(S 2 ) = n E( X−µ 4
σ ) −
n−3
n−1 .
The connection between the t-distribution and the standard normal distribution can be described in the following
way: if Z, Z1 , . . . , Zk are independent random variables with N(0, 1)-distribution, then
Z
p ∼ tk .
(Z12 + . . . + Zk2 )/k
8
1.4 Gamma, exponential, and chi-squared distributions
The gamma distribution Gam(α, λ) is a continuous distribution described by two parameters: the shape parameter
α > 0 and the inverse scale (or the rate) parameter λ > 0. Its probability density function has the form
1 α α−1 −λx
f (x) = λ x e , x > 0.
Γ(α)
The next figure depicts
on the left panel, the gamma densities with λ = 1 and α = 0.9, 1, 2, 3, 4,
on the middle panel, the gamma densities with α = 1 and λ = 0.5, 1, 2, 3,
on the right panel, the gamma densities with λ = 1 and α = 10, 15, 20, 25, 30.
The gamma distribution model, despite being restricted to positive values, is more flexible than the normal distri-
bution model since for the different values of α, the density curves have different shapes. In particular, if α = 1, then
we obtain the exponential distribution (see the middle panel above)
Gam(1, λ) = Exp(λ).
Moreover, if Xi ∼ Exp(λ), i = 1, . . . , k are independent, then
X1 + . . . + Xk ∼ Gam(k, λ), k = 1, 2, . . .
The mean and variance of the gamma distribution are
µ= α
λ, σ2 = α
λ2 .
For large values of the shape parameter, there is a useful normal approximation for the gamma distribution:
√
α
Gam(α, λ) ≈ N( αλ , λ ), α ≫ 1.
The chi-squared distribution with k degrees of freedom is the gamma distribution with α = k2 , λ = 12 . The figure
below depicts the chi-squared distribution densities with k = 2, 3, 4, 5, 6 degrees of freedom.
The chi-squared distribution is connected to the standard normal distribution as follows: if Z1 , . . . , Zk are independent
random variables with N(0, 1)-distribution, then
Z12 + . . . + Zk2 ∼ χ2k .
Importantly, if (X1 , . . . , Xn ) are independent and random variables each having the N(µ, σ) distribution, then
Pn 2
i=1 (Xi − X̄)
∼ χ2n−1 .
σ2
Here, the number of degrees of freedom is n − 1 instead of n, because one degree of freedom is consumed after µ being
replaced by X̄.
9
1.5 Bernoulli, binomial, and multinomial distributions
Let X be the outcome of a Bernoulli trial with probability of success p, meaning that
P(X = 1) = p, P(X = 0) = 1 − p.
In this case, we write X ∼ Bin(1, p) and say that X has the Bernoulli distribution with parameter p ∈ [0, 1]. The
Bernoulli model is used for describing dichotomous data, when observations have two possible outcomes: female or
male, heads or tails, passed or failed. Under such a dichotomy, one outcome is usually called a success and the other
failure, and the value x = 1 is assigned to the successful outcome. The mean and variance of X ∼ Bin(1, p) are
µ = p, σ 2 = p(1 − p).
If X is the sum of outcomes of n independent Bernoulli trials with probability p of success, then its distribution is
called the binomial distribution Bin(n, p). This is a discrete distribution with the probability mass function
n x
P(X = x) = p (1 − p)n−x , x = 0, . . . , n,
x
yielding the following formulas for the mean and variance
µ = np, σ 2 = np(1 − p).
The normal approximation for the binomial distribution
Bin(n, p) ≈ N(np,
p
np(1 − p)).
is an instrumental example of the central limit theorem. The rule of thumb says that this normal approximation is
good enough if both np ≥ 5 and n(1 − p) ≥ 5, so that
P(X ≤ x) ≈ Φ √ x−np .
np(1−p)
Continuity correction
For smaller values of n, this approximation is improved with help of the continuity correction trick, which in view of
the equality
P(X ≤ x) = P(X < x + 1),
suggests replacing x by x + 1
2 on the right hand side of the approximation formula
P(X ≤ x) ≈ Φ √ x−np ,
np(1−p)
resulting in
x+ 1 −np
P(X ≤ x) ≈ Φ √ 2 ,
np(1−p)
or similarly,
x− 1 −np
P(X < x) ≈ Φ √ 2 .
np(1−p)
Example
To illustrate, we plot the Bin(10, 1/3) distribution together with its normal approximation: on the left panel without
the continuity correction and on the right panel with the continuity correction. Observe that n = 10 is so small that
with p = 1/3 we have np = 3.33 which smaller than the recommended lower bound 5. Still, the normal approximation
with the continuity corrections is quite close.
10
The multinomial distribution (X1 , . . . , Xr ) ∼ Mn(n; p1 , . . . , pr ) is defined by
n
P(X1 = x1 , . . . , Xr = xr ) = px1 . . . pxr r ,
x1 , . . . , x r 1
where
xi = 0, . . . , n, i = 1, . . . , r,
and (p1 , . . . , pr ) is a vector of probabilities such that
p1 + . . . + pr = 1.
This is an extension of the binomial distribution Bin(n, p) = Mn(n; p, 1 − p). The Mn(n; p1 , . . . , pr ) distribution
describes the outcome of n independent trials with r possible outcomes labeled by i = 1, . . . , r. If each trial outcome
has distribution (p1 , . . . , pr ) over the set of possible labels {1, . . . , r}, then Xi should be treated as the number of trials
with the outcome labeled by i. We have
X1 + . . . + Xr = n,
the marginal distribution of Xi is binomial Xi ∼ Bin(n, pi ), and the different counts (Xi , Xj ) are negatively correlated:
Cov(Xi , Xj ) = −npi pj , i ̸= j.
Geometric distribution
Consider a sequence of independent Bernoulli trials with probability p of success. The geometric distribution is either
one of two discrete probability distributions:
the distribution of the number X of trials needed to get one success,
the distribution of the number Y = X − 1 of failures before the first success.
Which of these is called the geometric distribution is a matter of convention and convenience. Often, the name shifted
geometric distribution is adopted for the distribution of X supported on the set {1, 2, . . .}. To avoid ambiguity, in this
text, we say that X has the geometric distribution with parameter p and write X ∼ Geom(p) if
P(X = x) = (1 − p)x−1 p, x = 1, 2, . . . ,
yielding the mean and variance formulas
1−p
µ = p1 , σ2 = p2 .
Like its continuous analogue, the exponential distribution, the geometric distribution is memoryless: the die one throws
or the coin one tosses does not have a memory of how many failures have been observed so far.
11
Hypergeometric distribution
The hypergeometric distribution X ∼ Hg(N, n, p) describes the number x of black balls among n balls drawn without
replacement from a box with N balls, of which
B = N p balls are black and
W = N (1 − p) balls are white.
In this case, X is the number of successes in n Bernoulli trials which depend on each other. The distribution of X is
given by the formula
B
W
x n−x
P(X = x) = N
,
n
which is called the finite population correction factor. With a small fraction value n/N , the finite population correction
is close to 1 and Hg(N, n, p) ≈ Bin(n, p). One may say that the binomial distribution is a version of the hypergeometric
distribution with the infinite population size.
Despite the dependence between the drawings without replacement, there is a normal approximation also for the
hypergeometric distribution:
q
Hg(N, n, p) ≈ N(µ, σ),
p
n−1
µ = np, σ = np(1 − p) 1 − N −1 ,
which is recommended to be applied provided np ≥ 5 and n(1 − p) ≥ 5. On the figure below the Hg(100, 10, 0.5)
distribution is compared to its normal approximation with the continuity correction. Again, with the continuity
correction the normal approximation works well even for smaller values of np and n(1 − p).
1.7 Exercises
Problem 1
For any pair of random variables (X1 , X2 ) with means (µ1 , µ2 ), show that
Var(Xi ) = E(Xi2 ) − µ2i , Cov(X1 , X2 ) = E(X1 X2 ) − µ1 µ2 .
Problem 2
Let
X1 − µ1 X2 − µ2
Z1 = , Z2 = ,
σ1 σ2
be the standardised versions of X1 and X2 . Verify that E(Zi ) = 0 and Var(Zi ) = 1. Show that the correlation
coefficient for X1 and X2 is given by
ρ = E (Z1 Z2 ) ,
and explain in what sense it is a dimensionless quantity.
12
Problem 3
For (X1 , . . . , Xr ) ∼ Mn(n; p1 , . . . , pr ), what is the distribution of the sum Xi + Xj assuming i ̸= j?
Problem 4
Let X ∼ Gam(α, λ). To see that the parameter λ influences only the scale of the gamma distribution, show that the
scaled random variable Y = λX has the gamma distribution Gam(α, 1).
Problem 5
Show that
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ).
Problem 6
The average number of goals in a World Cup soccer match is approximately 2.5 and the Poisson model is appropriate.
Compute the probabilities of k goals in a match for k = 0, 1, 2.
Problem 7
Show that for large N ,
Hg(N, n, p) ≈ Bin(n, p).
Problem 8
Consider a truncated version of the geometric distribution Geom(n, p) such that for X ∼ Geom(n, p),
n−1
X
P(X = x) = (1 − p)x p, x = 0, 1, . . . , n − 1, P(X = n) = 1 − P(X = x),
x=1
13
Chapter 2
Random sampling
Statistical inference is the use of data analysis for inferring relevant statistical patterns in a large population with help
of a random sample drawn from the population in question.
Picking at random one element from the population, produces a realisation x of
a random variable X ∼ F(µ, σ) having the population distribution.
In many situations, studying the population distribution by enumeration is either very expensive or even impossible.
Luckily, a good guess is available by studying a sample of n observations (x1 , . . . , xn ) drawn independently from the
population distribution F(µ, σ). Such a random sample is a single realisation of the vector (X1 , . . . , Xn ) of independent
and identically distributed random variables. If the sampling experiment is repeated, the new realization (x′1 , . . . , x′n )
will differ from (x1 , . . . , xn ).
Randomisation in sampling protects against investigator’s biases even unconscious.
Any function g(x1 , . . . , xn ) of the sample data is called a statistic. The most important examples of statistics are
the sample mean and sample variance
x1 + . . . + xn (x1 − x̄)2 + . . . + (xn − x̄)2
x̄ = , s2 = .
n n−1
14
with unknown parameters. Given a random sample (x1 , . . . , xn ) drawn from Gam(α, λ) distribution, one can try to
estimate the shape parameter α using a relevant statistic g(x1 , . . . , xn ).
More generally, to estimate a population parameter θ based on a given random sample (x1 , . . . , xn ), we need a
sensible point estimate θ̂ = g(x1 , . . . , xn ). Observe, that in the same way as (x1 , . . . , xn ) is a realisation of a random
vector (X1 , . . . , Xn ), the point estimate θ̂ is a realisation of a random variable
Θ
b = g(X1 , . . . , Xn ),
which we will call a point estimator of θ. The distribution of the random variable Θ
b is called the sampling distribution
of the point estimator. The quality of the the point estimator Θ b is measured by the mean square error
b − θ)2 ) = Var(Θ)
E((Θ b − θ)2
b + (E(Θ)
If the mean square error vanishes E((Θ b − θ)2 ) → 0 as n → ∞, the point estimate θ̂ is called consistent. If E(Θ)
q
b = θ,
the estimate is called unbiased. The standard deviation σ b = Var(Θ)
Θ
b of the estimator Θ b is called the standard error
of the point estimate θ̂.
Remark on termilogy
Consistency is a feature of estimator. Sometimes, we write consistent estimate and mean that the estimate is a value
of a consistent estimator. The same holds for unbiased estimators and estimates.
we conclude that the sample mean x̄ is an unbiased and consistent estimate of the population mean θ = µ. Furthermore,
the sample variance s2 is an unbiased and consistent estimate for the population variance θ = σ 2 .
The estimated standard error for the sample mean x̄ is given by sx̄ = √s .
n
Notice that the sample standard deviation s systematically underestimates the population standard deviation σ since
E(S) < σ, provided Var(S) > 0. To see this, observe that (E(S))2 < E(S 2 ), where E(S 2 ) = σ 2 . However, s is an
asymptotically unbiased and consistent estimate of σ.
X̄ ≈ N(µ, √σn ),
X̄ − µ
√ ≈ N(0, 1),
S/ n
This yields √ √ √ √
P(X̄ − zS/ n < µ < X̄ + zS/ n)) ≈ P(−zσ/ n < X̄ − µ < zσ/ n) ≈ 2(1 − Φ(z)),
giving the following formula of an approximate 100(1–α)% two-sided confidence interval for the population mean µ:
Iµ ≈ x̄ ± z( α2 ) · sx̄
15
or equivalently,
s
Iµ ≈ x̄ ± z( α2 ) · √ .
n
Here z(α) is obtained from the normal distribution table using the relation
The higher the confidence level 100(1–α)%, the wider is the confidence interval Iµ .
On the other hand, the larger the sample size n, the narrower is Iµ .
The exact meaning of the confidence level is a bit tricky. It is important to realise that the source of randomness
in the expression x̄ ± z( α2 ) · sx̄ is in the sampling procedure. For a given sample (x1 , . . . , xn ), the concrete interval Iµ
is a realisation of a random interval Iµ , such that
For example, out of a hundred 95% confidence intervals Iµ computed for 100 samples, on average 95 intervals are
expected to cover the true value of µ. Notice that in this case, the random number of the confidence intervals covering
µ has distribution Bin(100, 0.95) which is approximately normal N(95, 2.18).
16
- sampling with replacement produces what we call a random sample consisting of independent and identically
distributed observations,
- sampling without replacement produces a so called simple random sample having identically distributed but
dependent observations.
Notice that if the ratio N
n
is small, then the two approaches are almost indistinguishable.
In this section we consider a simple random sample (x1 , . . . , xn ) with dependent observations. In this case the
sample mean x̄ is again an unbiased and consistent estimate for the population mean, such that
σ2 n−1
E(X̄) = µ, Var(X̄) = n (1 − N −1 ).
Here N is the finite population size and the finite population correction
n−1 N −n
1− N −1 = N −1
reflects the negative correlation between observations due to sampling without replacement.
Observe that the sample variance s2 becomes a biased estimate of σ 2 . Indeed, since
Pn 2
i=1 Xi
E(S 2 ) = n−1
n
E n − X̄ 2 = n−1
n
(E(X 2 ) − E(X̄ 2 ))
2 σ2 σ2
= n
n−1 (σ + µ2 − n (1 − n−1
N −1 ) − µ2 ) = n
n−1 (σ
2
− n (1 − n−1
N −1 )) = σ 2 NN−1 ,
we find that
E(S 2 ) = σ 2 NN−1 .
σ2
Replacing σ 2 by s2 NN−1 in the formula Var(X̄) = n (1 − n−1
N −1 ), we obtain the following unbiased estimate of Var(X̄):
s2 N −1 s2
s2x̄ = n N (1 − n−1
N −1 ) = n (1 − n
N ).
Thus, for the sampling without replacement, the formula for the estimated standard error of the sample mean x̄ takes
the form q
sx̄ = √sn 1 − Nn
.
With this new formula for the estimated standard error sx̄ , the formula of an approximate 100(1–α)% confidence
interval
Iµ ≈ x̄ ± z(α/2)sx̄ ,
remains to be valid even for the sampling without replacement, due to the central limit theorem under weak dependence.
The figure below plots the sampling distribution of the sample mean and its normal approximation. Here the
sampling is performed without replacement from the population
17
2.4 Dichotomous data
Consider the important case when the population distribution is a Bernoulli distribution
F(µ, σ) = Bin(1, p)
with the unknown parameter p. A random sample drawn from the Bernoulli distribution consists of 0 or 1 values
xi ∈ {0, 1}, therefore, the data (x1 , . . . , xn ) will be called dichotomous. This model can be used for categorical data,
where the observed values are of non-numerical form, like male or female, after converting the non-numerical values
to the 0-1 format: female = 1, male = 0.
The defining parameter of the Bernoulli model
p = P(X = 1),
will be called called the population proportion. The population proportion p defines both population mean and
variance by
µ = p, σ 2 = p(1 − p),
and the sample mean turns into a sample proportion p̂ = x̄. The sample proportion is an unbiased and consistent
estimate of p. Since x2i = xi for xi ∈ {0, 1}, we find that the sample variance takes the form
(x1 − p̂)2 + . . . + (xn − p̂)2 x2 − 2x1 p̂ + p̂2 + . . . + x2n − 2xn p̂ + p̂2
s2 = = 1
n−1 n−1
x1 − 2x1 p̂ + p̂2 + . . . + xn − 2xn p̂ + p̂2 np̂2 − (x1 + . . . + xn )p̂
= =
n−1 n−1
np̂(1 − p̂)
= .
n−1
q
The estimated standard error for the sample proportion p̂ is sp̂ = p̂(1− p̂)
n−1 .
18
Indeed, the standard error of the sample proportion p̂ with n = 2000 and N = 7000000 equals
q q
sp̂ = p̂(1− p̂)
p
n
n−1 1− N = 0.0224 p̂(1 − p̂),
(µj , σj ), j = 1, . . . , k,
we get the following expressions for the population mean and variance
µ = w1 µ1 + . . . + wk µk ,
k
X
σ2 = σ2 + wj (µj − µ)2 .
j=1
In view of
w1 + . . . + wk = 1,
the above formulas for µ and σ 2 are obtained using the law of total expectation and law of total variance, see Wikipedia.
The expression for σ 2 is due to the law of total variance, with
σ 2 = w1 σ12 + . . . + wk σk2
The stratified random sampling procedure consists of taking k independent random samples from each stratum
with sample sizes (n1 , . . . , nk ) and sample means x̄1 , . . . , x̄k .
The stratified sample mean is the weighted average of k sample means x̄s = w1 x̄1 + . . . + wk x̄k .
n = n1 + . . . + nk
It is estimated by
w12 s21 2 2
wk sk
s2x̄s = w12 s2x̄1 + . . . + wk2 s2x̄k = n1 + ... + nk ,
where sj is the sample standard deviation corresponding to the sample mean x̄j .
19
A confidence interval for the mean based on a stratified sample: Iµ ≈ x̄s ± z( α2 ) · sx̄s
Suppose we are allowed to collect n observations from the stratified population of size N and assume that n ≪ N .
What would be the optimal allocation (n1 , . . . , nk ) of n observations among different strata minimising the sampling
w σ
error sx̄s of x̄s ? The solution of this optimisation problem is given by the formula nj = n jσ̄ j , j = 1, . . . , n, where σ̄
is the average standard deviation
σ̄ = w1 σ1 + . . . + wk σk .
wj σj
The stratified sample mean X̄so with the optimal allocation nj = n σ̄ has the
2
smallest variance Var(X̄so ) = σ̄
n among all allocations of n observations.
The optimal allocation assigns more observations to larger strata and strata with larger variation. The major drawback
of the optimal allocation formula is that it requires the knowledge of the standard deviations σj .
If σj are unknown, which is often the case, then a sensible choice is to allocate the observations proportionally to
the strata sizes, so that ni = nwi . Observe that with the proportional allocation, the stratified sample mean is equal
to the usual sample mean
n1 nk x1 + . . . + xn
x̄sp = w1 x̄1 + . . . + wn x̄k = n x̄1 + ... + n x̄k = = x̄.
n
However, this is not the mean of a random sample with independently allocated observations, since the n observations
are forcefully allocated among the k strata proportionally to the strata sizes. For the truly random sample, the sample
sizes n1 , . . . , nk are the outcome of a random allocation of n observations among k strata following the multinomial
distribution Mn(n, w1 , . . . , wk ).
The stratified sample mean X̄sp for the proportional allocation nj = nwj has the variance Var(X̄sp ) = n .
σ2
Comparing the three unbiased estimates of the population mean (x̄so , x̄sp , x̄), we find that their variances are
ordered in the following way
Var(X̄so ) ≤ Var(X̄sp ) ≤ Var(X̄),
since
(σ̄)2 σ2 σ2
≤ ≤ .
n n n
Variability of σj across the different strata makes the optimal allocation more effective than proportional
X
Var(X̄sp ) − Var(X̄so ) = n1 (σ 2 − σ̄ 2 ) = n1 wj (σj − σ̄)2 .
Variability of µj across the strata makes the proportional allocation more effective than the sample mean produced
by a random sample X
Var(X̄) − Var(X̄sp ) = n1 (σ 2 − σ 2 ) = n1 wj (µj − µ)2 .
Iµ ≈ x̄ ± z( α2 ) · sx̄ ,
which is based on the central limit theorem for the t-score of a random sample with a sufficiently large size n:
X̄ − µ
√ ≈ N(0, 1).
S/ n
The approximate confidence interval formula does not require that the population distribution is normal, but it works
only for large sample sizes. In this section, we state two confidence interval formulas based on the probability theory
facts mentioned in Sections 1.3 and 1.4:
X̄ − µ (n − 1)S 2
√ ∼ tn−1 , ∼ χ2n−1 .
S/ n σ2
These formulas are valid for small sample values under the assumption that the population distribution is normal.
Assume that a random sample (x1 , . . . , xn ) is taken from the normal distribution
F(µ, σ) = N(µ, σ)
20
with unspecified parameters µ and σ. Replacing the normal approximation by the exact distribution for the t-score
X̄ − µ
√ ∼ tn−1 ,
S/ n
we arrive at the exact 100(1 − α)% confidence interval formula for the mean
Iµ = x̄ ± tn−1 ( α2 ) · sx̄ ,
which is valid even for the small values of n ≥ 2. Remember that this formula requires that the population distribution
is normal.
For example, with α = 0.05 and n = 10, 16, 25, 30 we get the following four 95% confidence intervals for the mean
t9 (0.025) = 2.26, t15 (0.025) = 2.13, t24 (0.025) = 2.06, t29 (0.025) = 2.05
are obtained from the table in Section 11.2 using the column 0.025 and the rows k = 9, 15, 24, 29. These intervals are
getting narrower for larger values of n asymptotically approaching the familiar approximate 95% confidence interval
Iµ ≈ x̄ ± 1.96 · sx̄ .
The second exact confidence interval formula of this section is aimed at the population variance σ 2 . We want to
have a formula for a random interval Iσ2 , such that
P(σ 2 ∈ Iσ2 ) = 1 − α.
2σ 4
Under the normality assumption we have Var(S 2 ) = n−1 , so that an unbiased estimated of the standard error for the
unbiased estimate s2 of σ 2 is r
2 2
ss2 = s .
n−1
However, we can not use the same kind of formula Iσ2 = s2 ± tn−1 ( α2 ) · ss2 as for the mean.
The correct exact 100(1 − α)% confidence interval formula for σ 2 is based on the non-symmetric chi-squared
distribution in the above mentioned relation
(n − 1)S 2
∼ χ2n−1 ,
σ2
and has the following non-symmetric form
(n − 1)s2 (n − 1)s2
Iσ 2 = ; .
xn−1 ( α2 ) xn−1 (1 − α2 )
The numbers xk (α) are obtained from the table of Section 11.3 giving the critical values of the chi-squared distributions.
Example
Examples of the 95% confidence intervals for σ 2 are
To clarify, turn to the last formula dealing with n = 30. The χ229 -distribution table gives the critical values
yielding
n−1 29 n−1 29
= = 0.63, = = 1.81.
xn−1 ( α2 ) 45.722 xn−1 (1 − α2 ) 16.047
We conclude that for n = 30,
(n − 1)s2 (n − 1)s2
Iσ2 = ; = (0.63s2 , 1.81s2 ).
xn−1 ( α2 ) xn−1 (1 − α2 )
21
2.7 Exercises
Problem 1
Consider a population consisting of five values
1, 2, 2, 4, 8.
Find the population mean and variance. Calculate the sampling distribution of the mean of a random sample of size
2 by generating all possible such samples (x1 , x2 ). Then find the mean and variance of the sampling distribution, and
verify the formulas given in this section
σ2
E(X̄) = µ, Var(X̄) = .
n
Problem 2
In a simple random sample of 1500 voters, 55% said they planned to vote for a particular proposition, and 45% said
they planned to vote against it. The estimated margin of victory for the proposition is thus 10%. What is the standard
error of this estimated margin? What is an approximate 95% confidence interval for the margin of victory?
Problem 3
This problem introduces the concept of a one-sided confidence interval. Using the central limit theorem, how should
the constant k1 be chosen so that the interval
(−∞, x̄ + k1 sx̄ )
is an approximate 90% confidence interval for µ? How should k2 be chosen so that
(x̄ − k2 sx̄ , ∞)
Problem 4
Verify the formula for the mean square error
b − θ)2 ) = Var(Θ)
E((Θ b − θ)2 .
b + (E(Θ)
Problem 5
A simple random sample of a population size 2000 yields 25 values with
104 109 11 109 87
86 80 119 88 122
91 103 99 108 96
104 98 98 83 107
79 87 94 92 97
(c) Give an approximate 95% confidence interval for the population mean.
Problem 6
For a simple random sample, take x̄2 as a point estimate of µ2 . (This is an example of the method of moments estimate
introduced in the next chapter.) Compute the bias of this point estimate, if any.
Problem 7
The following table (Cochran 1977) shows the stratification of all farms in a county by farm size and the mean and
standard deviation of the number of acres of corn in each stratum.
Farm size 0-40 41-80 81-120 121-160 161-200 201-240 241+
Number of farms Nj 394 461 391 334 169 113 148
Stratum mean µj 5.4 16.3 24.3 34.5 42.1 50.1 63.8
Stratum standard deviation σj 8.3 13.3 15.1 19.8 24.5 26.0 35.2
22
(b) For a sample size of 100 farms, compute the sample sizes from each stratum for proportional and optimal
allocation, and compare them.
(c) Calculate the variances of three sample means with different allocations of 100 observations: (1) proportional
allocation, (2) optimal allocation, (3) random sampling.
(d) Suppose that ten farms are sampled per stratum. What is Var(X̄s )? How large a simple random sample would
have to be taken to attain the same variance? Ignore the finite population correction.
Problem 8
How might stratification be used in each of the following sampling problems?
(b) A survey to examine the lead concentration in the soil in a large plot of land.
(c) A survey to estimate the number of people who use elevators in a large building with a single bank of elevators.
Problem 9
Consider stratifying the population of Problem 1 into two strata (1,2,2) and (4,8). Assuming that one observation is
taken from each stratum, find the sampling distribution of the estimate of the population mean and the mean and
standard deviation of the sampling distribution. Check the formulas of Section 2.5.
Problem 10
The following 16 numbers were generated from a normal distribution N(µ, σ)
5.3299 4.2537 3.1502 3.7032
1.6070 6.3923 3.1181 6.5941
3.5281 4.7433 0.1077 1.5977
5.4920 1.7220 4.1547 2.2799
(b) Give 90%, 95%, and 99% confidence intervals for µ and σ 2 .
(c) Give 90%, 95%, and 99% confidence intervals for σ.
(d) How much larger sample would you need to halve the length of the confidence interval for µ?
23
Chapter 3
Parameter estimation
Given a parametric model determined by a vector of unknown parameters θ = (θ1 , . . . , θk ), we wish to estimate θ from
a given random sample (x1 , . . . , xn ). There are two basic methods of finding good point estimates: (1) the method
of moments and (2) the maximum likelihood method. The method of moments is based on k summary statistics
called sample moments, while the maximum likelihood method uses the full information on the joint distribution of
(X1 , . . . , Xn ).
X1j + . . . + Xnj
→ E(X j ), n → ∞,
n
implying that (x̄, x2 ) are consistent estimates of (E(X), E(X 2 )). After replacing the population moments with the
corresponding sample moments, we arrive at the equations
x̄ = f (θ̃1 , θ̃2 ), x2 = g(θ̃1 , θ̃2 ),
whose solution (θ̃1 , θ̃2 ) will be called the method of moments estimates of (θ1 , θ2 ).
Geometric model
A researcher has observed n = 130 birds and counted the number of hops that each bird does between flights. As a
result, she obtained a random sample (x1 , . . . , xn ), where
xi = number of hops that the i-th bird does between flights.
The observed range of the number of hops j was between 1 and 12. The next table summarizes the dataset (x1 , . . . , xn )
number of hops x 1 2 3 4 5 6 7 8 9 10 11 12 total
observed counts ox 48 31 20 9 6 5 4 2 1 1 2 1 130
in terms of the observed counts ox , defined as the number of observed birds, who hopped x times. The observed counts
are computed from (x1 , . . . , xn ) as
ox = 1{x1 =x} + . . . + 1{xn =x} ,
where 1A is the indicator function of the relation A, which equals 1 if A is true and 0 otherwise.
The data produces the following summary statistics
total number of hops
x̄ = number of birds = 363
130 = 2.79,
x2 = 1 · 130 + 2 · 130 + . . . + 112 · 130
2 48 2 31 2
+ 122 · 1
130 = 13.20,
2 130 2 2
s = 129 (x − x̄ ) = 5.47,
q
5.47
sx̄ = 130 = 0.205.
24
An approximate 95% confidence interval for µ, the mean number of hops per bird is given by
The data plot below exhibits the frequencies descending by a certain factor.
This suggests a geometric model for the number of jumps X ∼ Geom(p) for a random bird:
P(X = x) = (1 − p)x−1 p, x = 1, 2, . . .
(such a geometric model implies that a bird "does not remember" the number of hops made so far, and the next move
of the bird is to jump with probability 1 − p or to fly away with probability p).
The method of moment estimate for the parameter θ = p of the geometric model requires a single equation arising
from the expression for the first population moment
µ = p1 .
p̃ = 1/x̄ = 0.36.
In this case, we can even compute an approximate 95% confidence interval for p using the above mentioned Iµ :
1 1
Ip ≈ ( 2.79+0.40 , 2.79−0.40 ) = (0.31, 0.42).
It is useful to compare the observed frequencies (counts) to the frequencies expected from the geometric distribution
with parameter p̃:
x 1 2 3 4 5 6 7+
cx 48 31 20 9 6 5 11
Ex 46.8 30.0 19.2 12.3 7.9 5.0 8.8
The expected counts Ex are computed in terms of independent geometric random variables (X1 , . . . , Xn )
An appropriate measure of discrepancy between the observed and expected counts is given by the following so-called
chi-squared test statistic
7
(cx −Ex )2
X
x2 = Ex = 1.86.
x=1
As it will be explained later on, the obtained small value 1.86 of the chi-squared test statistic allows us to conclude
that the geometric model fits the bird data very well.
25
3.2 Maximum likelihood estimation
In a parametric setting with population density function f (x|θ), the observed sample (x1 , . . . , xn ) is a realization of
the random vector (X1 , . . . , Xn ) having the joint probability distribution
over the possible sample vectors (y1 , . . . , yn ), obtained as the product of the marginal densities due to the assumption
of independence. Fixing the sample values (y1 , . . . , yn ) = (x1 , . . . , xn ) and allowing the parameter value θ to vary, we
obtain the so-called likelihood function
The maximum likelihood estimate θ̂ of θ is the value of θ that maximises the likelihood function L(θ).
Observe that it is often more convenient to find the maximum likelihood estimate θ̂ by maximizing the log-likelihood
function
l(θ) = ln L(θ) = ln f (x1 |θ) + . . . + ln f (xn |θ).
Sufficiency
Suppose there is a summary statistic t = g(x1 , . . . , xn ) such that
where the sign ∝ means "proportional to". Here the coefficient of proportionality c(x1 , . . . , xn ) does not explicitly
depend on θ. In this case, the maximum likelihood estimate θ̂ depends on the data (x1 , . . . , xn ) only through the
statistic t. Given such a factorisation property, we call t a sufficient statistic, as no other statistic that can be calculated
from the same sample provides any additional information on the value of the maximum likelihood estimate θ̂.
F(µ, σ) = N(µ, σ)
Thus two samples having the same values for (t1 , t2 ) will produce the same maximum likelihood estimates for (µ, σ).
Notice the match between the number of sufficient statistics and the number of parameters of the model.
Suppose we are given a random sample (x1 , . . . , xn ) of zeros and ones drawn from a population distribution
F(µ, σ) = Bin(1, p)
with unknown p. Since µ = p, the method of moment estimate of p is computed as the sample proportion p̃ = x̄.
The likelihood function,
Yn
L(p) = pxi (1 − p)1−xi = pt (1 − p)n−t ,
i=1
26
where stands for the product, is fully determined by the number of successes
Q
t = x1 + . . . + xn = nx̄,
providing with another example of a sufficient statistic. To maximise the log-likelihood function
for the maximum likelihood estimator. Here σθ2 is the inverse of the so-called Fisher information in a single observation
(if interested, see Wikipedia). It follows that the maximum likelihood estimators are asymptotically unbiased and
consistent. Moreover, they are asymptotically efficient estimates in the sense of the following Cramer-Rao inequality.
σθ2
Cramer-Rao inequality: if θ∗ is an unbiased estimator of θ, then Var(Θ∗ ) ≥ n .
F(µ, σ) = Exp(θ).
In this case, the mean lifetime is µ = 1/θ, and θ can be viewed as the battery death rate per hour. The likelihood
function
L(θ) = θe−θx1 θe−θx2 θe−θx3 θe−θx4 θe−θx5 = θn e−θ(x1 +...+xn ) = θ5 e−θ·28.5
first grows from 0 to 2.2 · 10−7 and then falls down towards zero. The likelihood maximum is reached at θ̂ = 0.175.
For the exponential model, t = x1 + . . . + xn is a sufficient statistic, and the maximum likelihood estimate
θ̂ = n/t = 1/x̄
but asymptotically unbiased due to the Law of Large Numbers saying that X̄ ≈ µ for the large sample sizes.
F(µ, σ) = Gam(α, λ)
t1 = x1 + . . . + xn , t 2 = x1 · · · xn ,
27
and observe that the corresponding likelihood function takes the form
n
Y 1 α α−1 −λxi λnα λnα α−1 −λt1
L(α, λ) = λ xi e = n (x1 · · · xn )α−1 e−λ(x1 +...+xn ) = n t e .
i=1
Γ(α) Γ (α) Γ (α) 2
We see that (t1 , t2 ) is a pair of sufficient statistics containing all information from the data needed to compute the
likelihood function. To maximise the log-likelihood function
equal to zero. The maximum likelihood estimates of the parameters (α, λ) can be obtained from the following two
equations
λ̂ = α̂/x̄,
ln(α̂/x̄) = − n1 ln t2 + Γ′ (α̂)/Γ(α̂).
A numerical solution of the second equation would benefit of using the method of moment estimate α̃ as the initial
guess.
170, 175, 176, 176, 177, 178, 178, 179, 179, 180, 180, 180, 180, 180, 181, 181, 182, 183, 184, 186, 187, 192, 192, 199.
Assuming that these numbers are generated by the Gam(α, λ) distribution, we would like to estimate the parameters
λ and α.
To apply the method of moments, we use the formulas for the first and second population moments
α α2 α(1 + α)
E(X) = α
λ, E(X 2 ) = Var(X) + (E(X))2 = 2
+ 2 = ,
λ λ λ2
and the two sample moments computed from the data
x̄ = 181.46, x2 = 32964.2.
ln(α̂/x̄) = − n1 ln t2 + Γ′ (α̂)/Γ(α̂)
using the method of moment estimate α̃ = 907.3 as the initial guess. The Mathematica command
FindRoot[Log[a] == 0.00055+Gamma′ [a]/Gamma[a], {a, 907.3}]
gives
α̂ = 908.76.
Then the maximum likelihood estimate of the scale parameter λ is obtained from the equation λ̂ = α̂/x̄ yielding
λ̂ = 5.01.
Notice that the obtained maximum likelihood estimates are close to the method of moment estimates.
28
Example: Gam(α, 1) model
A random sample (1.23, 0.62, 2.22, 2.55, 1.42) was generated by the gamma distribution with the shape parameter
α = 2 and the scale parameter λ = 1. Treating θ = α as the unknown parameter, and assuming that λ = 1 is known,
we arrive at the Gam(α, 1) model and wish to estimate α using the given sample of size n = 5.
Since, the population mean is µ = α/λ = α, the method of moments estimate is α̃ = x̄ = 1.61. This estimate
should be compared to the true value α = 2. The corresponding likelihood function has the form
n
Y 1 α−1 −xi
L(α) = xi e = Γ−n (α)tα−1 e−8.04 ,
i=1
Γ(α)
t = x1 · · · xn = 6.13
is a sufficient statistic in this case. The graph of the likelihood function L(α) on the figure below, shows that the area
under the curve is much smaller than 1. A closer look at the figure reveals that the maximum likelihood estimate is
slightly smaller than the true value α = 2.
take its derivative and put it equal to zero. This results in the equation
0 = ln t − nΓ′ (α)/Γ(α).
gives a numerical solution α̂ = 1.90989, to be compared with the true value α = 2. Here the method of moments
estimate 1.61 is used as the initial guess for the algorithm solving the equation. In this example, the maximum
likelihood estimate brings a drastic improvement compared to the method of moments estimate.
3.4 Exercises
Problem 1
The Poisson distribution has been used by traffic engineers as a model for light traffic. The following table shows the
number of right turns during 300 three-min intervals at a specific intersection.
29
x frequency
0 14
1 30
2 36
3 68
4 43
5 43
6 30
7 14
8 10
9 6
10 4
11 1
12 1
13+ 0
Fit a Poisson distribution. Comment on the fit by comparing observed and expected counts. It is useful to know that
the 300 intervals were distributed over various hours of the day and various days of the week.
Problem 2
Let (x1 , . . . , xn ) be a random sample from a geometric distribution
P(X = x) = (1 − p)x−1 p, x = 1, 2, . . .
(a) Write down the likelihood function based on this random sample and suggest a simple sufficient statistic.
(b) Find the maximum likelihood estimate of p.
(c) Is this estimate consistent? Explain.
Problem 3
Suppose that X is a discrete random variable with
P(X = 0) = 23 θ,
P(X = 1) = 13 θ,
P(X = 2) = 23 (1 − θ),
P(X = 3) = 13 (1 − θ),
where the parameter θ ∈ [0, 1] is unknown. The following 10 independent observations were drawn from this distribu-
tion:
(3, 0, 2, 1, 3, 2, 1, 0, 2, 1).
(a) Find the method of moments estimate θ̃ of θ.
(b) Estimate the standard error of θ̃.
Problem 4
Suppose that x is generated by a Bin(n, p) distribution with unknown p.
(a) Show that the maximum likelihood estimate of p is p̂ = n.
x
Problem 5
A company has manufactured certain objects and has printed a serial number on each object. The serial numbers
start at 1 and end at N , where N is the number of objects that have been manufactured. One of these objects is
selected at random, and the serial number of that object is 888.
(a) What is the method of moments estimate of N ?
(b) What is the maximum likelihood estimate of N ?
30
Problem 6
To estimate the number N of fish living in a lake, a master degree student has applied the capture-recapture method
according to the two-step procedure:
1. capture and tag n = 100 fish, then release them in the lake,
2. capture and release k = 50 fish one by one, and count the number of the tagged fish among those captured on
the second stage.
Suppose x = 20 fish were tagged among the k = 50 fish captured on the second stage. Find a maximum likelihood
estimate of N after suggesting a simple parametric model.
Problem 7
The following 16 numbers came from the normal random number generator on a computer:
(a) Write down the likelihood function based on this sample. (Hint: to avoid tedious calculations on your calculator
use the numbers in the next subquestion.)
(b) In what sense the sum of the sample values (which is close to 58), and the sum of their squares (which is close
to 260) are sufficient statistics in this case?
(c) Turning to the log-likelihood function compute the maximum likelihood estimates for the mean and variance. Is
the obtained variance estimate unbiased?
Problem 8
Let (x1 , . . . , xn ) be a random sample generated by the continuous uniform distribution over the interval [0, θ] with
unknown θ.
(a) Find the method of moments estimate of θ and its mean and variance.
(b) Find the maximum likelihood estimate of θ.
(c) Find the probability density of the maximum likelihood estimator and calculate its mean and variance. Compare
the variance, the bias, and the mean square error to those of the method of moments estimate.
(d) Find a modification of the maximum likelihood estimate that renders it unbiased.
Problem 9
For two factors, starchy-or-sugary and green-or-white base leaf, the following counts for the progeny of self-fertilized
heterozygotes were observed (Fisher 1958)
Type Count
Starchy green c1 = 1997
Starchy white c2 = 906
Sugary green c3 = 904
Sugary white c4 = 32
According to the genetic theory the cell probabilities are
2+θ 1−θ 1−θ θ
p1 = , p2 = , p3 = , p4 = ,
4 4 4 4
where 0 < θ < 1. In particular, if θ = 0.25, then the genes are unlinked and the genotype frequencies are
Green White Total
Starchy 9/16 3/16 3/4
31
Problem 10
The method of randomised response deals with surveys asking sensitive questions. Suppose we want to estimate the
proportion q of the fishermen who during the last 12 months have gone fishing without a valid permit. We are interested
in the population as a whole - not in punishing particular individuals. Suppose randomly chosen n fishermen have
responded yes/no to a randomised statement according to the instructions in the figure below. Suggest a probability
model for this experiment, and find a method of moments estimate for q. What is the standard error of the estimated
proportion?
32
Chapter 4
Hypothesis testing
Hypothesis testing is a crucial part of the scientific method. In this chapter we learn about the statistical hypothesis
testing.
One particular way of reasoning in such a case is the topic of this chapter. Under the hypothesis of pure guessing
the number of correct answers is X ∼ Bin(100, 1/4). Then the observed outcome x = 30 deviates from the mean
100 · 1/4 = 25 for about one standard deviation. Since such an outcome is not unusual, a sensible conclusion is that
the data does not contradict the hypothesis of pure guessing.
33
Negative decision: do not reject H0 Positive decision: reject H0 in favour of H1
If H0 is true True negative outcome False positive outcome, type I error
If H1 is true False negative outcome, type II error True positive outcome
H0 : θ = θ0 , H 1 : θ = θ1 ,
where θ1 > θ0 , so that the alternative hypothesis claims a positive effect size θ1 − θ0 . The figure depicts two sampling
distributions of the test statistic T : the null distribution in green and the alternative distribution in blue. The rejection
region consists on the values of t to the right of the red line.
For a given test statistic, if the sample size is fixed, one can not make both α and β smaller by changing R.
A significance test resolves the conflict between the two types of errors, by controlling the significance level α.
Given the value of α, say 5%, the rejection region R is found from the equation
α = P(T ∈ R|H0 )
using the null distribution of the test statistic T . After the rejection region is determined, the size of type II error is
computed by
β = P(T ∈/ R|H1 ).
To summarise, the significance testing calculations follow the next flow chart
R = {y : y ≥ y(0.05)},
34
where the critical value y(0.05) is determined by the equation
0.05 = P(Y ≥ y(0.05)|p = 0.25).
Using the normal approximation for the z-score
Y − 25 H0
Z0 = ≈ N(0, 1),
4.33
we find that
y(0.05) ≈ 25 + 4.33 · 1.645 ≈ 32.
Since the observed test statistic yobs = 30 is below 32, we conclude that the experimental result is not significant, and
we do not reject H0 at the 5% significance level.
H0
By the central limit theorem, the null distribution of the z-score is approximately normal: Z0 ≈ N(0,1). Depending
on the problem in hand, there might arise three different alternative hypotheses:
H1 : p > p0 , H1 : p < p0 , H1 : p ̸= p0 .
The corresponding rejection region depends on the exact form of the alternative hypothesis
H1 Rejection region
one-sided p > p0 R = {z ≥ z(α)}
one-sided p < p0 R = {z ≤ −z(α)}
two-sided p ̸= p0 R = {z ≤ −z( α2 )} ∪ {z ≥ z( α2 )}
where z(α) is found from Φ(z(α)) = 1 − α using the normal distribution table.
35
Planning the sample size
In the setting
H0 : p = p0 against H1 : p = p1 ,
the formula p p
√ z(α) p0 (1 − p0 ) + z(β) p1 (1 − p1 )
n≈
|p1 − p0 |
gives the sample size n corresponding to given values of α and β. If the alternatives are very close to each other, the
denominator goes to zero and the required sample size becomes very large. This is very intuitive as it becomes more
difficult to distinguish between two close parameter values. On the other hand, if we decrease the levels α and β, the
values z(α) and z(β) from the normal distribution table become larger and the corresponding sample size n = n(α, β)
will be larger as well, meaning that for both types of errors to be small, you have to collect more data.
Next, we derive this formula for p1 > p0 leaving the other case p1 < p0 as an exercise. As shown before, for p1 > p0 ,
p √
z(α) p0 (1 − p0 ) + n(p0 − p1 )
1 − β = Pw(p1 ) ≈ 1 − Φ p
p1 (1 − p1 )
Binomial test
The binomial test is a test for proportion for a small sample size n. Consider a random sample (x1 , . . . , xn ) drawn
from a Bernoulli distribution Bin(1, p) with the unknown population proportion p. Let c = x1 + . . . + xn count the
number of successes, so that the corresponding random variable has a binomial distribution
C ∼ Bin(n, p).
Then the rejection region of the binomial test is determined according to the next table
H1 Rejection region
one-sided p > p0 R = {c ≥ cα }
one-sided p < p0 R = {c ≤ bα }
two-sided p ̸= p0 R = {c ≤ bα/2 } ∪ {c ≥ cα/2 }
To illustrate the binomial test, consider the extrasensory perception test, where the subject is asked to guess the
suits of n = 20 cards. The number of cards guessed correctly is C ∼ Bin(20, p) and the null hypothesis of interest is
H0 : p = 0.25. The null distribution Bin(20, 0.25) of the test statistic c gives the following probabilities
c 8 9 10 11
P(C ≥ c) 0.101 0.041 0.014 0.004
We conclude that in particular, for the one-sided alternative H1 : p > 0.25 and α = 4.1%, the rejection region of the
binomial test is R = {c ≥ 9}. The corresponding power function
36
4.3 P-value of the test
A p-value of the test base on the test statistic t is the probability p of obtaining a test statistic value as extreme or
more extreme than the observed value tobs , given that H0 is true.
For a given significance level α, we reject H0 , if p ≤ α, and do not reject H0 otherwise.
The p-value depends on the observed data tobs and therefore, is a realisation of a random variable P. The source of
randomness is in the sampling procedure: if you take another sample, you obtain a different p-value. To illustrate,
suppose we are testing H0 : θ = θ0 versus H1 : θ > θ0 with help of a test statistic Z whose null distribution is N(0,1).
In this case, the p-value is computed as
P(P ≤ p|H0 ) = P(1 − Φ(Zobs ) ≤ p|H0 ) = P(Φ(Zobs ) ≥ 1 − p|H0 ) = P(Zobs ≥ Φ−1 (1 − p)|H0 ) = 1 − Φ(Φ−1 (1 − p)) = p.
Applying the large sample test for proportion we find the rejection rule at 5% significance level to be
R = { p̂−0.25
0.0433 ≥ 1.645} = {p̂ ≥ 0.32} = {y ≥ 32},
where y is the number of suits guessed correctly. With a simple alternative H1 : p = 0.30 the power of the test is
1 − Φ( 1.645·0.433−0.5
0.458 ) = 32%.
n = ( 1.645·0.433+1.28·0.458
0.05 )2 = 675.
If the observed sample count is yobs = 30, then the observed z-score z0 = √ p̂−p0
takes the value
p0 (1−p0 )/n
0.3 − 0.25
zobs = = 1.15,
0.0433
and the one-sided p-value is computed as
Since the p-value is larger that 10%, the result is not significant, and we do not reject H0 .
One-sample t-test
For small n, under the assumption that the population distribution is normal
37
4.4 Likelihood-ratio test
A general method of finding asymptotically optimal tests (having the largest power for a given α) uses the likelihood
ratio as the test statistic. Consider first the case of two simple hypotheses. For testing
H0 : θ = θ0 against H1 : θ = θ1 ,
L(θ0 )
use the likelihood-ratio L(θ1 ) as the test statistic. A larger value of the likelihood-ratio would suggest that the
L(θ0 )
parameter value θ0 explains the data set better than θ1 , while a smaller value L(θ1 ) would indicate that the alternative
hypothesis H1 explains the data set better than H0 .
By the Neyman-Pearson lemma, see Wikipedia, the likelihood-ratio test is the optimal test in the case of two simple
hypotheses.
Nested hypotheses
The general case of two composite alternatives can be stated in term of a pair of nested parameter sets Ω0 ⊂ Ω
H0 : θ ∈ Ω0 against H1 : θ ∈ Ω \ Ω0 .
It will be more convenient to recast this setting in terms of two nested hypotheses
H 0 : θ ∈ Ω0 , H : θ ∈ Ω,
− ln w = ln L(θ̂) − ln L(θ̂0 ).
Observe that 0 < w ≤ 1, so that − ln w ≥ 0. By Wilks’ theorem, see Wikipedia, the test statistic (−2 ln w) has a nice
approximation for its null distribution
H0
−2 ln W ≈ χ2df , where df = dim(Ω) − dim(Ω0 ).
Ω = {(p1 , . . . , pJ ) : p1 + . . . + pJ = 1, p1 ≥ 0, . . . , pJ ≥ 0}
has dimension
dim(Ω) = J − 1.
Consider a parametric model for the data
H0 : (p1 , . . . , pJ ) ∈ Ω0 ,
where
Ω0 = {(p1 , . . . , pJ ) ∈ Ω : (p1 , . . . , pJ ) = (p1 (λ), . . . , pJ (λ)), λ ∈ Λ},
with
dim(Ω0 ) = r, 0 ≤ r < J − 1.
38
To see if the proposed model fits the data, compute λ̂, the maximum likelihood estimate of λ, and then the expected
cell counts
Ej = n · pj (λ̂),
where "expected" means expected under the null hypothesis model. In this special setting, the above mentioned
likelihood-ratio test statistic
−2 ln w ≈ x2
is approximated by the so-called chi-squared test statistic
J
X (cj − Ej )2
x2 = ,
j=1
Ej
df = dim(Ω) − dim(Ω0 ) = (J − 1) − r = J − 1 − r.
The chi-squared test is approximate in that all expected counts are recommended to be at least 5. If not, then you
should combine small cells in larger cells and recalculate the number of degrees of freedom df.
H0 : number of hops that a bird does between flights has a geometric distribution Geom(p).
Using p̂ = 0.358 and J = 7 we obtain x2 = 1.86. Using the chi-squared distribution table with df =7 − 1 − 1 = 5, we
find that the p-value is close to 90%. This implies that the geometric distribution model fits very well to the data.
The corresponding parameter space Ω has dimension dim(Ω) = 12. The data is given in the table below in the form
of thirteen observed counts.
(cx −Ex )2 (cx −Ex )2
Number of boys x Observed count cx Model 1: Ex and Ex Model 2: Ex and Ex
0 7 1.5 20.2 2.3 9.6
1 45 17.9 41.0 26.1 13.7
2 181 98.5 69.1 132.8 17.5
3 478 328.4 68.1 410.0 11.3
4 829 739.0 11.0 854.2 0.7
5 1112 1182.4 4.2 1265.6 18.6
6 1343 1379.5 1.0 1367.3 0.4
7 1033 1182.4 18.9 1085.2 2.5
8 670 739.0 6.4 628.1 2.8
9 286 328.4 5.5 258.5 2.9
10 104 98.5 0.3 71.8 14.4
11 24 17.9 2.1 12.1 11.7
12 3 1.5 1.5 0.9 4.9
Total 6115 6115 x2 = 249.2 6115 x2 = 110.5
Model 1
A simple way for describing the vector of parameters (p0 , . . . , p12 ) is to use the symmetric binomial distribution
X ∼ Bin(12, 0.5). This leads to the setting
(0) (0) 12
H0 : (p0 , . . . , p12 ) = (p0 , . . . , p12 ), where p(0)
x = · 2−12 , x = 0, 1, . . . , 12,
x
39
with the set Ω0 consisting of a single point
(0) (0)
Ω0 = {(p0 , . . . , p12 )},
so that dim(Ω0 ) = 0. Under this H0 , the expected counts are computed as
12
(0)
Ex = npx = 6115 · · 2−12 , x = 0, 1, . . . , 12,
x
see the table below. The observed chi-squared test statistic is x2 = 249.2, df = 12. Since x12 (0.005) = 28.3, we can
reject H0 at 0.5% level.
Model 2
Consider a more flexible model X ∼ Bin(12, λ) with an unspecified probability of a boy λ. The corresponding null
hypothesis takes the form
12
H0 : (p0 , . . . , p12 ) = (p1 (λ), . . . , p12 (λ)), where px (λ) = · λj (1 − λ)12−x , x = 0, . . . , 12, 0 ≤ λ ≤ 1.
x
Clearly, the corresponding parameter space Ω0 has dimension dim(Ω0 ) = 1. The expected cell counts
12
Ex = 6115 · · λ̂x · (1 − λ̂)12−x
x
are computed using the maximum likelihood estimate of the proportion of boys λ
total number of boys 1 · 45 + 2 · 181 + . . . + 12 · 3
λ̂ = = = 0.481
total number of children 6115 · 12
The observed chi-squared test statistic x2 = 110.5 is much smaller than the one for the Model 2. However, since
x11 (0.005) = 26.76, even the Model 2 should be rejected at 0.5% significance level.
Case summary
The figure below compares the observed counts (black histogram) with the Model 1 expected counts (red line) and the
Model 2 expected counts (blue line). The red line has a better fit to the data, however it underestimates the variation
of the observed cell counts. To make the model even more flexible model, one should allow the probability of a boy λ
to differ from family to family.
4.6 Exercises
Problem 1
Suppose that X ∼ Bin(100, p). Consider a test
H0 : p = 1/2, H1 : p ̸= 1/2.
that rejects H0 in favour of H1 for |x − 50| > 10. Use the normal approximation to the binomial distribution to
respond to the following items:
40
(a) What is α?
(b) Graph the power as a function of p.
Problem 2
This problem introduces the case of one-sided null hypothesis. A random sample (x1 , . . . , xn ) is drawn from a normal
population distribution N(µ, 1). Consider two alternative composite hypotheses
H0 : µ ≤ µ0 , H1 : µ > µ 0 .
(c) Show that the corresponding likelihood ratio has the form
(
1 if x̄ < µ0 ,
w= n
− 2 (x̄−µ0 )2
e if x̄ ≥ µ0
(d) Explain why the rejection region of the likelihood ratio test can be expressed as
R = {x̄ > µ0 + cα },
Problem 3
Let (x1 , . . . , xn ) be a sample from a Poisson distribution Pois(µ). Find the likelihood ratio for testing H0 : µ = µ0
against H1 : µ ̸= µ0 . Use the fact that the sum of independent Poisson random variables follows a Poisson distribution
to explain how to determine a rejection region for this test at the significance level α.
Problem 4
Let (x1 , . . . , x25 ) be a sample from a normal distribution having a variance of 100.
(a) Find the rejection region for a test at level α = 0.1 of H0 : µ = 0 versus H1 : µ = 1.5.
(b) What is the power of the test?
(c) Repeat (a) and (b) for α = 0.01.
Problem 5
Under H0 , a random variable has a cumulative distribution function
F (x) = x2 , 0 ≤ x ≤ 1,
F (x) = x3 , 0 ≤ x ≤ 1.
Problem 6
A random sample from N(µ, σ) results in Iµ = (−2, 3) as a 99% confidence interval for µ. Test
H0 : µ = −3 against H1 : µ ̸= −3
at α = 0.01.
41
Problem 7
Let (x1 , . . . , x15 ) be a random sample from a normal distribution N(µ, σ). The sample standard deviation is s = 0.7.
Test H0 : σ = 1 versus H1 : σ < 1 at the significance level α = 0.05.
Problem 8
Consider the binomial model for the data value x:
X ∼ Bin(n, p).
(a) What is the likelihood ratio for testing H0 : p = 0.5 against H1 : p ̸= 0.5?
(b) Show that the corresponding likelihood ratio test should reject for larger values of |x − n2 |.
(c) For the rejection region
R = {|x − n2 | > k}
determine the significance level α as a function of k.
Problem 9
Suppose that a test statistic Z has a standard normal distribution under the null hypothesis.
(a) If the test rejects for larger values of |z|, what is the p-value corresponding to the observed value 1.5 of the test
statistic z?
(b) Answer the same question if the test rejects only for larger values of z.
Problem 10
It has been suggested that H1 : dying people may be able to postpone their death until after an important occasion,
such as a wedding or birthday. Phillips and King (1988) studied the patterns of death surrounding Passover, an
important Jewish holiday.
(a) California data (1966–1984). They compared the number of deaths during the week before Passover to the
number of deaths during the week after Passover for 1919 people who had Jewish surnames. Of these, 922
occurred in the week before and 997 in the week after Passover. Apply a statistical test to see if there is evidence
supporting the claim H1 .
(b) For 852 males of Chinese and Japanese ancestry, 418 died in the week before and 434 died in the week after
Passover. Can we reject H0 : death cannot be postponed, using these numbers?
Problem 11
If gene frequencies are in equilibrium, the genotypes AA, Aa, and aa occur with probabilities
for some 0 ≤ θ ≤ 1. Plato et al. (1964) published the following data on haptoglobin type in a sample of 190 people
Genotype AA Aa aa
Observed count 10 68 112
Test the goodness of fit of the data to the equilibrium model.
Problem 12
Check for the seasonal variation in the following data on the US suicides in 1970.
42
Month Number of suicides
Jan 1867
Feb 1789
Mar 1944
Apr 2094
May 2097
Jun 1981
Jul 1887
Aug 2024
Sep 1928
Oct 2032
Nov 1978
Dec 1859
Problem 13
In 1965, a newspaper carried a story about a high school student who reported getting 9207 heads and 8743 tails in
17950 coin tosses.
(a) Is this a significant discrepancy from the null hypothesis H0 : p = 12 , where p is the probability of heads?
(b) A statistician contacted the student and asked him exactly how he had performed the experiment (Youden
1974). To save time the student had tossed groups of five coins at a time, and a younger brother had recorded
the results, shown in the table:
Are the data consistent with the hypothesis that all the coins were fair (p = 12 )?
(c) Are the data consistent with the hypothesis that all five coins had the same probability of heads but this
probability was not necessarily 12 ?
43
Chapter 5
Bayesian inference
The statistical tools introduced in this course so far are based on the so called frequentist approach. In the parametric
case, the frequentist treats the data x as randomly generated by a distribution f (x|θ) involving the unknown true
population parameter value θ, which may be estimated using the method of maximum likelihood. This section presents
basic concepts of the Bayesian approach relying on the following model for the observed data x:
g(θ) f (x|θ)
Apriori distribution −→ generates a value θ −→ data x.
The model assumes that before the data is collected the parameter of interest θ is randomly generated by a prior
distribution g(θ). The computational power of the Bayesian approach stems from the possibility to treat θ as a
realisation of a random variable Θ.
The prior distribution g(θ) brings into the statistical model our knowledge (or lack of knowledge) on θ before the
data x is generated using a conditional distribution f (x|θ), which in this section is called the likelihood function. After
the data x is generated by such a two-step procedure involving the pair g(θ) and f (x|θ), we may update our knowledge
on θ and compute a posterior distribution h(θ|x) using the Bayes formula
f (x|θ)g(θ)
h(θ|x) = .
ϕ(x)
gives the marginal distribution of the random data X. For a fixed realization x, treating the denominator ϕ(x) as a
constant which does not explicitly involve θ, the Bayes formula can be summarized as
Example: IQ measurement
A randomly chosen individual has an unknown true intelligence quotient value θ. Suppose the IQ test is calibrated in
such a way that θ can be viewed as a realisation of a random variable Θ having the normal prior distribution N(100, 15).
This normal distribution describes the population distribution of people’s IQ with population mean µ0 = 100 and
population standard deviation σ0 = 15. For a person with an IQ value θ, the result x of an IQ measurement is
generated by another normal distribution N(θ, 10), with no systematic error and a random error σ = 10. Since
(θ−µ0 ) 2
1 − 1 (x−θ)2
e− 2σ2 ,
2
g(θ) = √ e 2σ0 , f (x|θ) = √
2πσ0 2πσ
we find that likelihood times prior equals
(θ−µ0 ) 2 (x−θ)2
1 − 2 − 2σ 2
g(θ)f (x|θ) = e 2σ0 ,
2πσ0 σ
which is proportional to
n θ2 − 2µ1 θ o n (θ − µ )2 o
1
exp − ∝ exp − ,
2σ12 2σ12
44
where
σ2
µ1 = γµ0 + (1 − γ)x, σ12 = γσ02 , γ= .
σ 2 + σ02
It follows that the posterior distribution is also normal N(µ1 , σ1 ).
The parameter
σ2
γ = 12
σ0
is called a shrinkage factor. Being γ ∈ (0, 1) it measures the reduction in variance when the posterior distribution is
compared to the prior distribution. The smaller is γ the larger is the gain from the data.
In particular, if the observed IQ result is x = 130, then the posterior distribution becomes N(120.7, 8.3). The
posterior mean µ1 = 120.7 is obtained as a down-corrected measurement result x = 130 in view of the lower prior
expectation µ0 = 100. The posterior variance σ12 = 69.2 is smaller than that of the prior distribution σ0 = 225 by the
shrinkage factor γ = 0.308, reflecting the fact that the updated knowledge brings much more certainty about the true
IQ value compared to the available prior knowledge.
The figure on the right depicts three probability density curves over
the values of the parameter θ representing possible IQ values.
Observe that the posterior curve is close to zero
not only for those θ where either the prior curve is close to zero
but also where the likelihood curve is close to zero.
As a result, the posterior curve becomes narrower than the prior curve.
The next table presents five Bayesian models involving conjugate priors. The details of the first three models come
next. Notice that the posterior variance is always smaller than the prior variance. This list also illustrates that the
contribution of the prior distribution to the posterior distribution becomes smaller as the sample size n increases.
Normal-normal model
Suppose a random sample (x1 , . . . , xn ) is drawn from the normal distribution N(µ, σ) with a known standard deviation
σ and the unknown mean θ = µ. Taking the normal prior Θ ∼ N(µ0 , σ0 ) with known (µ0 , σ0 ) results in the normal
posterior N(µ1 , σ1 ) with
µ1 = γn µ0 + (1 − γn )x̄, σ12 = σ02 γn ,
where
σ2
σ2 n
γn = =
σ + nσ02
2 σ2
+ σ02
n
is the shrinkage factor which becomes smaller for the larger sample sizes n. As a result for the large samples, the
posterior mean µ1 gets close to the maximum likelihood estimate and the input γn µ0 involving the prior mean becomes
negligible.
Binomial-beta model
Next, we introduce the beta distribution which serves as a convenient family of conjugate priors for Bayesian inference
for p, in the case when the data x is generated by the Bin(n, p).
45
Beta distribution
Beta distribution Beta(a, b) is determined by two parameters a > 0, b > 0 which are called pseudo-counts. It is defined
by the probability density function
Γ(a + b) a−1
g(p) = p (1 − p)b−1 , 0 < p < 1,
Γ(a)Γ(b)
its mean and variance are given by
a µ(1 − µ)
µ=, σ2 = .
a+b a+b+1
The figure below depicts five different beta-distribution curves.
- flat black Beta(1, 1),
- U-shaped brown Beta(0.5, 0.5),
- bell-shaped red Beta(5, 3),
- L-shaped blue Beta(1, 3),
- J-shaped green Beta(3, 0.5).
and
likelihood ∝ px (1 − p)n−x ,
imply
posterior ∝ prior × likelihood ∝ pa+x−1 (1 − p)b+n−x−1 .
This entails that the posterior is also a beta distribution Beta(a1 , b1 ) with the updated parameters
a1 = a + x, b1 = b + n − x.
Experiment 2: after n2 = 40 tosses, the observed count of the base landings is x2 = 9. As a new prior distribution
we use the posterior distribution obtained from the first experiment Beta(3, 9). The new posterior distribution becomes
Beta(12, 40) with the mean µ2 = 52 12
= 0.23 and standard deviation σ2 = 0.06.
Multinomial-Dirichlet model
The multinomial-Dirichlet model is a multivariate version of the binomial-beta model. For both the binomial-beta
and multinomial-Dirichlet models, the updating rule has the form
the posterior pseudo-counts = the prior pseudo-counts plus the sample counts
Dirichlet distribution
The Dirichlet distribution Dir(α1 , . . . , αr ) is a multivariate extension of the beta distribution. It is a probability
distribution over the vectors (p1 , . . . , pr ) with non-negative components such that
p1 + . . . + pr = 1.
The positive parameters α1 , . . . , αr of the Dirichlet distribution are often called the pseudo-counts. The probability
density function of Dir(α1 , . . . , αr ) is given by
Γ(α0 )
g(p1 , . . . , pr ) = pα1 −1 . . . pα
r
r −1
, α0 = α1 + . . . + αr .
Γ(α1 ) . . . Γ(αr ) 1
46
The marginal distributions of the random vector (X1 , . . . , Xr ) ∼ Dir(α1 , . . . , αr ) are the beta distributions
Xj ∼ Beta(αj , α0 − αj ), j = 1, . . . , r.
The figure below illustrates four examples of Dir(α1 , α2 , α3 ) distribution. Each triangle contains n = 300 points
generated using different sets of parameters (α1 , α2 , α3 ):
upper left (0.3, 0.3, 0.1), upper right (13, 16, 15), lower left (1, 1, 1), lower right (3, 0.1, 1).
A dot in a triangle gives a realisation (x1 , x2 , x3 ) of the vector (X1 , X2 , X3 ) ∼ Dir(α1 , α2 , α3 ) as the distances to the
bottom edge of the triangle (x1 ), to the right edge of the triangle (x2 ), and to the left edge of the triangle (x3 ).
2, 1, 1, 4, 5, 3, 3, 2, 4, 1, 4, 2, 3, 4, 3, 5, 1, 5.
The parameter of interest is the vector of six probabilities θ = (p1 , . . . , p6 ). The data can be viewed as generated by a
multinomial distribution Mn(18; p1 , . . . , p6 ). Since we have no idea about the values of the probabilities (p1 , . . . , p6 ) we
will use the uninformative prior distribution Dir(1, 1, 1, 1, 1, 1). Due to the conjugacy property we obtain the posterior
distribution to be Dir(5, 4, 5, 5, 4, 1), where the posterior pseudo-counts are computed as
We consider two loss functions leading to two different Bayesian estimators. These two loss functions called the
zero-one loss and the squared error loss
47
Zero-one loss function: l(θ, a) = 1{θ̸=a} Squared error loss: l(θ, a) = (θ − a)2
In this case, to minimise the risk we have to maximise the posterior probability h(a|x). We define θ̂map as the value
of θ that maximises h(θ|x). Observe that with the uninformative prior, θ̂map = θ̂mle .
Since the first component is independent of a, we minimise the posterior risk by putting
θ̂pme = E(Θ|x),
Notably, the maximum likelihood estimate assigns value zero to p6 , thereby predicting that in the future observations
there will be no sixes.
Considering the two alternative Bayesian estimates based on the posterior distribution Dir(5, 4, 5, 5, 4, 1)
4 3 4 4 3 5 4 5 5 4 1
θ̂map = ( 18 , 18 , 18 , 18 , 18 , 0), θ̂pme = ( 24 , 24 , 24 , 24 , 24 , 24 ),
we see that the former coincides with θ̂mle , while the latter estimate has the advantage of assigning a positive value
to p6 .
48
is viewed as a realisation of a random interval (a1 (X), a2 (X)) such that
This frequentist interpretation of the confidence level 100(1 − α)% is rather cumbersome as it requires mentioning
other samples and potential confidence intervals which as a group cover the true unknown value of θ with probability
1 − α.
In the framework of Bayesian inference we can refer to θ as a realisation of a random variable Θ with a certain
posterior distribution h(θ|x). This allows us to define a 100(1 − α)% credibility interval (or credible interval)
The interpretation of the credibility interval is more intuitive as it does not refer to some potential, never observed
data values.
Example: IQ measurement
Given a single IQ value x = 130, we have X̄ ∼ N(µ; 10) and an exact 95% confidence interval for the true IQ value θ
takes the form
Iθ = 130 ± 1.96 · 10 = 130 ± 19.6.
The interval 130 ± 19.6 has 95% confidence level in the sense that if we repeat the IQ measurement for the same
person, then the new IQ result X will produce a random confidence interval which will cover the true IQ of the subject
with probability 0.95.
With the posterior distribution Θ ∼ N(120.7; 8.3) in hand, a 95% credibility interval for θ is computed as
The proper understanding of the obtained interval 120.7 ± 16.3 is the following. If we choose another person from the
general population and the person’s IQ turns out to be x = 130, then the new person’s IQ Θ belongs to the interval
120.7 ± 16.3 with probability 0.95. On the other hand, if we get a second IQ test result x2 for the same person, then
we can compute a new credibility interval based on x2 using the posterior distribution after the first IQ test result as
the new prior.
H0 : θ = θ 0 against H1 : θ = θ1
we wish to choose between H0 and H1 using not only the two likelihood functions f (x|θ0 ), f (x|θ1 ) but also the prior
probabilities of the two optional values
P(H0 ) = π0 , P(H1 ) = π1 , π0 + π1 = 1.
In terms of an appropriate rejection region R for the available data x, the Bayesian decision should be taken depending
of a cost function having the following four cost values
Decision H0 true H1 true
x∈
/R Do not reject H0 0 cost1
x∈R Reject H0 cost0 0
where cost0 is the error type I cost and cost1 is the error type II cost. For a given set R, the average cost is the
weighted mean of two values cost0 and cost1
Z
cost0 π0 P(X ∈ R|H0 ) + cost1 π1 P(X ∈
/ R|H1 ) = cost1 π1 + cost0 π0 f (x|θ0 ) − cost1 π1 f (x|θ1 ) dx.
R
where
R∗ = {x : cost0 π0 f (x|θ0 ) < cost1 π1 f (x|θ1 )}.
49
It follows that the rejection region minimising the average cost is R = R∗ . Taking R∗ as the rejection region, we
should reject H0 for values of the likelihood ratio which are smaller than a certain critical value:
f (x|θ0 ) cost1 π1
< ,
f (x|θ1 ) cost0 π0
determined by the prior odds π0 /π1 and the cost ratio cost1 /cost0 . In other terms, we reject H0 for the values of the
posterior odds smaller than the cost ratio
h(θ0 |x) cost1
< .
h(θ1 |x) cost0
is suggested, taking into account the number of males who theoretically could have committed the crime without any
evidence taken into account. This yields the prior odds for H0 to be very high
π0
= 200000.
π1
The resulting posterior odds is
P(H0 |E1 , E2 , E3 ) π0 P(E1 , E2 , E3 |H0 ) π0 P(E1 |H0 ) P(E2 |H0 ) P(E3 |H0 )
= = · · · = 0.018.
P(H1 |E1 , E2 , E3 ) π1 P(E1 , E2 , E3 |H1 ) π1 P(E1 |H1 ) P(E2 |H1 ) P(E3 |H1 )
Conclusion: the defendant N would deemed to be guilty if the cost values assigned by the jury are such that
cost1 cost for unpunished crime
= > 0.018.
cost0 cost for punishing an innocent
5.5 Exercises
Problem 1
This is a continuation of the Problem 3 (a-d) from Section 3.4.
(e) Assume the uniform prior for the parameter θ and find the posterior density. Sketch the posterior curve. Find
the MAP estimate of θ.
50
Problem 2
In an ecological study of the feeding behaviour of birds, the number of hops between flights was counted for several
birds.
Number of hops j 1 2 3 4 5 6 7 8 9 10 11 12 Tot
Observed count cj 48 31 20 9 6 5 4 2 1 1 2 1 130
Assume that the data were generated by a Geom(p) model and take the uniform prior for p. What is the posterior
distribution and what are the posterior mean and standard deviation?
Problem 3
On the Laplace rule of succession.
Laplace claimed that when an event happens n times in a row and never fails to happen,
the probability that the event will occur the next time is n+1
n+2 .
Can you suggest a rationale for this claim?
Problem 4
It is known that the random variable X has one of the following two distributions
X-values x1 x2 x3 x4
P(x|H0 ) 0.2 0.3 0.3 0.2
P(x|H1 ) 0.1 0.4 0.1 0.4
P(x|H0 )
(a) Compare the likelihood ratio, P(x|H1 ) , for each xi and order the xi accordingly.
(b) What is the likelihood ratio test of H0 versus H1 at level α = 0.2? What is the test at level α = 0.5?
(c) If the prior probabilities are P(H0 ) = P(H1 ) = 12 , which outcomes favour H0 ?
(d) What prior probabilities correspond to the decision rules with α = 0.2 and α = 0.5?
Problem 5
Suppose that under H0 , a measurement X is N(0, σ), and under H1 , the measurement X is N(1, σ). Assume that the
prior probabilities satisfy
P(H0 ) = 2P(H1 ).
The hypothesis H0 will be chosen if P(H0 |x) > P(H1 |x). Answer the following questions referring to the different
choices of σ 2 = 0.1, 0.5, 1.0, 5.0.
(a) For what values of X = x will H0 be chosen?
(b) In the long run, what proportion of the time will H0 be chosen if H0 is true 2
3 of the time?
Problem 6
Under H0 , a random variable X has a cumulative distribution function F (x) = x2 , 0 ≤ x ≤ 1, and under H1 , it has
a cumulative distribution function F (x) = x3 , 0 ≤ x ≤ 1.
If the two hypotheses have equal prior probabilities, for what values of x is the posterior probability of H0 greater
than that of H1 ?
Problem 7
Suppose your prior beliefs about the probability p of success have mean 1/3 and variance 1/32. What is the posterior
mean after having observed 8 successes in 20 trials?
Problem 8
Mice were injected with a bacterial solution, some of the mice were also given penicillin. The results were
Without penicillin With penicillin
Survived 8 12
Died 48 62
(a) Find a 95% confidence interval for the difference between two probabilities of survival.
(b) Assume that both groups have the probability of survival p. How would you compute an exact credibility interval
for the population proportion p, if you could use a computer? Compute an approximate 95% credibility interval
using a normal approximation.
51
Problem 9
The gamma distribution Gam(α, λ) is a conjugate prior for the Poisson likelihood with mean θ. If x is a single
observed value randomly sampled from the Pois(θ) distribution, then the parameters (α1 , λ1 ) for the posterior gamma
distribution of Θ are found by the following updating rule:
- the shape parameter α1 = α + x,
- the inverse scale parameter λ1 = λ + 1.
(a) Find θ̂PME , the posteriori mean estimate for the θ, under the exponential prior Exp(1), given the following
random sample values from the Pois(θ) population distribution
x1 = 2, x2 = 0, x3 = 2, x4 = 5.
(b) What is the updating rule for an arbitrary sample size n? Compare the value of θ̂PME with the maximum
likelihood estimator θ̂MLE as n → ∞. Your conclusions?
Problem 10
Consider the Beta(a, b) distribution. Verify that given a > 1 and b > 1, the maximum of the probability density
function g(p) is attained at
a−1
p∗ = .
a+b−2
Problem 11
Credible intervals are not unique on a posterior distribution. Methods for defining a suitable credible interval include:
1. choosing the narrowest interval, which for a unimodal distribution will involve choosing those values of highest
probability density including the mode (the maximum a posteriori). This is sometimes called the highest posterior
density interval (HPDI),
2. choosing the interval where the probability of being below the interval is as likely as being above it, this is
sometimes called the equal-tailed interval,
3. the interval for which the posterior mean is the central point.
Respond to the following:
(a) demonstrate that the equal-tailed interval includes the posterior median,
(b) compare these three options for the beta-posterior Beta(5, 5),
(c) compute the 95% HPDI for the normal-posterior N(3, 0.2).
52
Chapter 6
Summarising data
Consider a random sample (x1 , . . . , xn ) taken from the unknown population distribution F(µ, σ) characterized by the
cumulative distribution function
F (x) = P(X ≤ x).
Even for moderate sample sizes n, a proper preprocessing of the dataset is crucial. A very useful first rearrangement
is to sort the random sample (x1 , . . . , xn ) in the ascending order. As a result, we arrive at the ordered sample values
where in particular,
x(1) = min{x1 , . . . , xn }, x(n) = max{x1 , . . . , xn }.
Doing this we do not lose any essential information about the data making it much more tractable. For example,
the sample mean and sample standard deviations can be computed from the ordered sample (x(1) , . . . , x(n) ). In this
chapter we present several other basic tools for summarising the data in hand.
(x(1) , . . . , x(24) ) =
(170, 175, 176, 176, 177, 178, 178, 179, 179, 180, 180, 180, 180, 180, 181, 181, 182, 183, 184, 186, 187, 192, 192, 199).
The graph of the empirical distribution function F̂ (x) gives an idea about the shape of the population distribution
function F (x).
53
As a function of x, F̂ (x) is a cumulative distribution function for a random variable Y with the discrete uniform
distribution
1
P(Y = xi ) = , i = 1, . . . , n,
n
assuming that all sample values xi are pairwise different. Clearly,
n
X xi
E(Y ) = = x̄,
i=1
n
and since
n
X x2 i
E(Y 2 ) = = x2 ,
i=1
n
we get
n−1 2
Var(Y ) = x2 − (x̄)2 = s .
n
It is easy to verify that even if some of xi coincide, F̂ (·) is a cumulative distribution function with mean x̄ and variance
n s . We will call
n−1 2
X n
σ̂ 2 = n−1 2
n s = n
1
(xi − x̄)2 = x2 − (x̄)2
i=1
Ŝ(x) = 1 − F̂ (x),
the proportion of the data greater than x. If the life length L has distribution function F (x) = P(L ≤ x), then its
survival function is
S(x) = P(L > x) = 1 − F (x), x ≥ 0.
If f (x) = F ′ (x) is the probability density function for the life length L, then the corresponding hazard function is
defined by
f (x)
h(x) = .
S(x)
The hazard h(x) is the mortality rate at age x in that
The hazard function can be viewed as the negative slope of the log survival function:
f (x) S ′ (x) d
h(x) = =− = − dx ln S(x).
S(x) S(x)
If L has the exponential distribution Exp(λ), then the hazard rate h(x) ≡ λ is constant over the ages.
Control lifetimes
18 36 50 52 86 87 89 91 102 105 114 114 115 118 119 120 149 160 165 166 167 167 173 178 189 209 212 216 273
278 279 292 341 355 367 380 382 421 421 432 446 455 463 474 506 515 546 559 576 590 603 607 608 621 634 634 637
638 641 650 663 665 688 725 735
Dose I lifetimes
76 93 97 107 108 113 114 119 136 137 138 139 152 154 154 160 164 164 166 168 178 179 181 181 183 185 194
198 212 213 216 220 225 225 244 253 256 259 265 268 268 270 283 289 291 311 315 326 326 361 373 373 376 397 398
54
406 452 466 592 598
Dose II lifetimes
72 72 78 83 85 99 99 110 113 113 114 114 118 119 123 124 131 133 135 137 140 142 144 145 154 156 157 162
162 164 165 167 171 176 177 181 182 187 192 196 211 214 216 216 218 228 238 242 248 256 257 262 264 267 267 270
286 303 309 324 326 334 335 358 409 473 550
10 33 44 56 59 72 74 77 92 93 96 100 100 102 105 107 107 108 108 108 109 112 113 115 116 120 121 122 122
124 130 134 136 139 144 146 153 159 160 163 163 168 171 172 176 183 195 196 197 202 213 215 216 222 230 231 240
245 251 253 254 254 278 293 327 342 347 361 402 432 458 555
Dose IV lifetimes
Dose V lifetimes
12 15 22 24 24 32 32 33 34 38 38 43 44 48 52 53 54 54 55 56 57 58 58 59 60 60 60 60 61 62 63 65 65 67 68 70
70 72 73 75 76 76 81 83 84 85 87 91 95 96 98 99 109 110 121 127 129 131 143 146 146 175 175 211 233 258 258 263
297 341 341 376
It is difficult to compare the six groups just looking at the numbers. The data is illuminated by two graphs: the left
one for the survival functions and the right one for the log-survival functions.
The negative slopes of the curves to the right illustrate the hazard rates for different groups. Notice that the top line
is almost linear. This suggests an exponential model for the life lengths of the guinea pigs in the control group, with
the constant hazard rate λ ≈ 0.001 deaths per day.
For a given population distribution function F and 0 < p < 1, the population p-quantile is defined by
xp = Q(p).
55
The p-quantile xp cuts off the proportion p of smallest values for the random variable X with P(X ≤ x) = F (x):
gives the strictly ordered n jump points for the empirical distribution function F̂ , so that
k k−1
F̂ (x(k) ) = n, F̂ (x(k) − ϵ) = n .
If such a QQ-plot closely follows the 45 degree line, that is when we observe almost equal quantiles, we may conclude
that the data supports the null hypothesis of equality H0 : F1 ≡ F2 .
More generally, if the QQ-plot approximates a straight line y = a + bx, then we take this as evidence for the linear
relation
Y = a + bX in distribution.
Indeed, the latter claim means that for all x,
F1 (x) = F2 (a + bx),
so that putting Q1 (p) = x, we get Q2 (p) = a + bx, which yields the linear relationship for the QQ-plot
The normality hypothesis H0 states that the random sample (x1 , . . . , xn ) is drawn from the normal distribution
with unspecified parameter values. A QQ-plot used for testing this hypothesis is called a normal QQ-plot or normal
probability plot. To define the normal QQ-plot, let
yk = Φ−1 ( k−0.5
n ),
where Φ−1 is the quantile function for the N(0, 1) distribution function Φ. The normal QQ-plot is the scatter plot for
the sample quantiles matched with the theoretical quantiles
(x(1) , y1 ), . . . , (x(n) , yn ).
If the normal QQ-plot is close to a straight line y = a + bx, then we take it as an evidence supporting the hypothesis
of normality, and we may use the point estimates µ̂ = − ab , σ̂ = 1b for the population mean and standard deviation.
56
The panels below show the normal QQ-plots for three samples of size n = 100 drawn from the t-distribution. Observe
the deviations from the straight lines characteristic for the heavy tails: the sample values in the tails are more extreme
than those projected by the straight lines going through the middle part of the QQ-plot.
The next panels show the normal QQ-plots for three samples of size n = 100 drawn from the beta-distribution. The
deviations from the straight lines are typical for the light tails: the sample values in the tails are less extreme than
those projected by the straight lines going through the middle part of the QQ-plot.
The choice of a balanced width h is important: smaller h give ragged profiles, larger h give obscured profiles.
57
To produce a smooth version of the scaled histogram, one can use the kernel density estimate with bandwidth h
defined by
n
1 X x−xi 2
fh (x) = ϕ( h ), where ϕ(x) = √12π e−x /2 .
nh i=1
µ1 = E(Z) = 0, µ2 = E(Z 2 ) = 1,
µ3 = E(Z 3 ), µ4 = E(Z 4 ).
With µ3 close to zero, kurtosis being close to 3, can be used as an indication of the curve profile to be close to that
of the normal distribution. Otherwise, we distinguish between the leptokurtic distributions with µ4 > 3 (heavy tails),
and platykurtic distributions with µ4 < 3 (light tails). Given a random sample (x1 , . . . , xn ) with the sample mean x̄
and sample variance s2 , the sample skewness and sample kurtosis are computed by
n n
1 X 1 X
m3 = 3 (xi − x̄)3 , m4 = 4 (xi − x̄)4 .
s n i=1 s n i=1
The next table gives a list of parametric distributions with their skewness and kurtosis coefficients.
Parametric distribution Skewness Kurtosis
N(µ, σ) normal distribution µ3 = 0 symmetric µ4 = 3
Gam(α, λ) gamma distribution µ3 = √2α skewed to the right µ4 = 3 + α6 leptokurtic
Beta(a, a) beta distribution µ3 = 0 symmetric 1
µ4 = 3 − 2a+3 platykurtic
t-distribution with df = k and k ≥ 5 µ3 = 0 symmetric µ4 = 3 + k−4 leptokurtic
6
For example, we see that the gamma distribution Gam(α, λ) is positively skewed having µ3 = √2 .
α
As the shape
parameter α gets larger, the skewness gets smaller and kurtosis becomes close to 3.
58
Example: t3 and Beta(2, 2)
Returning to the example of the t3 and Beta(2, 2) distributions, we find that the heavy-tailed t-distribution has infinite
kurtosis µ4 = ∞, while the light-tailed beta-distribution has kurtosis µ4 = 2.143 which is smaller than 3. Recalling
the two standardised curves,
observe that the heavy-tailed curve is indeed leptokurtic (from Greek lepto ‘narrow’ + kurtos ‘bulging’) and the
light-tailed curve is platykurtic (from platy ‘broad, flat’).
170, 175, 176, 176, 177, 178, 178, 179, 179, 180, 180, 180, 180, 180, 181, 181, 182, 183, 184, 186, 187, 192, 192, 199,
Good to know: the distribution of the heights of adult males is positively skewed,
implying m < µ, so that more than half of heights are below the average.
59
Confidence interval for the median
Consider a random sample (x1 , . . . , xn ) without assuming any parametric model for the unknown population distri-
bution F(µ, σ). Suppose that the population distribution F(µ, σ) is continuous so that all sample values (x1 , . . . , xn )
are different (no ties). Observe that the number of observations below the true population median m,
n
X
y= 1{xi ≤m}
i=1
is a realisation of a random variable Y having the symmetric binomial distribution Y ∼ Bin(n, 0.5). Denote
k−1
X
n −n
pk = P(Y < k) = 2 , k = 1, . . . , n.
i=0
i
Since
{Y ≥ k} = {X(k) < m}, {Y ≤ n − k} = {X(n−k+1) > m},
we obtain for k < n/2,
P(X(k) < m < X(n−k+1) ) = P(k ≤ Y ≤ n − k) = 1 − 2pk .
This yields the following non-parametric formula for an exact confidence interval for the median.
Im = (x(k) , x(n−k+1) ) is a 100 · (1 − 2pk )% confidence interval for the population median m
For example, if n = 25, then from the table below we find that (X(8) , X(18) ) gives a 95.7% confidence interval for the
median.
k 6 7 8 9 10 11 12
100 · (1 − 2pk ) 99.6 98.6 95.7 89.2 77.0 57.6 31.0
Sign test
The sign test is a non-parametric test of H0 : m = m0 against the two-sided alternative H1 : m ̸= m0 . The sign test
statistic
n
X
y0 = 1{xi ≤m0 }
i=1
counts the number of observations below the null hypothesis value. This test rejects the null hypothesis for the larger
H
or smaller observed values y0 with the reference to the null distribution Y0 ∼0 Bin(n, 0.5). There is a simple connection
between the sign test and the confidence interval for the median: we reject H0 if m0 falls outside the confidence interval
Im = (x(k) , x(n−k+1) ),
the median of absolute values of deviations, MAD, is defined as the sample median of {|xi −m̂|, i = 1, . . . , n}.
Turning to the standard normal distribution, we get
so that for the general normal distribution N(µ, σ), the theoretical lower and upper quartiles are µ ± 0.675 · σ yielding
IQR = (µ + 0.675 · σ) − (µ − 0.675 · σ) = 1.35 · σ.
On the other hand, since
P(|X − µ| ≤ 0.675 · σ) = (Φ(0.675) − 0.5) · 2 = 0.5,
we obtain MAD = 0.675 · σ. To summarise, for the normal distribution F(µ, σ) = N(µ, σ) model, we have three
estimates of σ:
IQR MAD
s, , .
1.35 0.675
60
Boxplots
The boxplot is a convenient graphical tool for comparing the distributions of different samples in terms of the measures
of location and dispersion.
The boxplot is built of the following components: box, whiskers and outliers.
Box
• upper edge of the box = upper quartile (UQ)
• box center = median
• lower edge of the box = lower quartile (LQ)
• box height = interquartile range (IQR)
Wiskers
• upper whisker end = {largest data point ≤ UQ + 1.5 × IQR}
• lower whisker end = {smallest data point ≥ LQ – 1.5 × IQR}
Outliers
• upper dots = {data points > UQ + 1.5 × IQR}
• lower dots = {data points < LQ – 1.5 × IQR}
6.7 Exercises
Problem 1
Suppose that (X1 , . . . , Xn ) are independent uniform U(0, 1) random variables. The corresponding empirical distribu-
tion function
n
1X
F̂ (x) = 1{Xi ≤x}
n i=1
is a random variable with the mean x. One hundred such samples of size n = 16 were generated, and from these
samples the standard errors of F̂ (x) were computed for different values of x. In the following figure the estimated
standard errors of F̂ (x) are depicted as the vertical bars.
61
The elliptic line on the graph giving the theoretical prediction of the observed profile of the vertical bars follows the
formula
(x − 0.5)2 + 16y 2 = 0.25.
Explain this formula by computing a certain variance.
Problem 2
Using the setting of the previous problem, show for x < y, that
x(1 − y)
Cov(F̂ (x), F̂ (y)) = .
n
Observe that this positive correlation is quite intuitive: if the empirical distribution function F̂ (x) overshoots its
expected value x, then we would expect that the other value F̂ (y) would also overshoot the respective value y.
Problem 3
The data below are the percentages of hydrocarbons in n = 59 samples of beeswax.
14.27 14.80 12.28 17.09 15.10 12.92 15.56 15.38 15.15 13.98
14.90 15.91 14.52 15.63 13.83 13.66 13.98 14.47 14.65 14.73
15.18 14.49 14.56 15.03 15.40 14.68 13.33 14.41 14.19 15.21
14.75 14.41 14.04 13.68 15.31 14.32 13.64 14.77 14.30 14.62
14.10 15.47 13.73 13.65 15.02 14.01 14.92 15.47 13.75 14.87
15.28 14.43 13.96 14.57 15.49 15.13 14.23 14.44 14.57
(a) For this data, plot the empirical distribution function, a histogram, and the normal QQ-plot. Find the 0.9, 0.75,
0.5, 0.25, and 0.1 quantiles. Does the distribution appear Gaussian?
(b) The average percentage of hydrocarbons in a synthetic wax is 85%. Suppose that beeswax was diluted with 1%
synthetic wax. Could this be detected? What about 3% and 5% dilution?
Problem 4
Calculate the hazard function for the Weibull distribution having the distribution function
β
F (x) = 1 − e−αx , x ≥ 0,
where α and β are two positive parameters. (Waloddi Weibull was a Swedish engineer, scientist, and mathematician.)
Problem 5
Give an example of a distribution with an increasing failure rate (hazard function). Give an example of a distribution
with a decreasing failure rate.
Problem 6
Olson, Simpson, and Eden (1975) discuss the analysis of data obtained from a cloud seeding experiment. The following
data present the rainfall from 26 seeded and 26 control clouds.
Seeded clouds
129.6, 31.4, 2745.6, 489.1, 430, 302.8, 119, 4.1, 92.4, 17.5,
200.7, 274.7, 274.7, 7.7, 1656, 978, 198.6, 703.4, 1697.8, 334.1,
118.3, 255, 115.3, 242.5, 32.7, 40.6
Control clouds
26.1, 26.3, 87, 95, 372.4, 0.01, 17.3, 24.4, 11.5, 321.2,
68.5, 81.5, 47.3, 28.6, 830.1, 345.5, 1202.6, 36.6, 4.9, 4.9,
41.1, 29, 163, 244.3, 147.8, 21.7
Make a QQ-plot for seeded rainfall versus control rainfall and log seeded rainfall versus log control rainfall. What do
these plots suggest about the effect, if any, of seeding?
62
Problem 7
The Laplace distribution with a positive parameter λ is a two-sided exponential distribution. Its density function is
f (x) = λ2 e−λ|x| for x ∈ (−∞, ∞). The variance of this distribution is 2λ−2 and kurtosis is 6.
√
(a) Take λ = 2. Plot carefully the density f (x) together with the standard normal distribution density.
(b) Use the drawn picture to explain the exact meaning of the following citation. "Kurtosis is a measure of the
peakedness of the probability distribution of a real-valued random variable, although some sources are insistent
that heavy tails, and not peakedness, is what is really being measured by kurtosis".
Problem 8
Given a random sample (x1 , . . . , xn ) with n = 25, we are interested in testing H0 : m = 20 against the two-sided
alternative HP
1 : m ̸= 20 concerning the population median m. No parametric model is assumed. As a test statistic
n
we take y = i=1 1{xi ≤20} , the number of observations below the null hypothesis value.
(a) Find the exact null distribution of the test statistic. What are your assumptions?
(b) Suggest an approximate confidence interval formula for m.
Problem 9
Miscellaneous questions.
(a) Describe a situation when a stratified sampling is more effective than a simple random sampling for estimating
the population mean. Which characteristics of the strata will influence your sample allocation choice?
(b) Given a dataset how do you compute kurtosis? What is the purpose of this summary statistic? Why is it
important to compute the coefficient of skewness for a proper interpretation of the kurtosis value?
(c) Suppose we are interested in the average height for a population of size 2,000,000. To what extend can a sample
of 200 individuals be representative for the whole population?
Problem 10
Let x1 , . . . , x25 be a random sample drawn from N(0.3, 1). Consider testing
H0 : m = 0 vs H1 : m > 0
(b) the power of the test based on the normal theory assuming that σ = 1 is known.
63
Chapter 7
Suppose we wish to compare two population distributions F(µ1 , σ1 ) and F(µ2 , σ2 ) based on two random samples
(x1 , . . . , xn ) and (y1 , . . . , ym ) from these two populations. Consider two sample means
x1 + . . . + xn y1 + . . . + ym
x̄ = , ȳ = ,
n m
two sample variances
n m
1 X 1 X
s21 = (xi − x̄)2 , s22 = (yi − ȳ)2 ,
n − 1 i=1 m − 1 i=1
and the standard errors of x̄ and ȳ
s1 s2
sx̄ = √ , sȳ = √ .
n m
The difference (x̄ − ȳ) is an unbiased estimate of µ1 − µ2 . We are interested in finding the standard error of x̄ − ȳ, a
confidence interval for the difference µ1 − µ2 , as well as testing the null hypothesis of equality
H0 : µ1 = µ2 .
Two main settings will be addressed: (1) two independent samples and (2) paired samples for sampling the differences.
64
Two-sample t-test
The two–sample t-test is based on the following three assumptions on the population distributions:
1. two random samples (x1 , . . . , xn ) and (y1 , . . . , ym ) are independent of each other,
2. both underlying population distributions are normal, F(µ1 , σ1 ) = N(µ1 , σ1 ) and F(µ2 , σ2 ) = N(µ2 , σ2 ),
3. these two normal distributions have equal variances σ12 = σ22 = σ 2 .
To estimate the common variance σ12 = σ22 = σ 2 , we can use
Pn Pm
(xi − x̄)2 + i=1 (yi − ȳ)2
s2p = i=1 ,
n+m−2
which is called the pooled sample variance. Using the equality
n−1 m−1
s2p = · s2 + · s2 ,
n+m−2 1 n+m−2 2
it is easy to see that the pooled sample variance is an unbiased estimate of σ 2 :
n−1 m−1
E(Sp2 ) = E(S12 ) + E(S22 ) = σ 2 .
n+m−2 n+m−2
In the case of equal variances and independent samples, the variance
n+m
Var(X̄ − Ȳ ) = σ 2 · ,
nm
has the following unbiased estimate
n+m
s2x̄−ȳ = s2p ·
.
nm
Under the normality assumption with equal variances, the sampling distribution of the two-sample t-score
r
(x̄ − ȳ) − (µ1 − µ2 ) nm
t= ·
sp n+m
The two sample t-test for testing H0 : µ1 = µ2 is based on the test statistic
r
x̄ − ȳ nm
t0 = ·
sp n+m
H
obtained from the two-sample t-score t by putting µ1 −µ2 = 0. The null distribution of the test statistic is T0 ∼0 tn+m−2 .
Fe3+ (10.2) Fe3+ (1.2) Fe3+ (0.3) Fe2+ (10.2) Fe2+ (1.2) Fe2+ (0.3)
0.71 2.20 2.25 2.20 4.04 2.71
1.66 2.93 3.93 2.69 4.16 5.43
2.01 3.08 5.08 3.54 4.42 6.38
2.16 3.49 5.82 3.75 4.93 6.38
2.42 4.11 5.84 3.83 5.49 8.32
2.42 4.95 6.89 4.08 5.77 9.04
2.56 5.16 8.50 4.27 5.86 9.56
2.60 5.54 8.56 4.53 6.28 10.01
3.31 5.68 9.44 5.32 6.97 10.08
3.64 6.25 10.52 6.18 7.06 10.62
3.74 7.25 13.46 6.22 7.78 13.80
3.74 7.90 13.57 6.33 9.23 15.99
4.39 8.85 14.76 6.97 9.34 17.90
4.50 11.96 16.41 6.97 9.91 18.25
5.07 15.54 16.96 7.52 13.46 19.32
5.26 15.89 17.56 8.36 18.40 19.87
8.15 18.3 22.82 11.65 23.89 21.60
8.24 18.59 29.13 12.45 26.39 22.25
65
Two forms of iron Fe2+ and Fe3+ were compared in a randomised study on mice. The obtained data is the percentage
of iron retained in mice under three different iron dosage concentrations: 10.2, 1.2, and 0.3 millimolar. The six samples
were collected using n = 18 mice randomly assigned at each combination of the iron form and concentration. Turning
to the two independent samples at the medium concentration, we get the following summary of the data:
Fe2+ percentage retained: n = 18, x̄ = 9.63, s1 = 6.69, sx̄ = 1.58
Fe3+ percentage retained: m = 18, ȳ = 8.20, s2 = 5.45, sȳ = 1.28
The graphs below show that the population distributions are not normal. Therefore, to test H0 : µ1 = µ2 we turn to
the large sample test. Using the observed value
x̄ − ȳ
z0 = q = 0.7,
s2x̄ + s2ȳ
and applying the normal distribution table we find an approximate two-sided p-value = 0.48.
Left panel: boxplots for percentages of Fe2+ (left) and Fe3+ (right). Right panel: two normal QQ plots.
x◦i = ln xi , yi◦ = ln yi ,
the normality assumption becomes more acceptable, as seen from the graphs below. For the transformed data, we get
Fe2+ log-percentage retained: n = 18, x̄◦ = 2.09, s◦1 = 0.659, sx̄◦ = 0.155,
Fe3+ log-percentage retained: m = 18, ȳ ◦ = 1.90, s◦2 = 0.574, sȳ◦ = 0.135.
Assuming that two population variances are equal to the unknown σ 2 , we estimate σ 2 by the pooled sample variance
formula
n−1 m−1
s2p = · (s◦1 )2 + · (s◦2 )2 = 0.5(0.6592 + 0.5742 ) = 0.382.
n+m−2 n+m−2
The two-sample t-score for the transformed data is
x̄◦ − ȳ ◦ 2.09 − 1.90
t0 = p = = 0.922.
sp 2/n 0.618/3
Applying the two-sample t-test with df = 34, we find the two-sided p-value = 0.366 also resulting in non-significant
difference. In the next chapter we will analyse the full dataset of six samples using the two-way layout model of the
analysis of variance.
Left panel: boxplots for log-percentages of Fe2+ (left) and Fe3+ (right). Right panel: two normal QQ plots.
66
Rank sum test
The rank sum test is a nonparametric test for two independent samples, which does not assume normality of population
distributions. Suppose we have two independent samples (x1 , . . . , xn ) and (y1 , . . . , ym ) drawn from two continuous
population distributions F1 and F2 , and we would like to test
H0 : F1 = F2 against H1 : F1 ̸= F2
in the case when either one of the distributions F1 and F2 is not normal, or their variances are not equal. The rank
sum test procedure consists of the following steps:
pool the samples and replace the data values by their ranks 1, 2, . . . , n + m, starting from the smallest
sample value to the largest, and then compute two test statistics r1 = sum of the ranks of x-observations,
and r2 = sum of y-ranks.
Clearly, these two test statistics have a fixed sum
(n+m)(n+m+1)
r1 + r2 = 1 + 2 + . . . + (n + m) = 2 .
The exact null distributions for R1 and R2 depend only on the sample sizes n and m, and the rank sum test
is included in all statistical packages. For n ≥ 10, m ≥ 10, we may apply the normal approximation for the null
distributions of R1 and R2 with
n(n + m + 1) m(n + m + 1) mn(n + m + 1)
E(R1 ) = , E(R2 ) = , Var(R1 ) = Var(R2 ) = .
2 2 12
Note that the variances of R1 and R2 must be equal since their sum is a constant.
r1 = 1 + 2 + 5 + 6 + 8 + 9 = 31, r2 = 3 + 4 + 7 + 10 + 11 = 35.
Despite the sample sizes are small, we will use the normal approximation for the null distribution of R2 , which has
the mean and variance
m(n + m + 1) mn(n + m + 1)
µ= = 30, σ 2 = = 30.
2 12
The resulting one-sided p-value is larger than 10%,
1 − Φ( 35.5−30
√
30
) = 1 − Φ(1.00) = 0.16,
implying that we can not reject the null hypothesis of equal means judging from these small samples.
H0 : p1 = p2 ,
the two-sample t-test is not valid, and we would like to modify the large sample test for the mean difference. For the
Bernoulli model, the sample means turn into the sample proportions
x̄ = p̂1 , ȳ = p̂2 ,
67
Large sample test for two proportions
If the samples sizes m and n are large, then an approximate confidence interval for the difference p1 − p2 is given by
r
α p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
Ip1 −p2 ≈ p̂1 − p̂2 ± z( 2 ) · + .
n−1 m−1
If this interval does not cover 0, we reject the null hypothesis of equality H0 : p1 = p2 in favour of the two-sided
alternative H1 : p1 ̸= p2 at the significance level α.
H0 : p = 0.4 vs H0 : p ̸= 0.4.
H0 : p1 = p2 ,
when the sample sizes m and n are not sufficiently large for applying normal approximations for the distributions of
the sample means. We summarise the data of two independent samples as a 2 × 2 table of sample counts
Sample 1 Sample 2 Total
Number of successes c11 c12 c11 + c12
Number of failures c01 c02 c01 + c02
Sample sizes n m N =n+m
where
c11 = x1 + . . . + xn , c01 = n − c11 , c12 = y1 + . . . + ym , c02 = m − c12 .
The key of the Fisher’s idea is to treat the observed count c11 as a random outcome of drawing without replacement
of n balls from an urn with N balls. Assuming that the urn contains c11 + c12 black balls and c01 + c02 white balls,
the c11 is the number of black balls among n sampled balls. Putting N = n + m, and freezing the proportion
c11 + c12
p= ,
N
we find that under the null hypothesis of equality, the sample count c11 is generated by the hypergeometric distribution
H
C11 ∼0 Hg(N, n, p).
This null distribution should be used for determining the rejection rule of the exact Fisher test in terms of the sample
counts (c11 , c12 ) and the sample sizes (n, m).
68
Example: gender bias
The following data were collected after 48 copies of the same file with 24 files labeled as “male” and the other 24
labeled as “female” were sent to 48 experts. Each expert decision had two possible outcomes: promote or hold file.
We wish to test
H0 : p1 = p2 no gender bias,
against
H1 : p1 > p2 males are favoured.
Given the data
Male Female Total
Promote 21 14 35
Hold file 3 10 13
Total 24 24 48
Fisher’s test would reject H0 in favour of the one-sided alternative H1 if the observed value c11 = 21 will be deemed
significantly large under the hypergeometric distribution Hg(N, n, p) with N = 48, n = 24, and p = 35/48:
35
13 35
13
k 24−k 35−k
P(C11 = k) = 48
= 48
k−11 , 11 ≤ k ≤ 24.
24 24
We conclude that the one-sided p-value = 0.0245, and a two-sided p-value = 0.049. According to this dataset, there
is a significant evidence of sex bias in favor of male candidates.
(x1 , y1 ), . . . , (xn , yn ).
As before, we have two samples (x1 , . . . , xn ) and (y1 , . . . , yn ) taken from two possibly different population means and
variances
E(X) = µ1 , E(Y ) = µ2 , Var(X) = σ12 , Var(Y ) = σ22 ,
and our main question is again whether the difference µ1 − µ2 is statistically significant.
In contrast to the two independent samples case, in the paired sampling setting, there is a positive dependence in
the joint distribution of (X, Y ):
Cov(X, Y )
ρ= > 0.
σ1 σ2
Even with paired samples, the difference x̄ − ȳ gives an unbiased estimate of the population mean difference µ1 − µ2 .
This is an unbiased estimate whose variance value takes into account dependence between X and Y . Observe that
implying
2
Var(X̄ − Ȳ ) = 1
n (σ1 + σ22 − 2σ1 σ2 ρ).
In particular, if the two samples are independent and have equal sizes, then ρ = 0 and
2
Var(X̄ − Ȳ ) = Var(X̄) + Var(Ȳ ) = 1
n (σ1 + σ22 ).
69
It follows that with ρ > 0, the paired sampling ensures a smaller standard error for the estimate x̄ − ȳ compared to
the independent X and Y .
The key idea in handling the paired samples is to reduce the two-sample case to the single random sample case by
taking the differences
(d1 , . . . , dn ), di = xi − yi .
For the sample of the differences, we have
d¯ = x̄ − ȳ,
with
2
E(D̄) = µ1 − µ2 , E(D̄) = 1
n (σ1 + σ22 − 2σ1 σ2 ρ).
The first two columns show the paired measurements (xi , yi ) for n = 11 individuals. Using the data we estimate
the correlation coefficient ρ for the measurement results before and after smoking by computing the sample correlation
coefficient
(x1 − x̄)(y1 − ȳ) + . . . + (xn − x̄)(yn − ȳ)
r= = 0.90.
(n − 1)s1 s2
The obtained high value indicates a strong positive correlation between the first two columns, and the third column
reduces the two sample data to the single random sample of the differences di = xi − yi .
Assuming that the population distribution for differences D ∼ N(µ, σ) is normal with µ = µ1 − µ2 , we may apply
the one-sample t-test for
H0 : µ = 0 against H1 : µ ̸= 0.
The observed test statistic value
d¯ 10.27
t0 = = = 4.28
sd¯ 2.40
according to t-distribution with 10 degrees of freedom produces the two-sided p-value 0.0016. We conclude that
smoking has a significant health effect.
Without the assumption of normality on the distribution of differences, we can apply the non-parametric sign test
for a pair of hypotheses concerning the population median m of the difference D = X − Y :
H0 : m = 0 against H1 : m ̸= 0.
We may use
y0 = 1{d1 <0} + . . . + 1{dn <0} = 1
as the test statistic which has the null distribution
H
Y0 ∼0 Bin(n, 0.5).
Thus we reject the null hypothesis of no difference also with the help of the non-parametric sign test.
70
Signed rank test
The sign test disregards a lot of information in the data taking into account only the sign of the differences. The
signed rank test pays attention to the sizes of positive and negative differences. This non-parametric test requires the
following assumption: the population distribution of D = X − Y is symmetric about its median m. The test statistic
for
H0 : m = 0 against H1 : m ̸= 0.
is computed in terms the ranks of the absolute values of the differences
ri = rank(|di |), i = 1, . . . , n.
Here ri = 1 if di has the smallest absolute value and ri = n if di has the largest absolute value among the differences
irrespective of the sign of the difference di . The signed ranks used for computing the signed rank test statistic w are
defined as follows: if di > 0, then ri is called a positive rank, and if di < 0, then ri is called a negative rank. There
are two possible ways to compute the signed rank test statistic w: either as the sum of positive ranks
n
X
w= ri · 1{di >0}
i=1
Irrespective of the sign of the ranks used for computing the test statistic its null distributions is the same. The exact
null distribution of W is included in all statistical packages. For n ≥ 20, one can use the normal approximation N(µ, σ)
of the null distribution of W , with
The signed rank test uses more data information than the sign test
but requires symmetric distribution of differences.
It gives the two-sided p-value = 0.002. Given the small p-value, before we conclude that there is a significant effect,
we have to verify the symmetry assumption on the distribution of differences D. Judging from the plot below, the
vector of observed differences (2, 4, 10, 15, 16, 12, 9, 4, 27, −1) is symmetric enough around its median.
71
7.4 Paired samples: comparing population proportions
Suppose we have two dependent Bernoulli random variables X ∼ Bin(1, p1 ) and Y ∼ Bin(1, p2 ). The vector (X, Y ) have
four possible values (0, 0), (0, 1), (1, 0), (1, 1) with probabilities denoted as π00 , π01 , π10 , π11 . With n independent paired
observations of the vector (X, Y ), we get four counts (c00 , c01 , c10 , c11 ) for the different outcomes. The corresponding
joint distribution is multinomial
Var(C10 − C01 ) = nπ10 (1 − π10 ) + nπ01 (1 − π01 ) + 2nπ10 π01 = n(π10 + π01 − (π10 − π01 )2 ).
Referring to the central limit theorem we arrive at the following approximate 100(1 − α)% confidence interval
formula for the difference p1 − p2
Ip1 −p2 ≈ p̂1 − p̂2 ± z( α2 ) · sp̂1 −p̂2 .
McNemar’s test
A significant difference between p1 and p2 is established when the confidence interval Ip1 −p2 does not cover zero, that
is when
|p̂1 − p̂2 | > z( α2 ) · sp̂1 −p̂2 ,
or in other words, the rejection region for
H0 : p1 = p2 against H1 : p1 ̸= p2
och equivalently
H0 : π10 = π01 against H1 : π10 ̸= π01
has the form
|π̂10 − π̂01 |
R= q > z( α2 ) .
π̂10 +π̂01 −(π̂10 −π̂01 )2
n−1
Now notice that for the large sample sizes, the squared left hand side approximately equals
One-sample tests
• One sample t-test: normal population distribution
• Large sample test for mean
72
• Large sample test for proportion: categorical data
• Small sample test for proportion: categorical data
• Chi-squared test of goodness of fit: categorical data, large sample
• Chi-squared test of independence: categorical data, large sample
• Model utility test: linear model, several explanatory variables, normal noise, homoscedasticity
Two-sample tests
7.6 Exercises
Problem 1
Four random numbers were generated from a normal distribution N(µ1 , σ 2 )
along with five random numbers with the same variance σ 2 but perhaps a different mean µ2
(a) What do you think the means of the random normal number generators were? What do you think the difference
of the means was?
(b) Estimate σ 2 .
(c) What is the estimated standard error of your estimate of the difference of the means?
(d) Form a 90% confidence interval for the difference of the means.
(e) In this situation, is it more appropriate to use a one-sided test or a two-sided test of the equality of the means?
(f) What is the p-value of a two-sided test of the null hypothesis of equal means?
(g) Would the hypothesis that the means were the same versus a two-sided alternative be rejected at the significance
level α = 0.1?
(h) Suppose you know that the variance of the normal distribution was σ 2 = 1. How would your answers to the
preceding questions change?
Problem 2
In the "two independent samples" setting we have two ways of estimating the variance of X̄ − Ȳ :
(a) s2p ( n1 + 1
m ), if σ1 = σ2 = σ,
s21 s22
(b) n + m without the assumption of equal variances.
Show that if m = n, then these two estimates are identical.
73
Problem 3
An experiment of the efficacy of a drug for reducing high blood pressure is performed using four subjects in the
following way:
two of the subjects are chosen at random for the control group and two for the treatment group.
During the course of a treatment with the drug, the blood pressure of each of the subjects in the teatment group is
measured for ten consecutive days as is the blood pressure of each of the subjects in the control group.
(a) In order to test whether the treatment has an effect, do you think it is appropriate to use the two-sample t-test
with n = m = 20?
(b) Do you think it is appropriate to use the rank sum test?
Problem 4
This is an example of the so-called Simpson’s paradox.
Hospital A has higher overall death rate than hospital B:
Hospital: A B
Died 63 16
Survived 2037 784
Total 2100 800
Death Rate 0.03 0.02
However, if we split the data in two parts, patients in good (+) and bad (−) conditions, for both parts hospital A
performs better:
Hospital: A+ A− B+ B−
Died 6 57 8 8
Survived 594 1443 592 192
Total 600 1500 600 200
Death Rate 0.010 0.038 0.013 0.040
Resolve this paradox.
Problem 5
Suppose that n measurements are to be taken under a treatment condition and another n measurements are to be
taken independently under a control condition. It is thought that the standard deviation of a single observation is
about 10 under both conditions. How large should n be so that a 95% confidence interval for the mean difference has
a width of 2? Use the normal distribution rather than the t-distribution, since n will turn out to be quite large.
Problem 6
Two types of engine bearings are compared.
The data in the table give the millions of cycles until failure.
Type I Type II
3.03 3.19
5.53 4.26
5.60 4.47
9.30 4.53
9.92 4.67
12.51 4.69
12.95 6.79
15.21 9.37
16.04 12.75
16.84 12.78
(a) Use a normal theory test for the null hypothesis of no difference against the two-sided alternative
H0 : µ1 = µ2 , H1 : µ1 ̸= µ2 .
(b) Test the hypothesis of no difference between the two types of bearing using a non-parametric method.
(c) Which of the methods (a) or (b) do you think is better in this case? (Hint: refer to the normal QQ-plot
constructed for the deviations of the 20 observations from their sample means.)
(d) Estimate π, the probability that a type I bearing will outlast a type II bearing.
74
Problem 7
Find the exact null distribution for the test statistic of the signed rank test with n = 4.
Problem 8
Turn to the signed rank test and denote by W a random variable having the null distribution of its test statistic. For
n = 10, 20, 25 and α = 0.05, 0.01, the next table gives the critical values wα such that 2P(W ≤ wα ) is closest to α.
n = 10 n = 20 n = 25
α = 0.05 8 52 89
α = 0.01 3 38 68
Compare these critical values with their counterparts wα◦ obtained using the normal approximation for W .
Problem 9
From two population distributions with the same σ = 10, two samples (x1 , . . . , xn ) and (y1 , . . . , yn ) of sizes n = 25
can be taken in two ways
(a) paired sample with Cov(Xi , Yi ) = 50, i = 1, . . . , 25,
(b) two independent random samples.
Compare the power curves for testing
H0 : µ1 = µ2 , H1 : µ1 > µ2 , α = 0.05, Sample 1 Sample 2
for the settings (a) and (b). 97.2 97.2
105.8 97.8
99.5 96.2
Problem 10
100 101.8
The table on the right reports fifteen pairs (xi , yi ) of measurements. 93.8 88
What are x̄ − ȳ and sx̄−ȳ ? 79.2 74
72 75
If the pairing had been erroneously ignored and it had been assumed that 72 67.5
the two samples were independent, what would have been the estimate 69.5 65.8
of the standard deviation of X̄ − Ȳ ? 20.5 21.2
95.2 94.8
Analyse the data to determine if there is a systematic 90.8 95.8
difference between µ1 and µ2 . 96.2 98
96.2 99
91 100.2
Problem 11
The media often present short reports of the results of experiments. To the critical reader, such reports often raise
more questions than they answer. Comment on the following pitfalls in the interpretation of each of the following.
(a) It is reported that patients whose hospital rooms have a window recover faster than those whose rooms do not.
(b) Nonsmoking wives whose husbands smoke have a cancer rate twice that of wives whose husbands do not smoke.
(c) A two-year study in North Carolina found that 75% of all industrial accidents in the state happened to workers
who had skipped breakfast.
(d) A 15-year study of more than 45 000 Swedish soldiers revealed that heavy users of marijuana were six times
more likely than nonusers to develop schizophrenia.
(e) A study of nearly 4000 elderly North Carolinians has found that those who attended religious services every week
were 46% less likely to die over a six-year period than people who attended less often or not at all.
Problem 12
The following table shows admission rates for the six most popular majors at the graduate school at the University of
California at Berkeley. The numbers in the table are the number of applicants and the percentage admitted.
Men Women
Major A 825 (62%) 108 (82%)
Major B 560 (63%) 25 (68%)
Major C 325 (37%) 593 (34%)
Major D 417 (33%) 375 (35%)
Major E 191 (28%) 393 (34%)
Major F 373 (6%) 341 (7%)
75
(a) If the percentage admitted are compared, women do not seem to be unfavourably treated. But when the combined
admission rates for all six majors are calculated, it is found that 44% of the men and only 30% of the women
were admitted. How this paradox is resolved?
(b) This is an example of an observational study. Suggest a controlled experiment testing relevant statistical
hypotheses.
Problem 13
You have got a grant to measure the average weight of the hippopotamus at birth. You have seen in a previous
publication by Stanley and Livingstone that for male calves the distribution of weights has a mean of roughly 70 kg
and a standard deviation of 10 kg, while these numbers are 60 kg and 5 kg for females, but you are interested in a
better remeasurement of the overall average.
The experimental procedure is simple: you wait for the herd of hippopotami to be sleeping, you approach a
newborn, you put it quickly on the scales, and you pray for the mother not to wake up. You managed to weigh 13
female and 23 male newborns with the following results:
Female Male
Sample mean 62.8 69.7
Sample standard deviation 6.8 11.7
(a) Test the null hypothesis of the equal sex ratio for the newborn hippopotami (meaning that the ratio of males to
females at birth is 1 to 1).
(b) Assuming the ratio of males to females at birth is 1 to 1, suggest two different unbiased point estimates for the
overall average weight of the hippopotamus at birth: the random sample mean and the stratified sample mean.
(c) Compute the standard errors for the stratified sample mean.
Problem 14
For each of nine horses, a veterinary anatomist measured the density of nerve cells at specified sites in the intestine:
Animal Site I Site II
1 50.6 38.0
2 39.2 18.6
3 35.2 23.2
4 17.0 19.0
5 11.2 6.6
6 14.2 16.4
7 24.2 14.4
8 37.4 37.6
9 35.2 24.4
The null hypothesis of interest is that in the population of all horses there is no difference between the two sites.
(a) Which of the two non-parametric tests is appropriate here: the rank sum test or the signed rank test? Explain
your choice.
(b) On the basis of the data, would you reject the null hypothesis? Use one of the tests named in the item (a).
(c) Explain the following extract from the course text book:
More precisely, with the signed rank test, H0 states that the distribution of the differences is symmetric
about zero. This will be true if the members of pairs of experimental units are assigned randomly to
treatment and control conditions, and the treatment has no effect at all.
Problem 15
A study is conducted of the association between the rate at which words are spoken and the ability of a “talking
computer” to recognise commands that it is programmed to accept. A random sample of 50 commands is spoken
first at a rate under 60 words per minute, and then the same commands are repeated at a rate over 60 words per
minute. In the first case the computer recognised 42 out of 50 commands while in the second case it recognised only 35
commands. Is the observed difference statistically significant? Assume that if the "talking computer" made a correct
answer in the fast regime, then it recognised correctly the same command in the slow regime.
76
Chapter 8
Analysis of variance
The two sample setting from the previous chapter is the case, where the response is explained by a single main factor
having two levels. In this chapter devoted to the analysis of variance or in short ANOVA, we extend the settings
with one or two main factors having arbitrary many levels. The one-way layout setting extends the two independent
samples case, while the two-way layout generalises the paired sample setting of the previous section.
The table below lists 70 measurements of chlorpheniramine maleate in tablets with a nominal dosage of 4 mg made
by seven different labs. This is an example of a one-way layout data consisting of I = 7 independent samples each of
size n = 10.
Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 7
4.13 3.86 4.00 3.88 4.02 4.02 4.00
4.07 3.85 4.02 3.88 3.95 3.86 4.02
4.04 4.08 4.01 3.91 4.02 3.96 4.03
4.07 4.11 4.01 3.95 3.89 3.97 4.04
4.05 4.08 4.04 3.92 3.91 4.00 4.10
4.04 4.01 3.99 3.97 4.01 3.82 3.81
4.02 4.02 4.03 3.92 3.89 3.98 3.91
4.06 4.04 3.97 3.9 3.89 3.99 3.96
4.10 3.97 3.98 3.97 3.99 4.02 4.05
4.04 3.95 3.98 3.90 4.00 3.93 4.06
77
The null hypothesis of interest H0 states that there is no significant difference between the output of the seven
laboratories. We would reject H0 if the discrepancy between the ordered sample means
Lab i 1 3 7 2 5 6 4
Mean µi 4.062 4.003 3.998 3.997 3.957 3.955 3.920
can not be explained by the noise factors.
α1 + . . . + αI = 0.
Here the noise components σZik are assumed to have the same standard deviation σ. The corresponding maximum
likelihood estimates of µ, µi = µ + αi , and αi ,
where êik are the so-called residuals giving the discrepancy between the observed response and the corresponding
sample mean.
The ANOVA test is built around the following decomposition
where
ssT = i k (yik − ȳ.. )2 is the total sum of squares for the pooled sample,
P P
2
is the error sum of squares.
P P
ssE = i k êik
Each of these sums of squares (ssT , ssA , ssE ) is characterised by its number of degrees of freedom:
df T = N − 1, df A = I − 1, df E = I · (n − 1).
Notice that
df T = df A + df E .
This decomposition says that the total variation in the response values is the sum of the between-group variation and
the within-group variation, making clear why the data analysis based on this decomposition is called the analysis of
variance.
By normalising the sums of squares with help of the numbers of degrees of freedom, we obtain the so-called mean
sums of squares
ssA ssE
msA = , msE = .
dfA df E
The underlying random variables have the mean values
n X 2
E(MSA ) = σ 2 + α , E(MSE ) = σ 2 .
I −1 i i
Notice that msE gives an unbiased estimate of the noise variance σ 2 . It can be viewed as the pooled sample variance
PI Pn
2 (yik − ȳi. )2
sp = msE = i=1 k=1 .
I(n − 1)
78
One-way F -test
Under the null hypothesis H0 : α1 = . . . = αI = 0, we expect
E(MSA ) = E(MSE ) = σ 2 .
On the other hand, if H0 is violated, then E(M SA ) > E(M SE ), as given some αi ̸= 0, we have
n X 2
σ2 + α > σ2 .
I −1 i i
R = {f ≥ fα }.
where Fk1 ,k2 is the so-called F-distribution with degrees of freedom (k1 , k2 ). The F-distribution is the distribution for
the ratio
X1 /k1
∼ Fk1 ,k2 ,
X2 /k2
where X1 ∼ χ2k1 and X2 ∼ χ2k2 are two independent random variables having chi-squared distributions. The critical
values of the F-distributions for different significance levels α are given in Section 11.4.
significant. The table below lists the 10 largest pairwise difference between the sample means.
(i, j) (1, 4) (1, 6) (1, 5) (3, 4) (7, 4) (2, 4) (1, 2) (1, 7) (1, 3) (5, 4)
ȳi. − ȳj. 0.142 0.107 0.105 0.083 0.078 0.077 0.065 0.064 0.059 0.047
Clearly, the difference (µ1 − µ4 ) corresponding to to the largest sample mean difference
must be significant at the 5% significance level. The important follow up question is: which of the other 21 pairwise
differences are significant and which are not at the 5% significance level?
A naive answer to this question would rely on the 95% confidence interval for a single pair of independent samples
(µi − µj ) from the previous chapter
sp
Iµi −µj = (ȳi. − ȳj. ) ± t63 (0.025) · √ = (ȳi. − ȳj. ) ± 0.055,
5
where t63 (0.025) = 2.00. Notice that here we use the t-distribution with 63 degrees of freedom, as the corresponding
pooled sample variance s2p = 0.0037 is based on I = 7 samples each of size n = 10. According to this confidence
79
interval formula we have 9 significant differences: (1, 4), (1, 6), (1, 5), (3, 4), (7, 4), (2, 4), (1, 2), (1, 7), (1, 3), since the
next ordered difference 0.047 for the pair (5, 4) is smaller than 0.055. The inequality 0.047 < 0.055 ensures that the
95% confidence interval
Iµ5 −µ4 = 0.047 ± 0.055
contains 0, implying that the difference (µ5 − µ4 ) is not significant at the 5% significance level.
The problem with the suggested solution is that the use of the confidence interval formula aimed at a single difference
may produce false discoveries. This is an example of the famous multiple comparison problem nicely illustrated by
the illuminating cartoon "Jelly beans cause acne", see https://xkcd.com/882/. What we need here is a simultaneous
confidence interval formula addressing all c = 21 pairwise comparisons. Next, we introduce two formulas for the
simultaneous confidence interval: the Bonferrroni formula Bµi −µj , and the Tukey formula Tµi −µj .
Bonferroni method
Think of a statistical test involving a null hypothesis H0 at the significance level αc is repeatedly applied to c inde-
pendently sampled datasets. The overall result is deemed to be positive if we get at least one positive result among
these c tests. Then the overall significance level α is computed as
α = P(Xc ≥ 1|H0 ),
where Xc is the number of positive results among these c tests. Due to the independence of the tests, we have
H
Xc ∼0 Bin(c, αc ),
yielding
P(X ≥ 1|H0 ) = 1 − (1 − αc )c .
Assuming αc is small, we find that
α = 1 − (1 − αc )c ≈ cαc .
This reasoning leads to the following important conclusion. To obtain the overall significance level α for the
multiple testing procedure involving c independenttests, each individual test should be performed at the level α/c.
Applying this idea to the current setting of c = I2 pairwise differences (µi − µj ), we arrive at the the Bonferroni’s
formula of the 100(1 − α)% simultaneous confidence interval:
q
α
Bµi −µj = ȳi. − ȳj. ± tI(n−1) ( I(I−1) ) · sp n2 , 1 ≤ i, j ≤ I.
This formula is obtained from the confidence interval for a single difference
q
Iµi −µj = ȳi. − ȳj. ± tI(n−1) ( α2 ) · sp n2 , 1 ≤ i, j ≤ I,
Observe that the pairwise differences µi − µj are not independent as required by Bonferroni method: for example,
knowing the differences (µ1 − µ2 ) and (µ2 − µ3 ) we may exactly predict the third difference µ1 − µ3 using the equality
µ1 − µ2 + µ2 − µ3 = µ1 − µ3 .
As a result the Bonferroni method gives slightly wider intervals compared to the Tukey method introduced below.
80
Tukey method
Under the normality assumption with equal variances, the centred sample means
are independent normal variables. Consider the range of all paiwise differences Xi − Xj :
max{X1 , . . . , XI } − min{X1 , . . . , XI }.
giving the largest pairwise difference between the components of the vector (Z1 , . . . , ZI ). The corresponding normalised
range has a distribution that is free from the parameter σ
max{X1 , . . . , XI } − min{X1 , . . . , XI }
√ ∼ SRk1 ,k2 , k1 = I, k2 = I(n − 1).
Sp / n
Here, the so-called studentised range distribution SRk1 ,k2 is controlled by two parameters: the number of independent
samples k1 and the number of degrees of freedom k2 used in the variance estimate s2p .
s
Tukey’s 100(1 − α)% simultaneous confidence interval Tµi −µj = ȳi. − ȳj. ± qI,I(n−1) (α) · √p
n
The Tukey interval is built using the factor qk1 ,k2 (α) determined by
In contrast to Bonferroni, Tukey takes into account the dependences between the differences (µi − µj ).
recognising four significant pairwise differences: (1, 4), (1, 5), (1, 6), (3, 4).
Extending the idea of the rank-sum test, turn to the pooled sample of size N = I · n. Let rik be the pooled ranks of
the sample values yik , so that
XX N (N + 1)
rik = 1 + 2 + . . . + N = ,
i
2
k
81
Example: seven labs
In the table below the actual measurements are replaced by their ranks 1 ÷ 70. For the observed Kruskal-Wallis test
statistic w = 28.17, using the χ26 -distribution table, we see that the p-value of the Kruskal-Wallis test is smaller than
0.005.
Labs 1 2 3 4 5 6 7
70 4 35 6 46 48 38
63 3 45 7 21 5 50
53 65 40 13 47 22 52
64 69 41 20 8 28 58
59 66 57 16 14 37 68
54 39 32 26 42 2 1
43 44 51 17 9 31 15
61 56 25 11 10 34 23
67 24 29 27 33 49 60
55 19 30 12 36 18 62
Means 58.9 38.9 38.5 15.5 26.6 27.4 42.7
{yijk , i = 1, . . . , I, j = 1, . . . , J, k = 1, . . . , n}
where Zijk are independent random variables having the standard normal distribution N(0, 1). We start by assuming
that n ≥ 2, so that for each combination of levels (i, j) of the main factors we have at least two replications, which
would allow us to measure the variation in the response variable within each cell (i, j). The important case n = 1 will
be addressed separately.
The maximum likelihood estimates under the assumption of normality take the form
XXX
1
µ̂ = ȳ... = IJn yijk ,
i j k
XX
1
α̂i = ȳi.. − ȳ... , ȳi.. = Jn yijk ,
j k
XX
1
β̂j = ȳ.j. − ȳ... , ȳ.j. = In yijk ,
i k
X
1
δ̂ij = ȳij. − ȳ... − α̂i − β̂j = ȳij. − ȳi.. − ȳ.j. + ȳ... , ȳij. = n yijk .
k
82
Fe3+ (10.2) Fe3+ (1.2) Fe3+ (0.3) Fe2+ (10.2) Fe2+ (1.2) Fe2+ (0.3)
0.71 2.20 2.25 2.20 4.04 2.71
1.66 2.93 3.93 2.69 4.16 5.43
2.01 3.08 5.08 3.54 4.42 6.38
2.16 3.49 5.82 3.75 4.93 6.38
2.42 4.11 5.84 3.83 5.49 8.32
2.42 4.95 6.89 4.08 5.77 9.04
2.56 5.16 8.50 4.27 5.86 9.56
2.60 5.54 8.56 4.53 6.28 10.01
3.31 5.68 9.44 5.32 6.97 10.08
3.64 6.25 10.52 6.18 7.06 10.62
3.74 7.25 13.46 6.22 7.78 13.80
3.74 7.90 13.57 6.33 9.23 15.99
4.39 8.85 14.76 6.97 9.34 17.90
4.50 11.96 16.41 6.97 9.91 18.25
5.07 15.54 16.96 7.52 13.46 19.32
5.26 15.89 17.56 8.36 18.40 19.87
8.15 18.3 22.82 11.65 23.89 21.60
8.24 18.59 29.13 12.45 26.39 22.25
The six boxplots for the retention data (see the next figure, left panel) show that all samples come from population
distributions skewed to the right, and that there is a clear discrepancy among different sample standard deviations.
With the data skewed to the right, taking the logarithms might serve as a simple remedy. For this reason, we will
focus on the transformed data (yijk ) obtained by taking the natural logarithms of the percentage of iron retention.
The right panel of the next figure shows that the boxplots for the transformed data (yijk ) look more symmetric and
exhibit less discrepancy between the within sample variation.
The next table lists the sample means (ȳij. , ȳi.. , ȳ.j. , ȳ... ) for the transformed data (yijk ):
10.2 1.2 0.3 Level mean
Fe3+
1.16 1.90 2.28 1.78
Fe2+ 1.68 2.09 2.40 2.06
Level mean 1.42 2.00 2.34 1.92
The resulting maximum likelihood estimates are
µ̂ = 1.92, α̂1 = −0.14, α̂2 = 0.14, β̂1 = −0.50, β̂2 = 0.08, β̂3 = 0.42,
and
−0.12 0.04 0.08
(δ̂ij ) = .
0.12 −0.04 −0.08
It is useful to graphically compare the two profiles shown on the figure below, where the red line connects the three
sample means for the iron form Fe3+ , see the first row of the previous table, and the blue line does the same for for
Fe2+ . In particular, the fact that the two rows are not parallel may indicate a possible interaction between the two
main factors.
83
Observe that the estimated effect size for the log-retention is
This yields the multiplicative effect of e0.28 = 1.32 on the original scale, implying that the retention percentage of
Fe2+ is higher than that of Fe3+ by factor 1.32.
α̂i2 ,
P
ssA = Jn i df A = I − 1,
β̂j2 ,
P
ssB = In j df B = J − 1,
2
,
P P
ssAB = n i j δ̂ij df AB = (I − 1)(J − 1),
k êijk ,
2
P P P
ssE = i j df E = IJ(n − 1).
The corresponding mean sums of squares and their expected values are
ssA
, E(MSA ) = σ 2 + I−1
Jn
i αi ,
P 2
msA = df A
df B , E(MSB ) = σ 2 + βj2 ,
ssB In
P
msB = J−1 j
df AB , E(MSAB ) = σ 2 + ,
ssAB n
P P 2
msAB = (I−1)(J−1) i j δij
df E , E(MSE ) = σ 2 .
ssE
msE =
The error mean sum of squares msE , an unbiased estimate of σ 2 , is in fact the pooled sample variance
2
P P P
2 i j k (yijk − ȳij. )
sp = msE = .
IJ(n − 1)
In the two-way ANOVA setting, there are three different null hypotheses of interest
Following the idea of the F-test in the one-way setting, we arrive at three F-tests addressing the three null hypotheses
using three different F-test statistics.
Null hypothesis No effect property Test statistics and its null distribution
Again, we reject the null hypothesis for the values of the F-test statistic larger than the critical value determined by
the corresponding F-distribution.
84
Example: iron retention
The two-way ANOVA table for the transformed iron retention data takes the form
Source df ss ms f p
Iron form 1 2.074 2.074 5.99 0.017
Dosage 2 15.588 7.794 22.53 0.000
Interaction 2 0.810 0.405 1.17 0.315
Error 102 35.296 0.346
Total 107 53.768
Referring to the three p-values in the rightmost column, obtained using the three null distributions
Finally, after inspecting the normal QQ-plot for the residuals (êijk ) we approve the assumptions of normality and
equal variance for the transformed data.
Blocking is used to remove the effects of the most important nuisance variable.
Randomisation is then used to reduce the contaminating effects of
the remaining nuisance variables. Applied to the ANOVA two-way layout
setting, this principle leads to the following experimental design:
randomly assign I treatments within each of J blocks. The first of the next three examples is illustrated by the figure
above.
Block Treatments Observation
A homogeneous plot of land I fertilisers each applied to The yield on the
divided into I subplots a randomly chosen subplot subplot (i, j)
A four-wheel car 4 types of tires tested on the same car tire’s life-length
A litter of I animals I diets randomly assigned to I siblings the weight gain
85
For the given data (yij ), we find the maximum likelihood estimates and the residuals using
with
− ȳ.. )2 ,
P P
ssT = i j (ȳij df T = IJ − 1,
α̂i2 ,
P
ssA = J i df A = I − 1,
β̂j2 ,
P
ssB = I j df B = J − 1,
ê2ij ,
P P
ssE = i j df E = (I − 1)(J − 1).
Using
df A , E(MSA ) = σ 2 +
ssA J
αi2
P
msA = I−1 i
df B , E(MSB ) = σ 2 +
ssB I
βj2
P
msB = J−1 j
df E , E(MSE ) = σ 2 .
ssE
msE =
we can apply two F-tests for two different null hypotheses
MSA HA
HA : α1 = . . . = αI = 0, FA = ∼ Fdf A ,df E ,
MSE
MSB HB
HB : β1 = . . . = βJ = 0, FB = ∼ Fdf B ,df E .
MSE
86
Friedman test
Here we introduce a non-parametric test for the two-way layout of the the ANOVA setting with n = 1, which does
not require that ϵij are normally distributed. The Friedman tests null hypothesis H0 : of no treatment effect based on
within block ranking. Let the ranks within j-th block be:
so that
I(I + 1)
r1j + . . . + rIj = 1 + 2 + . . . + I = ,
2
and the average ranks are
ri1 + . . . + riJ r̄1. + . . . + r̄I. I +1
r̄i. = , r̄.. = = .
J I 2
PI H0
Friedman test statistic q = 12J
I(I+1) i=1 (r̄i. − I+1 2
2 ) has an approximate null distribution Q ≈ χ2I−1 .
The Friedman test statistic q is a measure of agreement between J rankings, so we reject H0 for large values of q.
Using the chi-squared distribution table with k = 6 degrees of freedom we find that the p-value of the test is approxi-
mately 3%. The Friedman test also rejects the null hypothesis and suggests the following top 3 list in the effectiveness
against itching:
(1) papaverine, (2) aminophylline, (3) morphine.
8.6 Exercises
Problem 1
For the one-way analysis of variance with two treatment groups, show that the F -test statistic equals t2 , where t is
the test statistic for the two-sample t-test.
Problem 2
A study on the tensile strength of aluminium rods is conducted. Forty identical rods are randomly divided into four
groups, each of size 10. Each group is subjected to a different heat treatment, and the tensile strength, in thousands
of pounds per square inch, of each rod is determined. The following table presents the results of the measurements.
Consider the null hypothesis of equality between the four treatment means of tensile strength
H0 : µ1 = µ2 = µ3 = µ4 .
87
Treatment 1 2 3 4 Combined data
18.9 18.3 21.3 15.9 18.9 18.3 21.3 15.9
20.0 19.2 21.5 16.0 20.0 19.2 21.5 16.0
20.5 17.8 19.9 17.2 20.5 17.8 19.9 17.2
20.6 18.4 20.2 17.5 20.6 18.4 20.2 17.5
19.3 18.8 21.9 17.9 19.3 18.8 21.9 17.9
19.5 18.6 21.8 16.8 19.5 18.6 21.8 16.8
21.0 19.9 23.0 17.7 21.0 19.9 23.0 17.7
22.1 17.5 22.5 18.1 22.1 17.5 22.5 18.1
20.8 16.9 21.7 17.4 20.8 16.9 21.7 17.4
20.7 18.0 21.9 19.0 20.7 18.0 21.9 19.0
mean 20.34 18.34 21.57 17.35 19.40
variance 0.88 0.74 0.88 0.89 3.58
skewness 0.16 0.14 -0.49 -0.08 0.05
kurtosis 2.51 2.59 2.58 2.46 1.98
(a) Test the null hypothesis applying an ANOVA test. Show clearly how all the sums of squares are computed using
the sample means and variances given in the table.
(b) What are the assumptions of the ANOVA model you used? Do they appear fulfilled?
(c) The Bonferroni method gives the following formula for computing simultaneous 95% confidence intervals for six
pairwise differences between four treatment means
Bµi −µj = (ȳi. − ȳj. ) ± t36 ( 0.025
6 ) · 0.4472 · sp .
Explain this formula and using it check which of the pairs of treatments have significantly different means.
Problem 3
Consider the likelihood ratio test for the null hypothesis of the one-way layout under the normality assumption, and
show that it is equivalent to the F-test.
Problem 4
Suppose in a one-way layout there are 10 treatments and seven observations under each treatment. What is the
ratio of the length of a simultaneous confidence interval for the difference of two means formed by Tukey’s method to
that of one formed by the Bonferroni method? How do both of these compare in length to an interval based on the
t-distribution that does not take account of multiple comparisons?
Problem 5
During each of four experiments on the use of carbon tetrachloride as a worm killer, ten rats were infested with larvae
(Armitage 1983). Eight days later, five rats were treated with carbon tetrachloride; the other five were kept as controls.
After two more days, all the rats were killed and the numbers of worms were counted. The table below gives the counts
of worms for the four control groups.
Group I Group II Group III Group IV
279 378 172 381
338 275 335 346
334 412 335 340
198 265 282 471
303 286 250 318
Significant differences among the control groups, although not expected, might be attributable to changes in the
experimental conditions. A finding of significant differences could result in more carefully controlled experimentation
and thus greater precision in later work.
Use both graphical techniques and the F-test to test whether there are significant differences among the four groups.
Use a nonparametric technique as well.
Problem 6
The concentrations (in nanogram per millimiter) of plasma epinephrine were measured for 10 dogs under isofluorane,
halothane, and cyclopropane anesthesia. The measurements are given in the following table (Perry et al. 1974).
Dog 1 Dog 2 Dog 3 Dog 4 Dog 5 Dog 6 Dog 7 Dog 8 Dog 9 Dog 10
Isofluorane 0.28 0.51 1.00 0.39 0.29 0.36 0.32 0.69 0.17 0.33
Halothane 0.30 0.39 0.63 0.68 0.38 0.21 0.88 0.39 0.51 0.32
Cyclopropane 1.07 1.35 0.69 0.28 1.24 1.53 0.49 0.56 1.02 0.30
88
Is there a difference in treatment effects? Use a parametric and a nonparametric analysis.
Problem 7
The following table gives the survival times (in hours) for animals in an experiment whose design consisted of three
poisons, four treatments, and four observations per cell.
Treatment A Treatment B Treatment C Treatment D
Poison I 3.1 4.5 8.2 11.0 4.3 4.5 4.5 7.1
4.6 4.3 8.8 7.2 6.3 7.6 6.6 6.2
Poison II 3.6 2.9 9.2 6.1 4.4 3.5 5.6 10.0
4.0 2.3 4.9 12.4 3.1 4.0 7.1 3.8
Poison III 2.2 2.1 3.0 3.7 2.3 2.5 3.0 3.6
1.8 2.3 3.8 2.9 2.4 2.2 3.1 3.3
(a) Conduct a two-way analysis of variance to test the effects of the two main factors and their interaction.
(b) Box and Cox (1964) analysed the reciprocals of the data, pointing out that the reciprocal of a survival time can
be interpreted as the rate of death. Conduct a two-way analysis of variance, and compare to the results of part
(a). Comment on how well the standard two-way ANOVA model fits and on the interaction in both analyses.
Problem 8
Officials of a small transit system with only five buses want to evaluate four types of tires with respect to wear.
Applying a randomized block design, they decided to put one tire of each type on each of the five buses. The tires are
run for 15,000 miles, after which the tread wear, in millimeters, is measured.
Bus Tire 1 Tire 2 Tire 3 Tire 4 Mean
1 9.1 17.1 20.8 11.8 14.7
2 13.4 20.3 28.3 16.0 19.5
3 15.6 24.6 23.7 16.2 20.0
4 11.0 18.2 21.4 14.1 16.2
5 12.7 19.8 25.1 15.8 18.4
Mean 12.4 20.0 23.9 14.8 17.8
(a) State the most appropriate null hypothesis by referring to a suitable parametric model. What are the main
assumptions of the parametric model?
(b) Using a non-parametric procedure test the null hypothesis of no difference between the four types of tires.
(c) What kind of external effects are controlled by the suggested randomised block design? How the wheel positions
for different tire types should be assigned for each of the five buses?
Problem 9
The accompanying data resulted from an experiment carried out to investigate whether yield from a certain chemical
process depended either on the formulation of a particular input or on mixer speed.
Speed
60 70 80 Means
1 189.7 185.1 189.0
1 188.6 179.4 193.0 187.03
1 190.1 177.3 191.1
Formulation
2 165.1 161.7 163.3
2 165.9 159.8 166.6 164.66
2 167.6 161.6 170.3
Means 177.83 170.82 178.88 175.84
A statistical computer package gave
ssA = 2253.44, ssB = 230.81, ssAB = 18.58, ssE = 71.87.
(a) Calculate estimates of the main effects.
(b) Does there appear to be interaction between Formulation and Speed? In which various ways interaction between
these two factors could manifest itself? Illustrate with graphs.
(c) Does yield appear to depend either on formulation or speed.
(d) Why is it important to inspect the normal QQ-plot of the 18 residuals?
89
Problem 10
Three different varieties of tomato (Harvester, Pusa Early Dwarf, and Ife No. 1) and four different plant densities (10,
20, 30, and 40 thousands plants per hectare) are being considered for planting in a particular region. To see whether
either variety or plant density affects yield, each combination of variety and plant density is used in three different
plots, resulting in the following data on yields:
Variety Density 10,000 Density 20,000 Density 30,000 Density 40,000 mean
Har 10.5, 9.2, 7.9 12.8, 11.2, 13.3 12.1, 12.6, 14.0 10.8, 9.1, 12.5 11.33
Ife 8.1, 8.6, 10.1 12.7, 13.7, 11.5 14.4, 15.4, 13.7 11.3, 12.5, 14.5 12.21
PED 16.1, 15.3, 17.5 16.6, 19.2, 18.5 20.8, 18.0, 21.0 18.4, 18.9, 17.2 18.13
mean 11.48 14.39 15.78 13.91 13.89
Source of variation ss df ms f
Varieties
Density
Interaction 8.03
Errors 38.04
(b) Clearly state the three pairs of hypotheses of interest. Test them using the normal theory approach.
(c) Estimate the noise size σ.
Problem 11
A population with mean µ consists of three subpopulations with means µ1 , µ2 , µ3 and the same variance σ 2 . Three
independent random samples, each of size n = 13, from the three subpopulation distributions gave the following sample
means and standard deviations:
Sample 1 Sample 2 Sample 3
Mean 6.3 5.6 6.0
SD 2.14 2.47 3.27
(a) Compute a stratified sample mean, assuming that the three subpopulation sizes have the ratios N1 : N2 : N3 =
0.3 : 0.2 : 0.5. Prove that this is an unbiased estimate for the population mean µ.
(b) Assume that all three subpopulation distributions are normal. Write down a simultaneous confidence interval
for the differences µ1 − µ2 , µ1 − µ3 , and µ2 − µ3 .
(c) Would you reject the null hypothesis of equality µ1 = µ2 = µ3 in this case?
Problem 12
In an experimental study two volunteer male subjects aged 23 and 25 underwent three treatments to compare a new
drug against no drug and placebo. Each volunteer had one treatment per day and the time order of these three
treatments was randomised.
(a) Comment on the details of the experimental design.
(b) Find the exact null distribution for the test statistic of an appropriate non-parametric test.
90
Chapter 9
Categorical data appear in the form of a contingency table containing the sample counts for several, say k, mutually
exclusive categories. The corresponding population distribution can modelled as Mn(1, π1 , . . . , πk ) , an extension of
the Bernoulli distribution with k possible outcomes having probabilities (π1 , . . . , πk ) such that
π1 + . . . + πk = 1.
In this case, the sample counts for a random sample of size n have the multinomial distribution
(C1 , . . . , Ck ) ∼ Mn(n, π1 , . . . , πk ).
Consider a cross-classification for a pair of categorical factors A and B. If factor A has I levels and factor B has
J levels, then the population distribution of a single cross classification event has the form
b1 b2 ... bJ total
a1 π11 π12 ... π1J π1·
a2 π21 π22 ... π2J π2·
... ... ... ... ... ...
aI πI1 πI2 ... πIJ πI·
total π·1 π·2 ... π·J 1
Here
πij = P(A = ai , B = bj )
are the joint the probabilities, and
πi· = P(A = ai ), π·j = P(B = bj )
are the marginal probabilities. The null hypothesis of independence claims that there is no relationship between factors
A and B
H0 : πij = πi· π·j for all pairs (i, j).
The conditional probabilities
πij
πi|j = P(A = ai |B = bj ) =
π·j
are summarised in the next table
b1 b2 ... bJ
a1 π1|1 π1|2 ... π1|J
a2 π2|1 π2|2 ... π2|J
... ... ... ... ...
aI πI|1 πI|2 ... πI|J
total 1 1 ... 1
The null hypothesis of homogeneity states the equality of J population distributions
H0 : πi|j = πi· for all pairs (i, j).
91
This model is described by J multinomial distributions
Notice that the total number of degrees of freedom for J independent samples from I-dimensional multinomial distri-
butions is J(I − 1).
Under the hypothesis of homogeneity
H0 : πi|j = πi for all (i, j),
the maximum likelihood estimates of πi are the pooled sample proportions
π̂i = ci /n, i = 1, . . . , I.
These estimates consumes (I − 1) degrees of freedom, since their sum is 1. Using these maximum likelihood estimates
we compute the expected cell counts
Eij = nj · π̂i = ci nj /n
and the chi-squared test statistic takes the form
I X J I X J
X (cij − Eij )2 X (cij − ci nj /n)2
x2 = = .
i=1 j=1
Eij i=1 j=1
ci nj /n
We reject H0 for larger values of x2 using the approximate null distribution X2 ≈ χ2df with
x2 = 27.24 with df = (3 − 1) · (3 − 1) = 4.
Since 27.24 is larger than the table value x4 (0.005) = 14.86, we reject the hypothesis of homogeneity at 0.5% significance
level. Conclusion: the persons who saw themselves as cautious are more likely to express a favourable opinion of small
cars.
92
In contrast to the previous setting of J independent samples, the total counts (n1 , n2 , . . . , nJ ) are random outcomes
cross-classification of of the single sample of size n. This data is modelled by the multinomial joint distribution
characterised by IJ − 1 degrees of freedom. Using such data we would like to test the null hypothesis of independence
The chi-squared tests of homogeneity and independence have the same test rejection rule.
We test H0 : no relationship between the marital status and the education level, by applying the chi-squared test of
independence. Using the expected counts given in the brackets, we find the chi-squared test statistic to be x2 = 16.01.
With df = 1 we can use the normal distribution table, since Z ∼ N(0, 1) is equivalent to Z 2 ∼ χ21 . Thus
It follows that that the p-value of the test is less that 0.1%, and we reject the null hypothesis of independence.
Conclusion: the college educated women, once they marry, are less likely to divorce.
93
• a prospective study: take an E-sample of size c1 and a control Ē-sample of size c2 , then watch who gets affected,
0 c1
would give , because of the very low incidence of the Hodgkin disease,
0 c2
• a retrospective study: take an A-sample of size n1 and a control Ē-sample of size n2 , then find out who of them
in the past had been exposed to tonsillectomy, would give informative counts.
Two retrospective case-control studies produced opposite results. Study (a) by Vianna, Greenwald, and Davis
(1971) gave the next counts.
A Ā total
E 67 43 110
Ē 34 64 98
total 101 107 208
The chi-squared test of homogeneity was applied, which resulted in the test statistic x2a = 14.29. With df = 1, the
p-value was found to be very small
√
P(X2a ≥ 14.29|H0 ) ≈ 2(1 − Φ( 14.29)) = 0.0002.
Study (b) by Johnson and Johnson (1972) was summarised by the table
A Ā total
E 41 33 74
Ē 44 52 96
total 85 85 170
and the chi-squared tests of homogeneity was applied. With x2b = 1.53 and df = 1, the p-value was found to be
strikingly different √
P(X2b ≥ 1.53|H0 ) ≈ 2(1 − Φ( 1.53)) = 0.215.
It turned out that the study (b) was based on a matched-pairs design violating the assumptions of the chi-squared
test of homogeneity. The sample consisted of 85 sibling pairs having same sex and close age: one of the siblings was
affected the other not. A proper summary of the study (b) sample distinguishes among four groups of sibling pairs:
Notice that this contingency table contains the information of the previous one in terms of the marginal counts.
An appropriate test in this setting is McNemar’s test introduced in Section 7.4. For the data of study (b), the
McNemar test statistic is
(7 − 15)2
x2 = = 2.91,
7 + 15
giving the p-value of √
P(X2 ≥ 2.91|H0 ) ≈ 2(1 − Φ( 2.91)) = 0.09.
The correct p-value of 0.09 is much smaller than that of 0.215 computed using the test of homogeneity. Since there
are only 7 + 15 = 22 informative observations, more data is required.
P(A) P(A)
odds(A) = c
= ,
P(A ) 1 − P(A)
taking non-negative values. Clearly,
odds(A)
P(A) = ,
1 + odds(A)
and for small P(A), we have odds(A) and P(A) are close to each other. Conditional odds for A given B are defined
similarly,
P(A|B) P(AB)
odds(A|B) = c
= .
P(A |B) P(Ac B)
94
Defininng the odds ratio for a pair of events (A, B) by
odds(A|B)
∆AB = ,
odds(A|B c )
we find that
P(AB)P(Ac B c )
∆AB = ,
P(Ac B)P(AB c )
implying
1
∆AB = ∆BA , ∆AB c =
.
∆AB
The odds ratio is a measure of dependence between a pair of random events. It has the following properties
if ∆AB = 1, then events A and B are independent,
if ∆AB > 1, then P(A|B) > P(A|B c ) so that B increases the probability of A,
if ∆AB < 1, then P(A|B) < P(A|B c ) so that B decreases the probability of A.
Ē 34/101 64/107
total 1 1
Thus the odds ratio
P(E|A) P(Ē|Ā)
∆AE = ·
P(Ē|A) P(E|Ā)
is consistently estimated by
67/101 64/107 67 · 64
∆
b AE = · = = 2.93.
34/101 43/107 34 · 43
9.5 Exercises
Problem 1
Adult-onset diabetes is known to be highly genetically determined. A study was done comparing frequencies of a
particular allele in a sample of such diabetics and a sample of non-diabetics.
diabetic normal total
Bb or bb 12 4 16
BB 39 49 88
total 51 53 104
Is there a relationship between the allele frequencies and the adult-onset diabetes?
95
Problem 2
Overfield and Klauber (1980) published the following data on the incidence of tuberculosis in relation to blood groups
in a sample of Eskimos. Is there any association of the disease and blood group within the ABO system or within the
MN system?
O A AB B MM MN NN
moderate 7 5 3 13 moderate 21 6 1
minimal 27 32 8 18 minimal 54 27 5
not present 55 50 7 24 not present 74 51 11
Problem 3
It is conventional wisdom in military squadron that pilots tend to father more girls than boys. Snyder (1961) gathered
data for military fighter pilots. The sex of the pilots’ offspring were tabulated for three kinds of flight duty during the
month of conception, as shown in the following table.
girl boy
flying fighter 51 38
flying transport 14 16
not flying 38 46
Problem 4
A randomised double-blind experiment compared the effectiveness of several drugs in ameliorating postoperative
nausea. All patients were anesthetized with nitrous oxide and ether. The following table shows the incidence of
nausea during the first four hours for each of several drugs and a placebo (Beecher 1959).
number of patients incidence of nausea
placebo 165 95
chlorpromazine 152 52
dimenhydrinate 85 52
pentobarbital (100 mg) 67 35
pentobarbital (150 mg) 85 37
Compare the drugs to each other and to the placebo.
Problem 5
In a study of the relation of blood type to various diseases, the following data were gathered in London and Manchester
(Woolf 1955).
London Control Peptic Ulcer Manchester Control Peptic Ulcer
Group A 4219 579 Group A 3775 246
Group O 4578 911 Group O 4532 361
First, consider the two tables separately. Is there a relationship between blood type and propensity to peptic ulcer?
If so, evaluate the strength of the relationship. Are the data from London and Manchester comparable?
Problem 6
Record of 317 patients at least 48 years old who were diagnosed as having endometrial carcinoma were obtained from
two hospitals (Smith et al. 1975). Matched controls for each case were obtained from the two institutions: the controls
had cervical cancer, ovarian cancer, or carcinoma of the vulva. Each control was matched by age at diagnosis (within
four years) and year of diagnosis (within two years) to a corresponding case of endometrial carcinoma.
The following table gives the number of cases and controls who had taken estrogen for at least 6 months prior to
the diagnosis of cancer.
96
(a) Is there a significant relationship between estrogen use and endometrial cancer?
(b) This sort of design, called a retrospective case-control study, is frequently used in medical investigations where
a randomised experiment is not possible. Do you see any possible weak points in a retrospective case-control
design?
Problem 7
A psychological experiment was done to investigate the effect of anxiety on a person’s desire to be alone or in company
(Lehman 1975). A group of 30 subjects was randomly divided into two groups of sizes 13 and 17. The subjects were
told that they would be subjected to some electric shocks, but one group (high-anxiety) was told that the shocks
would be quite painful and the other group (low-anxiety) was told that they would be mild and painless. Both groups
were told that there would be a 10-min wait before the experiment began, and each subject was given the choice of
waiting alone or with the other subjects. The following are the results:
Wait Together Wait Alone Total
High-Anxiety 12 5 17
Low-Anxiety 4 9 13
Total 16 14 30
Use Fisher’s exact test to test whether there is a significant difference between the high- and low-anxiety groups. What
is a reasonable one-sided alternative, if any?
Problem 8
Hill and Barton (2005) asked the question: red against blue outfits - does it matter in combat sports? Although other
colors are also present in animal displays, it is specifically the presence and intensity of red coloration that correlates
with male dominance and testosterone levels. Increased redness during aggressive interactions may reflect relative
dominance.
In the 2004 Olympic Games, contestants in four combat sports were randomly assigned red and blue outfits. The
winner counts in different sports
Is there any evidence that wearing red is more favourable in some of the sports than others?
Problem 9
Suppose that a company wishes to examine the relationship of gender to job satisfaction, grouping job satisfaction into
four categories: very satisfied, somewhat satisfied, somewhat dissatisfied, and very dissatisfied. The company plans to
ask the opinion of 100 employees. Should you, the company’s statistician, carry out a chi-square test of independence
or a test of homogeneity?
Problem 10
Questions concerning hypotheses testing methodology. Try to give detailed answers.
(a) Consider a hypothetical study of the effects of birth control pills. In such a case, it would be impossible to
assign women to a treatment or a placebo at random. However, a non-randomized study might be conducted
by carefully matching control to treatments on such factors as age and medical history.
The two groups might be followed up on for some time, with several variables being recorded for each subject
such as blood pressure, psychological measures, and incidences of various problems. After termination of the
study, the two groups might be compared on each of these many variables, and it might be found, say, that there
was a "significant difference" in the incidence of melanoma.
What is a common problem with such "significant findings"?
(b) You analyse cross-classification data summarized in a two by two contingency table. You wanted to apply the
chi-square test but it showed that one of the expected counts was below 5. What alternative statistical test you
may try applying?
(c) Why tests like rank sum test, Friedman test, and Kruskal-Wallis tests are often called distribution-free tests?
97
Problem 11
A public policy polling group is investigating whether people living in the same household tend to make independent
political choices. They select 200 homes where exactly three voters live. The residents are asked separately for their
opinion ("yes" or "no") on a city charter amendment. The results of the survey are summarized in the table:
Number of saying "yes" 0 1 2 3
Frequency 30 56 73 41
Based on these data can we claim that opinions are formed independently?
Problem 12
Verify that the hypothesis of homogeneity
Problem 13
A study of the relationship between facility conditions at gasoline stations and aggressiveness in the pricing of gasoline
is based on n = 441 stations.
Pricing policy
Aggressive Neutral Nonaggressive Total
Substandard condition 24 15 17 56
Standard condition 52 73 80 205
Modern condition 58 86 36 180
Total 134 174 133 441
(a) Suggest a parametric model for the data and write down the corresponding likelihood function.
Problem 14
In a controlled clinical trial which began in 1982 and ended in 1987, more than 22000 physicians participated. The
participants were randomly assigned in two groups: Aspirin and Placebo. The aspirin group have been taking 325 mg
aspirin every second day. At the end of trial, the number of participants who suffered from myocardial infarctions was
assessed.
MyoInf No MyoInf Total
Aspirin 104 10933 11037
Placebo 189 10845 11034
The popular measure in assessing the results in clinical trials is the Risk Ratio
(a) How would you interpret the obtained value of the risk ratio? What ratio of conditional probabilities is estimated
by RR?
(b) Is the observed value of RR significantly different from 1?
98
Chapter 10
Multiple regression
Karl Pearson collected the heights of 1,078 fathers and their full-grown sons in England, circa 1900. The following
scatter diagram illustrates the data, with one dot for each father-son pair. The straight line on the left panel, is the
so called regression line which indicates the average son-height for the various father-heights. Using this like one can
predict the average son-height for a given father-height. In this sense the father-height x serves as a predictor variable
for the response variable y, the son-height.
The right panel adds several important insights on this dataset. It shows that the regression line go through the
middle point of the scatter plot (x̄, ȳ), where x̄ is the sample mean of the father-heights and ȳ is the sample mean of
the son-heights. The cross going through the central point (x̄, ȳ) divides the scatter plot in four quadrants. We see
that the majority of the dots lies in the first and the third quadrants reflecting the fact that the heights of the father
and the son are positively correlated.
Consider now only the dots to the right of the vertical line, that is when the fathers are taller than average. For
this part of the dots, we see that on average the son is shorter than his father. To the left of the vertical line the
relation is opposite: on average the son is taller than his father. Francis Galton called this remarkable phenomenon
regression to mediocrity.
is the linear function of the predictor variable x plus a normally distributed noise of size σ. The key assumption of
homoscedasticity requires that the noise size is independent of the x-value. Whenever this assumption is violated, the
situation is described by the term heteroscedasticity.
99
Under this model, the data consist of n pairs
(x1 , y1 ), . . . , (xn , yn ),
where
yi = β0 + β1 xi + ei , i = 1, . . . , n,
involves a vector (e1 , . . . , en ) of independent realisations of a random variable with distribution N(0, σ).
where C = (2π)−n/2 . This implies the following expression for the log-likelihood function l(θ) = ln L(θ)
Pn 2
e
l(θ) = ln C − n ln σ − i=12 i .
2σ
Observe that
ei = yi − β0 − β1 xi ,
implying
n
X
n−1 e2i = β02 + 2β0 β1 x̄ − 2β0 ȳ − 2β1 xy + β12 x2 + y 2 ,
i=1
∂l n
= − 2 (β0 + β1 x̄ − ȳ),
∂β0 σ
∂l n
= − 2 (β0 x̄ − xy + β1 x2 ),
∂β1 σ
n
∂l n 1 X 2
= − + e ,
∂σ 2 2σ 2 2σ 4 i=1 i
and set them all equal to zero. As a result we get a system of three equations
ssE
b0 + b1 x̄ = ȳ, b0 x̄ + b1 x2 = xy, σ̂ 2 = ,
n
where
ssE = ê21 + . . . + ê2n ,
is the error sum of squares defined in terms of the residuals
êi = yi − b0 − b1 xi .
Solving the first two equations we obtain the maximum likelihood estimates for the slope and intercept
xy − x̄ȳ
b1 = , b0 = ȳ − b1 x̄.
x2 − x̄2
Notice that the estimates (b0 , b1 ) of parameters (β0 , β1 ) are obtained by minimising the sum of squares
n
nX o
ssE = min (yi − β0 − β1 xi )2 ,
β0 ,β1
i=1
and for this reason, (b0 , b1 ) are called the least squares estimates of (β0 , β1 ).
The maximum likelihood estimate of σ̂ 2 = ssnE is a biased but asymptotically unbiased estimate of σ 2 . An unbiased
estimate of σ 2 is given by
ssE
s2 = .
n−2
100
Residuals
For each predictor value xi define the predicted response by
ŷi = b0 + b1 xi ,
so that the corresponding residual is given by the difference êi = yi − ŷi . The residuals (ê1 , . . . , ên ) are linearly
connected via
ê1 + . . . + ên = 0, x1 ê1 + . . . + xn ên = 0, ŷ1 ê1 + . . . + ŷn ên = 0,
implying that êi are uncorrelated with xi and êi are uncorrelated with ŷi . Thus under the simple linear regression
model, the scatter plot of the residuals êi versus xi should look like a horizontal blur. If the linear model is not valid,
it will show up in a somewhat bowed shape of the (xi , êi ) scatter plot. In some cases, such a non-linearity problem
can be fixed by some kind of a log-transformation of the data.
The residuals êi are realisations of random variables Êi having normal distributions with zero means, whose
variances and covariances are given by
cii cij
Var(Êi ) = σ 2 1 − , Cov(Êi , Êj ) = −σ 2 ,
n−1 n−1
where Pn
k=1 (xi − xk )(xj − xk )
cij = .
ns2x
For larger n, the random variables Êi can be viewed as independent having the same N(0, σ) distribution.
Then the sample correlation coefficient between the predictor and response variables is defined as
sxy
r= .
sx sy
Noticing that
rsy
,
b1 =
sx
we find that the fitted regression line y = b0 + b1 x can be written in the form
s
y = ȳ + r sxy (x − x̄),
or equivalently y − ȳ x − x̄
=r .
sy sx
The last formula is most appealing to the intuition: it claims that the standardised versions of the predictor and the
response are connected through the regression factor.
Coefficient of determination
Using the formula for the predicted responses
s
ŷi = ȳ + b1 (xi − x̄), b1 = r sxy ,
where X
ssT = (yi − ȳ)2 = (n − 1)s2y
i
101
is the regression sum of squares. The obtained relations yield
ssR ssE
= r2 , = 1 − r2 .
ssT ssT
These equalities explain why the squared sample correlation coefficient r2 is called the coefficient of determination.
Coefficient of determination r2 is the proportion of variation in the response variable explained by the variation of the
predictor variable.
Coefficient of determination r2 has a more intuitive meaning than the sample correlation coefficient r
σ 2 x2i s2 x2i
P P
B0 − β 0
B0 ∼ N(β0 , σ0 ), 2
σ0 = , 2
s b0 = , ∼ tn−2 ,
n(n − 1)s2x n(n − 1)s2x SB0
σ2 s2 B 1 − β1
B1 ∼ N(β1 , σ1 ), σ12 = , s 2
b = , ∼ tn−2 .
(n − 1)s2x 1
(n − 1)s2x SB1
For β0 and β1 , we get two exact 100(1 − α)% confidence intervals Iβi = bi ± tn−2 ( α2 ) · sbi
For i = 0 or i = 1 and a given value β ∗ , one could be interested in testing the the null hypothesis H0 : βi = β ∗ .
Use the test statistic
bi − β ∗
t= ,
sbi
which is a realisation of a random variable T that has the exact null distribution
T ∼ tn−2 .
H0 : β1 = 0
stating that there is no relationship between the predictor variable x and the response y. The corresponding
test statistic, often called t-value,
b1
t=
sb1
has tn−2 as the null distribution.
2. The zero-intercept test aims at
H0 : β0 = 0.
Its t-value t = b0
sb0 has the same null distribution tn−2 .
102
Intervals for individual observations
Suppose a sample of size n resulted in the least squares estimates (b0 , b1 ), the noise size estimate s, the sample mean x̄
and the sample standard deviation sx for the predictor values in the sample. We are going to make a new measurement
with the predictor value x = xp and wish to say something on the response value
yp = β0 + β1 xp + σzp ,
where zp is generated by the N(0, 1) distribution independently of the available sample (x1 , y1 ), . . . , (xn , yn ). The
expected value of Yp ,
µp = β0 + β1 xp
is estimated by
µ̂p = b0 + b1 xp .
The standard error σp of µ̂p is computed as the square root of
σ2 σ2 xp −x̄ 2
Var(B0 + B1 xp ) = Var(B0 ) + x2p Var(B1 ) + 2xp Cov(B0 , B1 ) = n + n−1 ·( sx ) .
This leads to the following exact 100(1 − α)% confidence interval for µp
q
x −x̄
Iµp = b0 + b1 xp ± tn−2 ( α2 ) · s n1 + n−1
1
( psx )2 .
The new feature of this section is the so called prediction interval of the response value yp with takes into account
the noise factors accompanying the new measurement.
q
xp −x̄ 2
Exact 100(1 − α)% prediction interval Iyp = b0 + b1 xp ± tn−2 ( α2 ) · s 1 + 1
n + 1
n−1 ( sx )
This prediction interval has wider limits since it reflects the uncertainty due the noise factors. The larger factor
appearing under the square root comes from the variance formula
xp −x̄ 2
Var(Yp − µ̂p ) = Var(µp + σZp − µ̂p ) = σ 2 + Var(µ̂p ) = σ 2 (1 + 1
n + 1
n−1 ·( sx ) ).
The next figure illustrates the relationship between the 95% confidence and prediction intervals (Iµp , Iyp ). Two
features are important to be aware of. Firstly, the prediction interval is broader. Observe that as n → ∞, the width
of the confidence interval goes to zero, while the width of the 95% prediction interval converges to 1.96σ. Secondly,
both intervals gets wider as the value xp is placed further from the sample mean x̄ of the predictor values.
where e1 , . . . , en are independently generated by the N(0, σ) distribution. In terms of the column vectors
103
the multiple regression model is compactly described by the relation
y = Xβ + e,
where X is the so called design matrix assumed to have rank p:
1 x1,1 . . . x1,p−1
X = ... ... ... ... .
1 xn,1 . . . xn,p−1
The machinery developed for the simple linear regression model works well for the multiple regression. The least
squares estimates b = (b0 , . . . , bp−1 )⊺ are obtained as
b = (X⊺ X)−1 X⊺ y,
and yield the next formula for the predicted responses ŷ = Xb:
ŷ = Py, where P = X(X⊺ X)−1 X⊺ .
Turning to the random vector B behind the the least squares estimates b, we find that
E(B) = β.
Furthermore, the covariance matrix, the p × p matrix with elements Cov(Bi , Bj ), is given by
E{(B − β)(B − β)⊺ } = σ 2 (X⊺ X)−1 .
The vector of residuals
ê = y − ŷ = (I − P)y
has a zero mean vector and a covariance matrix σ 2 (I − P).
Denote by d20 , . . . , d2p−1 the diagonal elements of the matrix (X⊺ X)−1 . Then the standard error of bj is computed as
sbj = sdj , entailing the exact sampling distributions for the least squares estimates
B j − βj
∼ tn−p , j = 0, 1, . . . , p − 1.
SBj
Special case: p = 2
For the simple linear regression case, the design matrix X has the dimensions n × 2, so that
⊺ 1 ... 1 ⊺ n x 1 + . . . + xn 1 x̄
X = , X X= =n .
x1 . . . xn x1 + . . . + xn x21 + . . . + x2n x̄ x2
It follows
1 x2 −x̄ ȳ
(X⊺ X)−1 = , X⊺ y = n ,
n(x2 − (x̄)2 ) −x̄ 1 xy
and the formulas for b0 and b1 are recovered in the matrix form
1 x2 −x̄ ȳ
b = (X⊺ X)−1 X⊺ y = .
2
x − (x̄)2 −x̄ 1 xy
104
For this example the response variable y is the catheter length, and the two predictors are x1 is the height of a pa-
tient and x2 is the same patient’s weight. In this case p = 3, with the response vector and the design matrix are given by
20.0 1 22.5 8.5
30.5
1 23.5 9.5
38.5
1 33.0 21.0
33.5
1 37.0 33.0
34.5
1 37.5 35.5
28.0 1 38.5 17.0
(10.1)
y= , X= .
36.0
1 39.5 30.0
37.0
1 42.8 40.0
37.0
1 43.0 38.5
43.0
1 45.5 52.0
47.0 1 58.0 79.0
49.5 1 63.5 93.5
s2
Ra2 = 1 − ,
s2y
s2
we arrive at the following transparent interpretation of the adjusted coefficient of determination. Since s2y is the
proportion of the variance in y explained by the noise, the adjusted R is the proportion of the variance in y explained
2
y = β0 + β1 x + β2 x 2 ,
with x1 = x and x2 = x2 . The results of such analysis are presented in the table:
105
Coefficient Estimate Standard Error t value
β0 1.68 1.06 1.52
β1 −10.86 4.52 −2.40
β2 23.54 4.27 5.51
The residuals show no sign of systematic misfit. The test statistic t = 5.51 of the utility test for the crucial null
hypothesis H0 : β2 = 0, is larger than the critical value t7 (0.0005) = 5.4079. We conclude that the quadratic term in
the model is statistically significant at the 0.1% significance level.
10.5 Exercises
Problem 1
Suppose we are given a two-dimensional random sample
(x1 , y1 ), . . . , (xn , yn ).
Verify that the sample covariance is an unbiased estimate of the population covariance.
Problem 2
Draw by hand a scatter plot for ten pairs of measurements
x 0.34 1.38 -0.65 0.68 1.40 -0.88 -0.30 -1.18 0.50 -1.75
y 0.27 1.34 -0.53 0.35 1.28 -0.98 -0.72 -0.81 0.64 -1.59
106
(a) Fit a straight line y = a + bx by the method of least squares, and sketch it on the plot.
(b) Fit a straight line x = c + dy by the method of least squares, and sketch it on the plot.
(c) Why the lines on (a) and (b) are not the same?
Problem 3
Each student receives two consecutive grade point averages
x = the high school GPA and y = the freshman GPA.
Starting from two coupled models for female students
y = β0 + β1 x + σz, Z ∼ N(0, σ),
and for male students
y = β0′′ + β1 x + σz, Z ∼ N(0, σ),
suggest a joint multiple regression model involving an extra predictor f which equal 1 if the student is female and 0
otherwise. Give the form of the design matrix for such a model.
Problem 4
Check that P = X(X⊺ X)−1 X⊺ is a projection matrix such that P2 = P.
Problem 5
The sample (x1 , y1 ), . . . , (xn , yn ) was collected for n students who took a statistical course, with
x giving the midterm grade and y giving the final grade.
The data produced the following five summary statistics
r = 0.5, x̄ = ȳ = 75, sx = sy = 10.
(a) Given a student’s midterm score x = 95, predict the final score of the student.
(b) Given the final score y = 85 and not knowing the midterm score of a student, predict the student’s midterm
score.
Problem 6
Let X ∼ N(0, 1) and Z ∼ N(0, 1) be two independent random variables and consider a third one
Y = X + βZ.
(a) Show that the correlation coefficient for X and Y is
1
ρ= √ .
1+β 2
(b) Using the result of part (a) suggest a way of generating bivariate vectors (x, y) with a specified positive population
correlation coefficient.
Problem 7
The stopping distance of an automobile on a certain road was studied as a function of velocity (Brownee, 1960)
velocity of a car x (mi/h) 20.5 20.5 30.5 40.5 48.8 57.8
stopping distance y (ft) 15.4 13.3 33.9 73.1 113.0 142.6
√
Fit y and y as linear functions of the velocity, and examine the residuals in each case. Which fit is better? Can you
suggest any physical reason that explains why?
Problem 8
An excerpt rom Wikipedia:
"The American Psychological Association’s 1995 report "Intelligence: Knowns and Unknowns" stated
that the correlation between IQ and crime was −0.2. It was −0.19 between IQ scores and number of
juvenile offences in a large Danish sample; with social class controlled, the correlation dropped to −0.17.
A correlation of 0.20 means that the explained variance is less than 4%."
Explain how the 4% appearing in the last sentence was computed.
107
Problem 9
Verify that the test statistic for the model utility test can be expressed in terms of the sample correlation r as follows
√
b1 r n−2
t= = √ .
sb1 1 − r2
Problem 10
The article "Effects of gamma radiation on juvenile and mature cuttings of quaking aspen" (Forest science, 1967)
reports the following data on exposure time to radiation x and dry weight of roots y:
x 0 2 4 6 8
y 110 123 119 86 62
Problem 11
A sports statistician studied the relation between the time (y in seconds) for a particular competitive swimming event
and the swimmer’s age (x in years) for 20 swimmers with age ranging from 8 to 18. She employed quadratic regression
model and obtained the following result
The standard error for the curvature effect coefficient was estimated as sb2 = 0.1157.
(a) Plot the estimated regression function. Would it be reasonable to use this regression function when the swimmer’s
age is 40?
(b) Construct a 99 percent confidence interval for the curvature effect coefficient. Interpret your interval estimate.
(c) Test whether or not the curvature effect can be dropped from the quadratic regression model, controlling the α
risk at 0.01. State the alternatives, the decision rule, the value of the test statistic, and the conclusion. What is
the p-value of the test?
Problem 12
Suppose that grades of 10 students on a midterm and a final exams have a correlation coefficient of 0.5 and both
exams have an average score of 75 and a standard deviation of 10.
(a) Sketch a scatterplot illustrating performance on two exams for this group of 10 students.
(b) If Carl’s score on the midterm is 90, what would you predict his score on the final to be? How uncertain is this
prediction?
(c) If Maria scored 80 on the final, what would you guess that her score on the midterm was?
(d) Exactly what assumptions do you make to make your calculations in (b) and (c)?
108
Chapter 11
For example, the number on the row 1.9 and column 0.06 gives Φ(1.96) = 0.9750.
109
11.2 Critical values of the t-distribution
If T has a tk -distribution with k degrees of freedom, then for a chosen α, the following table gives the value of tk (α)
such that P(T > tk (α)) = α. For example, if k = 5 and α = 0.05, then t5 (0.05) = 2.015.
110
11.3 Critical values of the chi-squared distribution
If X has a χ2k -distribution with k degrees of freedom, then for a chosen α, the following table gives the value of xk (α)
such that P(X > xk (α)) = α. For example, if k = 5 and α = 0.05, then x5 (0.05) = 11.070.
111
11.4 Critical values of the F-distribution
If F has a Fdf,k -distribution, then the tables below give the critical value Fdf,k (α) such that P(F > Fdf,k (α)) = α.
Different k are shown as columns and different α are shown as rows.
For example, if df = 2, k = 6, and α = 0.025, then F2,6 (0.025) = 7.2599.
112
F5,k -distribution table
113
Chapter 12
Solutions to exercises
Solution 2
With
X1 − µ1 X2 − µ2
Z1 = , Z2 = ,
σ1 σ2
we get
E(Zi ) = E( Xiσ−µ
i
i
)= 1
σi E(Xi − µi ) = 1
σi (µi − µi ) = 0,
and
σi2
Var(Zi ) = E(( Xiσ−µ
i
i 2
) )= 1
σi2
E((Xi − µi )2 ) = σi2
= 1.
By the definition of the correlation coefficient,
1
ρ= E((X1 − µ1 )(X2 − µ2 )) = E (Z1 Z2 ) .
σ1 σ2
This number is a dimensionless quantity in the interval (−1, 1) by the Cauchy–Schwarz inequality, see Wikipedia. To
illustrate, let X1 and X2 are the height and the weight of a randomly chosen person. The standardised height Z1
and weight Z2 would not depend on whether the numbers (X1 , X2 ) are given in (centimetres, kilograms) or (inches,
pounds).
Solution 3
Given (X1 , . . . , Xr ) ∼ Mn(n; p1 , . . . , pr ), the sum X = Xi + Xj can be treated as the number of outcomes among n
independent trials with r possible outcomes such that either i or j are observed. Since X is the number of successes
among n trials with the probability pi + pj of a success, we conclude that
Xi + Xj ∼ Bin(n, pi + pj ).
Solution 4
If X ∼ Gam(α, λ), then the probability density function of X is
1 α α−1 −λx
f (x) = λ x e , x > 0.
Γ(α)
The distribution function of the scaled random variable Y = λX is
P(Y ≤ x) = P(X ≤ λ−1 x) = F (λ−1 x).
Taking the derivatives over x on both sides, we find the probability density function of Y is
1 α −1 α−1 −λλ−1 x 1 α−1 −x
λ−1 f (λ−1 x) = λ−1 λ (λ x) e = x e ,
Γ(α) Γ(α)
indeed the density of the gamma distribution Gam(α, 1).
114
Solution 5
Solution 6
Because the average event rate is 2.5 goals per match, the Poisson formula with µ = 2.5 gives
2.5x e−2.5
P(x goals in a match) = , x = 0, 1 . . .
x!
Therefore, we get
Solution 7
If XN has the hypergeometric distribution Hg(N, n, p), then
Np N (1−p)
x n−x
P(XN = x) = N
.
n
As N → ∞, we have
Nn (N p)x (N (1 − p))n−x
N Np N (1 − p)
∼ , ∼ , ∼ ,
n n! x x! n−x (n − x)!
n!px (1 − p)n−x
n x
P(XN = x) → = p (1 − p)n−x , N → ∞.
x!(n − x)! x
Solution 8
We have
n−1
X
P(X = n) = 1 − p (1 − p)x = (1 − p)n ,
x=0
The next table lists the possible values of the sample mean x̄ = x1 +x2
2 together with their probabilities given in
brackets, obtained by multiplication of the marginal probabilities.
115
x2 = 1 x2 = 2 x2 = 4 x2 = 8 marginal prob.
x1 = 1 1.0 (1/25) 1.5 (2/25) 2.5 (1/25) 4.5 (1/25) 1/5
x1 = 2 1.5 (2/25) 2.0 (4/25) 3.0 (2/25) 5.0 (2/25) 2/5
x1 = 4 2.5 (1/25) 3.0 (2/25) 4.0 (1/25) 6.0 (1/25) 1/5
x1 = 8 4.5 (1/25) 5.0 (2/25) 6.0 (1/25) 8.0 (1/25) 1/5
marginal prob. 1/5 2/5 1/5 1/5 total prob. = 1
From this sampling distribution table, using the same three steps as for the computing of µ and σ 2 , we find
1 4 4 2 4 1 2 4 2 1
E(X̄) = 1 · 25 + 1.5 · 25 +2· 25 + 2.5 · 25 + 3 · 25 + 4 · 25 + 4.5 · 25 +5· 25 + 6 · 25 +8· 25 = 3.4,
E(X̄ 2 ) = 1
25 + (1.5)2 · 4
25 +4· 4
25 + (2.5)2 · 25
2 4
+ 9 · 25 + 16 · 1
25 + (4.5)2 · 2
25 + 25 ·
4
25 + 36 · 252 1
+ 64 · 25 = 14.68,
2
Var(X̄) = 14.68 − (3.4) = 3.12.
Solution 2
The question is about a simple random sample, with unspecified population size N . We assume that N is much larger
than the sample size n = 1500 and use the formulas for a random sample with independent observations.
For the given dichotomous data, we have
q q
n = 1500, p̂ = 0.55, 1 − p̂ = 0.45, sp̂ = p̂(1−
n−1
p̂)
= 0.55×0.45
1499 = 0.013,
v = p − (1 − p) = 2p − 1.
(a) Since
Var(V̂ ) = Var(2P̂ − 1) = Var(2P̂ ) = 4Var(P̂ ),
the standard error of v̂ is twice the standard error of p̂
q q
Var(V̂ ) = 2 Var(P̂ ).
Solution 3
By the central limit theorem the t-score X̄−µ
SX̄ is asymptotically N(0, 1)-distributed. Therefore, referring to the normal
distribution table we get
0.90 ≈ P( X̄−µ
S > −1.28) = P(−∞ < µ < X̄ + 1.28SX̄ ),
X̄
0.95 ≈ P( X̄−µ
SX̄ < 1.645) = P(X̄ − 1.645SX̄ < µ < ∞).
implying
k1 = 1.28, k2 = 1.645.
116
Solution 4
It is enough to verify that for any random variable X and constant θ,
where µ is the expected value of X. This is done by a simple transformation on the left hand side and opening the
brackets
E((X − θ)2 ) = E((X − µ + µ − θ)2 ) = Var(X) + 2E((X − µ)(µ − θ)) + (µ − θ)2 = Var(X) + (µ − θ)2 .
Solution 5
Based on the data summary
X X
N = 2000, n = 25, xi = 2351, x2i = 231305,
Solution 6
The bias size is computed as follows
σ2 n−1
E(X̄ 2 ) − µ2 = E(X̄ 2 ) − (EX̄)2 = Var(X̄) = 1− .
n N −1
For the large sample sizes n, the bias is small.
Solution 7
The problem deals with a stratified population of size N = 2010 having k = 7 strata.
(a) With n = 100, we get the following answers using the relevant formulas
Stratum number j 1 2 3 4 5 6 7 Weighted mean
Stratum proportion wj 0.196 0.229 0.195 0.166 0.084 0.056 0.074
Stratum mean µj 5.4 16.3 24.3 34.5 42.1 50.1 63.8 µ = 26.311
Stratum standard deviation σj 8.3 13.3 15.1 19.8 24.5 26.0 35.2 σ̄ = 17.018
w σ
Optimal allocation n jσ̄ j 10 18 17 19 12 9 15
Proportional allocation nwj 20 23 19 17 8 6 7
σ̄ 2 σ2 σ2
Var(X̄so ) = = 2.896, Var(X̄sp ) = = 3.433, Var(X̄) = = 6.20,
n n n
where σ 2 is computed in the next item.
117
Therefore
σ 2 = 343.28 + 276.89 = 620.17, σ = 24.90.
(d) If n1 = . . . = n7 = 10 and n = 70, then
w12 σ12 2 2
wk σk
Var(X̄s ) = n1 + ... + nk = 4.45.
The requested sample size x is found from the equation
σ2 620.17 620.17
Var(X̄) = = = 4.45, x= = 139.364.
x x 4.45
After rounding up, we find the answer x = 140, which is twice larger than the size n = 70 of the stratified sample.
(e) The proportional allocation of the sample size n = 70, gives the variance of the stratified sample mean
Var(X̄sp ) = 4.90. Notice that it is larger than the Var(X̄s ) = 4.45 of the item (d). Solving the equation
σ2 620.17 620.17
Var(X̄) == = 4.90, x= = 126.57,
x x 4.90
we find that the corresponding random sample size is x = 127.
Solution 9
We are dealing with a stratified population having
N = 5, k = 2, w1 = 0.6, w2 = 0.4, µ1 = 1.67, µ2 = 6, σ12 = 0.21, σ22 = 4.
Given n1 = n2 = 1 and n = 2, the stratified sample mean x̄s = 0.6x1 + 0.4x2 can take four possible values listed in
the table below (with probabilities in the brackets).
x1 = 1 x1 = 2 marginal prob.
x2 = 4 2.2 (1/6) 2.8 (2/6) 1/2
x2 = 8 3.8 (1/6) 4.4 (2/6) 1/2
marginal prob. 1/3 2/3 1
Using this table we find that
1 2 1 2
E(X̄s ) = 2.2 · 6 + 2.8 · 6 + 3.8 · 6 + 4.4 · 6 = 3.4,
2
(E(X̄s )) = 11.56,
E(X̄s2 ) = (2.2)2 · 1
6 + (2.8)2 · 2
6 + (3.8)2 · 1
6 + (4.4)2 · 2
6 = 12.28,
Var(X̄s ) = 12.28 − 11.56 = 0.72.
These results are in agreement with the formulas of Section 2.5:
w12 σ12 2 2
wk σk
E(X̄s ) = µ, Var(X̄s ) = n1 + ... + nk = 0.36σ12 + 0.16σ22 .
Solution 10
Data: a random sample of size n = 16 taken from a normal distribution.
(b), (c) The following exact confidence intervals are computed using the formulas of Section 2.6
90% 95% 99%
Iµ 3.61 ± 0.81 3.61 ± 0.98 3.61 ± 1.36
Iσ 2 (2.05; 7.06) (1.87; 8.19) (1.56; 11.15)
Iσ (1.43; 2.66) (1.37; 2.86) (1.25; 3.34)
(d) To find the sample size x that halves the confidence interval length, we set up an equation using the exact
confidence interval formula for the mean
s s′
t15 ( α2 ) · √ = 2 · tx−1 ( α2 ) · √ ,
16 x
where s′ is the sample standard deviation for the sample of size x. A simplified version of this equation 1
4 = √2
x
implies
x ≈ (2 · 4)2 = 64. Further adjustment for a 95% confidence interval is obtained using
t15 ( α2 ) = 2.13, tx−1 ( α2 ) ≈ 2,
yielding x ≈ (2 · 4 · 2.13
2 2
) = 56.4. We conclude that going from a sample of size 16 to a sample of size 56 would halve
the length of the confidence interval for µ.
118
Solutions to Section 3.4
Solution 1
A method of moment estimate of the parameter µ for the Poisson distribution model is given by the sample mean
µ̃ = 3.9. Using this value we compute the expected counts, see table. Comparing the observed and expected counts
by a naked eye we see that the Poisson model does not fit well. The sample variance is close to 5 which shows that
there is over-dispersion (the estimated variance is larger than the estimated µ).
This extra variation in the data can be explained by the fact that the 300 intervals were distributed over various
hours of the day and various days of the week.
x observed counts expected counts
0 14 6.1
1 30 23.8
2 36 46.3
3 68 60.1
4 43 58.5
5 43 45.6
6 30 29.6
7 14 16.4
8 10 8.0
9 6 3.5
10 4 1.3
11 1 0.5
12 1 0.2
13+ 0 0.1
Solution 2
The likelihood function has the form
n
Y
L(p) = (1 − p)xi −1 p = (1 − p)t−n pn ,
i=1
where
t = x1 + . . . + xn
is a sufficient statistic. The log-likelihood function
l(p) = (t − n) ln(1 − p) + n ln p
Solution 3
The population distribution is discrete: the random variable X takes the values x = 0, 1, 2, 3 with probabilities
2 1 2 1
p0 = · θ, p1 = · θ, p2 = · (1 − θ), p3 = · (1 − θ),
3 3 3 3
so that
p0 + p1 = θ, p2 + p3 = 1 − θ.
We are given a random sample with
n = 10, x̄ = 1.5, s = 1.08,
and observed counts
x 0 1 2 3 Total
cx 2 3 3 2 10
119
(a) Method of moments. The expression for the first population moment
1 2 1 7
µ= · θ + 2 · · (1 − θ) + 3 · · (1 − θ) = − 2θ,
3 3 3 3
leads to the equation equation
7
x̄ = − 2θ̃.
3
It gives an unbiased estimate
7 x̄ 7 3
θ̃ = − = − = 0.417.
6 2 6 4
(b) To find sθ̃ , observe that
1 σ2
Var(X̄) =
Var(Θ̃) = .
4 40
Thus we need to find sθ̃ , which estimates σθ̃ = 6.325
σ
. Next we estimate σ using two methods.
Method 1. From
2
2 2 1 2 2 1 7 7 2
σ = E(X ) − µ = · θ + 4 · · (1 − θ) + 9 · · (1 − θ) = − 2θ − − 2θ = + 4θ − 4θ2 ,
3 3 3 3 3 9
we can estimate σ after replacing θ with θ̃:
r
2
σ̃ = + 4θ̃ − 4θ̃2 = 1.093.
9
This gives
1.093
sθ̃ =
= 0.173.
6.325
Method 2. Using the sample standard deviation s we get
s 1.08
=
sθ̃ = = 0.171.
6.325 6.325
(c) Using the multinomial distribution (C0 , C1 , C2 , C3 ) ∼ Mn(n, p0 , p1 , p2 , p3 ) we obtain the likelihood function to
be c0 c1 c2 c3
L(θ) = 2
3θ
1
3θ
2
3 (1 − θ) 1
3 (1 − θ) = const · θt (1 − θ)n−t ,
where t = c0 + c1 is a sufficient statistic. Notice that the underlying random variable T = C0 + C1 has the binomial
distribution Bin(n, θ). The log-likelihood function and its derivative take the form
l(θ) = const + t ln θ + (n − t) ln(1 − θ),
t n−t
l′ (θ) = − .
θ 1−θ
Setting the last expression to zero, we find
t n−t t 2+3 1
= ⇒ θ̂ = = = .
θ̂ 1 − θ̂ n 10 2
Thus the maximum likelihood estimate is the sample proportion, which is an unbiased estimate of the population
proportion θ.
(d) We find sθ̂ using the formula for the standard error of the sample proportion
q
sθ̂ = θ̂(1− θ̂)
n−1 = 0.167.
Solution 4
For a given n and X = x the likelihood function is
n x
L(p) = p (1 − p)n−x ∝ px (1 − p)n−x .
x
(a) We find p̂ maximising L(p) by maximising
ln(px (1 − p)n−x ) = x ln p + (n − x) ln(1 − p).
Since
∂ x n−x
(x ln p + (n − x) ln(1 − p)) = − ,
∂p p 1−p
we have to solve the familiar equation x
p = 1−p ,
n−x
which brings the maximum likelihood estimate formula p̂ = n.
x
(b) The graph of the likelihood function L(p) = 252p5 (1 − p)5 is symmetric around the middle point p̂ = 0.5.
120
Solution 5
The observed serial number x = 888 can be modelled by the discrete uniform distribution
P(X = x) = N −1 , x = 1, . . . , N.
Solution 6
We will treat x as the number of black balls obtained by sampling k = 50 balls with replacement from an urn with N
balls of which n = 100 balls are black. Then x = 20 is a realisation of the binomial distribution Bin(50, 100
N ). Thus
the likelihod function is
(N − 100)30
50 100 20 N −100 30
L(N ) = P(X = 20) = ( N ) ( N ) = const · .
20 N 50
The log-likelihood function takes the form
Solution 7
Pn Pn
(a) Given i=1 xi = 58 and i=1 x2i = 260, the likelihood function is
n n x2 −2µ n 2
1 (xi −µ)2 1 1 260−116µ+16µ2
P P
− i=1 i i=1 xi +nµ
Y
L(µ, σ 2 ) = √ e 2σ2 = n/2 n
e 2σ 2 = 8 16
e− 2σ 2 .
i=1
2πσ (2π) σ (2π) σ
Pn Pn
(b) It is sufficient to know the values of i=1 xi and i=1 x2i to compute the likelihood function.
Since
n
X σ2 n−1 2
E(n−1 Xi2 − X̄ 2 ) = σ 2 − = σ ,
i=1
n n
σ̂ is a biased estimate of σ .
2 2
121
Solution 8
Consider a random sample (x1 , . . . , xn ) drawn from the uniform distribution U(0, θ) with density
f (x|θ) = θ1 1{0≤x≤θ} .
(a) The method of moments estimate θ̃ = 2x̄ is unbiased, and its sampling distribution has the following mean and
variance
2
θ2
E(Θ̃) = θ, Var(Θ̃) = 4σn = 3n .
(b) In terms of x(n) = max(x1 , . . . , xn ), the likelihood function takes the form
1 1
L(θ) = f (x1 |θ) · · · f (xn |θ) = θ n 1{θ≥x1 } · · · 1{θ≥xn } = θ n 1{θ≥x(n) } ,
so that x(n) is a sufficient statistic. The maximum of this function is achieved at θ̂ = x(n) .
(c) The sampling distribution of the maximum likelihood estimate θ̂ = x(n) is computed as
x n
P(X(n) ≤ x) = P(X1 ≤ x, . . . , Xn ≤ x) = P(X1 ≤ x) · · · P(Xn ≤ x) = ,
θ
where we used independence between different Xi . Taking the derivative, we find the probability density function to
be
n n−1
b (x) = n · x
fΘ , 0 ≤ x ≤ θ.
θ
Using this density function we can compute the mean and variance of Θ:
b
θ
nθ2
Z
n n n 2
E(Θ)
b = xn dx = θ, b 2) =
E(Θ θ , Var(Θ)
b = .
θn 0 n+1 n+2 (n + 1)2 (n + 2)
The maximum likelihood estimate is biased, but asymptotically unbiased. Notice the unusual asymptotics
θ2
Var(Θ)
b = , n → ∞,
n2
indicating that the conditions on the parametric model implying Θ̂ ≈ N(θ, σnθ ) are violated.
Comparing the two mean square errors:
2
nθ2 2θ2
θ
MSE(Θ) = E(Θ − θ) = −
b b 2
+ = ,
n+1 (n + 1)2 (n + 2) (n + 1)(n + 2)
θ2
MSE(Θ̃) = ,
3n
we conclude that for sufficiently large n, the maximum likelihood estimate θb is more efficient than the method of
moments estimate θ̃.
(d) The corrected maximum likelihood estimate
n+1
θ̂c = · x(n)
n
θ2
becomes unbiased E(Θ
b c ) = θ with Var(Θ
b c) =
n(n+2) .
Solution 9
The data in hand is summarised by four observed counts
obtained from a random sample of size n = 3839 drawn from the discrete distribution
2+θ 1−θ 1−θ θ
p1 = P(X = 1) = , p2 = P(X = 2) = , p3 = P(X = 3) = , p4 = P(X = 4) = .
4 4 4 4
Using the multinomial distribution (C1 , C2 , C3 , C4 ) ∼ Mn(n, p1 , p2 , p3 , p4 ) for the given realisation
(c1 , c2 , c3 , c4 ) with c1 + c2 + c3 + c4 = n,
122
with n = 3839, we compute the likelihood function as
n
L(θ) = pc1 pc2 pc3 pc4 ∝ (2 + θ)c1 (1 − θ)c2 +c3 θc4 4−n ∝ (2 + θ)c1 θc4 (1 − θ)n−c1 −c4 ,
c1 , c2 , c3 , c4 1 2 3 4
where ∝ means that we drop the factors depending only on (n, c1 , c2 , c3 , c4 ) and not involving θ. The last expression
reveals that we have a case of two sufficient statistics (c1 , c4 ). Putting
d c1 c4 n − c1 − c4
ln L(θ) = + −
dθ 2+θ θ 1−θ
equal to zero, we arrive at the equation
c1 c4 n − c1 − c4
+ =
2+θ θ 1−θ
or equivalently
θ2 n + θu − 2c4 = 0,
where
u = 2c2 + 2c3 + c4 − c1 = 2n − c4 − 3c1 .
We find the maximum likelihood estimate to be
√
−u + u2 + 8nc4
θ̂ = = 0.0357.
2n
Solution 10
By the law of total probability the probability of the "yes" answer to the randomly generated question equals
1 4 1 + 4q
p = P(a “yes” answer) = + ·q = ,
6 6 6
where 16 is the probability of the number 1 on the top face of the die, implying the answer "yes", 64 is the probability
of the numbers 2, 3, 4, or 5, and q is the probability of the honest answer "yes". For a random sample of size n, put
Under the independence assumptions, we have X ∼ Bin(n, p). Given X = x, we estimate the unknown parameter p
by the sample proportion p̂ = nx . The underlying random variable P̂ = X
n has the mean value p and the variance
Var(X) p(1 − p)
Var(P̂ ) = = .
n2 n
After computing the sample proportion p̂, we may apply the method of moments by turning to the linear equation
1 + 4q̃
p̂ = .
6
Its solution gives us the following estimate of the population proportion q:
6p̂ − 1
q̃ = .
4
This estimate is unbiased because p̂ is an unbiased estimate of p:
6p − 1
E(Q̃) = = q.
4
The variance of the sampling distribution equals
9 9 p(1 − p) (1 + 4q)(5 − 4q)
Var(Q̃) = · Var(P̂ ) = · = .
4 4 n 16n
To illustrate, take n = 100 and x = 20, so that the sample proportion of "yes" answers is p̂ = 0.2. In this case,
6p̂ − 1
q̃ = = 0.05,
4
implying that the proportion of interest q is around 5%. The corresponding estimated standard error is 6%
r
(1 + 4q̃)(5 − 4q̃)
sq̃ = = 0.06,
16n
indicating that it is desirable to increase the sample size if we want a more precise estimate of q.
123
Solutions to Section 4.6
Solution 1
The z-score
X − 100p
Z= p
10 p(1 − p)
has a distribution that is approximated by N(0, 1).
(b) The power of the test is a function of the population proportion p (here for simplicity computed without
continuity correction)
Pw(p) = P(|X − 50| > 10) = P(X < 40) + P(X > 60)
! !
40 − 100p 60 − 100p
=P Z< p +P Z > p
10 p(1 − p) 10 p(1 − p)
! !
4 − 10p 10p − 6
≈Φ p +Φ p .
p(1 − p) p(1 − p)
Solution 2
(a) The two composite nested hypotheses have the form
H0 : µ ≤ µ0 , H : −∞ < µ < ∞.
Observe that in the last step, the term exp{−nx2 } is treated as a constant of proportionality, as it does not involve
the parameter of interest µ and will not influence the maximum likelihood estimates.
(c) The maximum likelihood estimate under H is µ̂ = x̄. The maximum likelihood estimate under H0 is µ̂0 = x̄ if
x̄ < µ0 , and µ̂0 = µ0 if x̄ ≥ µ0 . Therefore, the likelihood-ratio equals
(
L(µ̂0 ) exp{− n2 (µ̂0 − x̄)2 } 1 if x̄ < µ0 ,
w= = = exp{− n2 (µ̂0 − x̄)2 } = n
− 2 (x̄−µ0 )2
L(µ̂) exp{− n2 (µ̂ − x̄)2 } e if x̄ ≥ µ0
(d) The likelihood-ratio test rejects H0 for small values of w or equivalently, for large positive values of (x̄ − µ0 ).
In particular, the 5% rejection region with n = 25 takes the form
1.645
R = {x̄ − µ0 ≥ 5 } = {x̄ ≥ µ0 + 0.33}.
Here we used X̄ ∼ N(µ0 , 51 ) as the null distribution, since the other eligible choices of µ ∈ H0 result in a smaller area
under the curve to the right of the critical value.
124
Solution 3
In this case tikelihood function takes the form
n n
Y 1 xi −µ Y 1
L(µ) = µ e = e−µn µy
i=1
x i ! x
i=1 i
!
where
y = x1 + . . . + xn
is a sufficient statistic.
Case 1: two simple hypotheses
H0 : µ = µ0 , H1 : µ = µ1 .
Reject H0 for small values of the likelihood-ratio
L(µ0 ) µ y
0
= e−n(µ0 −µ1 ) .
L(µ1 ) µ1
If µ1 > µ0 , then we reject H0 for large values of y. If µ1 < µ0 , then we reject H0 for small values of y. Test statistic
Y has null distribution Pois(nµ0 ).
Case 2: two-sided alternative hypothesis
H0 : µ = µ0 , H1 : µ ̸= µ0 .
Reject H0 for the larger values of |y|. The test statistic Y has the null distribution Pois(nµ0 ).
Solution 4
We have a random sample of size n = 25 drawn from the population distribution N(µ, 10) and would like to test two
simple hypotheses
H0 : µ = 0, H1 : µ = 1.5
An appropriate test statistic and its exact sampling distribution are
X̄ ∼ N(µ, 2),
H H
X̄ ∼0 N(0, 2), X̄ ∼1 N(1.5, 2).
(a) The rejection region at α = 0.1 is {x̄ > x}, where x is the solution of the equation
From the normal distribution table we find x/2 = 1.28, so that x = 2.56 and the rejection region is
The corresponding confidence interval method is based the one-sided 90% confidence interval for the mean
We reject H0 if the interval does not cover µ0 = 0, that is when x̄ − 2.56 > 0.
125
Solution 5
We have a pair of alternative distribution densities (which are two beta-distribution densities)
f (x|H0 ) 2
= , 0 ≤ x ≤ 1.
f (x|H1 ) 3x
that is
1 − x2crit = α.
We conclude that √
R = {x : x > 1 − α}.
(c) The power of the test is given by
√
P(X > 1 − α|H1 ) = 1 − (1 − α)3/2 .
Solution 6
Using the confidence interval method of hypotheses testing we reject H0 in favour of the two-sided alternative, since
the value µ = −3 is not covered by the two-sided confidence interval (−2, 3).
Solution 7
14·S 2
Under the normality assumption the random variable σ2 has the χ214 -distribution, implying
2
P( 14·S
σ 2 ≤ 6.571) = 0.05.
R = {s2 ≤ 6.571
14 } = {s ≤ 0.685}.
Solution 8
The following analysis is the basis of the sign test.
where
y = |x − n/2|.
It remains to observe that the function a(y) is monotone over y ∈ [0, n/2], since
n
+y
a′ (y) = ln 2
n > 0,
2 −y
126
and therefore the test rejects the null hypothesis for the large values of y.
(c) The significance level for the rejection region |x − n2 | > k is computed by
X n
n
α = P(|X − 2 | > k|H0 ) = 2 2−n .
n i
i< 2 −k
(e) Using the normal approximation for n = 100 and k = 10, we find
α = P(|X − n2 | > k|H0 ) ≈ 2(1 − Φ( √k )) = 2(1 − Φ(2)) = 0.046.
n/4
Solution 9
(a) The two-sided p-value = 0.134.
Solution 10
We are supposed to test
H0 : death cannot be postponed,
H1 : death can be postponed until after an important date.
(a) Jewish data: n = 1919 death dates
x = 922 deaths during the week before Passover,
n − x = 997 deaths during the week after Passover.
Under the binomial model X ∼ Bin(n, p), we would like to test
H0 : p = 0.5 against H1 : p < 0.5.
We apply the large-sample test for proportion. Since the observed test statistic value is
x − np0 922 − 1919 · 0.5
z0 = p = √ = −1.712,
np0 (1 − p0 ) 1919 · 0.5
the corresponding one-sided p-value of the test equals
Φ(−1.712) = 1 − Φ(1.712) = 1 − 0.9564 = 0.044.
Conclusion: we reject H0 in favour of the one-sided H1 at the significance level 5%.
(b) To control for the seasonal effect the Chinese and Japanese data were studied
n = 852, x = 418, n − x = 434, z = −0.548.
The one-sided p-value for this data is 29%, showing no significant effect.
Solution 11
We apply the multinomial model
(C1 , C2 , C3 ) ∼ Mn(190, p1 , p2 , p3 )
for testing the composite null hypothesis of the Hardy-Weinberg equilibrium:
H0 : p1 = (1 − θ)2 , p2 = 2θ(1 − θ), p3 = θ 2 .
It yields the following likelihood function and the maximum likelihood estimate:
190 292
L(θ) = 268 θ292 (1 − θ)88 , θ̂ = = 0.768.
10, 68, 112 380
The Pearson chi-squared test based on the table
127
x 1 2 3 Total
observed counts cx 10 68 112 190
expected counts Ex 10.23 67.71 112.07 190
√
results in the chi-squared test statistic x2 = 0.0065. With df = 1, we find the p-value to be 2(1 − Φ( 0.0065)) = 0.94.
Conclusion: the Hardy-Weinberg equilibrium model fits well to the haptoglobin data.
Solution 12
month cj number of days Ej cj − E j
Jan 1867 31 1994 −127
Feb 1789 28 1801 −12
Mar 1944 31 1994 −50
Apr 2094 30 1930 164
May 2097 31 1994 103
Jun 1981 30 1930 51
Jul 1887 31 1994 -107
Aug 2024 31 1994 30
Sep 1928 30 1930 -2
Oct 2032 31 1994 38
Nov 1978 30 1930 48
Dec 1859 31 1994 -135
Here we deal with the simple null hypothesis
31 28 30
H0 : p1 = p3 = p5 = p7 = p8 = p10 = p12 = , p2 = , p4 = p6 = p9 = p11 = .
365 365 365
The total number suicides n = 23480, so that the expected counts presented in the table are computed as
(0)
Ej = npj , j = 1, . . . , 12.
Using the table values we find the chi-squared test statistic to be
X (cj − Ej )2
x2 = = 47.4.
j
Ej
Since df = 12 − 1 = 11, and χ211 (0.005) = 26.8, we reject H0 of no seasonal variation. Merry Christmas!
Solution 13
We model the number of heads by
Y ∼ Bin(n, p), n = 17950.
(a) For H0 : p = 0.5 the observed z-score is
z0 = √ y−np0 = 3.46.
np0 (1−p0 )
128
number of heads 0 1 2 3 4 5 Total
observed counts 100 524 1080 1126 655 105 3590
expected counts 98.4 518.3 1091.5 1149.3 605.1 127.4 3590
we get x2 = 8.74. Here df = 6 − 1 − 1 = 4, yielding p-value = 0.07. We do not reject this H0 at the 5% significance
level.
Solution 2
The data in the table summarises a random sample
(d) Using the uniform prior U(0, 1), we find the posterior to be proportional to the likelihood
Solution 3
Let us use the binomial model X ∼ Bin(n, p) for the number x of successes in n independent trials. Given x = n,
we would like to estimate p using the Bayesian approach. Applying the uniform prior Beta(1, 1), we find that the
posterior distribution for the parameter p is Beta(n + 1, 1). Since the posterior mean is n+2
n+1
, we get
n+1
p̂pme = .
n+2
Solution 4
P(x|H0 )
Data: one observation of X = x. Likelihood ratio test: reject for small values of the likelihood-ratio P(x|H1 ) .
129
P(x|H0 )
(b) The null distribution of likelihood-ratio is discrete. Under H0 the test statistic w = P(x|H1 ) takes values
0.5, 0.75, 2, 3 with probabilities 0.2, 0.3, 0.2, 0.3. Therefore
P(W ≤ 0.5|H0 ) = 0.2, P(W ≤ 0.75|H0 ) = 0.5, P(W ≤ 2|H0 ) = 0.7, P(W ≤ 3|H0 ) = 1.
Since H0 is rejected for the small values of w, at α = 0.2 we reject H0 only if w = 0.5, that is when x = x4 . Similarly,
at α = 0.5 we reject H0 for w ≤ 0.75, that is when x is either x4 or x2 .
and we conclude that the outcomes x1 and x3 favour H0 since with these outcomes the likelihood-ratio is larger than 1.
R = {x = x4 } = {w = 0.5},
R = {x = x4 } ∪ {x = x2 } = {w = 0.5} ∪ {w = 0.75}.
To be able to reject for the value w = 0.75, we have to put the restriction
1 4
π0 ≤ = .
1 + 0.75 7
Solution 5
For a single observation X ∼ N(µ, σ), where σ is known, we would like to test H0 : µ = 0 against H1 : µ = 1 using the
prior probabilities
2 1
P(H0 ) = , P(H1 ) = .
3 3
(a) Since the likelihood-ratio equals
x2
f (x|0) e− 2σ2 1−2x
= (x−1)2
= e 2σ2 ,
f (x|1) e− 2σ2
130
P(H0 )
and the prior odds is P(H1 ) = 2, the posterior odds takes the form
P(X < 1
2 + σ 2 ln 2) = 32 P(X − µ < 1
2 + σ 2 ln 2) + 31 P(X − µ < σ 2 ln 2 − 12 ) = 23 Φ(σ ln 2 + 1
2σ ) + 13 Φ(σ ln 2 − 1
2σ ).
Solution 6
We are given a pair of beta-densities
Therefore, the posterior probability of H0 is greater than that of H1 for x satisfying 2 > 3x, that is when x < 32 .
Solution 7
The parameters (a, b) of the prior beta distribution are found from the equations
1
a 1 − 13 )
3 (1 1
= , = .
a+b 3 a+b+1 32
The prior pseudo-counts are well approximated by a = 2 and b = 4. Thus the posterior beta distribution has
parameters (10, 16) giving the posterior mean estimate p̂pme = 0.38.
Solution 8
(a) We are given two independent samples:
1. sample one has the sample size n = 56 and the sample proportion p̂1 = 8
56 = 0.143,
2. sample two has the sample size m = 74 and the sample proportion p̂2 = 12
74 = 0.162.
An asymptotic 95% confidence interval for the population difference is given by
r
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
Ip1 −p2 ≈ p̂1 − p̂2 ± 1.96 · + = −0.019 ± 0.125 = [−0.144, 0.106].
n−1 m−1
(b) To find a credibility interval for the parameter p, we can use the uniform prior Beta(1, 1). Adding the pseudo-
counts (1, 1) to the total counts (8 + 12, 48 + 62) we getq
Beta(21, 111) as the posterior distribution. Using the posterior
mean µ = 21+111
21
= 0.16 and standard deviation σ = 0.16(1−0.16)
132 = 0.03 we arrive at the normal approximation of
the posterior distribution with mean 0.16 and standard deviation 0.03. This yields an approximate 95% credibility
interval
Jp ≈ 0.16 ± 1.96 · 0.03 = [0.10, 0.22].
131
Solution 9
(a) The exponential prior with parameter 1 is Gam(1, 1). Applying the suggested updating rule consecutively four
times:
(1, 1) → (3, 2) → (3, 3) → (5, 4) → (10, 5),
we find the posterior distribution to be Gam(10, 5). Therefore, θ̂PME = 10/5 = 2.
(b) The general updating rule for the random sample (x1 , . . . , xn ) taken from a Poisson distribution, becomes
Comparing this to the maximum likelihood estimator θ̂MLE = x̄, we see that
α + nx̄ α − λx̄
θ̂PME − θ̂MLE = − x̄ = → 0,
λ+n λ+n
as n → ∞. This demonstrates that the role of the prior is less important for large sample sizes.
Solution 10
The Beta(a, b) probability density function g(p) is proportional to pa−1 (1−p)b−1 . Taking the derivative of this product
and setting it equal to zero we arrive at the equation
Solution 11
(a) Take the 95% equal-tailed credibility interval. By definition it is formed by the lower and upper quartiles of
the posterior distribution. It implies that the interval must include the posterior median.
(b) All three options produce the same interval since the beta-posterior Beta(5, 5) is symmetric about its mean.
(c) Due to symmetry of the N(3, 0.2) distribution, the 95% HPDI is computed as
p(1 − p) x(1 − x)
y 2 = σp̂2 = = , x ∈ [0, 1].
n n
Setting n = 16, we get the requested formula
Solution 2
Let 0 ≤ x ≤ y ≤ 1. We have
1{X1 ≤x} + . . . + 1{Xn ≤x} x(1 − x)
F̂ (x) = , E(F̂ (x)) = F (x) = x, Var(F̂ (x)) = ,
n n
1{X1 ≤y} + . . . + 1{Xn ≤y} y(1 − y)
F̂ (y) = , E(F̂ (y)) = F (y) = y, Var(F̂ (y)) = .
n n
132
Using
n n X
1 hX X i
F̂ (x) · F̂ (y) = 2
1 1
{Xi ≤x} {Xi ≤y} + 1{Xi ≤x} 1{Xj ≤y} ,
n i=1 i=1 j̸=i
and that
1{Xi ≤x} 1{Xi ≤y} = 1{Xi ≤x} ,
we obtain
n n X i x + (n − 1)xy
1 hX X
E(F̂ (x) · F̂ (y)) = x + xy = ,
n2 i=1 i=1
n
j̸=i
implying
x + (n − 1)xy x(1 − y)
Cov(F̂ (u), F̂ (v)) = E(F̂ (u) · F̂ (v)) − E(F̂ (u)) · E(F̂ (v)) = − xy = .
n n
Solution 3
Take a look at the ordered sample x(1) , . . . , x(n)
(a) The figure below shows the empirical distribution function and the normal QQ-plot.
The distribution appears to be close to normal. The 10% sample quantile equals
x(6) + x(7) 13.66 + 13.68
= = 13.67.
2 2
(b) The expected percentages under different dilution levels are computed as follows
1% dilution µ1 = 14.58 · 0.99 + 85 · 0.01 = 15.28 can not be detected,
3% dilution µ3 = 14.58 · 0.97 + 85 · 0.03 = 16.69 can be detected,
5% dilution µ5 = 14.58 · 0.95 + 85 · 0.05 = 18.10 can be detected.
We see that the value 15.28 can not be detected as an outlier, since it coincides with the 82% sample quantile. There
is only one sample value larger than 16.69, therefore 3% dilution would be easier to detect. Obviously, 5% dilution
resulting in 18.10 is very easy to detect.
133
Solution 4
Taking the negative derivative of the survival function
β
1 − F (x) = e−αx ,
h(x) = αβxβ−1 .
Solution 5
Take the Weibull distribution with parameters α and β.
• If β = 1, then h(x) = α is constant and the distribution is memoryless of the current age x.
• If β > 1, say β = 2, then h(x) = 2αx increases with x, implying that the older individuals die more often than
the younger.
• If 0 < β < 1, say β = 0.5, then h(x) = 0.5αx−0.5 decreases with x, predicting that the longer you live the
healthier you become.
Solution 6
The left panel shows a QQ-plot with x for the seeded clouds and y for the control clouds. The QQ-plot ia approximated
by the line x = 2.5y claiming 2.5 times more rainfall from the seeded clouds.
On the right panel, the QQ-plot (constructed after the log-transformation of the data) is approximated by the line
ln x = 2 + 0.8 ln y
Solution 7
(a) The Laplace curve is symmetric. Its shape is formed by two exponentially declining curves: one for positive x
and the other for the negative x.
√
(b) For λ = 2 the mean is 0, the standard deviation is 1, the skewness is 0, and the kurtosis is 6. Compared to
the normal curve with the same mean and standard deviation but smaller kurtosis (=3), the Laplace distribution has
heavier tails. Moreover, since the variances are equal, the two curves should cross 4 times as they cover the same area.
This implies that the Laplace curve must also have higher peakedness.
134
Solution 8
(a) The null distribution of Y is Bin(25, 12 ) as each observation is smaller than the true median (assuming that the
distribution is continuous) with probability 0.5.
(b) A non-parametric CI for the midean M is given by (x(k) , x(n−k+1) ) where k is such that
Solution 9
(a) When the population under investigation has a clear structure it is more effective to use stratified sampling for
estimating the overall population mean as it has a smaller standard error. In accordance with the optimal allocation
formula:
wi σi
ni = n ,
σ̄
the allocation of observations should follow the next two key rules: put more observations in the larger strata, and
put more observations in the strata with higher variation.
coefficient of skewness is close to zero and m4 ≈ 3, then we get an indication that the shape of the population distri-
bution curve is close to the normal distribution.
(c) The standard error for the sample mean is sx̄ = √200
s
. Roughly: the range of heights 160 − 200 in centimeters
covers 95% of the population distribution. Treating this interval as the mean plus-minus two standard deviations, we
find s ≈ 10 cm and sx̄ is something like 0.7 cm. We conclude that a random sample of size 200 may give a decent
estimate of the population mean height.
Solution 10
(a) The sign test statistic y = number of positive xi has the null distribution
H
Y ∼0 Bin(25, 12 ) ≈ N( 25 5
2 , 2 ).
At 5% significance level we would reject H0 in favour of the one-sided alternative for {y ≥ k}, where k is found from
! !
k − 0.5 − 12.5 k − 13
0.05 = P(Y ≥ k|H0 ) = P(Y > k − 1|H0 ) ≈ 1 − Φ =1−Φ .
5/2 2.5
R = {y ≥ 17}.
We know that the true population distribution is X ∼ N(0.3, 1), therefore the true probability of a positive observation
is
P(X > 0|X ∼ N(0.3, 1)) = 1 − Φ(−0.3) = Φ(0.3) = 0.62.
It follows that for computing the power of the sign test we should use the alternative distribution
135
The power of the sign test is
!
17 − 0.5 − 15.5
P(Y ≥ 17|Y ∼ Bin(25, 0.62)) ≈ 1 − Φ = 1 − Φ(0.41) = 0.34.
2.43
(b) Under the normal distribution model X ∼ N(µ, 1) with n = 5, we have X̄ ∼ N(µ, 1/5), and we reject H0 : µ = 0
for
5x̄ > 1.645, that is for x̄ > 0.33.
The power of this test,
!
0.33 − 0.3
P(X̄ > 0.33|X̄ ∼ N(0.3, 1/5)) = 1 − Φ = 1 − Φ(0.15) = 0.44,
1/5
(b) Using two sample variances s2x = 0.216, s2y = 1.180, we compute the pooled sample variance s2p = 0.767, which
gives us an unbiased estimate of σ 2 .
(c) The standard error of ȳ − x̄ = 1.069 is quite large sȳ−x̄ = 0.587 due to the small sample sizes.
(d) Based on the t7 -distribution, an exact 90% confidence interval for the mean difference is
(e) It is more appropriate to use the two-sided alternative since we do not have access to any prior information
about the relationship between µ1 and µ2 .
(f) From the observed test statistic value t0 = 1.821, we find the two-sided p-value to be larger than 10% using the
t-distribution table for df = 7.
(b∗ ) σ 2 = 1,
(c∗ ) sȳ−x̄ = 0.6708,
(d∗ ) Iµy −µx = 1.069 ± 1.104,
(f∗ ) using the observed z-score z0 = 1.594, we find the two-sided p-value = 0.11 with the help of the normal
distribution table.
Solution 2
If m = n, then by the definition of the pooled sample variance,
! Pn Pn
1 1 2 2
i=1 (xi − x̄) + i=1 (yi − ȳ)
2
s2 + s22 s2 s2y
2
sp + = · = 1 = x+ = s2x̄ + s2ȳ .
n m n 2n − 2 n n m
Solution 3
We would like to test the null hypothesis of no drug effect
H0 : µ1 = µ2 , the drug is not effective for reducing high blood pressure.
The suggested measurement design: during the same n = 10 days take blood pressure measurements on 4 people, two
on the treatment
x1,1 , . . . , x1,n ,
x2,1 , . . . , x2,n ,
136
and two controls
y1,1 , . . . , y1,n ,
y2,1 , . . . , y2,n .
Dependencies across the subjects make inappropriate both the two-sample t-test and the rank sum test, since we can
not treat neither
nor
Solution 4
The paradox arises from the way how the data was collected. This data is an example of an observational study,
which was properly planned, and the two samples are not random samples. The large proportion of patients in a bad
condition were allocated in the hospital A, which resulted in a smaller overall survival rate for the hospital A.
Solution 5
Two independent samples
x1 , . . . , x n , y 1 , . . . , yn ,
are taken from two population distributions with equal standard deviation σ = 10. Approximate 95% confidence
interval r
2 27.72
Iµ1 −µ2 ≈ x̄ − ȳ ± 1.96 · 10 · = x̄ − ȳ ± √ .
n n
For the confidence interval to have width 2, requires the equality
27.72
√ = 1,
n
implying n ≈ 768.
Solution 6
We are dealing with two independent samples:
Rank Type I Type II Rank
1 3.03 3.19 2
8 5.53 4.26 3
9 5.60 4.47 4
11 9.30 4.53 5
13 9.92 4.67 6
14 12.51 4.69 7
17 12.95 6.79 10
18 15.21 9.37 12
19 16.04 12.75 15
20 16.84 12.78 16
Rank sum 130 80
(a) Assume equal variances and apply the two-sample t-test. Using the summary statistics
q
x̄ = 10.693, ȳ = 6.750, s2x = 23.226, s2y = 12.978, sx̄−ȳ = s2x̄ + s2ȳ = 1.903,
137
H0
(b) The rank sum test statistics are r1 = 130, r2 = 80. Using the normal approximation R1 ≈ N(µ, σ) with
r
n(2n + 1) n2 (2n + 1)
µ= = 105, σ = = 13.23,
2 12
we find the two-sided p-value to be
greater than 5%. Again, we can not reject the null hypothesis of equality.
(c) The non-parametric test in (b) is more relevant, since the QQ-plot for the pooled deviations show non-normality.
(d) To estimate the probability π, that a type I bearing will outlast a type II bearing, we turn to the ordered
pooled sample
X-YYYYYY-XX-Y-X-Y-XX-YY-XXXX.
Pick a pair (xi , yj ) at random from the table, to estimate the probability that type II bearing will outlast the type I
by
number of (xi ≤ yj ) 10 + 4 + 4 + 3 + 2 + 2
P(X ≤ Y ) = = = 0.25.
total number of pairs (xi , yj ) 100
This implies a point estimate π̂ = 0.75.
Solution 7
Consider a random sample of the differences d1 , . . . , dn taken from a continuous population distribution which is
symmetric around the unknown median m. The signed rank test statistic
n
X
w= ri · 1{di >0}
i=1
138
Solution 8
The critical values wα◦ based on the normal approximation
q
n(n+1) n(n+1)(2n+1)
W ≈N 4 , 24
Solution 9
(a) Since the variance of the difference D = X − Y equals
we have
H0
D ≈ N(0, 10).
It follows that the sample mean of the differences with n = 25
H0
D̄ = X̄ − Ȳ ≈ N(0, √1025 ) = N(0, 2).
D̄ ≈ N(δ, 2),
as follows
Pw(δ) ≈ P(D̄ > 3.29|N(δ, 2)) = 1 − Φ( 3.29−δ
2 ) = Φ( δ−3.29
2 ).
(b) With two independent samples
q
H0
D̄ ≈ N(0, 100
25 + 100
25 ) = N(0, 2.83).
Conclusion. The two power functions are compared graphically on the next figure, where the x-axis represents the
effect size δ = µ1 − µ2 . Clearly, the power of the test for the paired samples is higher.
139
Power function
Paired samples
Independent samples
0.5
0.05
3.29 4.65
Solution 10
Summary statistics for n = 15 pairs (xi , yi ):
If the pairing had been erroneously ignored, then the two independent samples formula would give 6 times larger
standard error q
sx̄−ȳ = s2x̄ + s2ȳ = 7.81.
To test H0 : µ1 = µ2 against H1 : µ1 ̸= µ2 assume D ∼ N(µ1 − µ2 , σ) and apply the one-sample t-test based on
the test statistic
d¯
t0 = = 0.368.
sd¯
With df = 14, the two-sided p-value is much larger than 20%, so that we can not reject H0 .
Solution 11
Possible explanations are given below. These are observational studies, based on the data which was not properly
randomised.
(a) Rich patients get rooms with a window and higher quality of the health care.
(b) The smoker is a bad husband making miserable wife’s life, increasing the risk of cancer.
(c) More stressed condition of a person may result in skipping breakfast as well as accidents.
(e) Being part of a community can have a positive effect on mental health and emotional well being.
Solution 12
(a) This is another example of Simpson’s paradox. It is resolved by referring to the key factor of the difficulty to
enter different programmes. Men tend to apply for easy programs, while women more often apply for programs with
low admission rates.
(b) A simple hypothetical experimental study could be based on two independent random samples. Focus on one
major program, say major F. Take n randomly chosen female candidates and n randomly chosen male candidates.
Ask all of them to apply for major F. Compare two sample proportions of the admitted applicants. (Of course this
experiment is impossible to perform in practice.)
140
Solution 13
(a) We use the binomial model for the number of females Y ∼ Bin(36, p). Given the observed count y = 13,
we have to test H0 : p = 0.5 against the two-sided alternative H1 : p ̸= 0.5. The approximate null distribution is
Y ∼ N(18, 3), therefore, an approximate two-sided p-value becomes
2(1 − Φ( 18−13
3 ) = 2(1 − Φ(1.67)) = 2 × 0.048 = 9.6%.
With such a high p-value we can not reject the null hypothesis of equal sex ratio.
Solution 14
(a) This is an example of a paired sample, therefore the signed rank test is more appropriate for testing the null
hypothesis of no difference.
(b) We use the signed rank test. The observed test statistics are w = 39 as the sum of the positive ranks and
w = 6 as the sum of the negative ranks.
is larger than 5%. Therefore, we do not reject the null hypothesis of equality in favour of the two-sided alternative.
(c) The extract from the course text book reminds that the null hypothesis for the signed rank test, beside equality
of two population distributions, assumes a symmetric distribution for the differences. It also explains why such an
assumption is reasonable.
Solution 15
The data is collected using the matched pairs design for 50 independent trials with four possible outcomes (slow
correct, fast correct), (slow correct, fast wrong), (slow wrong, fast correct), (slow wrong, fast wrong). We are given
fast correct fast wrong fotal
slow correct 42
slow wrong 0 8
total 35 15 50
which allows us to fill in the table as follows.
141
fast correct fast wrong total
slow correct 35 7 42
slow wrong 0 8 8
total 35 15 50
We apply the McNemar test having the test statistic
(7 − 0)2
x2 = = 7.
7+0
With only 7 informative counts, the null distribution is roughly approximated by the χ21 −distribution. Since the
square root of 7 is 2.65, the standard normal distribution gives the (two-sided) p-value 0.8%. We conclude that the
observed difference is statistically significant.
For I = 2, put
x̄ + ȳ
ȳ1· = x̄, ȳ2· = ȳ, ȳ·· = .
2
In this two-sample setting, the F-test statistic becomes
(x̄ − x̄+ȳ 2
2 ) + (ȳ − x̄+ȳ
2 )
2 x̄ − ȳ 2
f = 2n(n − 1) Pn 2
P n 2
= 2n = t2 ,
j=1 (x j − x̄) + j=1 (y j − ȳ) 2sp
where t = x̄−ȳ
√2 is the two-sample t-test statistic.
sp n
Solution 2
(a) Using the table with the data and the available summary statistics, we compute three sums of squares:
ssA = 10((20.34 − 19.40)2 + (18.34 − 19.40)2 + (21.57 − 19.40)2 + (17.35 − 19.40)2 ) = 109.2,
ssE = 9(0.88 + 0.74 + 0.88 + 0.89) = 30.5,
ssT = 3.58 · 39 = 139.7 = 109.2 + 30.5.
Comparing of the observed test statistics 42.9 with the critical value for F3,36 (0.001) = 6.7436, see Section 11.4, we
see that the p-value of the F-test is much smaller than 0.001, so that we can reject the null hypothesis of no difference.
(b) The normality assumption is supported by the four skewness and kurtosis values, with the former being close
to zero and the latter close to 3. On the other hand, the four sample variances are close to each other making realistic
the assumption of equal variances.
√
(c) Since sp = msE = 0.92 and the t-distribution table gives approximately
142
Solution 3
Assume that under the null hypothesis
H0 : µ1 = . . . = µI = µ
all the data
{yij , i = 1, . . . , I, j = 1, . . . , n}
come from a single normal distribution N(µ, σ). Then the corresponding parameter space Ω0 has the dimension dim
Ω0 = 2. On the other hand, the dimension of the general parameter space Ω is
dim Ω = I + 1,
since the setting beyond H0 is described by the parameters µ1 , . . . , µI and σ. The likelihood-ratio
L0 (µ̂, σ̂0 )
w= ,
L(µ̂1 , . . . , µ̂I , σ̂)
is expressed in terms of two likelihood functions
I Y
n (yij −µi )2 2
ij −µi )
Y X X (y
L(µ1 , . . . , µI , σ) = √1 e
2πσ
2σ 2 ∝ σ −N exp{− 2σ 2 },
i=1 j=1 i j
2
ij −µ)
X X (y
L0 (µ, σ) = L(µ, . . . , µ, σ) ∝ σ −N exp{− 2σ 2 }.
i j
where N = In is the total number of observations. We find the maximum likelihood estimates to be
ssT ssE
µ̂ = ȳ·· , σ̂02 = , µ̂i = ȳi· , σ̂ 2 = ,
N N
so that P P (yij −µ̂)2
σ̂0−N exp{− 2σ̂02
} σ̂ 2 −N/2 exp{− ssT } σ̂ 2 N/2
2ssT /N
0
w= = · = .
σ̂ −N exp{−
P P (yij −µ̂i )2
} σ̂ 2 exp{− 2ssssEE/N } σ̂02
2σ̂ 2
The likelihood ratio test rejects the null hypothesis for the smaller values of w or equivalently for the larger values
of the ratio
σ̂02 ssT ssA (I − 1)msA (I − 1)
= =1+ =1+ =1+ ·f
σ̂ 2 ssE ssE I(n − 1)msE I(n − 1)
that is for the larger values of F-test statistic f .
Recall that
H0
−2 ln w ≈ χ2k , where k = dim(Ω) − dim(Ω0 ) = I − 1.
To see that the exact null distribution for the F -test statistic agrees with the asymptotic null distribution of the
likelihood-ratio test, observe that for large n,
(I−1) (I−1) ssA
−2 ln w = In · ln(1 + · f ) ≈ In · · f ≈ (I − 1)f = .
I(n−1) I(n−1) msE
Solution 4
Consider the one-way layout setting with I = 10, n = 7, and a dataset generated by independent random variables
143
(a) In this case, the 95% confidence interval for a single difference µi − µj takes the form
q
s
Iµi −µj = ȳi· − ȳj· ± t60 (0.025)sp n2 = ȳi· − ȳj· ± 2.82 · √pn .
Solution 5
For I = 4 control groups of n = 5 mice each, we would like to test H0 : no systematic differences between groups.
Applying the one-way F-test we get the following ANOVA table.
Source ss df ms f p
Columns 27230 3 9078 2.271 0.12
Error 63950 16 3997
Total 91190 19
The p-value of 12% is obtained using the F3,16 -distribution. We do not reject H0 at 10% significance level.
After inspecting the boxplots of the four samples we see that the assumption of normality and equal variances is
clearly violated. The largest difference is between the third and the fourth boxplots. A control question: explain why
the third boxplot has no upper whisker?
Turning to the non-parametric Kruskal-Wallis test, consider the pooled sample ranks and their group means:
Group I 2 6 9 11 14 r̄1. = 8.4
Group II 4 5 8 17 19 r̄2. = 10.6
Group III 1 3 7 12.5 12.5 r̄3. = 7.2
Group IV 10 15 16 18 20 r̄4. = 15.8
The observed Kruskal-Wallis test statistic takes value
12 · 5
(8.4 − 10.5)2 + (10.6 − 10.5)2 + (7.2 − 10.5)2 + (15.8 − 10.5)2 = 6.20.
w=
20 · 21
This is close to the critical value x3 (0.1) = 6.25 from the table of Section 11.3, implying that the p-value of the
Kruskal-Wallis test is close to 10%. We do not reject the null hypothesis of no difference between the four control
groups.
144
Solution 6
The two-way ANOVA table presents the results of two F-tests in the case of I = 3 treatments applied to J = 10
subjects with n = 1 replications.
Source SS df MS F P
Columns (blocks) 0.517 9 0.0574 0.4683 0.8772
Rows (treatments) 1.081 2 0.5404 4.406 0.0277
Error 2.208 18 0.1227
Total 3.806 29
We reject
H0 : no treatment effects
at 5% significance level. Notably, the differences among the subjects are not significant.
The non-parametric Friedman test is based on the ranking within the blocks:
Dog 1 Dog 2 Dog 3 Dog 4 Dog 5 Dog 6 Dog 7 Dog 8 Dog 9 Dog 10 r̄i.
Isof 1 2 3 2 1 2 1 3 1 3 1.9
Halo 2 1 1 3 2 1 3 1 2 2 1.8
Cycl 3 3 2 1 3 3 2 2 3 1 2.3
The observed value of the Friedman test statistic
12 · 10
(1.8 − 2)2 + (1.9 − 2)2 + (2.3 − 2)2 = 1.4
q=
3·4
is smaller than the critical value x2 (0.1) = 4.61 for the chi-squared distribution with 2 degrees of freedom. Thus the
p-value of the Friedman test is larger than 10% and we do not reject the null hypothesis of no difference between the
three treatments.
Solution 7
The data consists of 48 survival times obtained for I = 3 poisons, J = 4 treatments, and n = 4 observations per cell.
The cell means for the survival times are listed in the next table.
A B C D
I 4.125 8.800 5.675 6.100
II 3.200 8.150 3.750 6.625
III 2.100 3.350 2.350 3.250
If you draw by hand the three profiles connecting the means for three poisons, you will find that the profiles I and II
cross each other, and the profile III is more flat than the other two. The three profiles being non-parallel indicates
that there might be an interaction between the main two factors: the poisons and treatments. We have three null
hypotheses of interest
HA : no poison effect,
HB : no treatment effect,
HAB : no interaction.
(a) Referring to the two-way ANOVA table
Source SS df MS F P
Columns (treatments) 91.9 3 30.63 14.01 0.0000
Rows (poisons) 103 2 51.52 23.57 0.0000
Intercation 24.75 6 4.124 1.887 0.1100
Error 78.69 36 2.186
Total 298.4 47
we reject HA and HB at 1% significance level. We can not reject HAB even at 10% significance level. Conclusion:
3 poisons act differently,
4 treatments act differently,
some indication of interaction.
The analysis of the residuals shows the following results:
the normal QQ-plot reveals non-normality (see the left panel of the figure below),
the coefficient of skewness = 0.59 is positive,
the coefficient of kurtosis = 4.1 is larger than 3.
145
(b) After transforming the data by death rate = 1/survival time, we obtain new cell means for the death rates
given by the next table.
A B C D
I 0.249 0.116 0.186 0.169
II 0.327 0.139 0.271 0.171
III 0.480 0.303 0.427 0.309
If you now draw three profiles, you will see that they look more parallel. Using the two-way ANOVA results for the
transformed data
Source SS df MS F P
Columns (treatments) 0.204 3 0.068 28.41 0.0000
Rows (poisons) 0.349 2 0.174 72.84 0.0000
Intercation 0.01157 6 0.0026 1.091 0.3864
Error 0.086 36 0.0024
Total 0.6544 47
we reject HA and HB at 1% significance level, and do not reject HAB . Conclusions:
3 poisons act differently,
4 treatments act differently,
no interaction.
The normal QQ-plot of the residuals for the transformed data approves the normality assumption of the performed
three F-tests, see the right panel on the figure above.
Solution 8
(a) The assumptions of the normal theory for the two-way ANOVA require that the 20 measurements of this
example (yij ) are generated by the normal distributions
where
µij = µ + αj + βi , i = 1, 2, 3, 4, 5, j = 1, 2, 3, 4,
satisfying X X
αj = βi = 0.
j i
Under this model the most interesting is the null hypothesis of no difference among different types of tires
H0 : α1 = α2 = α3 = α4 = 0.
146
Solution 9
(a) In terms of the two-way ANOVA model
Yijk = µ + αi + βj + δij + σZijk , Zijk ∼ N(0, 1)
( grand mean + main effects + interaction+noise), we estimate the main effects as
α̂1 = 11.9, α̂2 = −11.8, β̂1 = 1.99, β̂2 = −5.02, β̂3 = 3.04.
(Notice the effect of rounding errors.)
190 190
160 160
60 70 80 60 70 80
One possible interaction effect could have the form on the right panel. In this case the formulation 2 interacts with
the speed factor in such a way that the yield becomes largest at the speed 70.
Solution 10
(a)
Source of variation ss df ms f
Varieties 328.24 2 164.12 103.55
Density 86.68 3 28.89 18.23
Interaction 8.03 6 1.34 0.84
Errors 38.04 24 1.59
(b) Using the critical values
F2,24 (0.001) = 9.3394, F3,24 (0.001) = 7.5545, F6,24 (0.100) = 2.0351,
we reject both null hypotheses on the main factors and do not reject the null hypothesis on interaction.
√
(c) s = 1.59 = 1.26.
147
Solution 11
(a) The stratified sample mean is
(b) We are in the one-way Anova setting with I = 3 and J = 13. The 95% Bonferroni simultaneous confidence
interval for the differences (µi − µj ) has the form
p
Bµi −µj = x̄i − x̄j ± t36 (0.05/6)sp 2/13,
Solution 12
(a) This is an example of a randomised block for I = 3 treatments, J = 2 blocks, and n = 1 observation for each of
6 combinations of levels. The subjects (blocks) were matched by gender and mail to reduce the influence of external
factors.
(b) Consider the Friedman test for I = 3 treatments and J = 2 blocks. The test statistic
3
X
q=2 (r̄i. − 2)2
i=1
is obtained from the ranks given by two subjects (rij ) to the three treatments. There are 62 = 36 possible rank
combinations:
1 1 1 1 1 2 3 1 3 1 3 3
(rij ) = 2 2 , 2 3 , 2 1 ,..., 2 2 , 2 3 , 2 2
3 3 3 2 3 3 1 3 1 2 1 1
Under the null hypothesis, these 36 combinations are equally likely. The corresponding vector of rank averages
(r̄1. , r̄2. , r̄3. ) may take 5 different values (up to permutations)
A1 = (1, 2, 3), A2 = (1, 2.5, 2.5), A3 = (1.5, 1.5, 3), A4 = (1.5, 2, 2.5), A5 = (2, 2, 2)
1, 2, 3 1, 3, 2 2, 1, 3 2, 3, 1 3, 1, 2 3, 2, 1
1, 2, 3 A1 A2 A3 A4 A4 A5
1, 3, 2 A2 A1 A4 A3 A5 A4
2, 1, 3 A3 A4 A1 A5 A2 A4
2, 3, 1 A4 A3 A5 A1 A4 A2
3, 1, 2 A4 A5 A2 A4 A1 A3
3, 2, 1 A5 A4 A4 A2 A3 A1
From this table, we find the possible outcomes of interest and their probabilities due to the random sampling assuming
that the null hypothesis is true:
148
Solutions to Section 9.5
Solution 1
We will test the null hypothesis of homogeneity
using the chi-squared test. The table below gives the expected counts along with the observed counts:
From here the chi-squared test statistic is computed to be x2 =5.10. With df=1, the p-value of the test = 0.024 is
found using the normal distribution table. We reject H0 and conclude that the diabetics have the genotype BB less
often than the normals.
Solution 2
Here we twice apply the chi-squared test of independence.
(a) We test H0 : no association of the disease and the ABO blood group, using the observed and expected counts:
O A AB B Total
Moderate 7 (10.4) 5 (9.8) 3 (2.0) 13 (6.2) 28
Minimal 27 (30.4) 32 (29.7) 8 (6.1) 18 (18.8) 85
Not present 55 (48.6) 50 (47.5) 7 (9.8) 24 (30.0) 136
Total 89 87 18 55 249
With x2 = 15.37 and df = 6, the chi-squared distribution table says that the p-value of the test is less than 2.5%. We
reject H0 .
(b) We test H0 : no association of the disease and the MN blood group, using the observed and expected counts:
MM MN NN Total
Moderate 21 (16.7) 6 (9.4) 1 (1.9) 28
Minimal 54 (51.3) 27 (28.9) 5 (5.8) 86
Not present 74 (81.1) 51 (45.7) 11 (9.2) 136
Total 149 84 17 250
With x2 = 4.73, df = 4, the chi-squared distribution table says that the p-value of the test is larger than 10%. We
can not reject H0 .
Solution 3
(a) Applying the chi-squared test of homogeneity:
girl boy total
flying fighter 51 (45.16) 38 (43.84) 89
flying transport 14 (15.22) 16 (14.78) 30
not flying 38 (42.62) 46 (41.38) 84
total 103 100 203
we get x2 = 2.75, df = 2, giving the p-value larger than 10%. Conclusion: the data does not confirm the difference in
boy proportion fathered by different professionals.
(b) We apply the goodness of fit chi-squared test for the null hypothesis
149
girl boy total
flying fighter 51 (43.34) 38 (45.66) 89
flying transport 14 (14.61) 16 (15.39) 30
not flying 38 (40.91) 46 (43.09) 84
Total 103 100 203
we get x2 = 3.09, df = 3, giving the p-value larger than 10%. Conclusion: can not reject H0 .
Why do we use here df = 3 is explained next. The general model is described by three independent parameters
(π12 = π22 = π32 ), this gives us
dim Ω = 3
degrees of freedom to start with. Since the null hypothesis model is simple, we get
dim Ω0 = 0,
df = 3 − 0 = 3.
Solution 4
We use the chi-squared test for homogeneity involving five samples.
no nausea incidence of nausea total
placebo 70 (84) 95 (81) 165
chlorpromazine 100 (78) 52 (74) 152
dimenhydrinate 33 (43) 52 (42) 85
pentobarbital (100 mg) 32 (34) 35 (33) 67
pentobarbital (150 mg) 48 (43) 37 (42) 85
total 283 271 554
The observed test statistic x2 = 35.8 according to the χ24 -distribution table gives the p-value smaller than 0.005.
Comparing the observed and expected counts we conclude that chlorpromazine is most effective in ameliorating
postoperative nausea.
Solution 5
(a) We test H0 : no relation between blood group and disease in London, using the chi-squared test of homogeneity.
The test statistic x2 = 42.40, with df = 1, according to the normal distribution table gives very small p-value = 0.000.
We reject H0 . The odds ratio
ˆ = 4219 · 911 = 1.45
∆
4578 · 579
says that blood group O increases the odds of peptic ulcer by factor 1.45 compared to group A.
(b) We test H0 : no relation between blood group and disease in Manchester, using the chi-squared test of homo-
geneity.
The test statistic x2 = 5.52, with df = 1, according to the normal distribution table gives the p-value = 0.019. We
reject H0 . The odds ratio fo the Manchester data is ∆
ˆ = 1.22.
(c) We test H0 : London group A and Manchester group A have the same propensity to Peptic Ulcer, using the
chi-squared test of homogeneity.
150
The test statistic x2 = 91.3, with df = 1, gives a very small p-value = 0.000. We reject H0 .
We test H0 : London Group O and Manchester Group O have the same propensity to Peptic Ulcer, using the
chi-squared test of homogeneity.
The test statistic x2 = 204.5, with df = 1, gives a very small p-value = 0.000. We reject H0 .
Solution 6
Denote the two main factors of the study by
D = endometrical carcinoma,
Solution 7
The exact Fisher test uses Hg(30,17, 30
16
) as the null distribution of the test statistic whose observed value is x = 12.
We will test
H0 : no difference between high-anxiety and low-anxiety groups
against the two-sided alternative
Thus the two-sided p-value is approximately 9% and we do not reject the null hypothesis.
It is interesting to compare this result with the chi-squared test yields, which in this case, gives the test statistic
x2 = 4.69. With df = 1, the two-sided p-value of the chi-squared test
√
2(1 − Φ( 4.69)) = 2(1 − Φ(2.16)) = 0.03
is smaller than 5% demonstrating a bad substitute of the exact Fisher test due to a small sample size.
151
Solution 8
Denote
π1 = probability that red wins in boxing,
π2 = probability that red wins in freestyle wrestling,
π3 = probability that red wins in Greco-Roman wrestling,
π4 = probability that red wins in Tae Kwon Do.
Is there evidence that wearing red is more favourable in some of the sports than others? We test
H0 : π 1 = π 2 = π 3 = π 4 vs H1 : πi ̸= πj for some i ̸= j
we find that the test statistic x2 = 0.3 with df = 5 is not significant. We can not reject H0 . We also see that the red
outfit gives an advantage over the blue one as π̂ = 0.55.
Solution 9
After asking the opinion of 50 female employees and 50 male employees, the company’s statistician should carry out
a chi-square test of homogeneity.
Solution 10
(a) Multiple testing problem.
(c) Nonparametric tests do not assume a particular form of the population distribution like normal distribution.
Solution 11
The null hypothesis is that everybody votes independently. Let p be the population proportion for ’yes’. Then the
number of ’yes’ for three voters in a household has the binomial distribution model X ∼ Bin(3, p) with an unspecified
parameter p. Thus the null hypothesis can be expressed in the following form
We apply the Pearson chi-squared test with the expected counts based on the maximum likelihood estimate of p, the
sample proportion p̂ = 0.5417:
E0 = n(1 − p̂)3 = 19, E1 = 3np̂(1 − p̂)2 = 68, E2 = 3np̂2 (1 − p̂) = 81, E3 = 3np̂3 = 32.
The observed chi-square test statistic is x2 = 11.8 which has the p-value less than 0.5% according to the approximate
null distribution χ2df with df = 4 − 1 − 1 = 2. Conclusion: we reject the null hypothesis of independent voting.
Solution 12
The equivalence of
H0 : πi|j = πi·
and
H0 : πij = πi· π·j
follows from the formula for the conditional probabilities
πij
πi|j = .
π·j
152
Solution 13
(a) This is a single sample of size n = 441. Each of n observations falls in of 9 groups. The multinomial distribution
model
(C11 , C12 , C13 , C21 , C22 , C23 , C31 , C32 , C33 ) ∼ Mn(n, p11 , p12 , p13 , p21 , p22 , p23 , p31 , p32 , p33 )
gives the likelihood function
(c) The chi-squared test statistic x2 = 22.5 computed from the observed and expected counts should be compared
with the critical values of the χ24 -distribution. Since it is larger than 14.86, we conclude that the p-value of the test is
smaller than 0.5%. We reject the null hypothesis of independence and infer that there is a relationship between the
facility conditions at gasoline stations and aggressiveness in the pricing of gasoline.
Pricing policy
Aggressive Neutral Nonaggressive Total
Substandard condition 24 (17) 15 (22) 17 (17) 56
Standard condition 52 (62.3) 73 (80.9) 80 (61.8) 205
Modern condition 58 (54.7) 86 (71) 36 (54.3) 180
Total 134 174 133 441
It looks like the standard conditions are coupled with the least aggressive pricing strategy.
Solution 14
(a) The risk ratio compares the chances to suffer from myocardial infarction under the aspirin treatment versus
the chances to suffer from myocardial infarction under the placebo treatment:
P(MyoInf|Aspirin)
RR = .
P(MyoInf|Placebo)
(b) The null hypothesis of RR = 1 is equivalent to the hypothesis of homogeneity.
MyoInf No MyoInf Total
Aspirin 104 (146.5) 10933 (10887.5) 11037
Placebo 189 (146.5) 10845 (10887.5) 11034
Total 293 21778 22071
The corresponding chi-squared test statistic is
42.52 42.52 42.52 42.52
x2 = + + + = 25.
146.5 146.5 10887.5 10887.5
Since df = 1, we can use the normal distribution table. The square root of 25 is 5 making the result highly significant.
Aspirin works!
153
and
n
X n
X n
X n
XX
n2 x̄ȳ = xi yi = xi yi + xi yj ,
i=1 i=1 i=1 i̸=j j=1
so that
n n
1X 1 XX
SXY = Xi Yi − Xi Yj .
n i=1 n(n − 1) j=1 i̸=j
Solution 2
After ordering over the x values we get
x −1.75 −1.18 −0.88 −0.65 −0.30 0.34 0.50 0.68 1.38 1.40
y −1.59 −0.81 −0.98 −0.53 −0.72 0.27 0.64 0.35 1.34 1.28
From this sample we compute the following five sufficient statistics
x̄ = −0.046, ȳ = −0.075, sx = 1.076, sy = 0.996, r = 0.98.
(a) If x is the predictor and y is the response, then the regression line takes the form
sy
y − ȳ = r · sx (x − x̄),
yielding
y = −0.033 + 0.904 · x.
The estimated noise size is q
n−1 2
s= n−2 sy (1 − r2 ) = 0.22.
(b) If y is the predictor and x is the response, then the regression line takes the form
sx
x − x̄ = r · sy (y − ȳ),
yielding
x = 0.033 + 1.055 · y.
The estimated noise size is q
n−1 2
s= n−2 sx (1 − r2 ) = 0.24.
(c) The first fitted line
y = −0.033 + 0.904 · x
is different from the second
y = −0.031 + 0.948 · x,
because in (a) we minimise the vertical residuals while in (b) we minimise the horizontal residuals.
Solution 3
Using an extra explanatory variable f which equal 1 for females and 0 for males, we combine two simple linear
regression models into the multiple regression
y = β0 + β1 x + β2 f + σZ,
so that
β0 = β0′′ , β2 = β0′ − β0′′ .
Here p = 3 and the design matrix is
1 x1 f1
.. .. .. .
X= . . .
1 xn fn
After β0 , β1 , β2 are estimated, we can compare the original intercepts β0′ and β0′′ using the relations
β0′′ = β0 , β0′ = β0 + β2 .
We may also test the key null hypothesis β2 = 0 saying that the grades of the female and male students satisfy the
same simple linear regression model.
154
Solution 4
Using P = X(X⊺ X)−1 X⊺ , we get
Solution 5
(a) Given x = 95, we predict the final score by
Regression to mediocrity in action: since the predictor x = 95 is larger than the average 75, the predicted response
value 85 is smaller than 95.
Solution 6
(a) First, we find the correlation coefficient ρ between X and Y . Since EX = 0, we have
(b) To generate pairs (x, y) with a given positive correlation coefficient ρ, one can proceed as follows. From
ρ = √ 1 2 , find
1+β
p
β = ρ−2 − 1.
After generating a pair (x, z) of N(0, 1) numbers, the pair (x, y) with
p
y = x + z ρ−2 − 1
Solution 7
Two regression models
produce two residual plots: the blue circles for the first model and the red stars for the second. Clearly, the second
model has a much better fit to the data.
155
The kinetic energy formula explains why the second model is better.
Solution 8
Coefficient of determination is the squared sample correlation coefficient
r2 = (0.2)2 = 0.04.
Solution 9
Using the formulas for b1 and sb1 we obtain
rsy √
b1 s rsy r n−2
t= = q x2 =q = √ .
sb1 s s2y (1−r 2 ) 1 − r2
(n−1)s2x (n−2)
Solution 10
(a) The corresponding multiple regression model takes the form yi = β0 + β1 xi + β2 x2i + ei , where ei , i = 1, . . . , 5
are independent realisations of a normal random variable with distribution N(0, σ). The corresponding design matrix
is
1 0 0
1 2 4
X= 1 4 16
1 6 36
1 8 64
(b) Using the formula ŷi = 111.8857 + 8.0643xi − 1.8393x2i we get
xi 0 2 4 6 8
yi 110 123 119 86 62
ŷi 111.8857 120.6571 114.7143 94.0571 58.6857
implying Pn
− ŷi )2
i=1 (yi 103.3
s2 = = = 51.65.
n−p 2
(c) The coefficient of multiple determination equals
ssE 103.3
R2 = 1 − =1− = 0.961.
ssT 2630
Solution 11
(a) The underlying parabola makes unrealistic prediction that ŷ40 = 139 sec compared to ŷ10 = 63 sec and ŷ20 = 34
sec. One should be careful to extend the range of explanatory variable from that used in the data.
156
(b) Using t17 (0.005) = 2.898 we get the exact confidence interval (under the assumption of normality and ho-
moscedasticity for the noise component)
(c) Since the confidence interval from 2b covers zero, we do not reject the null hypothesis H0 : β2 = 0 at the 1%
0.1157 = 2.36, and according to the t17 -distribution table the two-sided
significance level. The observed t-test statistic 0.2730
p-value lies between 2% and 5%.
Solution 12
(b) The fitted regression line for the final score y as a function of the midterm score x is y = 37.5 + 0.5x. Given
x = 90 we get a point prediction y = 82.5. The estimate of σ 2 is
n−1 2
s2 = s (1 − r2 ) = 84.4.
n−2 y
Thus the 95% prediction interval for Carl’s final score is
q
1
I = 82.5 ± t8 (0.025)s 1 + 9 + 18 ( 15 2
10 ) = 82.5 ± 24.6.
(c) The fitted regression line for the midterm score x as a function of the final score y is x = 37.5 + 0.5y. Given
y = 80 we get a point prediction x = 77.5.
157