Statistical Inference

Download as pdf or txt
Download as pdf or txt
You are on page 1of 158

To Elin ♢ Camilla ♣ Diana ♡ Nelli ♠

Statistical Inference

Serik Sagitov

Chalmers University of Technology and the University of Gothenburg

STATISTICAL INFERENCE

fits a statistical model to a given dataset


parameter estimation
hypotheses testing
REAL WORLD STATISTICAL MODELS

DATA MODEL PARAMETERS

PROBABILITY THEORY
predicts data patterns for
a given parametric model

Gothenburg, Sweden, January 28, 2023


Contents

1 Parametric models 5
1.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 One-way and two-way layout models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Sample mean, sample variance, and t-distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Gamma, exponential, and chi-squared distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Bernoulli, binomial, and multinomial distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Poisson, geometric, and hypergeometric distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Random sampling 14
2.1 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Approximate confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Simple random sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Dichotomous data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Stratified random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Exact confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Parameter estimation 24
3.1 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Maximum likelihood estimation for the gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Hypothesis testing 33
4.1 Statistical significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Large-sample test for the proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 P-value of the test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Likelihood-ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Chi-squared test of goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Bayesian inference 44
5.1 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Credibility interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Bayesian hypotheses testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Summarising data 53
6.1 Empirical probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Quantiles and QQ-plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4 Skewness and kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.5 Inference on the population median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.6 Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

1
7 Comparing two samples 64
7.1 Two independent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2 Two independent samples: comparing population proportions . . . . . . . . . . . . . . . . . . . . . . . 67
7.3 Paired samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.4 Paired samples: comparing population proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.5 List of statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8 Analysis of variance 77
8.1 One-way layout ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2 Simultaneous confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.3 Kruskal-Wallis test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.4 Two-way layout ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.5 Additive model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

9 Categorical data analysis 91


9.1 Chi-squared test of homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.2 Chi-squared test of independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.3 Matched-pairs designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.4 Odds ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

10 Multiple regression 99
10.1 Simple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.2 Confidence intervals and hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.3 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.4 Coefficients of multiple determination and model utility tests . . . . . . . . . . . . . . . . . . . . . . . 105
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

11 Statistical distribution tables 109


11.1 Normal distribution table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
11.2 Critical values of the t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
11.3 Critical values of the chi-squared distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
11.4 Critical values of the F-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

12 Solutions to exercises 114


Solutions to Section 1.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Solutions to Section 2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Solutions to Section 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Solutions to Section 4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Solutions to Section 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Solutions to Section 6.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Solutions to Section 7.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Solutions to Section 8.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Solutions to Section 9.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Solutions to Section 10.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

2
Introduction

Statistical analysis consists of three parts: collection of data, summarising data, and making inferences. The main
focus of this text is on the key tools of statistical inference: parameter estimation and hypothesis testing based upon
properly collected, relatively small data sets. Special attention, therefore, is paid to the basic principles of experimental
design: randomisation, blocking, and replication. The reader will get a deeper understanding of some traditional topics
in mathematical statistics such as methods based on likelihood, aspects of experimental design, non-parametric testing,
analysis of variance, introduction to Bayesian inference, chi-squared tests, and multiple regression.
It is expected that the reader is familiar with the basics of probability theory. The diagram on the front page
illustrates the relationship between probability theory and statistical inference. An appreciation of these two different
perspectives is provided by the following statements.
probability theory. Previous studies showed that the drug was 80% effective. We can anticipate that
for a study on 100 patients, in average 80 will be cured and at least 65 will be cured with probability 0.9999.

statistical inference. It was observed that 78 out of 100 patients were cured. We are 95% confi-
dent that for other similar studies, the drug will be effective on between 69.9% and 86.1% of patients.

About the author


Serik Sagitov is a full professor of Mathematical Statistics at Chalmers University of Technology. His research pa-
pers are devoted to the theory and applications of branching and coalescent processes (click here). Please, use
serik@chalmers.se if you wish to send to the author any comments or corrections concerning the text.

Acknowledgements
This text has grown from the lecture notes for the undergraduate course "Statistical Inference", given at the Chalmers
University of Technology and the University of Gothenburg. The author is grateful to constructive feedback from the
students who have taken this course. Special thanks go to professor Aila Särkkä for many insightful comments and
corrections. The course material was originally based on the second edition of the book "Mathematical statistics and
data analysis" by John Rice. A great number of examples and exercises included in this compendium are borrowed
from Rice’s textbook.

List of the course topics


Statistical inference and probability theory. Statistical models. Random variable, z-core. Correlation coefficient.
Population distribution. Population mean and standard deviation, population proportion.
Randomisation. Random sample, sampling with replacement.
Simple random sample, sampling without replacement.

Statistic, point estimate, sampling distribution.


Mean square error, systematic error and random (sampling) error.
Unbiased point estimate, consistent point estimate.
Sample mean, sample variance, sample standard deviation, sample proportion.
Finite population correction.
Standard error of the sample mean and sample proportion.
Approximate confidence intervals for the mean and the proportion.
Stratified random sampling. Optimal allocation of observations, proportional allocation.

Parametric models, population parameters.


Bernoulli, binomial, geometric, Poisson, and discrete uniform distributions.
Continuous uniform, exponential, gamma distributions.
Normal distribution, central limit theorem, continuity correction.
Method of moments for point estimation.
Maximum likelihood estimate (MLE). Likelihood function.
Normal approximation for the sampling distribution of MLE.

3
Sufficient statistics for population parameters.
Exact confidence intervals for the mean and variance. Chi-squared distribution, t-score and t-distribution.

Statistical hypotheses, simple and composite hypotheses, null hypothesis and alternative hypothesis.
Test statistic, rejection region. Two types of error.
Significance level, test power, effect size.
P-value of the test, one-sided and two-sided p-values.
Large-sample tests for the mean and for the proportion. One-sample t-test. The binomial test.
Nested hypotheses, generalised likelihood ratio test.
Chi-squared test of goodness of fit, its approximate nature. Multinomial distribution.

Bayes formulas for probabilities and densities. Prior and posterior distributions.
Conjugate priors. Normal-normal model.
Beta and Dirichlet distributions. Beta-binomial model and Dirichlet-multinomial model.
Bayesian estimation, zero-one loss function and squared error loss. Posterior risk.
Maximum aposteriori estimate (MAP) and posterior mean estimate (PME). Credibility interval.
Posterior odds. Bayesian hypotheses testing.

Empirical cumulative distribution function. Empirical variance.


Survival function and hazard function. Weibull distribution. Empirical survival function.
Histograms and kernel density estimates.
Population quantiles. Ordered sample and empirical quantiles.
QQ plots, normal QQ plot.
Coefficient of skewness and kurtosis. Light tails and heavy tails of probability distributions.
Leptokurtic and platykurtic distributions.

Population mean, mode, and median. Sample median, outliers.


Sign test and non-parametric confidence interval for the median.
Sample range, quartiles, interquartile range (IQR) and median of the absolute deviations (MAD). Boxplots.

Two independent samples versus paired samples.


Approximate confidence interval and large sample test for the mean difference.
Two-sample t-test, pooled sample variance.
Exact confidence interval for the mean difference. Transformation of variables.
Ranks versus exact measurements. Rank sum test. Signed rank test.
Approximate confidence interval for the difference between two proportions.
Large sample test for two proportions.
Fisher’s exact test.
Simpson’s paradox.

One-way ANOVA, sums of squares and mean squares.


Normal theory model, F-test, F-distribution, normal QQ plots for the residuals.
The problem of multiple comparison and multiple testing.
Simultaneous confidence intervals, Bonferroni’s method and Tukey’s method.
Two-way ANOVA, main effects and interaction. The noise and the residuals. Three F-tests.
Additive model. Randomised block design.
Non-parametric ANOVA: Kruskal-Wallis test, Friedman’s test.

Categorical data.
Dichotomous data. Cross classification.
Chi-squared tests of homogeneity and independence.
Prospective and retrospective studies. Matched-pairs design, McNemar’s test. Odds ratio.

Simple linear regression model. Normal equations. Least squares estimates.


Sample covariance, sample correlation coefficient.
Bias-corrected MLE of the noise variance. Coefficient of determination.
Confidence interval and hypotheses testing for the intercept and slope. Model utility test, t-value.
Prediction interval for a new observation.
Multiple regression. Design matrix.
Coefficient of multiple determination. Adjusted coefficient of multiple determination.
Collinearity problem.

4
Chapter 1

Parametric models

Notationally, we distinguish between random variables X, Y, Z and their realisations x, y, z by consistently using either
capital or small letters. For a random variable X, we usually denote its mean value and variance by

µ = E(X), σ 2 = Var(X).

Recall that the mean (expected) value of X is computed as


Z ∞ ∞
X
E(X) = xf (x)dx or E(X) = xi pi ,
−∞ i=1

depending on whether the distribution of X is continuous, with the probability density function f (x), or discrete, with
the probability mass function pi = P(X = xi ). The difference (X − µ) is called the deviation from the mean, and the
variance of X is defined by
Var(X) = E((X − µ)2 ).
The square root of the variance, σ, is called the standard deviation of X.

By X ∼ F(µ, σ) we will mean that X has a distribution with mean µ and standard deviation σ. The
symbol F(µ, σ) will often denote the so-called population distribution.

The standardized version of X,


X −µ
Z= ,
σ
often called a z-score, is the result of a linear transformation of X such that E(Z) = 0 and Var(Z) = 1. Notice that
an arbitrary linear transformation Y = a + bX, having mean µy = a + bµ and standard deviation σy = bσ, brings the
same z-score:
Y − µy X −µ
= = Z.
σy σ
It follows, that the z-score is a unit-less entity (for example, giving the same number of one’s standardized height
irrespectively of the length unit, be it centimeters or inches).
The covariance of two random variables (X1 , X2 ), with means (µ1 , µ2 ), is defined by

Cov(X1 , X2 ) = E((X1 − µ1 )(X2 − µ2 )).

If (Z1 , Z2 ) are standardised (X1 , X2 ), then the correlation coefficient for (X1 , X2 ) is given by ρ = Cov(Z1 , Z2 ).

The correlation coefficient is used as a measure of linear dependence between a pair of random variables.

In general, the correlation coefficient belongs to the interval −1 ≤ ρ ≤ 1. In particular, if X1 and X2 are independent
random variables, then ρ = 0, and if X2 = a + bX1 , then ρ = ±1 depending on the sign of the non-zero slope b.

1.1 Normal distribution


A key parametric statistical model is the normal distribution which will be denoted by N(µ, σ). A normally distributed
random variable
X ∼ N(µ, σ)
has mean E(X) = µ, variance Var(X) = σ 2 , and the probability density function
(x−µ)2
f (x) = √1 e− 2σ 2 , −∞ < x < ∞.
σ 2π

5
The corresponding z-score Z = X−µ σ has the so-called standard normal distribution N(0, 1). The cumulative distribu-
tion function of N(0, 1) is assigned a special notation
Φ(x) = P(Z ≤ x), −∞ < x < ∞.
The values of Φ(x) for x ≥ 0, may be found using the table in Section 11.1. Importantly, if X ∼ N(µ, σ), then
P(X ≤ x) = P( X−µ
σ ≤ x−µ
σ ) = P(Z ≤ x−µ
σ ) = Φ( x−µ
σ ).

The next figure presents four probability density functions N(0, 0.5), N(0, 2), N(−2, 1), and N(2, 0.5). All normal
distribution curves have the same shape and differ only in their location parameter µ and their scale parameter σ.

The key role of the normal model in the statistical inference is due to the central limit theorem. Let (X1 , . . . , Xn )
be independent and identically distributed random variables with mean µ and variance σ 2 . The arithmetical average
X1 + . . . + Xn
X̄ =
n
represents the sample mean. According to the central limit theorem, for a large sample size n,
X̄ ≈ N(µ, √σn ),

σ2
meaning that the random variable X̄ is asymptotically normally distributed with mean µ and variance n .

Signal and noise


Suppose we want to measure the unknown signal value µ and each measurement Y has a random measurement error.
Let
Y = µ + σZ, Z ∼ N(0, 1),
where σZ is the measurement error or, in other words, the noise component. Then, the response variable Y ∼ N(µ, σ).
In view of the central limit theorem, it is natural to model the random noise using the N(0, σ)-model, since the noise
is an accumulation of all external factors neither of which having a dominating effect on the response variable. In such
a setting, we refer to σ as the size of the noise.

Mixtures of normal distribution


A motivating example for the mixture model is the hight of people in a population consisting of women and men. Let
N(µ1 , σ1 ) be the distribution of women’s height and N(µ2 , σ2 ) be the distribution of men’s height. Then, the mixed
population distribution describes the outcome of a two step random experiment: first toss a coin for choosing index i
to be either 1 or 2, then generate a value using N(µi , σi ). The resulting density function has the shape of a "camel
curve" as illustrated below (red line).

6
More generally, suppose that we are given k ≥ 2 normally distributed random variables

X1 ∼ N(µ1 , σ1 ), . . . , Xk ∼ N(µk , σk ).

Define the response variable Y as Xi with a random index i taking one of the values 1, . . . , k with probabilities
w1 , . . . , wk , so that
w1 + . . . + wk = 1.
This yields the following expressions for the mean µ = E(Y ) and variance σ 2 = Var(Y )

µ = w1 µ1 + . . . + wk µk ,
k
X k
X
σ2 = wj (µj − µ)2 + wj σj2 .
j=1 j=1

The above expression for σ 2 is due to the law of total variance, see Wikipedia, which recognises two sources of variation
Pk
variation between the strata j=1 wj (µj − µ)2 ,
Pk
variation within the strata j=1 wj σi2 .

1.2 One-way and two-way layout models


Suppose the expectation µi of the response variable is a function of the level i for a single main factor A having I
different levels:
Yi ∼ N(µi , σ), i = 1, . . . , I.
It is helpful to represent the population means µi as the sum

µi = µ + αi , αi = µi − µ,

of the so called grand mean


µ1 + . . . + µI
µ=
I
PI
and the effect αi of the main factor A at the level i. Observe that i=1 αi = 0, meaning the total effect of the factor
A is zero.
In the case of two main (categorical) factors, with factor A having I different levels, and factor B having J different
levels, assume that the response variable depends on the combination of the levels of the two main factors in the
following way
Yij ∼ N(µij , σ), i = 1, . . . , I, j = 1, . . . , J,
where
µij = µ + αi + βj + δij ,
is the sum of the grand mean µ, the main effect αi of the factor A at the level i, the main effect βj of the factor B at
the level j, and the interaction δij of the two main factors. Here it is assumed that
I
X J
X
αi = βj = 0,
i=1 j=1

and
I
X J
X
δij = 0, j = 1, . . . , J, δij = 0, i = 1, . . . , I.
i=1 j=1

In general, at different combinations of levels (i, j) of the two factors may interact either negatively, δij < 0, or
positively, δij > 0.

Additive model
In the special case, with δij = 0 for all (i, j), the model claims that there is no interaction and the main factors
contribute additively:
µij = µ + αi + βj .

Example: pay gap


Let the response variable be the salary of a person chosen from a large population. Factor A is person’s sex having
I = 2 levels: i = 1 for a female and i = 2 for a male. Factor B is person’s profession having say J = 20 levels, where
j = 1 is a farmer, j = 2 is a police officer, j = 3 is a doctor, and so on. In this example, the difference α1 − α2
represents the pay gap between women and men.

7
1.3 Sample mean, sample variance, and t-distributions
Suppose we are given a vector (X1 , . . . , Xn ) of independent random variables having the same distribution F(µ, σ).
A realisation of this vector (x1 , . . . , xn ) will be called (with a slight abuse of the established terminology) a random
sample drawn from the population distribution F(µ, σ) with population mean µ and population standard deviation
σ. For the given sample (x1 , . . . , xn ) define the sample mean, sample variance, and sample standard deviation by
n r
x1 + . . . + xn 2 1 X 2 (x1 − x̄)2 + . . . + (xn − x̄)2
x̄ = , s = (xi − x̄) , s = .
n n − 1 i=1 n−1
An alternative formula for the sample variance
n x21 + . . . + x2n
s2 = (x2 − x̄2 ), x2 = ,
n−1 n
is often more convenient to use for pen and paper calculations. The sample mean x̄ and sample variance s2 are
realisations of the random variables
n
X1 + . . . + Xn 1 X
X̄ = , S2 = (Xi − X̄)2 ,
n n − 1 i=1
which have the following means and variances
 
σ2 σ4
E(X̄) = µ, Var(X̄) = n , E(S 2 ) = σ 2 , Var(S 2 ) = n E( X−µ 4
σ ) −
n−3
n−1 .

If the population distribution is normal


F(µ, σ) = N(µ, σ),
then by Cochran’s theorem, see Wikipedia, the so-called t-score of the random sample
X̄ − µ
√ ∼ tn−1
S/ n
has the t-distribution with n − 1 degrees of freedom. The density function of the t-distribution with k ≥ 1 degrees of
freedom
Γ( k+1 )  2
− k+1
2
f (x) = √ 2 k 1 + xk , −∞ < x < ∞
kπΓ( 2 )
involves the gamma function Z ∞
Γ(a) = xa−1 e−x dx,
0
which is an extension of the factorial function to the non-integer positive numbers a, in that
Γ(k) = (k − 1)!, k = 1, 2, . . .
The t-distribution curve with k ≥ 3 of degrees
q of freedom looks similar to the N(0, 1)-curve, being symmetric around
zero and having the standard deviations k−2 k
which is larger than 1. The figure below depicts three t-distribution
curves together with the N(0, 1)-curve (in red). The degrees of freedom used in the figure are k = 1, k = 2 in blue,
and k = 6. The t-distribution with k = 1 degree of freedom has undefined mean value and infinite variance. The
t-distribution with k = 2 degrees of freedom has zero mean and infinite variance.

The connection between the t-distribution and the standard normal distribution can be described in the following
way: if Z, Z1 , . . . , Zk are independent random variables with N(0, 1)-distribution, then
Z
p ∼ tk .
(Z12 + . . . + Zk2 )/k

8
1.4 Gamma, exponential, and chi-squared distributions
The gamma distribution Gam(α, λ) is a continuous distribution described by two parameters: the shape parameter
α > 0 and the inverse scale (or the rate) parameter λ > 0. Its probability density function has the form
1 α α−1 −λx
f (x) = λ x e , x > 0.
Γ(α)
The next figure depicts
on the left panel, the gamma densities with λ = 1 and α = 0.9, 1, 2, 3, 4,
on the middle panel, the gamma densities with α = 1 and λ = 0.5, 1, 2, 3,
on the right panel, the gamma densities with λ = 1 and α = 10, 15, 20, 25, 30.

The gamma distribution model, despite being restricted to positive values, is more flexible than the normal distri-
bution model since for the different values of α, the density curves have different shapes. In particular, if α = 1, then
we obtain the exponential distribution (see the middle panel above)
Gam(1, λ) = Exp(λ).
Moreover, if Xi ∼ Exp(λ), i = 1, . . . , k are independent, then
X1 + . . . + Xk ∼ Gam(k, λ), k = 1, 2, . . .
The mean and variance of the gamma distribution are
µ= α
λ, σ2 = α
λ2 .

For large values of the shape parameter, there is a useful normal approximation for the gamma distribution:

α
Gam(α, λ) ≈ N( αλ , λ ), α ≫ 1.

The chi-squared distribution with k degrees of freedom is the gamma distribution with α = k2 , λ = 12 . The figure
below depicts the chi-squared distribution densities with k = 2, 3, 4, 5, 6 degrees of freedom.

The chi-squared distribution is connected to the standard normal distribution as follows: if Z1 , . . . , Zk are independent
random variables with N(0, 1)-distribution, then
Z12 + . . . + Zk2 ∼ χ2k .
Importantly, if (X1 , . . . , Xn ) are independent and random variables each having the N(µ, σ) distribution, then
Pn 2
i=1 (Xi − X̄)
∼ χ2n−1 .
σ2
Here, the number of degrees of freedom is n − 1 instead of n, because one degree of freedom is consumed after µ being
replaced by X̄.

9
1.5 Bernoulli, binomial, and multinomial distributions
Let X be the outcome of a Bernoulli trial with probability of success p, meaning that
P(X = 1) = p, P(X = 0) = 1 − p.
In this case, we write X ∼ Bin(1, p) and say that X has the Bernoulli distribution with parameter p ∈ [0, 1]. The
Bernoulli model is used for describing dichotomous data, when observations have two possible outcomes: female or
male, heads or tails, passed or failed. Under such a dichotomy, one outcome is usually called a success and the other
failure, and the value x = 1 is assigned to the successful outcome. The mean and variance of X ∼ Bin(1, p) are
µ = p, σ 2 = p(1 − p).
If X is the sum of outcomes of n independent Bernoulli trials with probability p of success, then its distribution is
called the binomial distribution Bin(n, p). This is a discrete distribution with the probability mass function
 
n x
P(X = x) = p (1 − p)n−x , x = 0, . . . , n,
x
yielding the following formulas for the mean and variance
µ = np, σ 2 = np(1 − p).
The normal approximation for the binomial distribution
Bin(n, p) ≈ N(np,
p
np(1 − p)).
is an instrumental example of the central limit theorem. The rule of thumb says that this normal approximation is
good enough if both np ≥ 5 and n(1 − p) ≥ 5, so that
 
P(X ≤ x) ≈ Φ √ x−np .
np(1−p)

Continuity correction
For smaller values of n, this approximation is improved with help of the continuity correction trick, which in view of
the equality
P(X ≤ x) = P(X < x + 1),
suggests replacing x by x + 1
2 on the right hand side of the approximation formula
 
P(X ≤ x) ≈ Φ √ x−np ,
np(1−p)

resulting in
x+ 1 −np
 
P(X ≤ x) ≈ Φ √ 2 ,
np(1−p)

or similarly,
x− 1 −np
 
P(X < x) ≈ Φ √ 2 .
np(1−p)

Example
To illustrate, we plot the Bin(10, 1/3) distribution together with its normal approximation: on the left panel without
the continuity correction and on the right panel with the continuity correction. Observe that n = 10 is so small that
with p = 1/3 we have np = 3.33 which smaller than the recommended lower bound 5. Still, the normal approximation
with the continuity corrections is quite close.

10
The multinomial distribution (X1 , . . . , Xr ) ∼ Mn(n; p1 , . . . , pr ) is defined by
 
n
P(X1 = x1 , . . . , Xr = xr ) = px1 . . . pxr r ,
x1 , . . . , x r 1
where
xi = 0, . . . , n, i = 1, . . . , r,
and (p1 , . . . , pr ) is a vector of probabilities such that
p1 + . . . + pr = 1.
This is an extension of the binomial distribution Bin(n, p) = Mn(n; p, 1 − p). The Mn(n; p1 , . . . , pr ) distribution
describes the outcome of n independent trials with r possible outcomes labeled by i = 1, . . . , r. If each trial outcome
has distribution (p1 , . . . , pr ) over the set of possible labels {1, . . . , r}, then Xi should be treated as the number of trials
with the outcome labeled by i. We have
X1 + . . . + Xr = n,
the marginal distribution of Xi is binomial Xi ∼ Bin(n, pi ), and the different counts (Xi , Xj ) are negatively correlated:
Cov(Xi , Xj ) = −npi pj , i ̸= j.

1.6 Poisson, geometric, and hypergeometric distributions


The Poisson distribution X ∼ Pois(µ) is a discrete distribution with
µx −µ
P(X = x) = e , x = 0, 1, . . . , E(X) = µ, Var(X) = µ.
x!
The Poisson distribution is obtained as an approximation for the Bin(n, p) distribution in the case
n → ∞, p → 0, and np → µ.
It is used to describe the number of rear events (like accidents) observed during a given time interval. The next figure
depicts
on the left panel, the Poisson distribution with µ = 1,
on the middle panel, the Poisson distribution with µ = 8,
on the right panel, the Poisson distribution with µ = 3 in red is compared to the Bin(100, 0.03) distribution.

Geometric distribution
Consider a sequence of independent Bernoulli trials with probability p of success. The geometric distribution is either
one of two discrete probability distributions:
the distribution of the number X of trials needed to get one success,
the distribution of the number Y = X − 1 of failures before the first success.
Which of these is called the geometric distribution is a matter of convention and convenience. Often, the name shifted
geometric distribution is adopted for the distribution of X supported on the set {1, 2, . . .}. To avoid ambiguity, in this
text, we say that X has the geometric distribution with parameter p and write X ∼ Geom(p) if
P(X = x) = (1 − p)x−1 p, x = 1, 2, . . . ,
yielding the mean and variance formulas
1−p
µ = p1 , σ2 = p2 .
Like its continuous analogue, the exponential distribution, the geometric distribution is memoryless: the die one throws
or the coin one tosses does not have a memory of how many failures have been observed so far.

11
Hypergeometric distribution
The hypergeometric distribution X ∼ Hg(N, n, p) describes the number x of black balls among n balls drawn without
replacement from a box with N balls, of which
B = N p balls are black and
W = N (1 − p) balls are white.
In this case, X is the number of successes in n Bernoulli trials which depend on each other. The distribution of X is
given by the formula
B
 W 
x n−x
P(X = x) = N
 ,
n

for the integer numbers x satisfying


max(0, n − W ) ≤ x ≤ min(n, B).
The mean and variance of X are
−n
µ = np, σ 2 = np(1 − p) N
N −1 .
Compared to the variance of the Bin(n, p) distribution, the last formula contains the factor
N −n n−1
N −1 =1− N −1 ,

which is called the finite population correction factor. With a small fraction value n/N , the finite population correction
is close to 1 and Hg(N, n, p) ≈ Bin(n, p). One may say that the binomial distribution is a version of the hypergeometric
distribution with the infinite population size.
Despite the dependence between the drawings without replacement, there is a normal approximation also for the
hypergeometric distribution:
q
Hg(N, n, p) ≈ N(µ, σ),
p
n−1
µ = np, σ = np(1 − p) 1 − N −1 ,

which is recommended to be applied provided np ≥ 5 and n(1 − p) ≥ 5. On the figure below the Hg(100, 10, 0.5)
distribution is compared to its normal approximation with the continuity correction. Again, with the continuity
correction the normal approximation works well even for smaller values of np and n(1 − p).

1.7 Exercises
Problem 1
For any pair of random variables (X1 , X2 ) with means (µ1 , µ2 ), show that
Var(Xi ) = E(Xi2 ) − µ2i , Cov(X1 , X2 ) = E(X1 X2 ) − µ1 µ2 .

Problem 2
Let
X1 − µ1 X2 − µ2
Z1 = , Z2 = ,
σ1 σ2
be the standardised versions of X1 and X2 . Verify that E(Zi ) = 0 and Var(Zi ) = 1. Show that the correlation
coefficient for X1 and X2 is given by
ρ = E (Z1 Z2 ) ,
and explain in what sense it is a dimensionless quantity.

12
Problem 3
For (X1 , . . . , Xr ) ∼ Mn(n; p1 , . . . , pr ), what is the distribution of the sum Xi + Xj assuming i ̸= j?

Problem 4
Let X ∼ Gam(α, λ). To see that the parameter λ influences only the scale of the gamma distribution, show that the
scaled random variable Y = λX has the gamma distribution Gam(α, 1).

Problem 5
Show that
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ).

Problem 6
The average number of goals in a World Cup soccer match is approximately 2.5 and the Poisson model is appropriate.
Compute the probabilities of k goals in a match for k = 0, 1, 2.

Problem 7
Show that for large N ,
Hg(N, n, p) ≈ Bin(n, p).

Problem 8
Consider a truncated version of the geometric distribution Geom(n, p) such that for X ∼ Geom(n, p),
n−1
X
P(X = x) = (1 − p)x p, x = 0, 1, . . . , n − 1, P(X = n) = 1 − P(X = x),
x=1

and compute P(X = n).

13
Chapter 2

Random sampling

Statistical inference is the use of data analysis for inferring relevant statistical patterns in a large population with help
of a random sample drawn from the population in question.
Picking at random one element from the population, produces a realisation x of
a random variable X ∼ F(µ, σ) having the population distribution.
In many situations, studying the population distribution by enumeration is either very expensive or even impossible.
Luckily, a good guess is available by studying a sample of n observations (x1 , . . . , xn ) drawn independently from the
population distribution F(µ, σ). Such a random sample is a single realisation of the vector (X1 , . . . , Xn ) of independent
and identically distributed random variables. If the sampling experiment is repeated, the new realization (x′1 , . . . , x′n )
will differ from (x1 , . . . , xn ).
Randomisation in sampling protects against investigator’s biases even unconscious.
Any function g(x1 , . . . , xn ) of the sample data is called a statistic. The most important examples of statistics are
the sample mean and sample variance
x1 + . . . + xn (x1 − x̄)2 + . . . + (xn − x̄)2
x̄ = , s2 = .
n n−1

Example: in class experiment


The figure below presents the data of heights and gender for the students attending the course Statistical Inference
at a certain year. The collected height values form a sample drawn from the population distribution of the heights of
Gothenburg students for that year.
- Can this dataset be viewed as a random sample, in the sense that the students are drawn independently
at random from the population of Gothenburg students?
- How would you estimate the population mean and variance of the heights using the collected data?
- How would you estimate the population proportion of women?

2.1 Point estimation


Suppose the population distribution of interest is a gamma distribution
F(µ, σ) = Gam(α, λ)

14
with unknown parameters. Given a random sample (x1 , . . . , xn ) drawn from Gam(α, λ) distribution, one can try to
estimate the shape parameter α using a relevant statistic g(x1 , . . . , xn ).
More generally, to estimate a population parameter θ based on a given random sample (x1 , . . . , xn ), we need a
sensible point estimate θ̂ = g(x1 , . . . , xn ). Observe, that in the same way as (x1 , . . . , xn ) is a realisation of a random
vector (X1 , . . . , Xn ), the point estimate θ̂ is a realisation of a random variable

Θ
b = g(X1 , . . . , Xn ),

which we will call a point estimator of θ. The distribution of the random variable Θ
b is called the sampling distribution
of the point estimator. The quality of the the point estimator Θ b is measured by the mean square error

b − θ)2 ) = Var(Θ)
E((Θ b − θ)2
b + (E(Θ)

which is the sum of two components involving


the bias size E(Θ)
b − θ, measuring the lack of accuracy (systematic error),
b measuring the lack of precision (random error).
Var(Θ),

If the mean square error vanishes E((Θ b − θ)2 ) → 0 as n → ∞, the point estimate θ̂ is called consistent. If E(Θ)
q
b = θ,
the estimate is called unbiased. The standard deviation σ b = Var(Θ)
Θ
b of the estimator Θ b is called the standard error
of the point estimate θ̂.

The estimated standard error sθ̂ of the point estimate θ̂


is a point estimate of σΘ
b computed from the data.

Remark on termilogy
Consistency is a feature of estimator. Sometimes, we write consistent estimate and mean that the estimate is a value
of a consistent estimator. The same holds for unbiased estimators and estimates.

Sample mean, sample variance, and sample standard deviation


In view of  
σ2 σ4
E(X̄) = µ, Var(X̄) = n , E(S 2 ) = σ 2 , Var(S 2 ) = n E( X−µ 4
σ ) −
n−3
n−1 ,

we conclude that the sample mean x̄ is an unbiased and consistent estimate of the population mean θ = µ. Furthermore,
the sample variance s2 is an unbiased and consistent estimate for the population variance θ = σ 2 .

The estimated standard error for the sample mean x̄ is given by sx̄ = √s .
n

Notice that the sample standard deviation s systematically underestimates the population standard deviation σ since
E(S) < σ, provided Var(S) > 0. To see this, observe that (E(S))2 < E(S 2 ), where E(S 2 ) = σ 2 . However, s is an
asymptotically unbiased and consistent estimate of σ.

2.2 Approximate confidence intervals


By the central limit theorem, for the large sample sizes n,

X̄ ≈ N(µ, √σn ),

or in terms of the z-score


X̄ − µ
√ ≈ N(0, 1).
σ/ n
Due to the consistency property S ≈ σ it follows that in terms of the t-score we have

X̄ − µ
√ ≈ N(0, 1),
S/ n

This yields √ √ √ √
P(X̄ − zS/ n < µ < X̄ + zS/ n)) ≈ P(−zσ/ n < X̄ − µ < zσ/ n) ≈ 2(1 − Φ(z)),
giving the following formula of an approximate 100(1–α)% two-sided confidence interval for the population mean µ:

Iµ ≈ x̄ ± z( α2 ) · sx̄

15
or equivalently,
s
Iµ ≈ x̄ ± z( α2 ) · √ .
n

Here z(α) is obtained from the normal distribution table using the relation

Φ(z(α)) = 1 − α, α ∈ (0, 1),

so that for example,


100(1–α)% 68% 80% 90% 95% 99% 99.7%
z( α2 ) 1.00 1.28 1.64 1.96 2.58 3.00

The higher the confidence level 100(1–α)%, the wider is the confidence interval Iµ .
On the other hand, the larger the sample size n, the narrower is Iµ .

The exact meaning of the confidence level is a bit tricky. It is important to realise that the source of randomness
in the expression x̄ ± z( α2 ) · sx̄ is in the sampling procedure. For a given sample (x1 , . . . , xn ), the concrete interval Iµ
is a realisation of a random interval Iµ , such that

P(random interval Iµ covers the point µ) ≈ 1 − α.

For example, out of a hundred 95% confidence intervals Iµ computed for 100 samples, on average 95 intervals are
expected to cover the true value of µ. Notice that in this case, the random number of the confidence intervals covering
µ has distribution Bin(100, 0.95) which is approximately normal N(95, 2.18).

Example: 50% confidence level


The figure below presents a case for the 50% confidence interval. Here 20 samples of size n = 8 were drawn from a
population with mean µ. The 20 samples produce different sample means and sample variances. The resulting 20
confidence intervals for the mean µ have different middle points and widths:
some of the intervals cover the mean value (blue intervals),
some of the intervals fail to capture the mean value (red intervals).

2.3 Simple random sample


A finite population of size N can be viewed as a set of N elements characterised by numerical values x ∈ {a1 , a2 , . . . , aN }.
The corresponding population distribution F(µ, σ) is then given by the probability mass function
Nx
P(X = x) = ,
N
where Nx is the number of elements labeled by ai = x. There are two basic ways of drawing a random sample of size
n from a population of size N :

16
- sampling with replacement produces what we call a random sample consisting of independent and identically
distributed observations,

- sampling without replacement produces a so called simple random sample having identically distributed but
dependent observations.
Notice that if the ratio N
n
is small, then the two approaches are almost indistinguishable.
In this section we consider a simple random sample (x1 , . . . , xn ) with dependent observations. In this case the
sample mean x̄ is again an unbiased and consistent estimate for the population mean, such that
σ2 n−1
E(X̄) = µ, Var(X̄) = n (1 − N −1 ).

Here N is the finite population size and the finite population correction
n−1 N −n
1− N −1 = N −1

reflects the negative correlation between observations due to sampling without replacement.
Observe that the sample variance s2 becomes a biased estimate of σ 2 . Indeed, since
 Pn 2

i=1 Xi
E(S 2 ) = n−1
n
E n − X̄ 2 = n−1
n
(E(X 2 ) − E(X̄ 2 ))
2 σ2 σ2
= n
n−1 (σ + µ2 − n (1 − n−1
N −1 ) − µ2 ) = n
n−1 (σ
2
− n (1 − n−1
N −1 )) = σ 2 NN−1 ,

we find that
E(S 2 ) = σ 2 NN−1 .
σ2
Replacing σ 2 by s2 NN−1 in the formula Var(X̄) = n (1 − n−1
N −1 ), we obtain the following unbiased estimate of Var(X̄):

s2 N −1 s2
s2x̄ = n N (1 − n−1
N −1 ) = n (1 − n
N ).

Thus, for the sampling without replacement, the formula for the estimated standard error of the sample mean x̄ takes
the form q
sx̄ = √sn 1 − Nn
.

With this new formula for the estimated standard error sx̄ , the formula of an approximate 100(1–α)% confidence
interval

Iµ ≈ x̄ ± z(α/2)sx̄ ,

remains to be valid even for the sampling without replacement, due to the central limit theorem under weak dependence.
The figure below plots the sampling distribution of the sample mean and its normal approximation. Here the
sampling is performed without replacement from the population

{a1 , a2 , . . . , aN } = {1, 2, . . . , 100}

using the sample size n = 10.

17
2.4 Dichotomous data
Consider the important case when the population distribution is a Bernoulli distribution
F(µ, σ) = Bin(1, p)
with the unknown parameter p. A random sample drawn from the Bernoulli distribution consists of 0 or 1 values
xi ∈ {0, 1}, therefore, the data (x1 , . . . , xn ) will be called dichotomous. This model can be used for categorical data,
where the observed values are of non-numerical form, like male or female, after converting the non-numerical values
to the 0-1 format: female = 1, male = 0.
The defining parameter of the Bernoulli model
p = P(X = 1),
will be called called the population proportion. The population proportion p defines both population mean and
variance by
µ = p, σ 2 = p(1 − p),
and the sample mean turns into a sample proportion p̂ = x̄. The sample proportion is an unbiased and consistent
estimate of p. Since x2i = xi for xi ∈ {0, 1}, we find that the sample variance takes the form
(x1 − p̂)2 + . . . + (xn − p̂)2 x2 − 2x1 p̂ + p̂2 + . . . + x2n − 2xn p̂ + p̂2
s2 = = 1
n−1 n−1
x1 − 2x1 p̂ + p̂2 + . . . + xn − 2xn p̂ + p̂2 np̂2 − (x1 + . . . + xn )p̂
= =
n−1 n−1
np̂(1 − p̂)
= .
n−1
q
The estimated standard error for the sample proportion p̂ is sp̂ = p̂(1− p̂)
n−1 .

Simple random sample


If sampling from a dichotomous population of size N is performed without replacement, the formula for the estimated
standard error of p̂ becomes q q
sp̂ = p̂(1− p̂)
n−1
n
1− N .
The following confidence interval formula is valid both for sampling with replacement and for sampling without
replacement.
An approximate 100(1–α)% two-sided confidence interval for p is given by Ip ≈ p̂ ± z( α2 ) · sp̂

Example: opinion polls


Swedish population of eligible voters has the size N of 7 millions. You are interested in the current attitude among
voters towards a certain Swedish political party, quantified by a relevant population proportion p. How informative
would be the result of a survey involving n = 2000 people? An intuitive response is that asking 2000 people could not
accurately reflect the overall opinion of 7 000 000 people. However, if the sample is truly random, the error in p̂ is
quite small.

18
Indeed, the standard error of the sample proportion p̂ with n = 2000 and N = 7000000 equals
q q
sp̂ = p̂(1− p̂)
p
n
n−1 1− N = 0.0224 p̂(1 − p̂),

which is less or equal than 1.12%, since p̂(1 − p̂) ≤ 0.25.

2.5 Stratified random sampling


Given additional information on the population structure, one can reduce the sampling error using the method of
stratified sampling. Assume that a population of size N consists of k strata of sizes N1 , . . . , Nk , so that N =
N1 + . . . + Nk . Suppose the strata fractions wj = Nj /N are known. A simple example of a stratified population is
the Swedish population divided in two strata: the subpopulation of females and the subpopulation of males, so that
k = 2 and w1 = w2 = 0.5.
In terms of the unknown strata means and standard deviations

(µj , σj ), j = 1, . . . , k,

we get the following expressions for the population mean and variance

µ = w1 µ1 + . . . + wk µk ,
k
X
σ2 = σ2 + wj (µj − µ)2 .
j=1

In view of
w1 + . . . + wk = 1,
the above formulas for µ and σ 2 are obtained using the law of total expectation and law of total variance, see Wikipedia.
The expression for σ 2 is due to the law of total variance, with

σ 2 = w1 σ12 + . . . + wk σk2

being the average variance, see the end of Section 1.1.

The stratified random sampling procedure consists of taking k independent random samples from each stratum
with sample sizes (n1 , . . . , nk ) and sample means x̄1 , . . . , x̄k .

The stratified sample mean is the weighted average of k sample means x̄s = w1 x̄1 + . . . + wk x̄k .

Observe that for any allocation (n1 , . . . , nk ) of

n = n1 + . . . + nk

observations, the stratified sample mean is an unbiased estimate of µ since

E(X̄s ) = w1 E(X̄1 ) + . . . + wk E(X̄k ) = w1 µ1 + . . . + wk µk = µ.

The variance of X̄s is given by the formula


w12 σ12 2 2
wk σk
Var(X̄s ) = w12 Var(X̄1 ) + . . . + wk2 Var(X̄k ) = n1 + ... + nk .

It is estimated by
w12 s21 2 2
wk sk
s2x̄s = w12 s2x̄1 + . . . + wk2 s2x̄k = n1 + ... + nk ,

where sj is the sample standard deviation corresponding to the sample mean x̄j .

19
A confidence interval for the mean based on a stratified sample: Iµ ≈ x̄s ± z( α2 ) · sx̄s

Suppose we are allowed to collect n observations from the stratified population of size N and assume that n ≪ N .
What would be the optimal allocation (n1 , . . . , nk ) of n observations among different strata minimising the sampling
w σ
error sx̄s of x̄s ? The solution of this optimisation problem is given by the formula nj = n jσ̄ j , j = 1, . . . , n, where σ̄
is the average standard deviation

σ̄ = w1 σ1 + . . . + wk σk .

wj σj
The stratified sample mean X̄so with the optimal allocation nj = n σ̄ has the
2
smallest variance Var(X̄so ) = σ̄
n among all allocations of n observations.

The optimal allocation assigns more observations to larger strata and strata with larger variation. The major drawback
of the optimal allocation formula is that it requires the knowledge of the standard deviations σj .
If σj are unknown, which is often the case, then a sensible choice is to allocate the observations proportionally to
the strata sizes, so that ni = nwi . Observe that with the proportional allocation, the stratified sample mean is equal
to the usual sample mean

n1 nk x1 + . . . + xn
x̄sp = w1 x̄1 + . . . + wn x̄k = n x̄1 + ... + n x̄k = = x̄.
n
However, this is not the mean of a random sample with independently allocated observations, since the n observations
are forcefully allocated among the k strata proportionally to the strata sizes. For the truly random sample, the sample
sizes n1 , . . . , nk are the outcome of a random allocation of n observations among k strata following the multinomial
distribution Mn(n, w1 , . . . , wk ).

The stratified sample mean X̄sp for the proportional allocation nj = nwj has the variance Var(X̄sp ) = n .
σ2

Comparing the three unbiased estimates of the population mean (x̄so , x̄sp , x̄), we find that their variances are
ordered in the following way
Var(X̄so ) ≤ Var(X̄sp ) ≤ Var(X̄),
since
(σ̄)2 σ2 σ2
≤ ≤ .
n n n
Variability of σj across the different strata makes the optimal allocation more effective than proportional
X
Var(X̄sp ) − Var(X̄so ) = n1 (σ 2 − σ̄ 2 ) = n1 wj (σj − σ̄)2 .

Variability of µj across the strata makes the proportional allocation more effective than the sample mean produced
by a random sample X
Var(X̄) − Var(X̄sp ) = n1 (σ 2 − σ 2 ) = n1 wj (µj − µ)2 .

2.6 Exact confidence intervals


Recall the approximate confidence interval formula for the mean

Iµ ≈ x̄ ± z( α2 ) · sx̄ ,

which is based on the central limit theorem for the t-score of a random sample with a sufficiently large size n:

X̄ − µ
√ ≈ N(0, 1).
S/ n

The approximate confidence interval formula does not require that the population distribution is normal, but it works
only for large sample sizes. In this section, we state two confidence interval formulas based on the probability theory
facts mentioned in Sections 1.3 and 1.4:
X̄ − µ (n − 1)S 2
√ ∼ tn−1 , ∼ χ2n−1 .
S/ n σ2

These formulas are valid for small sample values under the assumption that the population distribution is normal.
Assume that a random sample (x1 , . . . , xn ) is taken from the normal distribution

F(µ, σ) = N(µ, σ)

20
with unspecified parameters µ and σ. Replacing the normal approximation by the exact distribution for the t-score
X̄ − µ
√ ∼ tn−1 ,
S/ n
we arrive at the exact 100(1 − α)% confidence interval formula for the mean

Iµ = x̄ ± tn−1 ( α2 ) · sx̄ ,

which is valid even for the small values of n ≥ 2. Remember that this formula requires that the population distribution
is normal.
For example, with α = 0.05 and n = 10, 16, 25, 30 we get the following four 95% confidence intervals for the mean

Iµ = x̄ ± 2.26 · sx̄ for n = 10,


Iµ = x̄ ± 2.13 · sx̄ for n = 16,
Iµ = x̄ ± 2.06 · sx̄ for n = 25,
Iµ = x̄ ± 2.05 · sx̄ for n = 30.

Here the critical values for the t-distribution

t9 (0.025) = 2.26, t15 (0.025) = 2.13, t24 (0.025) = 2.06, t29 (0.025) = 2.05

are obtained from the table in Section 11.2 using the column 0.025 and the rows k = 9, 15, 24, 29. These intervals are
getting narrower for larger values of n asymptotically approaching the familiar approximate 95% confidence interval

Iµ ≈ x̄ ± 1.96 · sx̄ .

The second exact confidence interval formula of this section is aimed at the population variance σ 2 . We want to
have a formula for a random interval Iσ2 , such that

P(σ 2 ∈ Iσ2 ) = 1 − α.
2σ 4
Under the normality assumption we have Var(S 2 ) = n−1 , so that an unbiased estimated of the standard error for the
unbiased estimate s2 of σ 2 is r
2 2
ss2 = s .
n−1
However, we can not use the same kind of formula Iσ2 = s2 ± tn−1 ( α2 ) · ss2 as for the mean.
The correct exact 100(1 − α)% confidence interval formula for σ 2 is based on the non-symmetric chi-squared
distribution in the above mentioned relation
(n − 1)S 2
∼ χ2n−1 ,
σ2
and has the following non-symmetric form
 (n − 1)s2 (n − 1)s2 
Iσ 2 = ; .
xn−1 ( α2 ) xn−1 (1 − α2 )

The numbers xk (α) are obtained from the table of Section 11.3 giving the critical values of the chi-squared distributions.

Example
Examples of the 95% confidence intervals for σ 2 are

Iσ2 = (0.47s2 , 3.33s2 ) for n = 10,


Iσ2 = (0.55s2 , 2.40s2 ) for n = 16,
Iσ2 = (0.61s2 , 1.94s2 ) for n = 25,
Iσ2 = (0.63s2 , 1.81s2 ) for n = 30.

To clarify, turn to the last formula dealing with n = 30. The χ229 -distribution table gives the critical values

x29 (0.025) = 45.722, x29 (0.975) = 16.047,

yielding
n−1 29 n−1 29
= = 0.63, = = 1.81.
xn−1 ( α2 ) 45.722 xn−1 (1 − α2 ) 16.047
We conclude that for n = 30,
 (n − 1)s2 (n − 1)s2 
Iσ2 = ; = (0.63s2 , 1.81s2 ).
xn−1 ( α2 ) xn−1 (1 − α2 )

21
2.7 Exercises
Problem 1
Consider a population consisting of five values
1, 2, 2, 4, 8.
Find the population mean and variance. Calculate the sampling distribution of the mean of a random sample of size
2 by generating all possible such samples (x1 , x2 ). Then find the mean and variance of the sampling distribution, and
verify the formulas given in this section
σ2
E(X̄) = µ, Var(X̄) = .
n

Problem 2
In a simple random sample of 1500 voters, 55% said they planned to vote for a particular proposition, and 45% said
they planned to vote against it. The estimated margin of victory for the proposition is thus 10%. What is the standard
error of this estimated margin? What is an approximate 95% confidence interval for the margin of victory?

Problem 3
This problem introduces the concept of a one-sided confidence interval. Using the central limit theorem, how should
the constant k1 be chosen so that the interval
(−∞, x̄ + k1 sx̄ )
is an approximate 90% confidence interval for µ? How should k2 be chosen so that

(x̄ − k2 sx̄ , ∞)

is an approximate 95% confidence interval for µ?

Problem 4
Verify the formula for the mean square error
b − θ)2 ) = Var(Θ)
E((Θ b − θ)2 .
b + (E(Θ)

Problem 5
A simple random sample of a population size 2000 yields 25 values with
104 109 11 109 87
86 80 119 88 122
91 103 99 108 96
104 98 98 83 107
79 87 94 92 97

(a) Calculate an unbiased estimate of the population mean.


(b) Calculate an unbiased estimates of the population variance and Var(X̄).

(c) Give an approximate 95% confidence interval for the population mean.

Problem 6
For a simple random sample, take x̄2 as a point estimate of µ2 . (This is an example of the method of moments estimate
introduced in the next chapter.) Compute the bias of this point estimate, if any.

Problem 7
The following table (Cochran 1977) shows the stratification of all farms in a county by farm size and the mean and
standard deviation of the number of acres of corn in each stratum.
Farm size 0-40 41-80 81-120 121-160 161-200 201-240 241+
Number of farms Nj 394 461 391 334 169 113 148
Stratum mean µj 5.4 16.3 24.3 34.5 42.1 50.1 63.8
Stratum standard deviation σj 8.3 13.3 15.1 19.8 24.5 26.0 35.2

(a) What are the population mean and variance?

22
(b) For a sample size of 100 farms, compute the sample sizes from each stratum for proportional and optimal
allocation, and compare them.

(c) Calculate the variances of three sample means with different allocations of 100 observations: (1) proportional
allocation, (2) optimal allocation, (3) random sampling.
(d) Suppose that ten farms are sampled per stratum. What is Var(X̄s )? How large a simple random sample would
have to be taken to attain the same variance? Ignore the finite population correction.

(e) Repeat part (d) using proportional allocation of n = 70 observations.

Problem 8
How might stratification be used in each of the following sampling problems?

(a) A survey of household expenses in a city.

(b) A survey to examine the lead concentration in the soil in a large plot of land.
(c) A survey to estimate the number of people who use elevators in a large building with a single bank of elevators.

Problem 9
Consider stratifying the population of Problem 1 into two strata (1,2,2) and (4,8). Assuming that one observation is
taken from each stratum, find the sampling distribution of the estimate of the population mean and the mean and
standard deviation of the sampling distribution. Check the formulas of Section 2.5.

Problem 10
The following 16 numbers were generated from a normal distribution N(µ, σ)
5.3299 4.2537 3.1502 3.7032
1.6070 6.3923 3.1181 6.5941
3.5281 4.7433 0.1077 1.5977
5.4920 1.7220 4.1547 2.2799

(a) Give unbiased estimates of µ and σ 2 .

(b) Give 90%, 95%, and 99% confidence intervals for µ and σ 2 .
(c) Give 90%, 95%, and 99% confidence intervals for σ.
(d) How much larger sample would you need to halve the length of the confidence interval for µ?

23
Chapter 3

Parameter estimation

Given a parametric model determined by a vector of unknown parameters θ = (θ1 , . . . , θk ), we wish to estimate θ from
a given random sample (x1 , . . . , xn ). There are two basic methods of finding good point estimates: (1) the method
of moments and (2) the maximum likelihood method. The method of moments is based on k summary statistics
called sample moments, while the maximum likelihood method uses the full information on the joint distribution of
(X1 , . . . , Xn ).

3.1 Method of moments


Consider the case of a parametric model characterised by a pair of parameters (θ1 , θ2 ). Suppose we have explicit
expressions for the first and second population moments (E(X), E(X 2 )) in terms of (θ1 , θ2 ):
E(X) = f (θ1 , θ2 ), E(X 2 ) = g(θ1 , θ2 ).
Given a random sample (x1 , . . . , xn ) drawn from this parametric model, we define the first and second sample moments
by
x1 + . . . + xn x2 + . . . + x2n
x̄ = , x2 = 1 .
n n
By the Law of Large Numbers saying that for any j ≥ 1,

X1j + . . . + Xnj
→ E(X j ), n → ∞,
n
implying that (x̄, x2 ) are consistent estimates of (E(X), E(X 2 )). After replacing the population moments with the
corresponding sample moments, we arrive at the equations
x̄ = f (θ̃1 , θ̃2 ), x2 = g(θ̃1 , θ̃2 ),
whose solution (θ̃1 , θ̃2 ) will be called the method of moments estimates of (θ1 , θ2 ).

Geometric model
A researcher has observed n = 130 birds and counted the number of hops that each bird does between flights. As a
result, she obtained a random sample (x1 , . . . , xn ), where
xi = number of hops that the i-th bird does between flights.
The observed range of the number of hops j was between 1 and 12. The next table summarizes the dataset (x1 , . . . , xn )
number of hops x 1 2 3 4 5 6 7 8 9 10 11 12 total
observed counts ox 48 31 20 9 6 5 4 2 1 1 2 1 130
in terms of the observed counts ox , defined as the number of observed birds, who hopped x times. The observed counts
are computed from (x1 , . . . , xn ) as
ox = 1{x1 =x} + . . . + 1{xn =x} ,
where 1A is the indicator function of the relation A, which equals 1 if A is true and 0 otherwise.
The data produces the following summary statistics
total number of hops
x̄ = number of birds = 363
130 = 2.79,
x2 = 1 · 130 + 2 · 130 + . . . + 112 · 130
2 48 2 31 2
+ 122 · 1
130 = 13.20,
2 130 2 2
s = 129 (x − x̄ ) = 5.47,
q
5.47
sx̄ = 130 = 0.205.

24
An approximate 95% confidence interval for µ, the mean number of hops per bird is given by

Iµ ≈ x̄ ± z(0.025) · sx̄ = 2.79 ± 1.96 · 0.205 = 2.79 ± 0.40.

The data plot below exhibits the frequencies descending by a certain factor.

This suggests a geometric model for the number of jumps X ∼ Geom(p) for a random bird:

P(X = x) = (1 − p)x−1 p, x = 1, 2, . . .

(such a geometric model implies that a bird "does not remember" the number of hops made so far, and the next move
of the bird is to jump with probability 1 − p or to fly away with probability p).
The method of moment estimate for the parameter θ = p of the geometric model requires a single equation arising
from the expression for the first population moment

µ = p1 .

This expression leads to the equation x̄ = 1


p̃ which gives the method of moment estimate

p̃ = 1/x̄ = 0.36.

In this case, we can even compute an approximate 95% confidence interval for p using the above mentioned Iµ :
1 1
Ip ≈ ( 2.79+0.40 , 2.79−0.40 ) = (0.31, 0.42).

It is useful to compare the observed frequencies (counts) to the frequencies expected from the geometric distribution
with parameter p̃:
x 1 2 3 4 5 6 7+
cx 48 31 20 9 6 5 11
Ex 46.8 30.0 19.2 12.3 7.9 5.0 8.8

The expected counts Ex are computed in terms of independent geometric random variables (X1 , . . . , Xn )

Ex = E(Cx ) = E(1{X1 =x} + . . . + 1{Xn =x} )


= nP(X = x) = n(1 − p̃)x−1 p̃ = 130 · (0.64)x−1 (0.36), x = 1, . . . , 6,
E7 = n − E1 − . . . − E6 .

An appropriate measure of discrepancy between the observed and expected counts is given by the following so-called
chi-squared test statistic
7
(cx −Ex )2
X
x2 = Ex = 1.86.
x=1

As it will be explained later on, the obtained small value 1.86 of the chi-squared test statistic allows us to conclude
that the geometric model fits the bird data very well.

25
3.2 Maximum likelihood estimation
In a parametric setting with population density function f (x|θ), the observed sample (x1 , . . . , xn ) is a realization of
the random vector (X1 , . . . , Xn ) having the joint probability distribution

f (y1 , . . . yn |θ) = f (y1 |θ) · · · f (yn |θ)

over the possible sample vectors (y1 , . . . , yn ), obtained as the product of the marginal densities due to the assumption
of independence. Fixing the sample values (y1 , . . . , yn ) = (x1 , . . . , xn ) and allowing the parameter value θ to vary, we
obtain the so-called likelihood function

L(θ) = f (x1 , . . . xn |θ) = f (x1 |θ) · · · f (xn |θ).

Observe thar the likelihood function is not a density function over θ.


Clearly, if L(θ1 ) > L(θ2 ), then the parametric model f (x|θ1 ) has a better support for the given data (x1 , . . . , xn )
compared to the model f (x|θ2 ).

The maximum likelihood estimate θ̂ of θ is the value of θ that maximises the likelihood function L(θ).

Observe that it is often more convenient to find the maximum likelihood estimate θ̂ by maximizing the log-likelihood
function
l(θ) = ln L(θ) = ln f (x1 |θ) + . . . + ln f (xn |θ).

Sufficiency
Suppose there is a summary statistic t = g(x1 , . . . , xn ) such that

L(θ) = f (x1 , . . . , xn |θ) = h(t, θ)c(x1 , . . . , xn ) ∝ h(t, θ),

where the sign ∝ means "proportional to". Here the coefficient of proportionality c(x1 , . . . , xn ) does not explicitly
depend on θ. In this case, the maximum likelihood estimate θ̂ depends on the data (x1 , . . . , xn ) only through the
statistic t. Given such a factorisation property, we call t a sufficient statistic, as no other statistic that can be calculated
from the same sample provides any additional information on the value of the maximum likelihood estimate θ̂.

Example: normal distribution model


The two-parameter normal distribution model

F(µ, σ) = N(µ, σ)

has a two-dimensional sufficient statistic t = (t1 , t2 ), where


n
X n
X
t1 = xi , t2 = x2i ,
i=1 i=1

which follows from


n
Y 1 (xi −µ)2 1 Pn (xi −µ)2 1 t2 −2µt1 +nµ2
L(µ, σ) = √ e− 2σ2 = n n/2
e− i=1 2σ 2 = e− 2σ 2 .
i=1
σ 2π σ (2π) σ n (2π)n/2

Thus two samples having the same values for (t1 , t2 ) will produce the same maximum likelihood estimates for (µ, σ).
Notice the match between the number of sufficient statistics and the number of parameters of the model.

Example: Bernoulli distribution model


Consider the case of the Bernoulli model X ∼ Bin(1, p) described by the probability mass function

f (x|p) = P(X = x) = px (1 − p)1−x , x ∈ {0, 1}.

Suppose we are given a random sample (x1 , . . . , xn ) of zeros and ones drawn from a population distribution

F(µ, σ) = Bin(1, p)

with unknown p. Since µ = p, the method of moment estimate of p is computed as the sample proportion p̃ = x̄.
The likelihood function,
Yn
L(p) = pxi (1 − p)1−xi = pt (1 − p)n−t ,
i=1

26
where stands for the product, is fully determined by the number of successes
Q

t = x1 + . . . + xn = nx̄,

providing with another example of a sufficient statistic. To maximise the log-likelihood function

l(p) = ln L(p) = t log p + (n − t) log(1 − p),

take its derivative


t n−t
l′ (p) = −
p 1−p
and put it equal to zero. As a result, we find that even the maximum likelihood estimate of p is the sample proportion
p̂ = nt .

Large sample properties of the maximum likelihood estimates


For a random sample (x1 , . . . , xn ) drawn from a parametric population distribution f (x|θ), the log-likelihood function
is the sum
l(θ) = ln f (x1 |θ) + . . . + ln f (xn |θ)
due to independence between n observations. This implies that the log-likelihood function can be treated as a reali-
sation of a sum of independent and identically distributed random variables Yi = ln f (Xi |θ). Using the central limit
theorem argument one can derive the normal approximation
σθ
Θ̂ ≈ N(θ, √ )
n

for the maximum likelihood estimator. Here σθ2 is the inverse of the so-called Fisher information in a single observation
(if interested, see Wikipedia). It follows that the maximum likelihood estimators are asymptotically unbiased and
consistent. Moreover, they are asymptotically efficient estimates in the sense of the following Cramer-Rao inequality.

σθ2
Cramer-Rao inequality: if θ∗ is an unbiased estimator of θ, then Var(Θ∗ ) ≥ n .

Example: exponential model


The lifetimes of five batteries measured in hours

x1 = 0.5, x2 = 14.6, x3 = 5.0, x4 = 7.2, x5 = 1.2,

are assumed to be generated by the exponential population distribution

F(µ, σ) = Exp(θ).

In this case, the mean lifetime is µ = 1/θ, and θ can be viewed as the battery death rate per hour. The likelihood
function
L(θ) = θe−θx1 θe−θx2 θe−θx3 θe−θx4 θe−θx5 = θn e−θ(x1 +...+xn ) = θ5 e−θ·28.5
first grows from 0 to 2.2 · 10−7 and then falls down towards zero. The likelihood maximum is reached at θ̂ = 0.175.
For the exponential model, t = x1 + . . . + xn is a sufficient statistic, and the maximum likelihood estimate

θ̂ = n/t = 1/x̄

coincides with the method of moment estimate. It is a biased estimate of θ, since


b = E(1/X̄) ̸= 1/E(X̄) = 1/µ = θ,
E(Θ)

but asymptotically unbiased due to the Law of Large Numbers saying that X̄ ≈ µ for the large sample sizes.

3.3 Maximum likelihood estimation for the gamma distribution


Let (x1 , · · · , xn ) be a random sample from the gamma distribution

F(µ, σ) = Gam(α, λ)

with the two-dimensional unknown parameter θ = (α, λ). Put

t1 = x1 + . . . + xn , t 2 = x1 · · · xn ,

27
and observe that the corresponding likelihood function takes the form
n
Y 1 α α−1 −λxi λnα λnα α−1 −λt1
L(α, λ) = λ xi e = n (x1 · · · xn )α−1 e−λ(x1 +...+xn ) = n t e .
i=1
Γ(α) Γ (α) Γ (α) 2

We see that (t1 , t2 ) is a pair of sufficient statistics containing all information from the data needed to compute the
likelihood function. To maximise the log-likelihood function

l(α, λ) = ln L(α, λ),

set the two derivatives




∂α l(α, λ) = n ln(λ) − n ΓΓ(α)
(α)
+ ln t2 ,
∂ nα
∂λ l(α, λ) = λ − t1 ,

equal to zero. The maximum likelihood estimates of the parameters (α, λ) can be obtained from the following two
equations

λ̂ = α̂/x̄,
ln(α̂/x̄) = − n1 ln t2 + Γ′ (α̂)/Γ(α̂).

A numerical solution of the second equation would benefit of using the method of moment estimate α̃ as the initial
guess.

Example: male heights


Consider a random sample of n = 24 male heights (cm) given in ascending order:

170, 175, 176, 176, 177, 178, 178, 179, 179, 180, 180, 180, 180, 180, 181, 181, 182, 183, 184, 186, 187, 192, 192, 199.

Assuming that these numbers are generated by the Gam(α, λ) distribution, we would like to estimate the parameters
λ and α.
To apply the method of moments, we use the formulas for the first and second population moments

α α2 α(1 + α)
E(X) = α
λ, E(X 2 ) = Var(X) + (E(X))2 = 2
+ 2 = ,
λ λ λ2
and the two sample moments computed from the data

x̄ = 181.46, x2 = 32964.2.

From the equations


α̃ α̃(1 + α̃)
x̄ = , x2 =
λ̃ (λ̃)2
we get
α̃ 1 + α̃ 32964.2
= 181.46, = = 181.66,
λ̃ λ̃ 181.46
yielding the method of moments estimates
λ̃ = 5.00, α̃ = 907.3.
The maximum likelihood estimate of the shape parameter α is obtained from the equation

ln(α̂/x̄) = − n1 ln t2 + Γ′ (α̂)/Γ(α̂)

using the method of moment estimate α̃ = 907.3 as the initial guess. The Mathematica command
FindRoot[Log[a] == 0.00055+Gamma′ [a]/Gamma[a], {a, 907.3}]
gives
α̂ = 908.76.
Then the maximum likelihood estimate of the scale parameter λ is obtained from the equation λ̂ = α̂/x̄ yielding

λ̂ = 5.01.

Notice that the obtained maximum likelihood estimates are close to the method of moment estimates.

28
Example: Gam(α, 1) model
A random sample (1.23, 0.62, 2.22, 2.55, 1.42) was generated by the gamma distribution with the shape parameter
α = 2 and the scale parameter λ = 1. Treating θ = α as the unknown parameter, and assuming that λ = 1 is known,
we arrive at the Gam(α, 1) model and wish to estimate α using the given sample of size n = 5.
Since, the population mean is µ = α/λ = α, the method of moments estimate is α̃ = x̄ = 1.61. This estimate
should be compared to the true value α = 2. The corresponding likelihood function has the form
n
Y 1 α−1 −xi
L(α) = xi e = Γ−n (α)tα−1 e−8.04 ,
i=1
Γ(α)

where the constant 8.04 is obtained as x1 + . . . + xn and

t = x1 · · · xn = 6.13

is a sufficient statistic in this case. The graph of the likelihood function L(α) on the figure below, shows that the area
under the curve is much smaller than 1. A closer look at the figure reveals that the maximum likelihood estimate is
slightly smaller than the true value α = 2.

To maximise the log-likelihood function

l(α) = −n ln(Γ(α)) + (α − 1) ln t − 8.04,

take its derivative and put it equal to zero. This results in the equation

0 = ln t − nΓ′ (α)/Γ(α).

The Mathematica command


FindRoot[Log[a] == Log[6.14] - 5*Gamma′ [a]/Gamma[a], {a, 1.61}]

gives a numerical solution α̂ = 1.90989, to be compared with the true value α = 2. Here the method of moments
estimate 1.61 is used as the initial guess for the algorithm solving the equation. In this example, the maximum
likelihood estimate brings a drastic improvement compared to the method of moments estimate.

3.4 Exercises
Problem 1
The Poisson distribution has been used by traffic engineers as a model for light traffic. The following table shows the
number of right turns during 300 three-min intervals at a specific intersection.

29
x frequency
0 14
1 30
2 36
3 68
4 43
5 43
6 30
7 14
8 10
9 6
10 4
11 1
12 1
13+ 0
Fit a Poisson distribution. Comment on the fit by comparing observed and expected counts. It is useful to know that
the 300 intervals were distributed over various hours of the day and various days of the week.

Problem 2
Let (x1 , . . . , xn ) be a random sample from a geometric distribution

P(X = x) = (1 − p)x−1 p, x = 1, 2, . . .

(a) Write down the likelihood function based on this random sample and suggest a simple sufficient statistic.
(b) Find the maximum likelihood estimate of p.
(c) Is this estimate consistent? Explain.

Problem 3
Suppose that X is a discrete random variable with

P(X = 0) = 23 θ,
P(X = 1) = 13 θ,
P(X = 2) = 23 (1 − θ),
P(X = 3) = 13 (1 − θ),

where the parameter θ ∈ [0, 1] is unknown. The following 10 independent observations were drawn from this distribu-
tion:
(3, 0, 2, 1, 3, 2, 1, 0, 2, 1).
(a) Find the method of moments estimate θ̃ of θ.
(b) Estimate the standard error of θ̃.

(c) What is the maximum likelihood estimate θ̂ of θ?

(d) Estimate the standard error of θ̂.

Problem 4
Suppose that x is generated by a Bin(n, p) distribution with unknown p.
(a) Show that the maximum likelihood estimate of p is p̂ = n.
x

(b) Given n = 10 and x = 5, sketch the graph of the likelihood function.

Problem 5
A company has manufactured certain objects and has printed a serial number on each object. The serial numbers
start at 1 and end at N , where N is the number of objects that have been manufactured. One of these objects is
selected at random, and the serial number of that object is 888.
(a) What is the method of moments estimate of N ?
(b) What is the maximum likelihood estimate of N ?

30
Problem 6
To estimate the number N of fish living in a lake, a master degree student has applied the capture-recapture method
according to the two-step procedure:
1. capture and tag n = 100 fish, then release them in the lake,
2. capture and release k = 50 fish one by one, and count the number of the tagged fish among those captured on
the second stage.
Suppose x = 20 fish were tagged among the k = 50 fish captured on the second stage. Find a maximum likelihood
estimate of N after suggesting a simple parametric model.

Problem 7
The following 16 numbers came from the normal random number generator on a computer:

5.33 4.25 3.15 3.70


1.61 6.39 3.12 6.59
3.53 4.74 0.11 1.60
5.49 1.72 4.15 2.28

(a) Write down the likelihood function based on this sample. (Hint: to avoid tedious calculations on your calculator
use the numbers in the next subquestion.)
(b) In what sense the sum of the sample values (which is close to 58), and the sum of their squares (which is close
to 260) are sufficient statistics in this case?
(c) Turning to the log-likelihood function compute the maximum likelihood estimates for the mean and variance. Is
the obtained variance estimate unbiased?

Problem 8
Let (x1 , . . . , xn ) be a random sample generated by the continuous uniform distribution over the interval [0, θ] with
unknown θ.
(a) Find the method of moments estimate of θ and its mean and variance.
(b) Find the maximum likelihood estimate of θ.
(c) Find the probability density of the maximum likelihood estimator and calculate its mean and variance. Compare
the variance, the bias, and the mean square error to those of the method of moments estimate.
(d) Find a modification of the maximum likelihood estimate that renders it unbiased.

Problem 9
For two factors, starchy-or-sugary and green-or-white base leaf, the following counts for the progeny of self-fertilized
heterozygotes were observed (Fisher 1958)
Type Count
Starchy green c1 = 1997
Starchy white c2 = 906
Sugary green c3 = 904
Sugary white c4 = 32
According to the genetic theory the cell probabilities are
2+θ 1−θ 1−θ θ
p1 = , p2 = , p3 = , p4 = ,
4 4 4 4
where 0 < θ < 1. In particular, if θ = 0.25, then the genes are unlinked and the genotype frequencies are
Green White Total
Starchy 9/16 3/16 3/4

Sugary 3/16 1/16 1/4

Total 3/4 1/4 1

(a) Find the likelihood function of θ. Specify a sufficient statistic.


(b) Find the maximum likelihood estimate of θ.

31
Problem 10
The method of randomised response deals with surveys asking sensitive questions. Suppose we want to estimate the
proportion q of the fishermen who during the last 12 months have gone fishing without a valid permit. We are interested
in the population as a whole - not in punishing particular individuals. Suppose randomly chosen n fishermen have
responded yes/no to a randomised statement according to the instructions in the figure below. Suggest a probability
model for this experiment, and find a method of moments estimate for q. What is the standard error of the estimated
proportion?

32
Chapter 4

Hypothesis testing

Hypothesis testing is a crucial part of the scientific method. In this chapter we learn about the statistical hypothesis
testing.

Example: extrasensory perception


Your friend claims that he has extrasensory perception ability. To test this claim you arrange an experiment where
he asked to guess the suits of n = 100 playing cards. As a result the friend guessed correctly the suits of 30 cards out
of 100. What would be your conclusion?

One particular way of reasoning in such a case is the topic of this chapter. Under the hypothesis of pure guessing
the number of correct answers is X ∼ Bin(100, 1/4). Then the observed outcome x = 30 deviates from the mean
100 · 1/4 = 25 for about one standard deviation. Since such an outcome is not unusual, a sensible conclusion is that
the data does not contradict the hypothesis of pure guessing.

4.1 Statistical significance


Often we need a rule based on data for choosing between two mutually exclusive hypotheses

H0 : the effect of interest is zero,


H1 : the effect of interest is not zero.
Here the null hypothesis H0 represents an established theory that must be discredited in order to demonstrate a
phenomenon contradicting the established theory stated in the form of the alternative hypothesis H1 . The decision
rule for hypothesis testing is based on a test statistic t = t(x1 , . . . , xn ), a function of the data with distinct typical
values under H0 and H1 . The task is to find an appropriately chosen rejection region R so that

we reject H0 in favour of H1 if and only if t ∈ R.

Making a such decision we are facing four possible outcomes:

33
Negative decision: do not reject H0 Positive decision: reject H0 in favour of H1
If H0 is true True negative outcome False positive outcome, type I error
If H1 is true False negative outcome, type II error True positive outcome

The figure below illustrates the case of two simple hypotheses

H0 : θ = θ0 , H 1 : θ = θ1 ,

where θ1 > θ0 , so that the alternative hypothesis claims a positive effect size θ1 − θ0 . The figure depicts two sampling
distributions of the test statistic T : the null distribution in green and the alternative distribution in blue. The rejection
region consists on the values of t to the right of the red line.

This figure illustrates the following four important conditional probabilities:


α = P(T ∈ R|H0 ) the conditional probability of type I error, called the significance level of the test,
is given by the area below the green line to the right of the red line
1 − α = P(T ∈
/ R|H0 ) the specificity of the test,
is given by the area below the green line to the left of the red line
β = P(T ∈
/ R|H1 ) the conditional probability of type II error,
is given by the area below the blue line to the left of the red line
1 − β = P(T ∈ R|H1 ) the sensitivity of the test or the power of the test
is given by the area below the blue line to the right of the red line
The larger the power of the test 1 − β, the better the ability of the test to recognise the effect of size θ1 − θ0 . It
is desirable to place the red line in such a way that both α and β are minimised, however, according to the figure,
moving the red line to the right or to the left would decrease one of the error sizes and increase the other.

For a given test statistic, if the sample size is fixed, one can not make both α and β smaller by changing R.

A significance test resolves the conflict between the two types of errors, by controlling the significance level α.
Given the value of α, say 5%, the rejection region R is found from the equation

α = P(T ∈ R|H0 )

using the null distribution of the test statistic T . After the rejection region is determined, the size of type II error is
computed by
β = P(T ∈/ R|H1 ).
To summarise, the significance testing calculations follow the next flow chart

choose α → find R → compute β → compute the power of the test 1 − β.

Example: extrasensory perception


Your friend guessed correctly the suits of y = 30 cards out of n = 100. To analyse this data we use a binomial
model, assuming that the number of cards guessed correctly is generated by the Y ∼ Bin(n, p) distribution, where
the unknown p is the probability of your friend guessing correctly the suit of a playing card. The null hypothesis,
H0 : p = 0.25, posits pure guessing, and the alternative hypothesis of interest, H1 : p > 0.25, suggests the presence of
the extrasensory perception ability. Taking y as a test statistic and putting α = 0.05, define the rejection region by

R = {y : y ≥ y(0.05)},

34
where the critical value y(0.05) is determined by the equation
0.05 = P(Y ≥ y(0.05)|p = 0.25).
Using the normal approximation for the z-score
Y − 25 H0
Z0 = ≈ N(0, 1),
4.33
we find that
y(0.05) ≈ 25 + 4.33 · 1.645 ≈ 32.
Since the observed test statistic yobs = 30 is below 32, we conclude that the experimental result is not significant, and
we do not reject H0 at the 5% significance level.

4.2 Large-sample test for the proportion


The last example is an illustration of the large-sample test for the proportion described next. Consider a random
sample of size n drawn from a Bernoulli distribution
F(µ, σ) = Bin(1, p)
with the unknown population proportion p. Recall that the corresponding sample proportion p̂ gives the maximum
likelihood estimate of p.

Given the null hypothesis H0 : p = p0 , use the test statistic z0 = √ p̂−p0


.
p0 (1−p0 )/n

H0
By the central limit theorem, the null distribution of the z-score is approximately normal: Z0 ≈ N(0,1). Depending
on the problem in hand, there might arise three different alternative hypotheses:
H1 : p > p0 , H1 : p < p0 , H1 : p ̸= p0 .
The corresponding rejection region depends on the exact form of the alternative hypothesis
H1 Rejection region
one-sided p > p0 R = {z ≥ z(α)}
one-sided p < p0 R = {z ≤ −z(α)}
two-sided p ̸= p0 R = {z ≤ −z( α2 )} ∪ {z ≥ z( α2 )}
where z(α) is found from Φ(z(α)) = 1 − α using the normal distribution table.

Confidence interval method of hypotheses testing


Consider testing H0 : p = p0 against the two-sided alternative H1 : p ̸= p0 . Observe that at the significance level α,
the rejection rule can be expressed as
R = {p0 ∈ / Ip },
in terms of a 100(1-α)% confidence interval for the proportion. The corresponding decision rule is simple: reject the
null hypothesis stating p = p0 , if the confidence interval Ip does not cover the value p0 .

Power of the test


Consider the one-sided setting
H0 : p = p0 against H1 : p > p0 .
The power function Pw(p) of the one-sided test can be computed using the normal approximation Z ≈ N(0, 1) under
H1 , for the sampling distribution of the z-score
p̂ − p
z=p .
p(1 − p)/n
To this end observe that p p
z p(1 − p) = z0 p0 (1 − p0 ) + p0 − p.
We have
p √
z(α) p0 (1 − p0 ) + n(p0 − p)
 
Pw(p) = P(Z0 ≥ z(α)|H1 ) = P Z ≥ p |H1
p(1 − p)
p √
z(α) p0 (1 − p0 ) + n(p0 − p) 
≈1−Φ p , p > p0 .
p(1 − p)
Observe that with p = p0 , we get Pw(p) = α. The larger is the effect size p − p0 , the larger is the power Pw(p) of the
test.

35
Planning the sample size
In the setting
H0 : p = p0 against H1 : p = p1 ,
the formula p p
√ z(α) p0 (1 − p0 ) + z(β) p1 (1 − p1 )
n≈
|p1 − p0 |
gives the sample size n corresponding to given values of α and β. If the alternatives are very close to each other, the
denominator goes to zero and the required sample size becomes very large. This is very intuitive as it becomes more
difficult to distinguish between two close parameter values. On the other hand, if we decrease the levels α and β, the
values z(α) and z(β) from the normal distribution table become larger and the corresponding sample size n = n(α, β)
will be larger as well, meaning that for both types of errors to be small, you have to collect more data.
Next, we derive this formula for p1 > p0 leaving the other case p1 < p0 as an exercise. As shown before, for p1 > p0 ,
p √
z(α) p0 (1 − p0 ) + n(p0 − p1 ) 
1 − β = Pw(p1 ) ≈ 1 − Φ p
p1 (1 − p1 )

yielding the equation p √


z(α) p0 (1 − p0 ) + n(p0 − p1 )
β ≈ Φ( p ).
p1 (1 − p1 )
Combining this with
β = Φ(−z(β)),
we arrive at the relation p √
z(α) p0 (1 − p0 ) + n(p0 − p1 )
p ≈ −z(β)
p1 (1 − p1 )
which brings the desired formula.

Binomial test
The binomial test is a test for proportion for a small sample size n. Consider a random sample (x1 , . . . , xn ) drawn
from a Bernoulli distribution Bin(1, p) with the unknown population proportion p. Let c = x1 + . . . + xn count the
number of successes, so that the corresponding random variable has a binomial distribution

C ∼ Bin(n, p).

Under H0 : p = p0 , we have C ∼ Bin(n, p0 ). Using this distribution define bα and cα by

P(C ≤ bα |H0 ) = α, P(C ≥ cα |H0 ) = α.

Then the rejection region of the binomial test is determined according to the next table
H1 Rejection region
one-sided p > p0 R = {c ≥ cα }
one-sided p < p0 R = {c ≤ bα }
two-sided p ̸= p0 R = {c ≤ bα/2 } ∪ {c ≥ cα/2 }
To illustrate the binomial test, consider the extrasensory perception test, where the subject is asked to guess the
suits of n = 20 cards. The number of cards guessed correctly is C ∼ Bin(20, p) and the null hypothesis of interest is
H0 : p = 0.25. The null distribution Bin(20, 0.25) of the test statistic c gives the following probabilities

c 8 9 10 11
P(C ≥ c) 0.101 0.041 0.014 0.004

We conclude that in particular, for the one-sided alternative H1 : p > 0.25 and α = 4.1%, the rejection region of the
binomial test is R = {c ≥ 9}. The corresponding power function

Pw(p) = P(C ≥ 9|C ∼ Bin(20, p))

takes the following values


p 0.27 0.30 0.40 0.50 0.60 0.70
Pw(p) 0.064 0.113 0.404 0.748 0.943 0.995

36
4.3 P-value of the test
A p-value of the test base on the test statistic t is the probability p of obtaining a test statistic value as extreme or
more extreme than the observed value tobs , given that H0 is true.
For a given significance level α, we reject H0 , if p ≤ α, and do not reject H0 otherwise.
The p-value depends on the observed data tobs and therefore, is a realisation of a random variable P. The source of
randomness is in the sampling procedure: if you take another sample, you obtain a different p-value. To illustrate,
suppose we are testing H0 : θ = θ0 versus H1 : θ > θ0 with help of a test statistic Z whose null distribution is N(0,1).
In this case, the p-value is computed as

p = P(Z > zobs |H0 ) = 1 − Φ(zobs ),

and in terms of the underlying random variables

P = P(Z > Zobs |H0 ) = 1 − Φ(Zobs ).


H
Since Zobs ∼0 N(0, 1), we conclude that the p-value has a uniform null distribution:

P(P ≤ p|H0 ) = P(1 − Φ(Zobs ) ≤ p|H0 ) = P(Φ(Zobs ) ≥ 1 − p|H0 ) = P(Zobs ≥ Φ−1 (1 − p)|H0 ) = 1 − Φ(Φ−1 (1 − p)) = p.

Example: extrasensory perception


A subject is asked to guess the suits of n = 100 cards, and we want to test

H0 : p = 0.25 (pure guessing), against H1 : p > 0.25 (extrasensory perception ability),

Applying the large sample test for proportion we find the rejection rule at 5% significance level to be

R = { p̂−0.25
0.0433 ≥ 1.645} = {p̂ ≥ 0.32} = {y ≥ 32},

where y is the number of suits guessed correctly. With a simple alternative H1 : p = 0.30 the power of the test is

1 − Φ( 1.645·0.433−0.5
0.458 ) = 32%.

The sample size required for the 90% power is

n = ( 1.645·0.433+1.28·0.458
0.05 )2 = 675.

If the observed sample count is yobs = 30, then the observed z-score z0 = √ p̂−p0
takes the value
p0 (1−p0 )/n

0.3 − 0.25
zobs = = 1.15,
0.0433
and the one-sided p-value is computed as

P(Z0 ≥ 1.15|H0 ) = 12.5%.

Since the p-value is larger that 10%, the result is not significant, and we do not reject H0 .

Large-sample test for mean


The large-sample test for the mean deals with H0 : µ = µ0 against either the two-sided or a one-sided alternative for
continuous or discrete data. The corresponding test statistic is the t-score
x̄ − µ0
t0 = .
sx̄
This test works well even if the population distribution F(µ, σ) is not necessarily normal, provided the sample size n
is sufficiently large. The rejection region is computed using the normal approximation of the null distribution
H0
T0 ≈ N(0, 1).

One-sample t-test
For small n, under the assumption that the population distribution is normal

F(µ, σ) = N(µ, σ),

the t-test of H0 : µ = µ0 is based on the exact null distribution


H
T0 ∼0 tn−1 .

37
4.4 Likelihood-ratio test
A general method of finding asymptotically optimal tests (having the largest power for a given α) uses the likelihood
ratio as the test statistic. Consider first the case of two simple hypotheses. For testing

H0 : θ = θ0 against H1 : θ = θ1 ,
L(θ0 )
use the likelihood-ratio L(θ1 ) as the test statistic. A larger value of the likelihood-ratio would suggest that the
L(θ0 )
parameter value θ0 explains the data set better than θ1 , while a smaller value L(θ1 ) would indicate that the alternative
hypothesis H1 explains the data set better than H0 .

The likelihood-ratio test rejects H0 for small values of the likelihood-ratio.

By the Neyman-Pearson lemma, see Wikipedia, the likelihood-ratio test is the optimal test in the case of two simple
hypotheses.

Nested hypotheses
The general case of two composite alternatives can be stated in term of a pair of nested parameter sets Ω0 ⊂ Ω

H0 : θ ∈ Ω0 against H1 : θ ∈ Ω \ Ω0 .

It will be more convenient to recast this setting in terms of two nested hypotheses

H 0 : θ ∈ Ω0 , H : θ ∈ Ω,

leading to two maximum likelihood estimates


θ̂0 = maximises the likelihood function L(θ) over θ ∈ Ω0 ,
θ̂ = maximises the likelihood function L(θ) over θ ∈ Ω.
In this case, the likelihood-ratio is defined by
L(θ̂0 )
w= ,
L(θ̂)
and the likelihood-ratio test rejects H0 for smaller values of w, or equivalently for larger values of

− ln w = ln L(θ̂) − ln L(θ̂0 ).

Observe that 0 < w ≤ 1, so that − ln w ≥ 0. By Wilks’ theorem, see Wikipedia, the test statistic (−2 ln w) has a nice
approximation for its null distribution
H0
−2 ln W ≈ χ2df , where df = dim(Ω) − dim(Ω0 ).

The approximation is valid for sufficiently large sample sizes n.

4.5 Chi-squared test of goodness of fit


Suppose that the random sample consists of n independent observations, with each observation belonging to one of
J classes with probabilities (p1 , . . . , pJ ). Such data are summarised as the vector of observed counts whose joint
distribution is multinomial
n!
(C1 , . . . , CJ ) ∼ Mn(n; p1 , . . . , pJ ), P(C1 = c1 , . . . , CJ = cJ ) = pc1 · · · pcJJ .
k1 ! · · · kJ ! 1
The general parameter space

Ω = {(p1 , . . . , pJ ) : p1 + . . . + pJ = 1, p1 ≥ 0, . . . , pJ ≥ 0}

has dimension
dim(Ω) = J − 1.
Consider a parametric model for the data
H0 : (p1 , . . . , pJ ) ∈ Ω0 ,
where
Ω0 = {(p1 , . . . , pJ ) ∈ Ω : (p1 , . . . , pJ ) = (p1 (λ), . . . , pJ (λ)), λ ∈ Λ},
with
dim(Ω0 ) = r, 0 ≤ r < J − 1.

38
To see if the proposed model fits the data, compute λ̂, the maximum likelihood estimate of λ, and then the expected
cell counts
Ej = n · pj (λ̂),
where "expected" means expected under the null hypothesis model. In this special setting, the above mentioned
likelihood-ratio test statistic
−2 ln w ≈ x2
is approximated by the so-called chi-squared test statistic
J
X (cj − Ej )2
x2 = ,
j=1
Ej

and the approximate null distribution of the chi-squared test statistic is


H0
X2 ≈ χ2J−1−r ,

where the number of degrees of freedom is computed as the difference

df = dim(Ω) − dim(Ω0 ) = (J − 1) − r = J − 1 − r.

The chi-squared test is approximate in that all expected counts are recommended to be at least 5. If not, then you
should combine small cells in larger cells and recalculate the number of degrees of freedom df.

Example: geometric model


Returning to the data on the number of hops for birds, consider

H0 : number of hops that a bird does between flights has a geometric distribution Geom(p).

Using p̂ = 0.358 and J = 7 we obtain x2 = 1.86. Using the chi-squared distribution table with df =7 − 1 − 1 = 5, we
find that the p-value is close to 90%. This implies that the geometric distribution model fits very well to the data.

Case study: sex ratio in german families


A 1889 study made in Germany recorded the numbers of boys (x1 , . . . , xn ) for n = 6115 families with 12 children each.
Each xi is an independent realisation of the random variable X having a discrete distribution

px = P(X = x), x = 0, 1, . . . , 12.

The corresponding parameter space Ω has dimension dim(Ω) = 12. The data is given in the table below in the form
of thirteen observed counts.
(cx −Ex )2 (cx −Ex )2
Number of boys x Observed count cx Model 1: Ex and Ex Model 2: Ex and Ex
0 7 1.5 20.2 2.3 9.6
1 45 17.9 41.0 26.1 13.7
2 181 98.5 69.1 132.8 17.5
3 478 328.4 68.1 410.0 11.3
4 829 739.0 11.0 854.2 0.7
5 1112 1182.4 4.2 1265.6 18.6
6 1343 1379.5 1.0 1367.3 0.4
7 1033 1182.4 18.9 1085.2 2.5
8 670 739.0 6.4 628.1 2.8
9 286 328.4 5.5 258.5 2.9
10 104 98.5 0.3 71.8 14.4
11 24 17.9 2.1 12.1 11.7
12 3 1.5 1.5 0.9 4.9
Total 6115 6115 x2 = 249.2 6115 x2 = 110.5

Model 1
A simple way for describing the vector of parameters (p0 , . . . , p12 ) is to use the symmetric binomial distribution
X ∼ Bin(12, 0.5). This leads to the setting
 
(0) (0) 12
H0 : (p0 , . . . , p12 ) = (p0 , . . . , p12 ), where p(0)
x = · 2−12 , x = 0, 1, . . . , 12,
x

39
with the set Ω0 consisting of a single point
(0) (0)
Ω0 = {(p0 , . . . , p12 )},
so that dim(Ω0 ) = 0. Under this H0 , the expected counts are computed as
 
12
(0)
Ex = npx = 6115 · · 2−12 , x = 0, 1, . . . , 12,
x
see the table below. The observed chi-squared test statistic is x2 = 249.2, df = 12. Since x12 (0.005) = 28.3, we can
reject H0 at 0.5% level.

Model 2
Consider a more flexible model X ∼ Bin(12, λ) with an unspecified probability of a boy λ. The corresponding null
hypothesis takes the form
 
12
H0 : (p0 , . . . , p12 ) = (p1 (λ), . . . , p12 (λ)), where px (λ) = · λj (1 − λ)12−x , x = 0, . . . , 12, 0 ≤ λ ≤ 1.
x
Clearly, the corresponding parameter space Ω0 has dimension dim(Ω0 ) = 1. The expected cell counts
 
12
Ex = 6115 · · λ̂x · (1 − λ̂)12−x
x
are computed using the maximum likelihood estimate of the proportion of boys λ
total number of boys 1 · 45 + 2 · 181 + . . . + 12 · 3
λ̂ = = = 0.481
total number of children 6115 · 12
The observed chi-squared test statistic x2 = 110.5 is much smaller than the one for the Model 2. However, since
x11 (0.005) = 26.76, even the Model 2 should be rejected at 0.5% significance level.

Case summary
The figure below compares the observed counts (black histogram) with the Model 1 expected counts (red line) and the
Model 2 expected counts (blue line). The red line has a better fit to the data, however it underestimates the variation
of the observed cell counts. To make the model even more flexible model, one should allow the probability of a boy λ
to differ from family to family.

4.6 Exercises
Problem 1
Suppose that X ∼ Bin(100, p). Consider a test
H0 : p = 1/2, H1 : p ̸= 1/2.
that rejects H0 in favour of H1 for |x − 50| > 10. Use the normal approximation to the binomial distribution to
respond to the following items:

40
(a) What is α?
(b) Graph the power as a function of p.

Problem 2
This problem introduces the case of one-sided null hypothesis. A random sample (x1 , . . . , xn ) is drawn from a normal
population distribution N(µ, 1). Consider two alternative composite hypotheses

H0 : µ ≤ µ0 , H1 : µ > µ 0 .

(a) Rewrite H0 and H1 as a pair of nested hypotheses H0 and H.

(b) Demonstrate that the likelihood function satisfies

L(µ) ∝ exp{− n2 (µ − x̄)2 }.

(c) Show that the corresponding likelihood ratio has the form
(
1 if x̄ < µ0 ,
w= n
− 2 (x̄−µ0 )2
e if x̄ ≥ µ0

(d) Explain why the rejection region of the likelihood ratio test can be expressed as

R = {x̄ > µ0 + cα },

where cα is determined by the equation


α = max P(X̄ > µ0 + cα |µ).
µ∈H0

In particular, for α = 0.05 and n = 25, show that

R = {x̄ > µ0 + 0.33}.

Problem 3
Let (x1 , . . . , xn ) be a sample from a Poisson distribution Pois(µ). Find the likelihood ratio for testing H0 : µ = µ0
against H1 : µ ̸= µ0 . Use the fact that the sum of independent Poisson random variables follows a Poisson distribution
to explain how to determine a rejection region for this test at the significance level α.

Problem 4
Let (x1 , . . . , x25 ) be a sample from a normal distribution having a variance of 100.
(a) Find the rejection region for a test at level α = 0.1 of H0 : µ = 0 versus H1 : µ = 1.5.
(b) What is the power of the test?
(c) Repeat (a) and (b) for α = 0.01.

Problem 5
Under H0 , a random variable has a cumulative distribution function

F (x) = x2 , 0 ≤ x ≤ 1,

and under H1 , it has a cumulative distribution function

F (x) = x3 , 0 ≤ x ≤ 1.

(a) What is the form of the likelihood ratio test of H0 versus H1 ?


(b) What is the rejection region of this test for the level α?
(c) What is the power of the test?

Problem 6
A random sample from N(µ, σ) results in Iµ = (−2, 3) as a 99% confidence interval for µ. Test

H0 : µ = −3 against H1 : µ ̸= −3

at α = 0.01.

41
Problem 7
Let (x1 , . . . , x15 ) be a random sample from a normal distribution N(µ, σ). The sample standard deviation is s = 0.7.
Test H0 : σ = 1 versus H1 : σ < 1 at the significance level α = 0.05.

Problem 8
Consider the binomial model for the data value x:

X ∼ Bin(n, p).

(a) What is the likelihood ratio for testing H0 : p = 0.5 against H1 : p ̸= 0.5?
(b) Show that the corresponding likelihood ratio test should reject for larger values of |x − n2 |.
(c) For the rejection region
R = {|x − n2 | > k}
determine the significance level α as a function of k.

(d) If n = 10 and k = 2, what is the significance level of the test?


(e) Use the normal approximation to the binomial distribution to find the significance level given n = 100 and
k = 10.

Problem 9
Suppose that a test statistic Z has a standard normal distribution under the null hypothesis.
(a) If the test rejects for larger values of |z|, what is the p-value corresponding to the observed value 1.5 of the test
statistic z?

(b) Answer the same question if the test rejects only for larger values of z.

Problem 10
It has been suggested that H1 : dying people may be able to postpone their death until after an important occasion,
such as a wedding or birthday. Phillips and King (1988) studied the patterns of death surrounding Passover, an
important Jewish holiday.
(a) California data (1966–1984). They compared the number of deaths during the week before Passover to the
number of deaths during the week after Passover for 1919 people who had Jewish surnames. Of these, 922
occurred in the week before and 997 in the week after Passover. Apply a statistical test to see if there is evidence
supporting the claim H1 .
(b) For 852 males of Chinese and Japanese ancestry, 418 died in the week before and 434 died in the week after
Passover. Can we reject H0 : death cannot be postponed, using these numbers?

Problem 11
If gene frequencies are in equilibrium, the genotypes AA, Aa, and aa occur with probabilities

p1 = (1 − θ)2 , p2 = 2θ(1 − θ), p3 = θ 2 ,

for some 0 ≤ θ ≤ 1. Plato et al. (1964) published the following data on haptoglobin type in a sample of 190 people
Genotype AA Aa aa
Observed count 10 68 112
Test the goodness of fit of the data to the equilibrium model.

Problem 12
Check for the seasonal variation in the following data on the US suicides in 1970.

42
Month Number of suicides
Jan 1867
Feb 1789
Mar 1944
Apr 2094
May 2097
Jun 1981
Jul 1887
Aug 2024
Sep 1928
Oct 2032
Nov 1978
Dec 1859

Problem 13
In 1965, a newspaper carried a story about a high school student who reported getting 9207 heads and 8743 tails in
17950 coin tosses.

(a) Is this a significant discrepancy from the null hypothesis H0 : p = 12 , where p is the probability of heads?
(b) A statistician contacted the student and asked him exactly how he had performed the experiment (Youden
1974). To save time the student had tossed groups of five coins at a time, and a younger brother had recorded
the results, shown in the table:

number of heads 0 1 2 3 4 5 Total


observed 100 524 1080 1126 655 105 3590

Are the data consistent with the hypothesis that all the coins were fair (p = 12 )?
(c) Are the data consistent with the hypothesis that all five coins had the same probability of heads but this
probability was not necessarily 12 ?

43
Chapter 5

Bayesian inference

The statistical tools introduced in this course so far are based on the so called frequentist approach. In the parametric
case, the frequentist treats the data x as randomly generated by a distribution f (x|θ) involving the unknown true
population parameter value θ, which may be estimated using the method of maximum likelihood. This section presents
basic concepts of the Bayesian approach relying on the following model for the observed data x:
g(θ) f (x|θ)
Apriori distribution −→ generates a value θ −→ data x.

The model assumes that before the data is collected the parameter of interest θ is randomly generated by a prior
distribution g(θ). The computational power of the Bayesian approach stems from the possibility to treat θ as a
realisation of a random variable Θ.
The prior distribution g(θ) brings into the statistical model our knowledge (or lack of knowledge) on θ before the
data x is generated using a conditional distribution f (x|θ), which in this section is called the likelihood function. After
the data x is generated by such a two-step procedure involving the pair g(θ) and f (x|θ), we may update our knowledge
on θ and compute a posterior distribution h(θ|x) using the Bayes formula

f (x|θ)g(θ)
h(θ|x) = .
ϕ(x)

The denominator, depending on whether the distribution is continuous or discrete,


Z X
ϕ(x) = f (x|θ)g(θ)dθ or ϕ(x) = f (x|θ)g(θ)
θ

gives the marginal distribution of the random data X. For a fixed realization x, treating the denominator ϕ(x) as a
constant which does not explicitly involve θ, the Bayes formula can be summarized as

posterior ∝ likelihood × prior

where the sign ∝ means proportional.


If we have no prior knowledge on θ, the prior distribution is often modelled by the uniform distribution. In this
case of uninformative prior, with g(θ) being a constant over a certain interval, we have h(θ|x) ∝ f (x|θ), implying that
the posterior knowledge comes solely from the likelihood function.

Example: IQ measurement
A randomly chosen individual has an unknown true intelligence quotient value θ. Suppose the IQ test is calibrated in
such a way that θ can be viewed as a realisation of a random variable Θ having the normal prior distribution N(100, 15).
This normal distribution describes the population distribution of people’s IQ with population mean µ0 = 100 and
population standard deviation σ0 = 15. For a person with an IQ value θ, the result x of an IQ measurement is
generated by another normal distribution N(θ, 10), with no systematic error and a random error σ = 10. Since
(θ−µ0 ) 2
1 − 1 (x−θ)2
e− 2σ2 ,
2
g(θ) = √ e 2σ0 , f (x|θ) = √
2πσ0 2πσ
we find that likelihood times prior equals
(θ−µ0 ) 2 (x−θ)2
1 − 2 − 2σ 2
g(θ)f (x|θ) = e 2σ0 ,
2πσ0 σ
which is proportional to
n θ2 − 2µ1 θ o n (θ − µ )2 o
1
exp − ∝ exp − ,
2σ12 2σ12

44
where
σ2
µ1 = γµ0 + (1 − γ)x, σ12 = γσ02 , γ= .
σ 2 + σ02
It follows that the posterior distribution is also normal N(µ1 , σ1 ).
The parameter
σ2
γ = 12
σ0
is called a shrinkage factor. Being γ ∈ (0, 1) it measures the reduction in variance when the posterior distribution is
compared to the prior distribution. The smaller is γ the larger is the gain from the data.
In particular, if the observed IQ result is x = 130, then the posterior distribution becomes N(120.7, 8.3). The
posterior mean µ1 = 120.7 is obtained as a down-corrected measurement result x = 130 in view of the lower prior
expectation µ0 = 100. The posterior variance σ12 = 69.2 is smaller than that of the prior distribution σ0 = 225 by the
shrinkage factor γ = 0.308, reflecting the fact that the updated knowledge brings much more certainty about the true
IQ value compared to the available prior knowledge.

The figure on the right depicts three probability density curves over
the values of the parameter θ representing possible IQ values.
Observe that the posterior curve is close to zero
not only for those θ where either the prior curve is close to zero
but also where the likelihood curve is close to zero.
As a result, the posterior curve becomes narrower than the prior curve.

5.1 Conjugate priors


Suppose the data x is generated by a parametric model having the likelihood function f (x|θ). Consider a parametric
family of the prior distributions G.

G is called a family of conjugate priors for the likelihood function f (x|θ)


if for any prior g(θ) ∈ G, the corresponding posterior distribution h(θ|x) ∈ G

The next table presents five Bayesian models involving conjugate priors. The details of the first three models come
next. Notice that the posterior variance is always smaller than the prior variance. This list also illustrates that the
contribution of the prior distribution to the posterior distribution becomes smaller as the sample size n increases.

Parametric model for the data Unknown θ Prior Posterior distribution



X1 , . . . , Xn ∼ N(µ, σ) θ=µ N(µ0 , σ0 ) N(γn µ0 + (1 − γn )x̄; σ0 γn )
X ∼ Bin(n, p) θ=p Beta(a, b) Beta(a + x, b + n − x)
(X1 , . . . , Xr ) ∼ Mn(n; p1 , . . . , pr ) θ = (p1 , . . . , pr ) Dir(α1 , . . . , αr ) Dir(α1 + x1 , . . . , αr + xr )
X1 , . . . , Xn ∼ Geom(p) θ=p Beta(a, b) Beta(a + n, b + nx̄ − n)
X1 , . . . , Xn ∼ Pois(µ) θ=µ Gam(α0 , λ0 ) Gam(α0 + nx̄, λ0 + n)
X1 , . . . , Xn ∼ Gam(α, λ) θ=λ Gam(α0 , λ0 ) Gam(α0 + αn, λ0 + nx̄)

Normal-normal model
Suppose a random sample (x1 , . . . , xn ) is drawn from the normal distribution N(µ, σ) with a known standard deviation
σ and the unknown mean θ = µ. Taking the normal prior Θ ∼ N(µ0 , σ0 ) with known (µ0 , σ0 ) results in the normal
posterior N(µ1 , σ1 ) with
µ1 = γn µ0 + (1 − γn )x̄, σ12 = σ02 γn ,
where
σ2
σ2 n
γn = =
σ + nσ02
2 σ2
+ σ02
n
is the shrinkage factor which becomes smaller for the larger sample sizes n. As a result for the large samples, the
posterior mean µ1 gets close to the maximum likelihood estimate and the input γn µ0 involving the prior mean becomes
negligible.

Binomial-beta model
Next, we introduce the beta distribution which serves as a convenient family of conjugate priors for Bayesian inference
for p, in the case when the data x is generated by the Bin(n, p).

45
Beta distribution
Beta distribution Beta(a, b) is determined by two parameters a > 0, b > 0 which are called pseudo-counts. It is defined
by the probability density function
Γ(a + b) a−1
g(p) = p (1 − p)b−1 , 0 < p < 1,
Γ(a)Γ(b)
its mean and variance are given by
a µ(1 − µ)
µ=, σ2 = .
a+b a+b+1
The figure below depicts five different beta-distribution curves.
- flat black Beta(1, 1),
- U-shaped brown Beta(0.5, 0.5),
- bell-shaped red Beta(5, 3),
- L-shaped blue Beta(1, 3),
- J-shaped green Beta(3, 0.5).

Proof of the conjugacy property


To demonstrate that the beta distribution is a conjugate prior for the binomial likelihood, observe that

prior ∝ pa−1 (1 − p)b−1 ,

and
likelihood ∝ px (1 − p)n−x ,
imply
posterior ∝ prior × likelihood ∝ pa+x−1 (1 − p)b+n−x−1 .
This entails that the posterior is also a beta distribution Beta(a1 , b1 ) with the updated parameters

a1 = a + x, b1 = b + n − x.

Example: thumbtack landing on its base


Suppose we are interested in the probability p of a thumbtack landing on its base.
Two experiments are performed. An experiment consists of n tosses of the thumbtack
with the number of base landings X ∼ Bin(n, p) being counted.
Experiment 1: after n1 = 10 tosses, the observed count of the base landings is x1 = 2. We apply the uninformative
prior distribution Beta(1, 1) with the mean µ0 = 0.50 and standard deviation σ0 = 0.29. The resulting posterior
distribution is the Beta(3, 9) distribution with the posterior mean µ1 = 12
3
= 0.25 and standard deviation σ1 = 0.12.

Experiment 2: after n2 = 40 tosses, the observed count of the base landings is x2 = 9. As a new prior distribution
we use the posterior distribution obtained from the first experiment Beta(3, 9). The new posterior distribution becomes
Beta(12, 40) with the mean µ2 = 52 12
= 0.23 and standard deviation σ2 = 0.06.

Multinomial-Dirichlet model
The multinomial-Dirichlet model is a multivariate version of the binomial-beta model. For both the binomial-beta
and multinomial-Dirichlet models, the updating rule has the form

the posterior pseudo-counts = the prior pseudo-counts plus the sample counts

Dirichlet distribution
The Dirichlet distribution Dir(α1 , . . . , αr ) is a multivariate extension of the beta distribution. It is a probability
distribution over the vectors (p1 , . . . , pr ) with non-negative components such that

p1 + . . . + pr = 1.

The positive parameters α1 , . . . , αr of the Dirichlet distribution are often called the pseudo-counts. The probability
density function of Dir(α1 , . . . , αr ) is given by
Γ(α0 )
g(p1 , . . . , pr ) = pα1 −1 . . . pα
r
r −1
, α0 = α1 + . . . + αr .
Γ(α1 ) . . . Γ(αr ) 1

46
The marginal distributions of the random vector (X1 , . . . , Xr ) ∼ Dir(α1 , . . . , αr ) are the beta distributions

Xj ∼ Beta(αj , α0 − αj ), j = 1, . . . , r.

Different components of the vector have negative covariances


αi αj
Cov(Xi , Xj ) = − 2 for i ̸= j.
α0 (α0 + 1)

The figure below illustrates four examples of Dir(α1 , α2 , α3 ) distribution. Each triangle contains n = 300 points
generated using different sets of parameters (α1 , α2 , α3 ):

upper left (0.3, 0.3, 0.1), upper right (13, 16, 15), lower left (1, 1, 1), lower right (3, 0.1, 1).

A dot in a triangle gives a realisation (x1 , x2 , x3 ) of the vector (X1 , X2 , X3 ) ∼ Dir(α1 , α2 , α3 ) as the distances to the
bottom edge of the triangle (x1 ), to the right edge of the triangle (x2 ), and to the left edge of the triangle (x3 ).

Example: loaded die experiment


A possibly loaded die is rolled 18 times, giving 4 ones, 3 twos, 4 threes, 4 fours, 3 fives, and 0 sixes:

2, 1, 1, 4, 5, 3, 3, 2, 4, 1, 4, 2, 3, 4, 3, 5, 1, 5.

The parameter of interest is the vector of six probabilities θ = (p1 , . . . , p6 ). The data can be viewed as generated by a
multinomial distribution Mn(18; p1 , . . . , p6 ). Since we have no idea about the values of the probabilities (p1 , . . . , p6 ) we
will use the uninformative prior distribution Dir(1, 1, 1, 1, 1, 1). Due to the conjugacy property we obtain the posterior
distribution to be Dir(5, 4, 5, 5, 4, 1), where the posterior pseudo-counts are computed as

(5, 4, 5, 5, 4, 1) = (1, 1, 1, 1, 1, 1) + (4, 3, 4, 4, 3, 0).

5.2 Bayesian estimation


In the language of decision theory, finding a point estimate a for the unknown population parameter θ is an action of
assigning the value a to the unknown parameter θ. In the frequentist setting, the optimal a is found by maximising
the likelihood function. In the Bayesian setting, the optimal choice of a is determined by the so-called loss function
l(θ, a). The so-called Bayes action minimises the posterior risk

R(a|x) = E l(Θ, a)|x ,

computed using the posterior distribution


Z X
R(a|x) = l(θ, a)h(θ|x)dθ or R(a|x) = l(θ, a)h(θ|x).
θ

We consider two loss functions leading to two different Bayesian estimators. These two loss functions called the
zero-one loss and the squared error loss

47
Zero-one loss function: l(θ, a) = 1{θ̸=a} Squared error loss: l(θ, a) = (θ − a)2

are schematically depicted on the figure below.

Zero-one loss function and maximum a posteriori probability


With the zero-one loss function, the posterior risk is equal to the probability of misclassification
X
R(a|x) = h(θ|x) = 1 − h(a|x).
θ̸=a

In this case, to minimise the risk we have to maximise the posterior probability h(a|x). We define θ̂map as the value
of θ that maximises h(θ|x). Observe that with the uninformative prior, θ̂map = θ̂mle .

Squared error loss function and posterior mean estimate


Using the squared error loss function we find that the posterior risk is a sum of two components

R(a|x) = E((Θ − a)2 |x) = Var(Θ|x) + (E(Θ|x) − a)2 .

Since the first component is independent of a, we minimise the posterior risk by putting

θ̂pme = E(Θ|x),

resulting in the posterior mean value as the Bayesian point estimate of θ.

Example: loaded die experiment


Turning to the loaded die experiment, observe that the maximum likelihood estimate based on the sample counts is
given by the vector of sample proportions
4 3 4 4 3
θ̂mle = ( 18 , 18 , 18 , 18 , 18 , 0).

Notably, the maximum likelihood estimate assigns value zero to p6 , thereby predicting that in the future observations
there will be no sixes.
Considering the two alternative Bayesian estimates based on the posterior distribution Dir(5, 4, 5, 5, 4, 1)
4 3 4 4 3 5 4 5 5 4 1
θ̂map = ( 18 , 18 , 18 , 18 , 18 , 0), θ̂pme = ( 24 , 24 , 24 , 24 , 24 , 24 ),

we see that the former coincides with θ̂mle , while the latter estimate has the advantage of assigning a positive value
to p6 .

5.3 Credibility interval


Given data x coming from a parametric model with the likelihood function f (x|θ), a 100(1 − α)% confidence interval
for the parameter θ,
Iθ = (a1 (x), a2 (x)),

48
is viewed as a realisation of a random interval (a1 (X), a2 (X)) such that

P(a1 (X) < θ < a2 (X)) = 1 − α.

This frequentist interpretation of the confidence level 100(1 − α)% is rather cumbersome as it requires mentioning
other samples and potential confidence intervals which as a group cover the true unknown value of θ with probability
1 − α.
In the framework of Bayesian inference we can refer to θ as a realisation of a random variable Θ with a certain
posterior distribution h(θ|x). This allows us to define a 100(1 − α)% credibility interval (or credible interval)

Jθ = (b1 (x), b2 (x))

by the relation based on the posterior distribution

P(b1 (x) < Θ < b2 (x)|x) = 1 − α.

The interpretation of the credibility interval is more intuitive as it does not refer to some potential, never observed
data values.

Example: IQ measurement
Given a single IQ value x = 130, we have X̄ ∼ N(µ; 10) and an exact 95% confidence interval for the true IQ value θ
takes the form
Iθ = 130 ± 1.96 · 10 = 130 ± 19.6.
The interval 130 ± 19.6 has 95% confidence level in the sense that if we repeat the IQ measurement for the same
person, then the new IQ result X will produce a random confidence interval which will cover the true IQ of the subject
with probability 0.95.
With the posterior distribution Θ ∼ N(120.7; 8.3) in hand, a 95% credibility interval for θ is computed as

Jµ = 120.7 ± 1.96 · 8.3 = 120.7 ± 16.3.

The proper understanding of the obtained interval 120.7 ± 16.3 is the following. If we choose another person from the
general population and the person’s IQ turns out to be x = 130, then the new person’s IQ Θ belongs to the interval
120.7 ± 16.3 with probability 0.95. On the other hand, if we get a second IQ test result x2 for the same person, then
we can compute a new credibility interval based on x2 using the posterior distribution after the first IQ test result as
the new prior.

5.4 Bayesian hypotheses testing


Considering the case of two simple hypotheses

H0 : θ = θ 0 against H1 : θ = θ1

we wish to choose between H0 and H1 using not only the two likelihood functions f (x|θ0 ), f (x|θ1 ) but also the prior
probabilities of the two optional values

P(H0 ) = π0 , P(H1 ) = π1 , π0 + π1 = 1.

In terms of an appropriate rejection region R for the available data x, the Bayesian decision should be taken depending
of a cost function having the following four cost values
Decision H0 true H1 true
x∈
/R Do not reject H0 0 cost1
x∈R Reject H0 cost0 0
where cost0 is the error type I cost and cost1 is the error type II cost. For a given set R, the average cost is the
weighted mean of two values cost0 and cost1
Z  
cost0 π0 P(X ∈ R|H0 ) + cost1 π1 P(X ∈
/ R|H1 ) = cost1 π1 + cost0 π0 f (x|θ0 ) − cost1 π1 f (x|θ1 ) dx.
R

Now observe that


Z   Z  
cost0 π0 f (x|θ0 ) − cost1 π1 f (x|θ1 ) dx ≥ cost0 π0 f (x|θ0 ) − cost1 π1 f (x|θ1 ) dx,
R R∗

where
R∗ = {x : cost0 π0 f (x|θ0 ) < cost1 π1 f (x|θ1 )}.

49
It follows that the rejection region minimising the average cost is R = R∗ . Taking R∗ as the rejection region, we
should reject H0 for values of the likelihood ratio which are smaller than a certain critical value:

f (x|θ0 ) cost1 π1
< ,
f (x|θ1 ) cost0 π0

determined by the prior odds π0 /π1 and the cost ratio cost1 /cost0 . In other terms, we reject H0 for the values of the
posterior odds smaller than the cost ratio
h(θ0 |x) cost1
< .
h(θ1 |x) cost0

Example of Bayesian hypothesis testing


The defendant N charged with rape, is a male of age 37 living in the area not very far from the crime place. The jury
have to choose between two alternative hypotheses H0 : N is innocent, H1 : N is guilty. There are three conditionally
independent pieces of evidence
E1 : a DNA match,

E2 : defendant N is not recognised by the victim,


E3 : an alibi supported by the N’s girlfriend.
The reliability of these pieces of evidence was quantified as
P(E1 |H0 )
P(E1 |H0 ) = 200,000,000 ,
1
P(E1 |H1 )=1, very strong evidence in favour of H1 with P(E1 |H1 ) = 1
200,000,000
2 |H0 )
P(E2 |H1 ) = 0.1, P(E2 |H0 ) = 0.9, strong evidence in favour of H0 with P(E
P(E2 |H1 ) = 9
P(E3 |H0 )
P(E3 |H1 ) = 0.25, P(E3 |H0 ) = 0.5, evidence in favour of H0 with P(E3 |H1 )
=2

For the sake of Bayesian inference the non-informative prior probability


1
π1 = P(H1 ) = 200000 ,

is suggested, taking into account the number of males who theoretically could have committed the crime without any
evidence taken into account. This yields the prior odds for H0 to be very high
π0
= 200000.
π1
The resulting posterior odds is

P(H0 |E1 , E2 , E3 ) π0 P(E1 , E2 , E3 |H0 ) π0 P(E1 |H0 ) P(E2 |H0 ) P(E3 |H0 )
= = · · · = 0.018.
P(H1 |E1 , E2 , E3 ) π1 P(E1 , E2 , E3 |H1 ) π1 P(E1 |H1 ) P(E2 |H1 ) P(E3 |H1 )

Conclusion: the defendant N would deemed to be guilty if the cost values assigned by the jury are such that
cost1 cost for unpunished crime
= > 0.018.
cost0 cost for punishing an innocent

5.5 Exercises
Problem 1
This is a continuation of the Problem 3 (a-d) from Section 3.4.

(e) Assume the uniform prior for the parameter θ and find the posterior density. Sketch the posterior curve. Find
the MAP estimate of θ.

50
Problem 2
In an ecological study of the feeding behaviour of birds, the number of hops between flights was counted for several
birds.
Number of hops j 1 2 3 4 5 6 7 8 9 10 11 12 Tot
Observed count cj 48 31 20 9 6 5 4 2 1 1 2 1 130
Assume that the data were generated by a Geom(p) model and take the uniform prior for p. What is the posterior
distribution and what are the posterior mean and standard deviation?

Problem 3
On the Laplace rule of succession.
Laplace claimed that when an event happens n times in a row and never fails to happen,
the probability that the event will occur the next time is n+1
n+2 .
Can you suggest a rationale for this claim?

Problem 4
It is known that the random variable X has one of the following two distributions
X-values x1 x2 x3 x4
P(x|H0 ) 0.2 0.3 0.3 0.2
P(x|H1 ) 0.1 0.4 0.1 0.4
P(x|H0 )
(a) Compare the likelihood ratio, P(x|H1 ) , for each xi and order the xi accordingly.

(b) What is the likelihood ratio test of H0 versus H1 at level α = 0.2? What is the test at level α = 0.5?
(c) If the prior probabilities are P(H0 ) = P(H1 ) = 12 , which outcomes favour H0 ?
(d) What prior probabilities correspond to the decision rules with α = 0.2 and α = 0.5?

Problem 5
Suppose that under H0 , a measurement X is N(0, σ), and under H1 , the measurement X is N(1, σ). Assume that the
prior probabilities satisfy
P(H0 ) = 2P(H1 ).
The hypothesis H0 will be chosen if P(H0 |x) > P(H1 |x). Answer the following questions referring to the different
choices of σ 2 = 0.1, 0.5, 1.0, 5.0.
(a) For what values of X = x will H0 be chosen?
(b) In the long run, what proportion of the time will H0 be chosen if H0 is true 2
3 of the time?

Problem 6
Under H0 , a random variable X has a cumulative distribution function F (x) = x2 , 0 ≤ x ≤ 1, and under H1 , it has
a cumulative distribution function F (x) = x3 , 0 ≤ x ≤ 1.
If the two hypotheses have equal prior probabilities, for what values of x is the posterior probability of H0 greater
than that of H1 ?

Problem 7
Suppose your prior beliefs about the probability p of success have mean 1/3 and variance 1/32. What is the posterior
mean after having observed 8 successes in 20 trials?

Problem 8
Mice were injected with a bacterial solution, some of the mice were also given penicillin. The results were
Without penicillin With penicillin
Survived 8 12
Died 48 62
(a) Find a 95% confidence interval for the difference between two probabilities of survival.
(b) Assume that both groups have the probability of survival p. How would you compute an exact credibility interval
for the population proportion p, if you could use a computer? Compute an approximate 95% credibility interval
using a normal approximation.

51
Problem 9
The gamma distribution Gam(α, λ) is a conjugate prior for the Poisson likelihood with mean θ. If x is a single
observed value randomly sampled from the Pois(θ) distribution, then the parameters (α1 , λ1 ) for the posterior gamma
distribution of Θ are found by the following updating rule:
- the shape parameter α1 = α + x,
- the inverse scale parameter λ1 = λ + 1.
(a) Find θ̂PME , the posteriori mean estimate for the θ, under the exponential prior Exp(1), given the following
random sample values from the Pois(θ) population distribution

x1 = 2, x2 = 0, x3 = 2, x4 = 5.

(b) What is the updating rule for an arbitrary sample size n? Compare the value of θ̂PME with the maximum
likelihood estimator θ̂MLE as n → ∞. Your conclusions?

Problem 10
Consider the Beta(a, b) distribution. Verify that given a > 1 and b > 1, the maximum of the probability density
function g(p) is attained at
a−1
p∗ = .
a+b−2

Problem 11
Credible intervals are not unique on a posterior distribution. Methods for defining a suitable credible interval include:
1. choosing the narrowest interval, which for a unimodal distribution will involve choosing those values of highest
probability density including the mode (the maximum a posteriori). This is sometimes called the highest posterior
density interval (HPDI),

2. choosing the interval where the probability of being below the interval is as likely as being above it, this is
sometimes called the equal-tailed interval,
3. the interval for which the posterior mean is the central point.
Respond to the following:

(a) demonstrate that the equal-tailed interval includes the posterior median,
(b) compare these three options for the beta-posterior Beta(5, 5),
(c) compute the 95% HPDI for the normal-posterior N(3, 0.2).

52
Chapter 6

Summarising data

Consider a random sample (x1 , . . . , xn ) taken from the unknown population distribution F(µ, σ) characterized by the
cumulative distribution function
F (x) = P(X ≤ x).
Even for moderate sample sizes n, a proper preprocessing of the dataset is crucial. A very useful first rearrangement
is to sort the random sample (x1 , . . . , xn ) in the ascending order. As a result, we arrive at the ordered sample values

x(1) ≤ x(2) ≤ . . . ≤ x(n) ,

where in particular,
x(1) = min{x1 , . . . , xn }, x(n) = max{x1 , . . . , xn }.
Doing this we do not lose any essential information about the data making it much more tractable. For example,
the sample mean and sample standard deviations can be computed from the ordered sample (x(1) , . . . , x(n) ). In this
chapter we present several other basic tools for summarising the data in hand.

6.1 Empirical probability distribution


Given a random sample (x1 , . . . , xn ), the unknown population distribution function F (x) can be estimated by the
empirical distribution function defined as
n
1X
F̂ (x) = 1{xi ≤x} .
n i=1
Observe that for a fixed x,
F̂ (x) = p̂
is the sample proportion estimating the population proportion p = F (x). This implies that F̂ (x) is an unbiased
consistent estimate of F (x). The next figure illustrates this definition in terms of the earlier mentioned n = 24 male
heights

(x(1) , . . . , x(24) ) =
(170, 175, 176, 176, 177, 178, 178, 179, 179, 180, 180, 180, 180, 180, 181, 181, 182, 183, 184, 186, 187, 192, 192, 199).

The graph of the empirical distribution function F̂ (x) gives an idea about the shape of the population distribution
function F (x).

53
As a function of x, F̂ (x) is a cumulative distribution function for a random variable Y with the discrete uniform
distribution
1
P(Y = xi ) = , i = 1, . . . , n,
n
assuming that all sample values xi are pairwise different. Clearly,
n
X xi
E(Y ) = = x̄,
i=1
n

and since
n
X x2 i
E(Y 2 ) = = x2 ,
i=1
n
we get
n−1 2
Var(Y ) = x2 − (x̄)2 = s .
n
It is easy to verify that even if some of xi coincide, F̂ (·) is a cumulative distribution function with mean x̄ and variance
n s . We will call
n−1 2

X n
σ̂ 2 = n−1 2
n s = n
1
(xi − x̄)2 = x2 − (x̄)2
i=1

the empirical variance.

Empirical survival function


If the data describes life lengths, then it is more convenient to use the empirical survival function

Ŝ(x) = 1 − F̂ (x),

the proportion of the data greater than x. If the life length L has distribution function F (x) = P(L ≤ x), then its
survival function is
S(x) = P(L > x) = 1 − F (x), x ≥ 0.
If f (x) = F ′ (x) is the probability density function for the life length L, then the corresponding hazard function is
defined by
f (x)
h(x) = .
S(x)
The hazard h(x) is the mortality rate at age x in that

P(x < L ≤ x + δ|L ≥ x) P(x < L ≤ x + δ) F (x + δ) − F (x)


= = → h(x), δ → 0.
δ δP(L ≥ x) δS(x)

The hazard function can be viewed as the negative slope of the log survival function:

f (x) S ′ (x) d
h(x) = =− = − dx ln S(x).
S(x) S(x)

If L has the exponential distribution Exp(λ), then the hazard rate h(x) ≡ λ is constant over the ages.

Example: guinea pigs


Guinea pigs were randomly divided in 5 treatment groups of 72 animals each and one control group of 107 animals.
The guinea pigs in the treatment groups were infected with increasing doses of tubercle bacilli (Bjerkdal, 1960). The
survival times were recorded in days (note that not all the animals in the lower-dosage regimens died).

Control lifetimes

18 36 50 52 86 87 89 91 102 105 114 114 115 118 119 120 149 160 165 166 167 167 173 178 189 209 212 216 273
278 279 292 341 355 367 380 382 421 421 432 446 455 463 474 506 515 546 559 576 590 603 607 608 621 634 634 637
638 641 650 663 665 688 725 735

Dose I lifetimes

76 93 97 107 108 113 114 119 136 137 138 139 152 154 154 160 164 164 166 168 178 179 181 181 183 185 194
198 212 213 216 220 225 225 244 253 256 259 265 268 268 270 283 289 291 311 315 326 326 361 373 373 376 397 398

54
406 452 466 592 598

Dose II lifetimes

72 72 78 83 85 99 99 110 113 113 114 114 118 119 123 124 131 133 135 137 140 142 144 145 154 156 157 162
162 164 165 167 171 176 177 181 182 187 192 196 211 214 216 216 218 228 238 242 248 256 257 262 264 267 267 270
286 303 309 324 326 334 335 358 409 473 550

Dose III lifetimes

10 33 44 56 59 72 74 77 92 93 96 100 100 102 105 107 107 108 108 108 109 112 113 115 116 120 121 122 122
124 130 134 136 139 144 146 153 159 160 163 163 168 171 172 176 183 195 196 197 202 213 215 216 222 230 231 240
245 251 253 254 254 278 293 327 342 347 361 402 432 458 555

Dose IV lifetimes

43 45 53 56 56 57 58 66 67 73 74 79 80 80 81 81 81 82 83 83 84 88 89 91 91 92 92 97 99 99 100 100 101 102


102 102 103 104 107 108 109 113 114 118 121 123 126 128 137 138 139 144 145 147 156 162 174 178 179 184 191 198
211 214 243 249 329 380 403 511 522 598

Dose V lifetimes

12 15 22 24 24 32 32 33 34 38 38 43 44 48 52 53 54 54 55 56 57 58 58 59 60 60 60 60 61 62 63 65 65 67 68 70
70 72 73 75 76 76 81 83 84 85 87 91 95 96 98 99 109 110 121 127 129 131 143 146 146 175 175 211 233 258 258 263
297 341 341 376

It is difficult to compare the six groups just looking at the numbers. The data is illuminated by two graphs: the left
one for the survival functions and the right one for the log-survival functions.

The negative slopes of the curves to the right illustrate the hazard rates for different groups. Notice that the top line
is almost linear. This suggests an exponential model for the life lengths of the guinea pigs in the control group, with
the constant hazard rate λ ≈ 0.001 deaths per day.

6.2 Quantiles and QQ-plots


The inverse of the cumulative distribution function F (x) is called the quantile function

Q(p) = F −1 (p), 0 < p < 1.

For a given population distribution function F and 0 < p < 1, the population p-quantile is defined by

xp = Q(p).

The following three quantiles are of special importance:

median m = x0.5 = Q(0.5),


lower quartile x0.25 = Q(0.25),
upper quartile x0.75 = Q(0.75).

55
The p-quantile xp cuts off the proportion p of smallest values for the random variable X with P(X ≤ x) = F (x):

P(X ≤ xp ) = F (xp ) = F (Q(p)) = p.


In the continuous distribution case, the ordered random sample

x(1) < x(2) < . . . < x(n) ,

gives the strictly ordered n jump points for the empirical distribution function F̂ , so that
k k−1
F̂ (x(k) ) = n, F̂ (x(k) − ϵ) = n .

This observation leads to the following definition of empirical quantiles

x(k) is called the empirical ( k−0.5


n )-quantile

QQ-plots and normal QQ-plots


Suppose we have two independent samples (x1 , . . . , xn ) and (y1 , . . . , yn ) of equal size n which are taken from two
population distributions with distribution functions F1 and F2 . A relevant null hypothesis H0 : F1 ≡ F2 has an
equivalent expression in terms of the quantile functions H0 : Q1 ≡ Q2 . The latter hypothesis can be tested graphically
using a QQ-plot.

QQ-plot is a scatter plot of n dots with coordinates (x(k) , y(k) ).

If such a QQ-plot closely follows the 45 degree line, that is when we observe almost equal quantiles, we may conclude
that the data supports the null hypothesis of equality H0 : F1 ≡ F2 .
More generally, if the QQ-plot approximates a straight line y = a + bx, then we take this as evidence for the linear
relation
Y = a + bX in distribution.
Indeed, the latter claim means that for all x,

F1 (x) = F2 (a + bx),

so that putting Q1 (p) = x, we get Q2 (p) = a + bx, which yields the linear relationship for the QQ-plot

Q2 (p) = a + bQ1 (p), 0 < p < 1.

The normality hypothesis H0 states that the random sample (x1 , . . . , xn ) is drawn from the normal distribution

F(µ, σ) = N(µ, σ),

with unspecified parameter values. A QQ-plot used for testing this hypothesis is called a normal QQ-plot or normal
probability plot. To define the normal QQ-plot, let

yk = Φ−1 ( k−0.5
n ),

where Φ−1 is the quantile function for the N(0, 1) distribution function Φ. The normal QQ-plot is the scatter plot for
the sample quantiles matched with the theoretical quantiles

(x(1) , y1 ), . . . , (x(n) , yn ).

If the normal QQ-plot is close to a straight line y = a + bx, then we take it as an evidence supporting the hypothesis
of normality, and we may use the point estimates µ̂ = − ab , σ̂ = 1b for the population mean and standard deviation.

Example: t3 and Beta(2, 2)


Consider two distributions t3 and Beta(2, 2), both being symmetric around 0 and 0.5 respectively. On the figure below
each of them is plotted after the linear transformation X−µ σ together with the N(0, 1) curve. This figure is meant
to demonstrate the differences in the tails of these two distributions: the t-distribution has heavy tails and the beta
distribution has light tails.

56
The panels below show the normal QQ-plots for three samples of size n = 100 drawn from the t-distribution. Observe
the deviations from the straight lines characteristic for the heavy tails: the sample values in the tails are more extreme
than those projected by the straight lines going through the middle part of the QQ-plot.

The next panels show the normal QQ-plots for three samples of size n = 100 drawn from the beta-distribution. The
deviations from the straight lines are typical for the light tails: the sample values in the tails are less extreme than
those projected by the straight lines going through the middle part of the QQ-plot.

6.3 Density estimation


Estimating the probability density function f (x) = F ′ (x) for the population distribution is trickier than estimating
F (x). For example, taking the derivative of the empirical distribution function F̂ (x) doesn’t work. One can use a
solution based on the histogram displaying the observed counts
n
X
cj = 1{xi ∈cellj }
i=1

over the adjacent cells of a width h. By scaling the observed counts


cj
fh (x) = nh , for x ∈ cellj ,

we arrive at the scaled histogram giving us a density estimate since


Z X
h
fh (x)dx = nh cj = 1.
j

The choice of a balanced width h is important: smaller h give ragged profiles, larger h give obscured profiles.

57
To produce a smooth version of the scaled histogram, one can use the kernel density estimate with bandwidth h
defined by
n
1 X x−xi 2
fh (x) = ϕ( h ), where ϕ(x) = √12π e−x /2 .
nh i=1

Example: male heights


Turning to the example of Section 3.3 with 24 male heights, we get the following plots for three kernel density estimates
with different bandwidths h. With smaller h, the curve is oversensitive to the individual sample values. With larger
h, the curve is almost fully determined by its sample mean and standard deviation.

6.4 Skewness and kurtosis


Another simple way of testing normality relies on the third and forth central moments of the data distribution:
σ , having
coefficient of skewness and kurtosis. Let X ∼ F(µ, σ). In terms of its normalized version Z = X−µ

µ1 = E(Z) = 0, µ2 = E(Z 2 ) = 1,

the population coefficient of skewness and population kurtosis are defined by

µ3 = E(Z 3 ), µ4 = E(Z 4 ).

Importantly, for the normal distribution, µ3 = 0 and µ4 = 3.


Depending on the sign of the coefficient of skewness, we distinguish between symmetric µ3 = 0, skewed to the right
µ3 > 0, and skewed to the left µ3 < 0 distributions.

With µ3 close to zero, kurtosis being close to 3, can be used as an indication of the curve profile to be close to that
of the normal distribution. Otherwise, we distinguish between the leptokurtic distributions with µ4 > 3 (heavy tails),
and platykurtic distributions with µ4 < 3 (light tails). Given a random sample (x1 , . . . , xn ) with the sample mean x̄
and sample variance s2 , the sample skewness and sample kurtosis are computed by
n n
1 X 1 X
m3 = 3 (xi − x̄)3 , m4 = 4 (xi − x̄)4 .
s n i=1 s n i=1

The next table gives a list of parametric distributions with their skewness and kurtosis coefficients.
Parametric distribution Skewness Kurtosis
N(µ, σ) normal distribution µ3 = 0 symmetric µ4 = 3
Gam(α, λ) gamma distribution µ3 = √2α skewed to the right µ4 = 3 + α6 leptokurtic
Beta(a, a) beta distribution µ3 = 0 symmetric 1
µ4 = 3 − 2a+3 platykurtic
t-distribution with df = k and k ≥ 5 µ3 = 0 symmetric µ4 = 3 + k−4 leptokurtic
6

For example, we see that the gamma distribution Gam(α, λ) is positively skewed having µ3 = √2 .
α
As the shape
parameter α gets larger, the skewness gets smaller and kurtosis becomes close to 3.

58
Example: t3 and Beta(2, 2)
Returning to the example of the t3 and Beta(2, 2) distributions, we find that the heavy-tailed t-distribution has infinite
kurtosis µ4 = ∞, while the light-tailed beta-distribution has kurtosis µ4 = 2.143 which is smaller than 3. Recalling
the two standardised curves,

observe that the heavy-tailed curve is indeed leptokurtic (from Greek lepto ‘narrow’ + kurtos ‘bulging’) and the
light-tailed curve is platykurtic (from platy ‘broad, flat’).

Example: male heights


For the random sample of n = 24 male heights given in ascending order

170, 175, 176, 176, 177, 178, 178, 179, 179, 180, 180, 180, 180, 180, 181, 181, 182, 183, 184, 186, 187, 192, 192, 199,

we compute the following summary statistics:

x̄ = 181.46, m̂ = 180, m3 = 1.05, m4 = 4.31,

saying that the distribution is skewed to the right.

Good to know: the distribution of the heights of adult males is positively skewed,
implying m < µ, so that more than half of heights are below the average.

The normal QQ-plot of the data reveals a deviation from normality.

6.5 Inference on the population median


The central point of a distribution can be defined in terms of various measures of location, for example, as the
population mean µ or the median m. The population mean µ is estimated by the sample mean x̄, and the population
median m is estimated by the sample median m̂. Given the ordered sample (x(1) , . . . , x(n) ), the sample median is
defined as
x(k) + x(k+1)
m̂ = x(k) or m̂ = ,
2
depending on whether n = 2k − 1 is an odd number or an even number n = 2k. The sample mean x̄ is sensitive to
outliers, while the sample median m̂ is not. Therefore, we say that m̂ is an estimator robust to outliers. This explains
why in many cases the data (like personal income in the United States, see Wikipedia) is summarised by the sample
median rather than by the sample mean.

59
Confidence interval for the median
Consider a random sample (x1 , . . . , xn ) without assuming any parametric model for the unknown population distri-
bution F(µ, σ). Suppose that the population distribution F(µ, σ) is continuous so that all sample values (x1 , . . . , xn )
are different (no ties). Observe that the number of observations below the true population median m,
n
X
y= 1{xi ≤m}
i=1

is a realisation of a random variable Y having the symmetric binomial distribution Y ∼ Bin(n, 0.5). Denote
k−1
X 
n −n
pk = P(Y < k) = 2 , k = 1, . . . , n.
i=0
i

Since
{Y ≥ k} = {X(k) < m}, {Y ≤ n − k} = {X(n−k+1) > m},
we obtain for k < n/2,
P(X(k) < m < X(n−k+1) ) = P(k ≤ Y ≤ n − k) = 1 − 2pk .
This yields the following non-parametric formula for an exact confidence interval for the median.

Im = (x(k) , x(n−k+1) ) is a 100 · (1 − 2pk )% confidence interval for the population median m

For example, if n = 25, then from the table below we find that (X(8) , X(18) ) gives a 95.7% confidence interval for the
median.
k 6 7 8 9 10 11 12
100 · (1 − 2pk ) 99.6 98.6 95.7 89.2 77.0 57.6 31.0

Sign test
The sign test is a non-parametric test of H0 : m = m0 against the two-sided alternative H1 : m ̸= m0 . The sign test
statistic
n
X
y0 = 1{xi ≤m0 }
i=1

counts the number of observations below the null hypothesis value. This test rejects the null hypothesis for the larger
H
or smaller observed values y0 with the reference to the null distribution Y0 ∼0 Bin(n, 0.5). There is a simple connection
between the sign test and the confidence interval for the median: we reject H0 if m0 falls outside the confidence interval

Im = (x(k) , x(n−k+1) ),

where k is computed from the given significance level of the test.

6.6 Measures of dispersion


So far we used the sample variance s2 as the measure of dispersion in the data. A more straightforward measure of
dispersion is the sample range x(n) − x(1) . Both the sample variance and the sample range are sensitive to outliers.
Consider two robust measures of dispersion:
the interquartile range, IQR = x0.75 − x0.25 , is the difference between the upper and lower quartiles,

the median of absolute values of deviations, MAD, is defined as the sample median of {|xi −m̂|, i = 1, . . . , n}.
Turning to the standard normal distribution, we get

Φ(0.675) = 0.75, Φ−1 (0.75) = 0.675,

so that for the general normal distribution N(µ, σ), the theoretical lower and upper quartiles are µ ± 0.675 · σ yielding
IQR = (µ + 0.675 · σ) − (µ − 0.675 · σ) = 1.35 · σ.
On the other hand, since
P(|X − µ| ≤ 0.675 · σ) = (Φ(0.675) − 0.5) · 2 = 0.5,
we obtain MAD = 0.675 · σ. To summarise, for the normal distribution F(µ, σ) = N(µ, σ) model, we have three
estimates of σ:
IQR MAD
s, , .
1.35 0.675

60
Boxplots
The boxplot is a convenient graphical tool for comparing the distributions of different samples in terms of the measures
of location and dispersion.

The boxplot is built of the following components: box, whiskers and outliers.
Box
• upper edge of the box = upper quartile (UQ)
• box center = median
• lower edge of the box = lower quartile (LQ)
• box height = interquartile range (IQR)
Wiskers
• upper whisker end = {largest data point ≤ UQ + 1.5 × IQR}
• lower whisker end = {smallest data point ≥ LQ – 1.5 × IQR}
Outliers
• upper dots = {data points > UQ + 1.5 × IQR}
• lower dots = {data points < LQ – 1.5 × IQR}

6.7 Exercises
Problem 1
Suppose that (X1 , . . . , Xn ) are independent uniform U(0, 1) random variables. The corresponding empirical distribu-
tion function
n
1X
F̂ (x) = 1{Xi ≤x}
n i=1
is a random variable with the mean x. One hundred such samples of size n = 16 were generated, and from these
samples the standard errors of F̂ (x) were computed for different values of x. In the following figure the estimated
standard errors of F̂ (x) are depicted as the vertical bars.

61
The elliptic line on the graph giving the theoretical prediction of the observed profile of the vertical bars follows the
formula
(x − 0.5)2 + 16y 2 = 0.25.
Explain this formula by computing a certain variance.

Problem 2
Using the setting of the previous problem, show for x < y, that

x(1 − y)
Cov(F̂ (x), F̂ (y)) = .
n

Observe that this positive correlation is quite intuitive: if the empirical distribution function F̂ (x) overshoots its
expected value x, then we would expect that the other value F̂ (y) would also overshoot the respective value y.

Problem 3
The data below are the percentages of hydrocarbons in n = 59 samples of beeswax.

14.27 14.80 12.28 17.09 15.10 12.92 15.56 15.38 15.15 13.98
14.90 15.91 14.52 15.63 13.83 13.66 13.98 14.47 14.65 14.73
15.18 14.49 14.56 15.03 15.40 14.68 13.33 14.41 14.19 15.21
14.75 14.41 14.04 13.68 15.31 14.32 13.64 14.77 14.30 14.62
14.10 15.47 13.73 13.65 15.02 14.01 14.92 15.47 13.75 14.87
15.28 14.43 13.96 14.57 15.49 15.13 14.23 14.44 14.57

(a) For this data, plot the empirical distribution function, a histogram, and the normal QQ-plot. Find the 0.9, 0.75,
0.5, 0.25, and 0.1 quantiles. Does the distribution appear Gaussian?
(b) The average percentage of hydrocarbons in a synthetic wax is 85%. Suppose that beeswax was diluted with 1%
synthetic wax. Could this be detected? What about 3% and 5% dilution?

Problem 4
Calculate the hazard function for the Weibull distribution having the distribution function
β
F (x) = 1 − e−αx , x ≥ 0,

where α and β are two positive parameters. (Waloddi Weibull was a Swedish engineer, scientist, and mathematician.)

Problem 5
Give an example of a distribution with an increasing failure rate (hazard function). Give an example of a distribution
with a decreasing failure rate.

Problem 6
Olson, Simpson, and Eden (1975) discuss the analysis of data obtained from a cloud seeding experiment. The following
data present the rainfall from 26 seeded and 26 control clouds.

Seeded clouds
129.6, 31.4, 2745.6, 489.1, 430, 302.8, 119, 4.1, 92.4, 17.5,
200.7, 274.7, 274.7, 7.7, 1656, 978, 198.6, 703.4, 1697.8, 334.1,
118.3, 255, 115.3, 242.5, 32.7, 40.6

Control clouds
26.1, 26.3, 87, 95, 372.4, 0.01, 17.3, 24.4, 11.5, 321.2,
68.5, 81.5, 47.3, 28.6, 830.1, 345.5, 1202.6, 36.6, 4.9, 4.9,
41.1, 29, 163, 244.3, 147.8, 21.7

Make a QQ-plot for seeded rainfall versus control rainfall and log seeded rainfall versus log control rainfall. What do
these plots suggest about the effect, if any, of seeding?

62
Problem 7
The Laplace distribution with a positive parameter λ is a two-sided exponential distribution. Its density function is
f (x) = λ2 e−λ|x| for x ∈ (−∞, ∞). The variance of this distribution is 2λ−2 and kurtosis is 6.

(a) Take λ = 2. Plot carefully the density f (x) together with the standard normal distribution density.

(b) Use the drawn picture to explain the exact meaning of the following citation. "Kurtosis is a measure of the
peakedness of the probability distribution of a real-valued random variable, although some sources are insistent
that heavy tails, and not peakedness, is what is really being measured by kurtosis".

Problem 8
Given a random sample (x1 , . . . , xn ) with n = 25, we are interested in testing H0 : m = 20 against the two-sided
alternative HP
1 : m ̸= 20 concerning the population median m. No parametric model is assumed. As a test statistic
n
we take y = i=1 1{xi ≤20} , the number of observations below the null hypothesis value.

(a) Find the exact null distribution of the test statistic. What are your assumptions?
(b) Suggest an approximate confidence interval formula for m.

Problem 9
Miscellaneous questions.
(a) Describe a situation when a stratified sampling is more effective than a simple random sampling for estimating
the population mean. Which characteristics of the strata will influence your sample allocation choice?
(b) Given a dataset how do you compute kurtosis? What is the purpose of this summary statistic? Why is it
important to compute the coefficient of skewness for a proper interpretation of the kurtosis value?
(c) Suppose we are interested in the average height for a population of size 2,000,000. To what extend can a sample
of 200 individuals be representative for the whole population?

Problem 10
Let x1 , . . . , x25 be a random sample drawn from N(0.3, 1). Consider testing

H0 : m = 0 vs H1 : m > 0

at α = 0.05. Compare the power of two tests:


(a) the power of the sign test,

(b) the power of the test based on the normal theory assuming that σ = 1 is known.

63
Chapter 7

Comparing two samples

Suppose we wish to compare two population distributions F(µ1 , σ1 ) and F(µ2 , σ2 ) based on two random samples
(x1 , . . . , xn ) and (y1 , . . . , ym ) from these two populations. Consider two sample means
x1 + . . . + xn y1 + . . . + ym
x̄ = , ȳ = ,
n m
two sample variances
n m
1 X 1 X
s21 = (xi − x̄)2 , s22 = (yi − ȳ)2 ,
n − 1 i=1 m − 1 i=1
and the standard errors of x̄ and ȳ
s1 s2
sx̄ = √ , sȳ = √ .
n m
The difference (x̄ − ȳ) is an unbiased estimate of µ1 − µ2 . We are interested in finding the standard error of x̄ − ȳ, a
confidence interval for the difference µ1 − µ2 , as well as testing the null hypothesis of equality
H0 : µ1 = µ2 .
Two main settings will be addressed: (1) two independent samples and (2) paired samples for sampling the differences.

7.1 Two independent samples


In the case of two independent samples, due to independence between X̄ and Ȳ , we have
σ12 σ2
Var(X̄ − Ȳ ) = Var(X̄) + Var(Ȳ ) = + 2,
n m
implying that
s21 s2
+ 2
s2x̄−ȳ = s2x̄ + s2ȳ =
n m
gives an unbiased estimate of Var(X̄ − Ȳ ). Taking the square root we obtain the following formula for the standard
error of the point estimate x̄ − ȳ r
q s21 s2
2 2
sx̄−ȳ = sx̄ + sȳ = + 2.
n m

Large sample test for the difference between two means


If n and m are large, we can use the normal approximation
(X̄ − Ȳ ) − (µ1 − µ2 )
Z= q ≈ N(0, 1).
2 + S2
SX̄ Ȳ

The hypothesis of equality


H0 : µ1 = µ2
is tested using the test statistic
x̄ − ȳ
z0 = q
s2x̄ + s2ȳ
whose null distribution is approximated by the standard normal N(0, 1).
q
Approximate 100(1 − α)% confidence interval Iµ1 −µ2 ≈ x̄ − ȳ ± z( α2 ) · s2x̄ + s2ȳ .

64
Two-sample t-test
The two–sample t-test is based on the following three assumptions on the population distributions:
1. two random samples (x1 , . . . , xn ) and (y1 , . . . , ym ) are independent of each other,
2. both underlying population distributions are normal, F(µ1 , σ1 ) = N(µ1 , σ1 ) and F(µ2 , σ2 ) = N(µ2 , σ2 ),
3. these two normal distributions have equal variances σ12 = σ22 = σ 2 .
To estimate the common variance σ12 = σ22 = σ 2 , we can use
Pn Pm
(xi − x̄)2 + i=1 (yi − ȳ)2
s2p = i=1 ,
n+m−2
which is called the pooled sample variance. Using the equality
n−1 m−1
s2p = · s2 + · s2 ,
n+m−2 1 n+m−2 2
it is easy to see that the pooled sample variance is an unbiased estimate of σ 2 :
n−1 m−1
E(Sp2 ) = E(S12 ) + E(S22 ) = σ 2 .
n+m−2 n+m−2
In the case of equal variances and independent samples, the variance
n+m
Var(X̄ − Ȳ ) = σ 2 · ,
nm
has the following unbiased estimate
n+m
s2x̄−ȳ = s2p ·
.
nm
Under the normality assumption with equal variances, the sampling distribution of the two-sample t-score
r
(x̄ − ȳ) − (µ1 − µ2 ) nm
t= ·
sp n+m

is the t-distribution with (n + m − 2) degrees of freedom.


q
Exact confidence interval Iµ1 −µ2 = x̄ − ȳ ± tn+m−2 ( α2 ) · sp · n+m
nm

The two sample t-test for testing H0 : µ1 = µ2 is based on the test statistic
r
x̄ − ȳ nm
t0 = ·
sp n+m
H
obtained from the two-sample t-score t by putting µ1 −µ2 = 0. The null distribution of the test statistic is T0 ∼0 tn+m−2 .

Case study: iron retention

Fe3+ (10.2) Fe3+ (1.2) Fe3+ (0.3) Fe2+ (10.2) Fe2+ (1.2) Fe2+ (0.3)
0.71 2.20 2.25 2.20 4.04 2.71
1.66 2.93 3.93 2.69 4.16 5.43
2.01 3.08 5.08 3.54 4.42 6.38
2.16 3.49 5.82 3.75 4.93 6.38
2.42 4.11 5.84 3.83 5.49 8.32
2.42 4.95 6.89 4.08 5.77 9.04
2.56 5.16 8.50 4.27 5.86 9.56
2.60 5.54 8.56 4.53 6.28 10.01
3.31 5.68 9.44 5.32 6.97 10.08
3.64 6.25 10.52 6.18 7.06 10.62
3.74 7.25 13.46 6.22 7.78 13.80
3.74 7.90 13.57 6.33 9.23 15.99
4.39 8.85 14.76 6.97 9.34 17.90
4.50 11.96 16.41 6.97 9.91 18.25
5.07 15.54 16.96 7.52 13.46 19.32
5.26 15.89 17.56 8.36 18.40 19.87
8.15 18.3 22.82 11.65 23.89 21.60
8.24 18.59 29.13 12.45 26.39 22.25

65
Two forms of iron Fe2+ and Fe3+ were compared in a randomised study on mice. The obtained data is the percentage
of iron retained in mice under three different iron dosage concentrations: 10.2, 1.2, and 0.3 millimolar. The six samples
were collected using n = 18 mice randomly assigned at each combination of the iron form and concentration. Turning
to the two independent samples at the medium concentration, we get the following summary of the data:
Fe2+ percentage retained: n = 18, x̄ = 9.63, s1 = 6.69, sx̄ = 1.58
Fe3+ percentage retained: m = 18, ȳ = 8.20, s2 = 5.45, sȳ = 1.28
The graphs below show that the population distributions are not normal. Therefore, to test H0 : µ1 = µ2 we turn to
the large sample test. Using the observed value
x̄ − ȳ
z0 = q = 0.7,
s2x̄ + s2ȳ

and applying the normal distribution table we find an approximate two-sided p-value = 0.48.

Left panel: boxplots for percentages of Fe2+ (left) and Fe3+ (right). Right panel: two normal QQ plots.

After the log transformation of the data

x◦i = ln xi , yi◦ = ln yi ,

the normality assumption becomes more acceptable, as seen from the graphs below. For the transformed data, we get
Fe2+ log-percentage retained: n = 18, x̄◦ = 2.09, s◦1 = 0.659, sx̄◦ = 0.155,
Fe3+ log-percentage retained: m = 18, ȳ ◦ = 1.90, s◦2 = 0.574, sȳ◦ = 0.135.
Assuming that two population variances are equal to the unknown σ 2 , we estimate σ 2 by the pooled sample variance
formula
n−1 m−1
s2p = · (s◦1 )2 + · (s◦2 )2 = 0.5(0.6592 + 0.5742 ) = 0.382.
n+m−2 n+m−2
The two-sample t-score for the transformed data is
x̄◦ − ȳ ◦ 2.09 − 1.90
t0 = p = = 0.922.
sp 2/n 0.618/3

Applying the two-sample t-test with df = 34, we find the two-sided p-value = 0.366 also resulting in non-significant
difference. In the next chapter we will analyse the full dataset of six samples using the two-way layout model of the
analysis of variance.

Left panel: boxplots for log-percentages of Fe2+ (left) and Fe3+ (right). Right panel: two normal QQ plots.

66
Rank sum test
The rank sum test is a nonparametric test for two independent samples, which does not assume normality of population
distributions. Suppose we have two independent samples (x1 , . . . , xn ) and (y1 , . . . , ym ) drawn from two continuous
population distributions F1 and F2 , and we would like to test

H0 : F1 = F2 against H1 : F1 ̸= F2

in the case when either one of the distributions F1 and F2 is not normal, or their variances are not equal. The rank
sum test procedure consists of the following steps:
pool the samples and replace the data values by their ranks 1, 2, . . . , n + m, starting from the smallest
sample value to the largest, and then compute two test statistics r1 = sum of the ranks of x-observations,
and r2 = sum of y-ranks.
Clearly, these two test statistics have a fixed sum
(n+m)(n+m+1)
r1 + r2 = 1 + 2 + . . . + (n + m) = 2 .

The exact null distributions for R1 and R2 depend only on the sample sizes n and m, and the rank sum test
is included in all statistical packages. For n ≥ 10, m ≥ 10, we may apply the normal approximation for the null
distributions of R1 and R2 with
n(n + m + 1) m(n + m + 1) mn(n + m + 1)
E(R1 ) = , E(R2 ) = , Var(R1 ) = Var(R2 ) = .
2 2 12
Note that the variances of R1 and R2 must be equal since their sum is a constant.

Example: comparing heights


The picture below illustrates a sampling experiment for testing the equality H0 : F1 = F2 of the height distributions
for females F1 and males F2 , against H1 : F1 ̸= F2 . According to the left panel we can measure the heights of n = 6
females and m = 5 males.

From the right panel we find the two rank sums

r1 = 1 + 2 + 5 + 6 + 8 + 9 = 31, r2 = 3 + 4 + 7 + 10 + 11 = 35.

Despite the sample sizes are small, we will use the normal approximation for the null distribution of R2 , which has
the mean and variance
m(n + m + 1) mn(n + m + 1)
µ= = 30, σ 2 = = 30.
2 12
The resulting one-sided p-value is larger than 10%,

1 − Φ( 35.5−30

30
) = 1 − Φ(1.00) = 0.16,

implying that we can not reject the null hypothesis of equal means judging from these small samples.

7.2 Two independent samples: comparing population proportions


Consider the special case of two independent samples (x1 , . . . , xn ) and (y1 , . . . , ym ) drawn from two Bernoulli distri-
butions X ∼ Bin(1, p1 ) and Y ∼ Bin(1, p2 ). The null hypothesis of equality takes the form

H0 : p1 = p2 ,

the two-sample t-test is not valid, and we would like to modify the large sample test for the mean difference. For the
Bernoulli model, the sample means turn into the sample proportions

x̄ = p̂1 , ȳ = p̂2 ,

giving unbiased estimates of p1 , p2 with the standard errors


q q
sp̂1 = p̂1 (1−
n−1
p̂1 )
, s p̂ 2 = p̂2 (1−p̂2 )
m−1 .

67
Large sample test for two proportions
If the samples sizes m and n are large, then an approximate confidence interval for the difference p1 − p2 is given by
r
α p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
Ip1 −p2 ≈ p̂1 − p̂2 ± z( 2 ) · + .
n−1 m−1
If this interval does not cover 0, we reject the null hypothesis of equality H0 : p1 = p2 in favour of the two-sided
alternative H1 : p1 ̸= p2 at the significance level α.

Example: opinion polls


Consider two consecutive monthly poll results p̂1 and p̂2 with n ≈ m ≈ 5000 interviews. A change in support to
a major political party from p̂1 to p̂2 , with both numbers being close to 40%, is deemed significant if the difference
between these two sample proportions satisfies
q
|p̂1 − p̂2 | > 1.96 · 2 · 0.4·0.6
5000 = 0.019.

This result may be compared with the one-sample hypothesis testing

H0 : p = 0.4 vs H0 : p ̸= 0.4.

The approximate 95% confidence interval for p is


r
p̂(1 − p̂)
Ip ≈ p̂ ± 1.96 · ,
n−1
and if p̂ is not far from 0.4, then the result of the opinion poll is significant if
q
|p̂ − 0.4| > 1.96 · 0.4·0.6
5000 = 0.013.

Fisher’s exact test

Fisher’s exact test deals with the null hypothesis

H0 : p1 = p2 ,

when the sample sizes m and n are not sufficiently large for applying normal approximations for the distributions of
the sample means. We summarise the data of two independent samples as a 2 × 2 table of sample counts
Sample 1 Sample 2 Total
Number of successes c11 c12 c11 + c12
Number of failures c01 c02 c01 + c02
Sample sizes n m N =n+m
where
c11 = x1 + . . . + xn , c01 = n − c11 , c12 = y1 + . . . + ym , c02 = m − c12 .
The key of the Fisher’s idea is to treat the observed count c11 as a random outcome of drawing without replacement
of n balls from an urn with N balls. Assuming that the urn contains c11 + c12 black balls and c01 + c02 white balls,
the c11 is the number of black balls among n sampled balls. Putting N = n + m, and freezing the proportion
c11 + c12
p= ,
N
we find that under the null hypothesis of equality, the sample count c11 is generated by the hypergeometric distribution
H
C11 ∼0 Hg(N, n, p).

This null distribution should be used for determining the rejection rule of the exact Fisher test in terms of the sample
counts (c11 , c12 ) and the sample sizes (n, m).

68
Example: gender bias
The following data were collected after 48 copies of the same file with 24 files labeled as “male” and the other 24
labeled as “female” were sent to 48 experts. Each expert decision had two possible outcomes: promote or hold file.
We wish to test
H0 : p1 = p2 no gender bias,
against
H1 : p1 > p2 males are favoured.
Given the data
Male Female Total
Promote 21 14 35
Hold file 3 10 13
Total 24 24 48
Fisher’s test would reject H0 in favour of the one-sided alternative H1 if the observed value c11 = 21 will be deemed
significantly large under the hypergeometric distribution Hg(N, n, p) with N = 48, n = 24, and p = 35/48:
35
 13  35
 13 
k 24−k 35−k
P(C11 = k) = 48
 = 48
k−11 , 11 ≤ k ≤ 24.
24 24

This distribution is symmetric around the mean µ = 17.5 with

P(C11 ≤ 14) = P(C11 ≥ 21) = 0.0245.

We conclude that the one-sided p-value = 0.0245, and a two-sided p-value = 0.049. According to this dataset, there
is a significant evidence of sex bias in favor of male candidates.

7.3 Paired samples


Paired difference tests discussed in this section are based on a specific type of blocking. Examples of paired observations:
• two different drugs applied to two patients matched by age and sex, with measured effects (x, y),
• a fruit’s weight x before shipment and y after shipment,
• two types of tires tested on the same car, with measured effects (x, y).
A paired samples is a vector of independent and identically distributed two-dimensional random variables

(x1 , y1 ), . . . , (xn , yn ).

As before, we have two samples (x1 , . . . , xn ) and (y1 , . . . , yn ) taken from two possibly different population means and
variances
E(X) = µ1 , E(Y ) = µ2 , Var(X) = σ12 , Var(Y ) = σ22 ,
and our main question is again whether the difference µ1 − µ2 is statistically significant.
In contrast to the two independent samples case, in the paired sampling setting, there is a positive dependence in
the joint distribution of (X, Y ):
Cov(X, Y )
ρ= > 0.
σ1 σ2
Even with paired samples, the difference x̄ − ȳ gives an unbiased estimate of the population mean difference µ1 − µ2 .
This is an unbiased estimate whose variance value takes into account dependence between X and Y . Observe that

Var(X̄ − Ȳ ) = Var(X̄) + Var(Ȳ ) − 2Cov(X̄, Ȳ )


σ12 σ2 2
= + 2 − 2 Cov(X1 + . . . + Xn , Y1 + . . . + Yn ).
n n n
Since Xi and Yj are independent for i ̸= j, we get

Cov(X1 + . . . + Xn , Y1 + . . . + Yn ) = Cov(X1 , Y1 ) + . . . + Cov(Xn , Yn ) = nCov(X, Y ) = nσ1 σ2 ρ,

implying
2
Var(X̄ − Ȳ ) = 1
n (σ1 + σ22 − 2σ1 σ2 ρ).

In particular, if the two samples are independent and have equal sizes, then ρ = 0 and
2
Var(X̄ − Ȳ ) = Var(X̄) + Var(Ȳ ) = 1
n (σ1 + σ22 ).

69
It follows that with ρ > 0, the paired sampling ensures a smaller standard error for the estimate x̄ − ȳ compared to
the independent X and Y .
The key idea in handling the paired samples is to reduce the two-sample case to the single random sample case by
taking the differences
(d1 , . . . , dn ), di = xi − yi .
For the sample of the differences, we have
d¯ = x̄ − ȳ,
with
2
E(D̄) = µ1 − µ2 , E(D̄) = 1
n (σ1 + σ22 − 2σ1 σ2 ρ).

Smoking and platelet aggregation


To study the effect of cigarette smoking on platelet aggregation, Levine (1973) drew blood samples from 11 individuals
before and after they smoked a cigarette and measured the extend to which the blood platelets aggregated. Platelets
are involved in the formation of blod clots, and it is known that smokers suffer more often from disorders involving
blood clots than do nonsmokers. The obtained data are shown in the following table, which gives the maximum
percentage of all the platelets that aggregated after being exposed to a stimulus.
Before smoking yi After smoking xi d i = xi − yi
25 27 2
25 29 4
27 37 10
28 43 15
30 46 16
44 56 12
52 61 9
53 57 4
53 80 27
60 59 –1
67 82 15

The first two columns show the paired measurements (xi , yi ) for n = 11 individuals. Using the data we estimate
the correlation coefficient ρ for the measurement results before and after smoking by computing the sample correlation
coefficient
(x1 − x̄)(y1 − ȳ) + . . . + (xn − x̄)(yn − ȳ)
r= = 0.90.
(n − 1)s1 s2
The obtained high value indicates a strong positive correlation between the first two columns, and the third column
reduces the two sample data to the single random sample of the differences di = xi − yi .
Assuming that the population distribution for differences D ∼ N(µ, σ) is normal with µ = µ1 − µ2 , we may apply
the one-sample t-test for
H0 : µ = 0 against H1 : µ ̸= 0.
The observed test statistic value
d¯ 10.27
t0 = = = 4.28
sd¯ 2.40
according to t-distribution with 10 degrees of freedom produces the two-sided p-value 0.0016. We conclude that
smoking has a significant health effect.
Without the assumption of normality on the distribution of differences, we can apply the non-parametric sign test
for a pair of hypotheses concerning the population median m of the difference D = X − Y :

H0 : m = 0 against H1 : m ̸= 0.

We may use
y0 = 1{d1 <0} + . . . + 1{dn <0} = 1
as the test statistic which has the null distribution
H
Y0 ∼0 Bin(n, 0.5).

The two-sided p-value of the sign test is

2P(Y0 ≤ 1) = 2[(0.5)11 + 11 · (0.5)11 ] = 0.012.

Thus we reject the null hypothesis of no difference also with the help of the non-parametric sign test.

70
Signed rank test
The sign test disregards a lot of information in the data taking into account only the sign of the differences. The
signed rank test pays attention to the sizes of positive and negative differences. This non-parametric test requires the
following assumption: the population distribution of D = X − Y is symmetric about its median m. The test statistic
for
H0 : m = 0 against H1 : m ̸= 0.
is computed in terms the ranks of the absolute values of the differences

ri = rank(|di |), i = 1, . . . , n.

Here ri = 1 if di has the smallest absolute value and ri = n if di has the largest absolute value among the differences
irrespective of the sign of the difference di . The signed ranks used for computing the signed rank test statistic w are
defined as follows: if di > 0, then ri is called a positive rank, and if di < 0, then ri is called a negative rank. There
are two possible ways to compute the signed rank test statistic w: either as the sum of positive ranks
n
X
w= ri · 1{di >0}
i=1

or the sum of the negative ranks


n
X
w= ri · 1{di <0} .
i=1

Irrespective of the sign of the ranks used for computing the test statistic its null distributions is the same. The exact
null distribution of W is included in all statistical packages. For n ≥ 20, one can use the normal approximation N(µ, σ)
of the null distribution of W , with

n(n + 1) n(n + 1)(2n + 1)


µ= , σ2 = .
4 24

The signed rank test uses more data information than the sign test
but requires symmetric distribution of differences.

Example: platelet aggregation


The table below illustrates how the signed ranks are computed for the data on platelet aggregation. Notice how the
problem of ties (equal differences) is resolved for assigning the ranks.
Before smoking yi After smoking xi di = xi − yi |di | Rank of |di | Signed rank
25 27 2 2 2 +2
25 29 4 4 3.5 +3.5
27 37 10 10 6 +6
28 43 15 15 8.5 +8.5
30 46 16 16 10 +10
44 56 12 12 7 +7
52 61 9 9 5 +5
53 57 4 4 3.5 +3.5
53 80 27 27 11 +11
60 59 –1 1 1 –1
67 82 15 15 8.5 +8.5
The observed value of the signed rank test statistic is
n
X
w= ri · 1{di <0} = 1.
i=1

It gives the two-sided p-value = 0.002. Given the small p-value, before we conclude that there is a significant effect,
we have to verify the symmetry assumption on the distribution of differences D. Judging from the plot below, the
vector of observed differences (2, 4, 10, 15, 16, 12, 9, 4, 27, −1) is symmetric enough around its median.

71
7.4 Paired samples: comparing population proportions
Suppose we have two dependent Bernoulli random variables X ∼ Bin(1, p1 ) and Y ∼ Bin(1, p2 ). The vector (X, Y ) have
four possible values (0, 0), (0, 1), (1, 0), (1, 1) with probabilities denoted as π00 , π01 , π10 , π11 . With n independent paired
observations of the vector (X, Y ), we get four counts (c00 , c01 , c10 , c11 ) for the different outcomes. The corresponding
joint distribution is multinomial

(C00 , C01 , C10 , C11 ) ∼ Mn(n, π00 , π01 , π10 , π11 ).

An unbiased estimate for the difference


p1 − p2 = π10 − π01 ,
in the case of the paired dichotomous observations is given by
c10 c01
p̂1 − p̂2 = π̂10 − π̂01 , π̂10 = , π̂01 = .
n n
The variance of the sampling distribution of p̂1 − p̂2 is estimated by

π̂10 + π̂01 − (π̂10 − π̂01 )2


s2p̂1 −p̂2 = ,
n−1
in view of

Var(C10 − C01 ) = nπ10 (1 − π10 ) + nπ01 (1 − π01 ) + 2nπ10 π01 = n(π10 + π01 − (π10 − π01 )2 ).

Referring to the central limit theorem we arrive at the following approximate 100(1 − α)% confidence interval
formula for the difference p1 − p2
Ip1 −p2 ≈ p̂1 − p̂2 ± z( α2 ) · sp̂1 −p̂2 .

McNemar’s test
A significant difference between p1 and p2 is established when the confidence interval Ip1 −p2 does not cover zero, that
is when
|p̂1 − p̂2 | > z( α2 ) · sp̂1 −p̂2 ,
or in other words, the rejection region for

H0 : p1 = p2 against H1 : p1 ̸= p2

och equivalently
H0 : π10 = π01 against H1 : π10 ̸= π01
has the form  
 |π̂10 − π̂01 | 
R= q > z( α2 ) .
 π̂10 +π̂01 −(π̂10 −π̂01 )2 
n−1

Now notice that for the large sample sizes, the squared left hand side approximately equals

n(π̂10 − π̂01 )2 (c10 − c01 )2


≈ .
π̂10 + π̂01 − (π̂10 − π̂01 )2 c10 + c01
This observation leads to the McNemar test statistic
(c10 − c01 )2
x2 = .
c10 + c01
Its null distribution is approximated by the χ21 -distribution. Observe that out of the four sample counts (c00 , c01 , c10 , c11 )
the McNemar test statistic depends only on two of them (c01 , c10 ). Thus the number of informative counts is just
c01 + c10 out of total n. We will return to the McNemar test in Section 9.3.

7.5 List of statistical tests


At this point, the reader might find it useful to get a look at the next list mentioning all the frequentist tests, parametric
and non-parametric, covered in this textbook (so far and later on).

One-sample tests
• One sample t-test: normal population distribution
• Large sample test for mean

72
• Large sample test for proportion: categorical data
• Small sample test for proportion: categorical data
• Chi-squared test of goodness of fit: categorical data, large sample
• Chi-squared test of independence: categorical data, large sample
• Model utility test: linear model, several explanatory variables, normal noise, homoscedasticity
Two-sample tests

• Two sample t-test: normal populations, equal variances, independent samples


• Fisher’s exact test: categorical data, independent samples
• McNemar: categorical data, matched samples, large samples
Several samples

• ANOVA 1: normal population distributions, equal variances, independent samples


• ANOVA 2: normal population distributions, equal variances, matched samples
• Chi-squared test of homogeneity: categorical data, independent samples, large samples
Non-parametric tests

• Sign test: one sample


• Signed rank test: two matched samples, symmetric distribution of differences
• Rank sum test: two independent samples
• Kruskal-Wallis: several independent samples
• Fridman: several matched samples

7.6 Exercises
Problem 1
Four random numbers were generated from a normal distribution N(µ1 , σ 2 )

x1 = 1.1650, x2 = 0.6268, x3 = 0.0751, x4 = 0.3516,

along with five random numbers with the same variance σ 2 but perhaps a different mean µ2

y1 = 0.3035, y2 = 2.6961, y3 = 1.0591, y4 = 2.7971, y5 = 1.2641.

(a) What do you think the means of the random normal number generators were? What do you think the difference
of the means was?
(b) Estimate σ 2 .

(c) What is the estimated standard error of your estimate of the difference of the means?
(d) Form a 90% confidence interval for the difference of the means.
(e) In this situation, is it more appropriate to use a one-sided test or a two-sided test of the equality of the means?

(f) What is the p-value of a two-sided test of the null hypothesis of equal means?
(g) Would the hypothesis that the means were the same versus a two-sided alternative be rejected at the significance
level α = 0.1?
(h) Suppose you know that the variance of the normal distribution was σ 2 = 1. How would your answers to the
preceding questions change?

Problem 2
In the "two independent samples" setting we have two ways of estimating the variance of X̄ − Ȳ :

(a) s2p ( n1 + 1
m ), if σ1 = σ2 = σ,
s21 s22
(b) n + m without the assumption of equal variances.
Show that if m = n, then these two estimates are identical.

73
Problem 3
An experiment of the efficacy of a drug for reducing high blood pressure is performed using four subjects in the
following way:
two of the subjects are chosen at random for the control group and two for the treatment group.
During the course of a treatment with the drug, the blood pressure of each of the subjects in the teatment group is
measured for ten consecutive days as is the blood pressure of each of the subjects in the control group.
(a) In order to test whether the treatment has an effect, do you think it is appropriate to use the two-sample t-test
with n = m = 20?
(b) Do you think it is appropriate to use the rank sum test?

Problem 4
This is an example of the so-called Simpson’s paradox.
Hospital A has higher overall death rate than hospital B:
Hospital: A B
Died 63 16
Survived 2037 784
Total 2100 800
Death Rate 0.03 0.02
However, if we split the data in two parts, patients in good (+) and bad (−) conditions, for both parts hospital A
performs better:
Hospital: A+ A− B+ B−
Died 6 57 8 8
Survived 594 1443 592 192
Total 600 1500 600 200
Death Rate 0.010 0.038 0.013 0.040
Resolve this paradox.

Problem 5
Suppose that n measurements are to be taken under a treatment condition and another n measurements are to be
taken independently under a control condition. It is thought that the standard deviation of a single observation is
about 10 under both conditions. How large should n be so that a 95% confidence interval for the mean difference has
a width of 2? Use the normal distribution rather than the t-distribution, since n will turn out to be quite large.

Problem 6
Two types of engine bearings are compared.
The data in the table give the millions of cycles until failure.

Type I Type II
3.03 3.19
5.53 4.26
5.60 4.47
9.30 4.53
9.92 4.67
12.51 4.69
12.95 6.79
15.21 9.37
16.04 12.75
16.84 12.78

(a) Use a normal theory test for the null hypothesis of no difference against the two-sided alternative
H0 : µ1 = µ2 , H1 : µ1 ̸= µ2 .

(b) Test the hypothesis of no difference between the two types of bearing using a non-parametric method.
(c) Which of the methods (a) or (b) do you think is better in this case? (Hint: refer to the normal QQ-plot
constructed for the deviations of the 20 observations from their sample means.)
(d) Estimate π, the probability that a type I bearing will outlast a type II bearing.

74
Problem 7
Find the exact null distribution for the test statistic of the signed rank test with n = 4.

Problem 8
Turn to the signed rank test and denote by W a random variable having the null distribution of its test statistic. For
n = 10, 20, 25 and α = 0.05, 0.01, the next table gives the critical values wα such that 2P(W ≤ wα ) is closest to α.
n = 10 n = 20 n = 25
α = 0.05 8 52 89
α = 0.01 3 38 68
Compare these critical values with their counterparts wα◦ obtained using the normal approximation for W .

Problem 9
From two population distributions with the same σ = 10, two samples (x1 , . . . , xn ) and (y1 , . . . , yn ) of sizes n = 25
can be taken in two ways
(a) paired sample with Cov(Xi , Yi ) = 50, i = 1, . . . , 25,
(b) two independent random samples.
Compare the power curves for testing
H0 : µ1 = µ2 , H1 : µ1 > µ2 , α = 0.05, Sample 1 Sample 2
for the settings (a) and (b). 97.2 97.2
105.8 97.8
99.5 96.2
Problem 10
100 101.8
The table on the right reports fifteen pairs (xi , yi ) of measurements. 93.8 88
What are x̄ − ȳ and sx̄−ȳ ? 79.2 74
72 75
If the pairing had been erroneously ignored and it had been assumed that 72 67.5
the two samples were independent, what would have been the estimate 69.5 65.8
of the standard deviation of X̄ − Ȳ ? 20.5 21.2
95.2 94.8
Analyse the data to determine if there is a systematic 90.8 95.8
difference between µ1 and µ2 . 96.2 98
96.2 99
91 100.2
Problem 11
The media often present short reports of the results of experiments. To the critical reader, such reports often raise
more questions than they answer. Comment on the following pitfalls in the interpretation of each of the following.
(a) It is reported that patients whose hospital rooms have a window recover faster than those whose rooms do not.
(b) Nonsmoking wives whose husbands smoke have a cancer rate twice that of wives whose husbands do not smoke.
(c) A two-year study in North Carolina found that 75% of all industrial accidents in the state happened to workers
who had skipped breakfast.
(d) A 15-year study of more than 45 000 Swedish soldiers revealed that heavy users of marijuana were six times
more likely than nonusers to develop schizophrenia.
(e) A study of nearly 4000 elderly North Carolinians has found that those who attended religious services every week
were 46% less likely to die over a six-year period than people who attended less often or not at all.

Problem 12
The following table shows admission rates for the six most popular majors at the graduate school at the University of
California at Berkeley. The numbers in the table are the number of applicants and the percentage admitted.
Men Women
Major A 825 (62%) 108 (82%)
Major B 560 (63%) 25 (68%)
Major C 325 (37%) 593 (34%)
Major D 417 (33%) 375 (35%)
Major E 191 (28%) 393 (34%)
Major F 373 (6%) 341 (7%)

75
(a) If the percentage admitted are compared, women do not seem to be unfavourably treated. But when the combined
admission rates for all six majors are calculated, it is found that 44% of the men and only 30% of the women
were admitted. How this paradox is resolved?
(b) This is an example of an observational study. Suggest a controlled experiment testing relevant statistical
hypotheses.

Problem 13
You have got a grant to measure the average weight of the hippopotamus at birth. You have seen in a previous
publication by Stanley and Livingstone that for male calves the distribution of weights has a mean of roughly 70 kg
and a standard deviation of 10 kg, while these numbers are 60 kg and 5 kg for females, but you are interested in a
better remeasurement of the overall average.
The experimental procedure is simple: you wait for the herd of hippopotami to be sleeping, you approach a
newborn, you put it quickly on the scales, and you pray for the mother not to wake up. You managed to weigh 13
female and 23 male newborns with the following results:
Female Male
Sample mean 62.8 69.7
Sample standard deviation 6.8 11.7

(a) Test the null hypothesis of the equal sex ratio for the newborn hippopotami (meaning that the ratio of males to
females at birth is 1 to 1).

(b) Assuming the ratio of males to females at birth is 1 to 1, suggest two different unbiased point estimates for the
overall average weight of the hippopotamus at birth: the random sample mean and the stratified sample mean.
(c) Compute the standard errors for the stratified sample mean.

Problem 14
For each of nine horses, a veterinary anatomist measured the density of nerve cells at specified sites in the intestine:
Animal Site I Site II
1 50.6 38.0
2 39.2 18.6
3 35.2 23.2
4 17.0 19.0
5 11.2 6.6
6 14.2 16.4
7 24.2 14.4
8 37.4 37.6
9 35.2 24.4
The null hypothesis of interest is that in the population of all horses there is no difference between the two sites.
(a) Which of the two non-parametric tests is appropriate here: the rank sum test or the signed rank test? Explain
your choice.

(b) On the basis of the data, would you reject the null hypothesis? Use one of the tests named in the item (a).
(c) Explain the following extract from the course text book:
More precisely, with the signed rank test, H0 states that the distribution of the differences is symmetric
about zero. This will be true if the members of pairs of experimental units are assigned randomly to
treatment and control conditions, and the treatment has no effect at all.

Problem 15
A study is conducted of the association between the rate at which words are spoken and the ability of a “talking
computer” to recognise commands that it is programmed to accept. A random sample of 50 commands is spoken
first at a rate under 60 words per minute, and then the same commands are repeated at a rate over 60 words per
minute. In the first case the computer recognised 42 out of 50 commands while in the second case it recognised only 35
commands. Is the observed difference statistically significant? Assume that if the "talking computer" made a correct
answer in the fast regime, then it recognised correctly the same command in the slow regime.

76
Chapter 8

Analysis of variance

The two sample setting from the previous chapter is the case, where the response is explained by a single main factor
having two levels. In this chapter devoted to the analysis of variance or in short ANOVA, we extend the settings
with one or two main factors having arbitrary many levels. The one-way layout setting extends the two independent
samples case, while the two-way layout generalises the paired sample setting of the previous section.

8.1 One-way layout ANOVA


Consider the one-way layout model from Section 1.2. Suppose that for each of the I levels of the main factor A, we
have independently collected a random sample (yi1 , . . . , yin ) of size n. With I independent samples of size n in hand,
we want to test
H0 : µ1 = . . . = µI , against H1 : µi ̸= µj for some pair of indices (i, j).
If the levels of the factor A are I different treatments in a comparison study, then the above null hypothesis claims that
the compared treatments have the same effect, implying that the factor A has no influence on the measured response
and the suggested one-way layout model is not useful.

Example: seven labs

The table below lists 70 measurements of chlorpheniramine maleate in tablets with a nominal dosage of 4 mg made
by seven different labs. This is an example of a one-way layout data consisting of I = 7 independent samples each of
size n = 10.
Lab 1 Lab 2 Lab 3 Lab 4 Lab 5 Lab 6 Lab 7
4.13 3.86 4.00 3.88 4.02 4.02 4.00
4.07 3.85 4.02 3.88 3.95 3.86 4.02
4.04 4.08 4.01 3.91 4.02 3.96 4.03
4.07 4.11 4.01 3.95 3.89 3.97 4.04
4.05 4.08 4.04 3.92 3.91 4.00 4.10
4.04 4.01 3.99 3.97 4.01 3.82 3.81
4.02 4.02 4.03 3.92 3.89 3.98 3.91
4.06 4.04 3.97 3.9 3.89 3.99 3.96
4.10 3.97 3.98 3.97 3.99 4.02 4.05
4.04 3.95 3.98 3.90 4.00 3.93 4.06

The data is graphically presented in the form of seven boxplots.

77
The null hypothesis of interest H0 states that there is no significant difference between the output of the seven
laboratories. We would reject H0 if the discrepancy between the ordered sample means
Lab i 1 3 7 2 5 6 4
Mean µi 4.062 4.003 3.998 3.997 3.957 3.955 3.920
can not be explained by the noise factors.

Normal theory model


Assume that N = I · n response variables have the form

Yik = µ + αi + σZik , Zik ∼ N(0, 1), i = 1, . . . , I, k = 1, . . . , n,

where αi = µi − µ are the main effects such that

α1 + . . . + αI = 0.

Here the noise components σZik are assumed to have the same standard deviation σ. The corresponding maximum
likelihood estimates of µ, µi = µ + αi , and αi ,

µ̂ = ȳ.. , µ̂i = ȳi. , α̂i = ȳi. − ȳ.. ,

can be expressed in terms of the sample means


n I n
1X ȳ1. + . . . ȳI. 1 XX
ȳi. = yik , ȳ.. = = yik .
n I N i=1
k=1 k=1

It is easy to verify that


α̂1 + . . . + α̂I = 0.
The observed response values can be represented as

yik = µ̂ + α̂i + êik , êik = yik − ȳi. ,

where êik are the so-called residuals giving the discrepancy between the observed response and the corresponding
sample mean.
The ANOVA test is built around the following decomposition

ssT = ssA + ssE ,

where
ssT = i k (yik − ȳ.. )2 is the total sum of squares for the pooled sample,
P P

α̂i2 is the factor A sum of squares,


P
ssA = n i

2
is the error sum of squares.
P P
ssE = i k êik

Each of these sums of squares (ssT , ssA , ssE ) is characterised by its number of degrees of freedom:

df T = N − 1, df A = I − 1, df E = I · (n − 1).
Notice that
df T = df A + df E .
This decomposition says that the total variation in the response values is the sum of the between-group variation and
the within-group variation, making clear why the data analysis based on this decomposition is called the analysis of
variance.
By normalising the sums of squares with help of the numbers of degrees of freedom, we obtain the so-called mean
sums of squares
ssA ssE
msA = , msE = .
dfA df E
The underlying random variables have the mean values
n X 2
E(MSA ) = σ 2 + α , E(MSE ) = σ 2 .
I −1 i i

Notice that msE gives an unbiased estimate of the noise variance σ 2 . It can be viewed as the pooled sample variance
PI Pn
2 (yik − ȳi. )2
sp = msE = i=1 k=1 .
I(n − 1)

78
One-way F -test
Under the null hypothesis H0 : α1 = . . . = αI = 0, we expect

E(MSA ) = E(MSE ) = σ 2 .

On the other hand, if H0 is violated, then E(M SA ) > E(M SE ), as given some αi ̸= 0, we have
n X 2
σ2 + α > σ2 .
I −1 i i

Thus taking the ratio between the two mean squares


msA
f=
msE
as the test statistic, we should reject H0 using the rejection region

R = {f ≥ fα }.

The critical value fα should be determined by the null distribution


H
F ∼0 Fk1 ,k2 , k1 = I − 1, k2 = I(n − 1),

where Fk1 ,k2 is the so-called F-distribution with degrees of freedom (k1 , k2 ). The F-distribution is the distribution for
the ratio
X1 /k1
∼ Fk1 ,k2 ,
X2 /k2
where X1 ∼ χ2k1 and X2 ∼ χ2k2 are two independent random variables having chi-squared distributions. The critical
values of the F-distributions for different significance levels α are given in Section 11.4.

Example: seven labs


The one-way ANOVA table below
Source df ss ms f p
Labs 6 0.125 0.0210 5.66 0.0001
Error 63 0.231 0.0037
Total 69 0.356
summarises the F-test applied to the seven labs data. The p-value 0.01% of the test is computed from the observed
test statistic f = 5.66 using the F6,63 -distribution built in the R-program. The use of the table in Section 11.4 allows
us to conclude that the p-value must be smaller than 0.1% since 5.66 is larger than the closest available table value
F6,63 (0.001) = 4.3395.
√ The normal QQ-plot of residuals êik supports the normality assumption. The noise size σ is estimated by sp =
0.0037 = 0.061.

8.2 Simultaneous confidence interval


The analysis of the last example leads to the conclusion that at least one of the c = 72 = 21 pairwise differences is


significant. The table below lists the 10 largest pairwise difference between the sample means.
(i, j) (1, 4) (1, 6) (1, 5) (3, 4) (7, 4) (2, 4) (1, 2) (1, 7) (1, 3) (5, 4)
ȳi. − ȳj. 0.142 0.107 0.105 0.083 0.078 0.077 0.065 0.064 0.059 0.047
Clearly, the difference (µ1 − µ4 ) corresponding to to the largest sample mean difference

ȳ1. − ȳ4. = 0.142

must be significant at the 5% significance level. The important follow up question is: which of the other 21 pairwise
differences are significant and which are not at the 5% significance level?
A naive answer to this question would rely on the 95% confidence interval for a single pair of independent samples
(µi − µj ) from the previous chapter
sp
Iµi −µj = (ȳi. − ȳj. ) ± t63 (0.025) · √ = (ȳi. − ȳj. ) ± 0.055,
5
where t63 (0.025) = 2.00. Notice that here we use the t-distribution with 63 degrees of freedom, as the corresponding
pooled sample variance s2p = 0.0037 is based on I = 7 samples each of size n = 10. According to this confidence

79
interval formula we have 9 significant differences: (1, 4), (1, 6), (1, 5), (3, 4), (7, 4), (2, 4), (1, 2), (1, 7), (1, 3), since the
next ordered difference 0.047 for the pair (5, 4) is smaller than 0.055. The inequality 0.047 < 0.055 ensures that the
95% confidence interval
Iµ5 −µ4 = 0.047 ± 0.055
contains 0, implying that the difference (µ5 − µ4 ) is not significant at the 5% significance level.
The problem with the suggested solution is that the use of the confidence interval formula aimed at a single difference
may produce false discoveries. This is an example of the famous multiple comparison problem nicely illustrated by
the illuminating cartoon "Jelly beans cause acne", see https://xkcd.com/882/. What we need here is a simultaneous
confidence interval formula addressing all c = 21 pairwise comparisons. Next, we introduce two formulas for the
simultaneous confidence interval: the Bonferrroni formula Bµi −µj , and the Tukey formula Tµi −µj .

Bonferroni method
Think of a statistical test involving a null hypothesis H0 at the significance level αc is repeatedly applied to c inde-
pendently sampled datasets. The overall result is deemed to be positive if we get at least one positive result among
these c tests. Then the overall significance level α is computed as

α = P(Xc ≥ 1|H0 ),

where Xc is the number of positive results among these c tests. Due to the independence of the tests, we have
H
Xc ∼0 Bin(c, αc ),

yielding
P(X ≥ 1|H0 ) = 1 − (1 − αc )c .
Assuming αc is small, we find that
α = 1 − (1 − αc )c ≈ cαc .
This reasoning leads to the following important conclusion. To obtain the overall significance level α for the
multiple testing procedure involving c independenttests, each individual test should be performed at the level α/c.
Applying this idea to the current setting of c = I2 pairwise differences (µi − µj ), we arrive at the the Bonferroni’s
formula of the 100(1 − α)% simultaneous confidence interval:
q
α
Bµi −µj = ȳi. − ȳj. ± tI(n−1) ( I(I−1) ) · sp n2 , 1 ≤ i, j ≤ I.

This formula is obtained from the confidence interval for a single difference
q
Iµi −µj = ȳi. − ȳj. ± tI(n−1) ( α2 ) · sp n2 , 1 ≤ i, j ≤ I,

by applying the Bonferroni correction of the significance level


α 2α
α→ c = I(I−1) .

Observe that the pairwise differences µi − µj are not independent as required by Bonferroni method: for example,
knowing the differences (µ1 − µ2 ) and (µ2 − µ3 ) we may exactly predict the third difference µ1 − µ3 using the equality

µ1 − µ2 + µ2 − µ3 = µ1 − µ3 .

As a result the Bonferroni method gives slightly wider intervals compared to the Tukey method introduced below.

Example: seven labs


For the seven labs example, the 95% Bonferroni interval takes the form
s
Bµi −µj = (ȳi. − ȳj. ) ± t63 ( 0.025
21 ) ·
√p
5
= (ȳi. − ȳj. ) ± 0.086,

where t63 (0.0012) = 3.17. Compared to the naive interval

Iµi −µj = (ȳi. − ȳj. ) ± 0.055,

the Bonferroni interval


Bµi −µj = (ȳi. − ȳj. ) ± 0.086
is much wider, resulting in only 3 significant differences: (1, 4), (1, 5), (1, 6).

80
Tukey method
Under the normality assumption with equal variances, the centred sample means

Xi = Ȳi. − µi ∼ N(0, √σn ), i = 1, . . . , I,

are independent normal variables. Consider the range of all paiwise differences Xi − Xj :

max{X1 , . . . , XI } − min{X1 , . . . , XI }.

giving the largest pairwise difference between the components of the vector (Z1 , . . . , ZI ). The corresponding normalised
range has a distribution that is free from the parameter σ

max{X1 , . . . , XI } − min{X1 , . . . , XI }
√ ∼ SRk1 ,k2 , k1 = I, k2 = I(n − 1).
Sp / n

Here, the so-called studentised range distribution SRk1 ,k2 is controlled by two parameters: the number of independent
samples k1 and the number of degrees of freedom k2 used in the variance estimate s2p .
s
Tukey’s 100(1 − α)% simultaneous confidence interval Tµi −µj = ȳi. − ȳj. ± qI,I(n−1) (α) · √p
n

The Tukey interval is built using the factor qk1 ,k2 (α) determined by

P(Q > qk1 ,k2 (α)) = α, where Q ∼ SRk1 ,k2 .

In contrast to Bonferroni, Tukey takes into account the dependences between the differences (µi − µj ).

Example: seven labs


Using the R-command qtukey(0.95,7,63) we find q7,63 (0.05) = 4.307 so that
0.061
Tµi −µj = ȳi. − ȳj. ± q7,63 (0.05) · √
10
= ȳi. − ȳj. ± 0.083,

recognising four significant pairwise differences: (1, 4), (1, 5), (1, 6), (3, 4).

8.3 Kruskal-Wallis test


The F-test requires the residual analysis where for example with help of the normal QQ-plot of the residuals one
should evaluate whether the assumption of normality and equal variances is reasonable. If these assumptions are not
valid then one can use the non-parametric Kruskal-Wallis test described next.
Assume that we have I independent random samples and consider the null hypothesis of no effect

H0 : the underlying I population distributions are equal.

Extending the idea of the rank-sum test, turn to the pooled sample of size N = I · n. Let rik be the pooled ranks of
the sample values yik , so that
XX N (N + 1)
rik = 1 + 2 + . . . + N = ,
i
2
k

implying that the overall mean rank is


N (N + 1) N +1
r̄.. = = .
2N 2
The Kruskal-Wallis test statistic
I
12n X
N +1 2
w= (r̄i. − 2 )
N (N + 1) i=1
measures the discrepancy between the sample means of the ranks
ri1 + . . . + rin
r̄i. = , i = 1, . . . , I.
n
A large value of w would indicate a deviation from the null distribution. Thus the Kruskall-Wallis test rejects H0 for
larger values of w. For I = 3, n ≥ 5 or I ≥ 4, n ≥ 4, one can use the approximate null distribution
H0
W ≈ χ2I−1 .

81
Example: seven labs
In the table below the actual measurements are replaced by their ranks 1 ÷ 70. For the observed Kruskal-Wallis test
statistic w = 28.17, using the χ26 -distribution table, we see that the p-value of the Kruskal-Wallis test is smaller than
0.005.

Labs 1 2 3 4 5 6 7
70 4 35 6 46 48 38
63 3 45 7 21 5 50
53 65 40 13 47 22 52
64 69 41 20 8 28 58
59 66 57 16 14 37 68
54 39 32 26 42 2 1
43 44 51 17 9 31 15
61 56 25 11 10 34 23
67 24 29 27 33 49 60
55 19 30 12 36 18 62
Means 58.9 38.9 38.5 15.5 26.6 27.4 42.7

8.4 Two-way layout ANOVA


Referring to the two-way layout model from Section 1.2, assume that the dataset in hand

{yijk , i = 1, . . . , I, j = 1, . . . , J, k = 1, . . . , n}

is generated in the following way

Yijk = µ + αi + βj + δij + σZijk , i = 1, . . . , I, j = 1, . . . , J, k = 1, . . . , n,

where Zijk are independent random variables having the standard normal distribution N(0, 1). We start by assuming
that n ≥ 2, so that for each combination of levels (i, j) of the main factors we have at least two replications, which
would allow us to measure the variation in the response variable within each cell (i, j). The important case n = 1 will
be addressed separately.
The maximum likelihood estimates under the assumption of normality take the form
XXX
1
µ̂ = ȳ... = IJn yijk ,
i j k
XX
1
α̂i = ȳi.. − ȳ... , ȳi.. = Jn yijk ,
j k
XX
1
β̂j = ȳ.j. − ȳ... , ȳ.j. = In yijk ,
i k
X
1
δ̂ij = ȳij. − ȳ... − α̂i − β̂j = ȳij. − ȳi.. − ȳ.j. + ȳ... , ȳij. = n yijk .
k

Given n ≥ 2, after computing the residuals

êijk = yijk − ȳij. , k = 1, . . . , n,

we arrive at the crucial decomposition


yijk = µ̂ + α̂i + β̂j + δ̂ij + êijk .

Case study: iron retention


The table below gives the percentages of iron retained in 108 mice randomly allocated into six groups. Here the factor
A has two levels I = 2 representing two iron forms, and the factor B has three levels J = 3 representing the different
dosage concentrations. Each sample has size n = 18.

82
Fe3+ (10.2) Fe3+ (1.2) Fe3+ (0.3) Fe2+ (10.2) Fe2+ (1.2) Fe2+ (0.3)
0.71 2.20 2.25 2.20 4.04 2.71
1.66 2.93 3.93 2.69 4.16 5.43
2.01 3.08 5.08 3.54 4.42 6.38
2.16 3.49 5.82 3.75 4.93 6.38
2.42 4.11 5.84 3.83 5.49 8.32
2.42 4.95 6.89 4.08 5.77 9.04
2.56 5.16 8.50 4.27 5.86 9.56
2.60 5.54 8.56 4.53 6.28 10.01
3.31 5.68 9.44 5.32 6.97 10.08
3.64 6.25 10.52 6.18 7.06 10.62
3.74 7.25 13.46 6.22 7.78 13.80
3.74 7.90 13.57 6.33 9.23 15.99
4.39 8.85 14.76 6.97 9.34 17.90
4.50 11.96 16.41 6.97 9.91 18.25
5.07 15.54 16.96 7.52 13.46 19.32
5.26 15.89 17.56 8.36 18.40 19.87
8.15 18.3 22.82 11.65 23.89 21.60
8.24 18.59 29.13 12.45 26.39 22.25

The six boxplots for the retention data (see the next figure, left panel) show that all samples come from population
distributions skewed to the right, and that there is a clear discrepancy among different sample standard deviations.
With the data skewed to the right, taking the logarithms might serve as a simple remedy. For this reason, we will
focus on the transformed data (yijk ) obtained by taking the natural logarithms of the percentage of iron retention.
The right panel of the next figure shows that the boxplots for the transformed data (yijk ) look more symmetric and
exhibit less discrepancy between the within sample variation.

The next table lists the sample means (ȳij. , ȳi.. , ȳ.j. , ȳ... ) for the transformed data (yijk ):
10.2 1.2 0.3 Level mean
Fe3+
1.16 1.90 2.28 1.78
Fe2+ 1.68 2.09 2.40 2.06
Level mean 1.42 2.00 2.34 1.92
The resulting maximum likelihood estimates are
µ̂ = 1.92, α̂1 = −0.14, α̂2 = 0.14, β̂1 = −0.50, β̂2 = 0.08, β̂3 = 0.42,
and  
−0.12 0.04 0.08
(δ̂ij ) = .
0.12 −0.04 −0.08
It is useful to graphically compare the two profiles shown on the figure below, where the red line connects the three
sample means for the iron form Fe3+ , see the first row of the previous table, and the blue line does the same for for
Fe2+ . In particular, the fact that the two rows are not parallel may indicate a possible interaction between the two
main factors.

83
Observe that the estimated effect size for the log-retention is

α̂2 − α̂1 = ȳ2.. − ȳ1.. = 0.28.

This yields the multiplicative effect of e0.28 = 1.32 on the original scale, implying that the retention percentage of
Fe2+ is higher than that of Fe3+ by factor 1.32.

Three F -tests for the two-way ANOVA


The two-way ANOVA version of the decomposition of the sums of squares has the form

ssT = ssA + ssB + ssAB + ssE ,


where
− ȳ... )2 ,
P P P
ssT = i j k (yijk df T = IJn − 1,

α̂i2 ,
P
ssA = Jn i df A = I − 1,

β̂j2 ,
P
ssB = In j df B = J − 1,

2
,
P P
ssAB = n i j δ̂ij df AB = (I − 1)(J − 1),

k êijk ,
2
P P P
ssE = i j df E = IJ(n − 1).

The corresponding mean sums of squares and their expected values are
ssA
, E(MSA ) = σ 2 + I−1
Jn
i αi ,
P 2
msA = df A

df B , E(MSB ) = σ 2 + βj2 ,
ssB In
P
msB = J−1 j

df AB , E(MSAB ) = σ 2 + ,
ssAB n
P P 2
msAB = (I−1)(J−1) i j δij

df E , E(MSE ) = σ 2 .
ssE
msE =

The error mean sum of squares msE , an unbiased estimate of σ 2 , is in fact the pooled sample variance
2
P P P
2 i j k (yijk − ȳij. )
sp = msE = .
IJ(n − 1)

In the two-way ANOVA setting, there are three different null hypotheses of interest

HA : α1 = . . . = αI = 0, there is no systematic difference between different levels of the factor A,


HB : β1 = . . . = βJ = 0, there is no systematic difference between different levels of the factor B,
HAB : all δij = 0, there is no interaction between the factors A and B.

Following the idea of the F-test in the one-way setting, we arrive at three F-tests addressing the three null hypotheses
using three different F-test statistics.
Null hypothesis No effect property Test statistics and its null distribution

HA E(MSA ) = E(MSE ) = σ 2 FA = MSA


MSE ∼ Fdf A ,df E

HB E(MSB ) = E(MSE ) = σ 2 FB = MSB


MSE ∼ Fdf B ,df E

HAB E(MSAB ) = E(MSE ) = σ 2 FAB = MSAB


MSE ∼ Fdf AB ,df E

Again, we reject the null hypothesis for the values of the F-test statistic larger than the critical value determined by
the corresponding F-distribution.

84
Example: iron retention
The two-way ANOVA table for the transformed iron retention data takes the form
Source df ss ms f p
Iron form 1 2.074 2.074 5.99 0.017
Dosage 2 15.588 7.794 22.53 0.000
Interaction 2 0.810 0.405 1.17 0.315
Error 102 35.296 0.346
Total 107 53.768
Referring to the three p-values in the rightmost column, obtained using the three null distributions

(F1,102 , F2,102 , F2,102 ),

we draw the following three conclusions


• there is a significant effect due to iron form,

• the dosage effect is undoubtably significant, however, this is something expected,


• the interaction effect between the iron form and the dosage is not statistically significant.
Concerning the main conclusion, recall that our previous analysis of two samples at the intermediate dosage has
produced a non-significant result.

Finally, after inspecting the normal QQ-plot for the residuals (êijk ) we approve the assumptions of normality and
equal variance for the transformed data.

Randomised block design


The famous data collecting principle suggests:

block what you can, randomise what you cannot.

Blocking is used to remove the effects of the most important nuisance variable.
Randomisation is then used to reduce the contaminating effects of
the remaining nuisance variables. Applied to the ANOVA two-way layout
setting, this principle leads to the following experimental design:
randomly assign I treatments within each of J blocks. The first of the next three examples is illustrated by the figure
above.
Block Treatments Observation
A homogeneous plot of land I fertilisers each applied to The yield on the
divided into I subplots a randomly chosen subplot subplot (i, j)
A four-wheel car 4 types of tires tested on the same car tire’s life-length
A litter of I animals I diets randomly assigned to I siblings the weight gain

8.5 Additive model


In this section we assume that n = 1. With only one replication per cell (i, j), we cannot estimate the interaction
effect between two main effects A and B. This restricts us to the additive model without interaction

Yij = µ + αi + βj + σZij , Zij ∼ N(0, 1).

85
For the given data (yij ), we find the maximum likelihood estimates and the residuals using

µ̂ = ȳ.. , α̂i = ȳi. − ȳ.. , β̂i = ȳ.j − ȳ.. ,


êij = yij − ȳ.. − α̂i − β̂i = yij − ȳi. − ȳ.j + ȳ.. .

It is easy to check that


yij = µ̂ + α̂i + β̂j + êij .
In this case, the decomposition of the sums of squares takes a reduced form

ssT = ssA + ssB + ssE ,

with
− ȳ.. )2 ,
P P
ssT = i j (ȳij df T = IJ − 1,

α̂i2 ,
P
ssA = J i df A = I − 1,

β̂j2 ,
P
ssB = I j df B = J − 1,

ê2ij ,
P P
ssE = i j df E = (I − 1)(J − 1).
Using

df A , E(MSA ) = σ 2 +
ssA J
αi2
P
msA = I−1 i

df B , E(MSB ) = σ 2 +
ssB I
βj2
P
msB = J−1 j

df E , E(MSE ) = σ 2 .
ssE
msE =
we can apply two F-tests for two different null hypotheses
MSA HA
HA : α1 = . . . = αI = 0, FA = ∼ Fdf A ,df E ,
MSE
MSB HB
HB : β1 = . . . = βJ = 0, FB = ∼ Fdf B ,df E .
MSE

Example: itching treatments


In a study the following I = 7 treatments
Treat 1: no drug, Treat 2: placebo, Treat 3: papaverine, Treat 4: morphine,
Treat 5: aminophylline, Treat 6: pentabarbital, Treat 7: tripelennamine.
were compared in their effect to relieve itching. Each treatment was applied to J = 10 male volunteers aged 20 - 30
years after an itching condition had been initiated by a certain injection. The data in the table give the durations of
the itching in seconds (yij ).
Subject Treat 1 Treat 2 Treat 3 Treat 4 Treat 5 Treat 6 Treat 7
BG 174 263 105 199 141 108 141
JF 224 213 103 143 168 341 184
BS 260 231 145 113 78 159 125
SI 225 291 103 225 164 135 227
BW 165 168 144 176 127 239 194
TS 237 121 94 144 114 136 155
GM 191 137 35 87 96 140 121
SS 100 102 133 120 222 134 129
MU 115 89 83 100 165 185 79
OS 189 433 237 173 168 188 317
The resulting two-way ANOVA table
Source df ss ms f p
Drugs 6 53013 8835 2.85 0.018
Subjects 9 103280 11476 3.71 0.001
Error 54 167130 3096
Total 69 323422
shows that the treatment effect is significant. The differences between the subjects are significant, as expected. The
two p-values are computed from the F6,54 and F9,54 -distributions.

86
Friedman test
Here we introduce a non-parametric test for the two-way layout of the the ANOVA setting with n = 1, which does
not require that ϵij are normally distributed. The Friedman tests null hypothesis H0 : of no treatment effect based on
within block ranking. Let the ranks within j-th block be:

(r1j , . . . , rIj ) = ranks of (y1j , . . . , yIj ),

so that
I(I + 1)
r1j + . . . + rIj = 1 + 2 + . . . + I = ,
2
and the average ranks are
ri1 + . . . + riJ r̄1. + . . . + r̄I. I +1
r̄i. = , r̄.. = = .
J I 2
PI H0
Friedman test statistic q = 12J
I(I+1) i=1 (r̄i. − I+1 2
2 ) has an approximate null distribution Q ≈ χ2I−1 .

The Friedman test statistic q is a measure of agreement between J rankings, so we reject H0 for large values of q.

Example: itching treatments


The boxplots for the seven treatments indicate violations of the assumptions of normality. Notably, the placebo results
exhibit larger variance for the itching duration in seconds. We complement the previous ANOVA results with the
Friedman non-parametric test based the following rank values (rij ) and their treatment means (r̄1. , . . . , r̄7. ).

Subject Treat 1 Treat 2 Treat 3 Treat 4 Treat 5 Treat 6 Treat 7


BG 5 7 1 6 3.5 2 3.5
JF 6 5 1 2 3 7 4
BS 7 6 4 2 1 5 3
SI 6 7 1 4 3 2 5
BW 3 4 2 5 1 7 6
TS 7 3 1 5 2 4 6
GM 7 5 1 2 3 6 4
SS 1 2 5 3 7 6 4
MU 5 3 2 4 6 7 1
OS 4 7 5 2 1 3 6
r̄i. 5.10 4.90 2.30 3.50 3.05 4.90 4.25
Each subject acts as a jury member by ranking the seven treatment effects. Notice how the subject BG resolves the
tie between treatments 5 and 7. The Friedman test statistic q is a measure of agreement among the jury members:
if q is larger than the critical value, then the observed disagreement will allow us to reject the null hypothesis of no
difference. With I+12 = 4, we find the Friedman test statistic value to be
 
q = 2.14 · (5.10 − 4)2 + (4.90 − 4)2 + (2.30 − 4)2 + (3.50 − 4)2 + (3.05 − 4)2 + (4.90 − 4)2 + (4.25 − 4)2 = 14.28.

Using the chi-squared distribution table with k = 6 degrees of freedom we find that the p-value of the test is approxi-
mately 3%. The Friedman test also rejects the null hypothesis and suggests the following top 3 list in the effectiveness
against itching:
(1) papaverine, (2) aminophylline, (3) morphine.

8.6 Exercises
Problem 1
For the one-way analysis of variance with two treatment groups, show that the F -test statistic equals t2 , where t is
the test statistic for the two-sample t-test.

Problem 2
A study on the tensile strength of aluminium rods is conducted. Forty identical rods are randomly divided into four
groups, each of size 10. Each group is subjected to a different heat treatment, and the tensile strength, in thousands
of pounds per square inch, of each rod is determined. The following table presents the results of the measurements.
Consider the null hypothesis of equality between the four treatment means of tensile strength

H0 : µ1 = µ2 = µ3 = µ4 .

87
Treatment 1 2 3 4 Combined data
18.9 18.3 21.3 15.9 18.9 18.3 21.3 15.9
20.0 19.2 21.5 16.0 20.0 19.2 21.5 16.0
20.5 17.8 19.9 17.2 20.5 17.8 19.9 17.2
20.6 18.4 20.2 17.5 20.6 18.4 20.2 17.5
19.3 18.8 21.9 17.9 19.3 18.8 21.9 17.9
19.5 18.6 21.8 16.8 19.5 18.6 21.8 16.8
21.0 19.9 23.0 17.7 21.0 19.9 23.0 17.7
22.1 17.5 22.5 18.1 22.1 17.5 22.5 18.1
20.8 16.9 21.7 17.4 20.8 16.9 21.7 17.4
20.7 18.0 21.9 19.0 20.7 18.0 21.9 19.0
mean 20.34 18.34 21.57 17.35 19.40
variance 0.88 0.74 0.88 0.89 3.58
skewness 0.16 0.14 -0.49 -0.08 0.05
kurtosis 2.51 2.59 2.58 2.46 1.98
(a) Test the null hypothesis applying an ANOVA test. Show clearly how all the sums of squares are computed using
the sample means and variances given in the table.
(b) What are the assumptions of the ANOVA model you used? Do they appear fulfilled?
(c) The Bonferroni method gives the following formula for computing simultaneous 95% confidence intervals for six
pairwise differences between four treatment means
Bµi −µj = (ȳi. − ȳj. ) ± t36 ( 0.025
6 ) · 0.4472 · sp .

Explain this formula and using it check which of the pairs of treatments have significantly different means.

Problem 3
Consider the likelihood ratio test for the null hypothesis of the one-way layout under the normality assumption, and
show that it is equivalent to the F-test.

Problem 4
Suppose in a one-way layout there are 10 treatments and seven observations under each treatment. What is the
ratio of the length of a simultaneous confidence interval for the difference of two means formed by Tukey’s method to
that of one formed by the Bonferroni method? How do both of these compare in length to an interval based on the
t-distribution that does not take account of multiple comparisons?

Problem 5
During each of four experiments on the use of carbon tetrachloride as a worm killer, ten rats were infested with larvae
(Armitage 1983). Eight days later, five rats were treated with carbon tetrachloride; the other five were kept as controls.
After two more days, all the rats were killed and the numbers of worms were counted. The table below gives the counts
of worms for the four control groups.
Group I Group II Group III Group IV
279 378 172 381
338 275 335 346
334 412 335 340
198 265 282 471
303 286 250 318
Significant differences among the control groups, although not expected, might be attributable to changes in the
experimental conditions. A finding of significant differences could result in more carefully controlled experimentation
and thus greater precision in later work.
Use both graphical techniques and the F-test to test whether there are significant differences among the four groups.
Use a nonparametric technique as well.

Problem 6
The concentrations (in nanogram per millimiter) of plasma epinephrine were measured for 10 dogs under isofluorane,
halothane, and cyclopropane anesthesia. The measurements are given in the following table (Perry et al. 1974).
Dog 1 Dog 2 Dog 3 Dog 4 Dog 5 Dog 6 Dog 7 Dog 8 Dog 9 Dog 10
Isofluorane 0.28 0.51 1.00 0.39 0.29 0.36 0.32 0.69 0.17 0.33
Halothane 0.30 0.39 0.63 0.68 0.38 0.21 0.88 0.39 0.51 0.32
Cyclopropane 1.07 1.35 0.69 0.28 1.24 1.53 0.49 0.56 1.02 0.30

88
Is there a difference in treatment effects? Use a parametric and a nonparametric analysis.

Problem 7
The following table gives the survival times (in hours) for animals in an experiment whose design consisted of three
poisons, four treatments, and four observations per cell.
Treatment A Treatment B Treatment C Treatment D
Poison I 3.1 4.5 8.2 11.0 4.3 4.5 4.5 7.1
4.6 4.3 8.8 7.2 6.3 7.6 6.6 6.2
Poison II 3.6 2.9 9.2 6.1 4.4 3.5 5.6 10.0
4.0 2.3 4.9 12.4 3.1 4.0 7.1 3.8
Poison III 2.2 2.1 3.0 3.7 2.3 2.5 3.0 3.6
1.8 2.3 3.8 2.9 2.4 2.2 3.1 3.3
(a) Conduct a two-way analysis of variance to test the effects of the two main factors and their interaction.
(b) Box and Cox (1964) analysed the reciprocals of the data, pointing out that the reciprocal of a survival time can
be interpreted as the rate of death. Conduct a two-way analysis of variance, and compare to the results of part
(a). Comment on how well the standard two-way ANOVA model fits and on the interaction in both analyses.

Problem 8
Officials of a small transit system with only five buses want to evaluate four types of tires with respect to wear.
Applying a randomized block design, they decided to put one tire of each type on each of the five buses. The tires are
run for 15,000 miles, after which the tread wear, in millimeters, is measured.
Bus Tire 1 Tire 2 Tire 3 Tire 4 Mean
1 9.1 17.1 20.8 11.8 14.7
2 13.4 20.3 28.3 16.0 19.5
3 15.6 24.6 23.7 16.2 20.0
4 11.0 18.2 21.4 14.1 16.2
5 12.7 19.8 25.1 15.8 18.4
Mean 12.4 20.0 23.9 14.8 17.8
(a) State the most appropriate null hypothesis by referring to a suitable parametric model. What are the main
assumptions of the parametric model?
(b) Using a non-parametric procedure test the null hypothesis of no difference between the four types of tires.
(c) What kind of external effects are controlled by the suggested randomised block design? How the wheel positions
for different tire types should be assigned for each of the five buses?

Problem 9
The accompanying data resulted from an experiment carried out to investigate whether yield from a certain chemical
process depended either on the formulation of a particular input or on mixer speed.
Speed
60 70 80 Means
1 189.7 185.1 189.0
1 188.6 179.4 193.0 187.03
1 190.1 177.3 191.1
Formulation
2 165.1 161.7 163.3
2 165.9 159.8 166.6 164.66
2 167.6 161.6 170.3
Means 177.83 170.82 178.88 175.84
A statistical computer package gave
ssA = 2253.44, ssB = 230.81, ssAB = 18.58, ssE = 71.87.
(a) Calculate estimates of the main effects.
(b) Does there appear to be interaction between Formulation and Speed? In which various ways interaction between
these two factors could manifest itself? Illustrate with graphs.
(c) Does yield appear to depend either on formulation or speed.
(d) Why is it important to inspect the normal QQ-plot of the 18 residuals?

89
Problem 10
Three different varieties of tomato (Harvester, Pusa Early Dwarf, and Ife No. 1) and four different plant densities (10,
20, 30, and 40 thousands plants per hectare) are being considered for planting in a particular region. To see whether
either variety or plant density affects yield, each combination of variety and plant density is used in three different
plots, resulting in the following data on yields:

Variety Density 10,000 Density 20,000 Density 30,000 Density 40,000 mean
Har 10.5, 9.2, 7.9 12.8, 11.2, 13.3 12.1, 12.6, 14.0 10.8, 9.1, 12.5 11.33
Ife 8.1, 8.6, 10.1 12.7, 13.7, 11.5 14.4, 15.4, 13.7 11.3, 12.5, 14.5 12.21
PED 16.1, 15.3, 17.5 16.6, 19.2, 18.5 20.8, 18.0, 21.0 18.4, 18.9, 17.2 18.13
mean 11.48 14.39 15.78 13.91 13.89

(a) Fill in the ANOVA table for the missing numbers

Source of variation ss df ms f
Varieties
Density
Interaction 8.03
Errors 38.04

(b) Clearly state the three pairs of hypotheses of interest. Test them using the normal theory approach.
(c) Estimate the noise size σ.

Problem 11
A population with mean µ consists of three subpopulations with means µ1 , µ2 , µ3 and the same variance σ 2 . Three
independent random samples, each of size n = 13, from the three subpopulation distributions gave the following sample
means and standard deviations:
Sample 1 Sample 2 Sample 3
Mean 6.3 5.6 6.0
SD 2.14 2.47 3.27

(a) Compute a stratified sample mean, assuming that the three subpopulation sizes have the ratios N1 : N2 : N3 =
0.3 : 0.2 : 0.5. Prove that this is an unbiased estimate for the population mean µ.
(b) Assume that all three subpopulation distributions are normal. Write down a simultaneous confidence interval
for the differences µ1 − µ2 , µ1 − µ3 , and µ2 − µ3 .
(c) Would you reject the null hypothesis of equality µ1 = µ2 = µ3 in this case?

Problem 12
In an experimental study two volunteer male subjects aged 23 and 25 underwent three treatments to compare a new
drug against no drug and placebo. Each volunteer had one treatment per day and the time order of these three
treatments was randomised.
(a) Comment on the details of the experimental design.
(b) Find the exact null distribution for the test statistic of an appropriate non-parametric test.

90
Chapter 9

Categorical data analysis

Categorical data appear in the form of a contingency table containing the sample counts for several, say k, mutually
exclusive categories. The corresponding population distribution can modelled as Mn(1, π1 , . . . , πk ) , an extension of
the Bernoulli distribution with k possible outcomes having probabilities (π1 , . . . , πk ) such that
π1 + . . . + πk = 1.
In this case, the sample counts for a random sample of size n have the multinomial distribution
(C1 , . . . , Ck ) ∼ Mn(n, π1 , . . . , πk ).
Consider a cross-classification for a pair of categorical factors A and B. If factor A has I levels and factor B has
J levels, then the population distribution of a single cross classification event has the form
b1 b2 ... bJ total
a1 π11 π12 ... π1J π1·
a2 π21 π22 ... π2J π2·
... ... ... ... ... ...
aI πI1 πI2 ... πIJ πI·
total π·1 π·2 ... π·J 1
Here
πij = P(A = ai , B = bj )
are the joint the probabilities, and
πi· = P(A = ai ), π·j = P(B = bj )
are the marginal probabilities. The null hypothesis of independence claims that there is no relationship between factors
A and B
H0 : πij = πi· π·j for all pairs (i, j).
The conditional probabilities
πij
πi|j = P(A = ai |B = bj ) =
π·j
are summarised in the next table
b1 b2 ... bJ
a1 π1|1 π1|2 ... π1|J
a2 π2|1 π2|2 ... π2|J
... ... ... ... ...
aI πI|1 πI|2 ... πI|J
total 1 1 ... 1
The null hypothesis of homogeneity states the equality of J population distributions
H0 : πi|j = πi· for all pairs (i, j).

9.1 Chi-squared test of homogeneity


Consider a table of I × J observed counts obtained from J independent random samples
sample 1 sample 2 ... sample J total counts
category a1 c11 c12 ... c1J c1
category a2 c21 c22 ... c2J c2
... ... ... ... ... ...
category aI cI1 cI2 ... cIJ cI
sample sizes n1 n2 ... nJ n

91
This model is described by J multinomial distributions

(C1j , . . . , CIj ) ∼ Mn(nj ; π1|j , . . . , πI|j ), j = 1, . . . , J.

Notice that the total number of degrees of freedom for J independent samples from I-dimensional multinomial distri-
butions is J(I − 1).
Under the hypothesis of homogeneity
H0 : πi|j = πi for all (i, j),
the maximum likelihood estimates of πi are the pooled sample proportions

π̂i = ci /n, i = 1, . . . , I.

These estimates consumes (I − 1) degrees of freedom, since their sum is 1. Using these maximum likelihood estimates
we compute the expected cell counts
Eij = nj · π̂i = ci nj /n
and the chi-squared test statistic takes the form
I X J I X J
X (cij − Eij )2 X (cij − ci nj /n)2
x2 = = .
i=1 j=1
Eij i=1 j=1
ci nj /n

We reject H0 for larger values of x2 using the approximate null distribution X2 ≈ χ2df with

df = J(I − 1) − (I − 1) = (I − 1)(J − 1).

Example: small cars and personality

A car company studies how customers’ attitude toward small cars


relates to different personality types. The next table summarises
the observed counts:

cautious middle-of-the-road explorer total


favourable 79 58 49 186
neutral 10 8 9 27
unfavourable 10 34 42 86
total 99 100 100 299
After putting in the brackets the corresponding expected counts

cautious middle-of-the-road explorer total


favourable 79(61.6) 58(62.2) 49(62.2) 186
neutral 10(8.9) 8(9.0) 9(9.0) 27
unfavourable 10(28.5) 34(28.8) 42(28.8) 86
total 99 100 100 299

we find the chi-squared test statistic to be

x2 = 27.24 with df = (3 − 1) · (3 − 1) = 4.

Since 27.24 is larger than the table value x4 (0.005) = 14.86, we reject the hypothesis of homogeneity at 0.5% significance
level. Conclusion: the persons who saw themselves as cautious are more likely to express a favourable opinion of small
cars.

9.2 Chi-squared test of independence


Suppose we have single random sample of size n and the sample counts for a cross-classification are summarised in
the following way.
b1 b2 ... bJ total
a1 c11 c12 ... c1J c1
a2 c21 c22 ... c2J c2
... ... ... ... ... ...
aI cI1 cI2 ... cIJ cI
total n1 n2 ... nJ n

92
In contrast to the previous setting of J independent samples, the total counts (n1 , n2 , . . . , nJ ) are random outcomes
cross-classification of of the single sample of size n. This data is modelled by the multinomial joint distribution

(C11 , . . . , CIJ ) ∼ Mn(n·· ; π11 , . . . , πIJ ),

characterised by IJ − 1 degrees of freedom. Using such data we would like to test the null hypothesis of independence

H0 : πij = πi· π·j for all pairs (i, j).

The maximum likelihood estimates of πi· and π·j are


ci nj
π̂i· = , π̂·j = .
n n
Under the null hypothesis of independence the expected cell counts
ci nj
Eij = nπ̂ij = nπ̂i· π̂·j =
n
are exactly the same as for the homogeneity test, with the same number of degrees of freedom

df = (IJ − 1) − (I − 1 + J − 1) = (I − 1)(J − 1).

The chi-squared tests of homogeneity and independence have the same test rejection rule.

Example: marital status and educational level


A sample is drawn from a population of married women. Each observation is placed in a 2 × 2 contingency table
depending on woman’s educational level and her marital status.

Married only once Married more than once Total


College 550 (523.8) 61(87.2) 611
No college 681(707.2) 144(117.8) 825
Total 1231 205 1436

We test H0 : no relationship between the marital status and the education level, by applying the chi-squared test of
independence. Using the expected counts given in the brackets, we find the chi-squared test statistic to be x2 = 16.01.
With df = 1 we can use the normal distribution table, since Z ∼ N(0, 1) is equivalent to Z 2 ∼ χ21 . Thus

P(X2 > 16.01|H0 ) ≈ P(|Z| > 4.001) = 2(1 − Φ(4.001)).

It follows that that the p-value of the test is less that 0.1%, and we reject the null hypothesis of independence.
Conclusion: the college educated women, once they marry, are less likely to divorce.

9.3 Matched-pairs designs


We start with an illuminating example concerning Hodgkin disease which has very low incidence of 2 in 10 000.

Case study: Hodgkin’s disease and tonsillectomy


To test a possible influence of tonsillectomy on the onset of Hodgkin’s disease, researchers use cross-classification data
of the form
A Ā total
E c11 c12 c1
Ē c21 c22 c2
total n1 n2 n
where the four counts distinguish among sampled individuals who are

either affected or unaffected (A or Ā),


and either exposed to tonsillectomy or non-exposed (E or Ē).

There are three possible sampling designs to choose:


0 0
 
• a single random sample would give counts like , as the majority of people are unaffected and non-
0 n
exposed,

93
• a prospective study: take an E-sample of size c1 and a control Ē-sample of size c2 , then watch who gets affected,
0 c1
 
would give , because of the very low incidence of the Hodgkin disease,
0 c2

• a retrospective study: take an A-sample of size n1 and a control Ē-sample of size n2 , then find out who of them
in the past had been exposed to tonsillectomy, would give informative counts.
Two retrospective case-control studies produced opposite results. Study (a) by Vianna, Greenwald, and Davis
(1971) gave the next counts.
A Ā total
E 67 43 110
Ē 34 64 98
total 101 107 208
The chi-squared test of homogeneity was applied, which resulted in the test statistic x2a = 14.29. With df = 1, the
p-value was found to be very small

P(X2a ≥ 14.29|H0 ) ≈ 2(1 − Φ( 14.29)) = 0.0002.

Study (b) by Johnson and Johnson (1972) was summarised by the table
A Ā total
E 41 33 74
Ē 44 52 96
total 85 85 170
and the chi-squared tests of homogeneity was applied. With x2b = 1.53 and df = 1, the p-value was found to be
strikingly different √
P(X2b ≥ 1.53|H0 ) ≈ 2(1 − Φ( 1.53)) = 0.215.
It turned out that the study (b) was based on a matched-pairs design violating the assumptions of the chi-squared
test of homogeneity. The sample consisted of 85 sibling pairs having same sex and close age: one of the siblings was
affected the other not. A proper summary of the study (b) sample distinguishes among four groups of sibling pairs:

(AE, ĀE), (AE, ĀĒ), (AĒ, ĀE), (AĒ, ĀĒ).

The next table lists the corresponding sample counts.

ĀE ĀĒ total


AE 26 15 41
AĒ 7 37 44
total 33 52 85

Notice that this contingency table contains the information of the previous one in terms of the marginal counts.
An appropriate test in this setting is McNemar’s test introduced in Section 7.4. For the data of study (b), the
McNemar test statistic is
(7 − 15)2
x2 = = 2.91,
7 + 15
giving the p-value of √
P(X2 ≥ 2.91|H0 ) ≈ 2(1 − Φ( 2.91)) = 0.09.
The correct p-value of 0.09 is much smaller than that of 0.215 computed using the test of homogeneity. Since there
are only 7 + 15 = 22 informative observations, more data is required.

9.4 Odds ratios


The probability of a random event A is a number P(A) between 0 and 1. The odds of the event A are defined as

P(A) P(A)
odds(A) = c
= ,
P(A ) 1 − P(A)
taking non-negative values. Clearly,
odds(A)
P(A) = ,
1 + odds(A)
and for small P(A), we have odds(A) and P(A) are close to each other. Conditional odds for A given B are defined
similarly,
P(A|B) P(AB)
odds(A|B) = c
= .
P(A |B) P(Ac B)

94
Defininng the odds ratio for a pair of events (A, B) by
odds(A|B)
∆AB = ,
odds(A|B c )
we find that
P(AB)P(Ac B c )
∆AB = ,
P(Ac B)P(AB c )
implying
1
∆AB = ∆BA , ∆AB c =
.
∆AB
The odds ratio is a measure of dependence between a pair of random events. It has the following properties
if ∆AB = 1, then events A and B are independent,
if ∆AB > 1, then P(A|B) > P(A|B c ) so that B increases the probability of A,
if ∆AB < 1, then P(A|B) < P(A|B c ) so that B decreases the probability of A.

Odds ratios for case-control studies


To illustrate the use of odds ratios in statistical inference, return to the case-control study by Vianna, Greenwald, and
Davis (1971) on the Hodgkin disease. Based on the counts for two independent samples
A Ā total
E 67 43 110
Ē 34 64 98
total 101 107 208
the chi-squared test of homogeneity rejected the null hypothesis of no relationship between tonsillectomy and the
disease. However, it would be useful to have a measure of the strength of this relationship. Such a measure is provided
by the odds ratio
odds(A|E) odds(E|A)
∆AE = =
odds(A|Ē) odds(E|Ā)
if we were able to estimate it from the data. As we show next, we can estimate the odds ratio by the ratio of the
products of the counts on two diagonals
∆b AE = 67 · 64 = 2.93,
34 · 43
and conclude that tonsillectomy increases the odds for the onset of Hodgkin’s disease by factor 2.93.
To this end, we refer to two population distributions underlying the two retrospective samples
A Ā
E P(E|A) P(E|Ā)
Ē P(Ē|A) P(Ē|Ā)
total 1 1
estimated by
A Ā
E 67/101 43/107

Ē 34/101 64/107

total 1 1
Thus the odds ratio
P(E|A) P(Ē|Ā)
∆AE = ·
P(Ē|A) P(E|Ā)
is consistently estimated by
67/101 64/107 67 · 64

b AE = · = = 2.93.
34/101 43/107 34 · 43

9.5 Exercises
Problem 1
Adult-onset diabetes is known to be highly genetically determined. A study was done comparing frequencies of a
particular allele in a sample of such diabetics and a sample of non-diabetics.
diabetic normal total
Bb or bb 12 4 16
BB 39 49 88
total 51 53 104
Is there a relationship between the allele frequencies and the adult-onset diabetes?

95
Problem 2
Overfield and Klauber (1980) published the following data on the incidence of tuberculosis in relation to blood groups
in a sample of Eskimos. Is there any association of the disease and blood group within the ABO system or within the
MN system?
O A AB B MM MN NN
moderate 7 5 3 13 moderate 21 6 1
minimal 27 32 8 18 minimal 54 27 5
not present 55 50 7 24 not present 74 51 11

Problem 3
It is conventional wisdom in military squadron that pilots tend to father more girls than boys. Snyder (1961) gathered
data for military fighter pilots. The sex of the pilots’ offspring were tabulated for three kinds of flight duty during the
month of conception, as shown in the following table.
girl boy
flying fighter 51 38
flying transport 14 16
not flying 38 46

(a) Is there any significant difference between the three groups?


(b) In the United States in 1950, 105.37 males were born for every 100 females. Are the data consistent with this
sex ratio?

Problem 4
A randomised double-blind experiment compared the effectiveness of several drugs in ameliorating postoperative
nausea. All patients were anesthetized with nitrous oxide and ether. The following table shows the incidence of
nausea during the first four hours for each of several drugs and a placebo (Beecher 1959).
number of patients incidence of nausea
placebo 165 95
chlorpromazine 152 52
dimenhydrinate 85 52
pentobarbital (100 mg) 67 35
pentobarbital (150 mg) 85 37
Compare the drugs to each other and to the placebo.

Problem 5
In a study of the relation of blood type to various diseases, the following data were gathered in London and Manchester
(Woolf 1955).
London Control Peptic Ulcer Manchester Control Peptic Ulcer
Group A 4219 579 Group A 3775 246
Group O 4578 911 Group O 4532 361
First, consider the two tables separately. Is there a relationship between blood type and propensity to peptic ulcer?
If so, evaluate the strength of the relationship. Are the data from London and Manchester comparable?

Problem 6
Record of 317 patients at least 48 years old who were diagnosed as having endometrial carcinoma were obtained from
two hospitals (Smith et al. 1975). Matched controls for each case were obtained from the two institutions: the controls
had cervical cancer, ovarian cancer, or carcinoma of the vulva. Each control was matched by age at diagnosis (within
four years) and year of diagnosis (within two years) to a corresponding case of endometrial carcinoma.
The following table gives the number of cases and controls who had taken estrogen for at least 6 months prior to
the diagnosis of cancer.

Controls: estrogen used Controls: estrogen not used Total


Cases: estrogen used 39 113 152
Cases: estrogen not used 15 150 165
Total 54 263 317

96
(a) Is there a significant relationship between estrogen use and endometrial cancer?

(b) This sort of design, called a retrospective case-control study, is frequently used in medical investigations where
a randomised experiment is not possible. Do you see any possible weak points in a retrospective case-control
design?

Problem 7
A psychological experiment was done to investigate the effect of anxiety on a person’s desire to be alone or in company
(Lehman 1975). A group of 30 subjects was randomly divided into two groups of sizes 13 and 17. The subjects were
told that they would be subjected to some electric shocks, but one group (high-anxiety) was told that the shocks
would be quite painful and the other group (low-anxiety) was told that they would be mild and painless. Both groups
were told that there would be a 10-min wait before the experiment began, and each subject was given the choice of
waiting alone or with the other subjects. The following are the results:
Wait Together Wait Alone Total
High-Anxiety 12 5 17
Low-Anxiety 4 9 13
Total 16 14 30
Use Fisher’s exact test to test whether there is a significant difference between the high- and low-anxiety groups. What
is a reasonable one-sided alternative, if any?

Problem 8
Hill and Barton (2005) asked the question: red against blue outfits - does it matter in combat sports? Although other
colors are also present in animal displays, it is specifically the presence and intensity of red coloration that correlates
with male dominance and testosterone levels. Increased redness during aggressive interactions may reflect relative
dominance.
In the 2004 Olympic Games, contestants in four combat sports were randomly assigned red and blue outfits. The
winner counts in different sports

Red Biue Total


Boxing 148 120 268
Freestyle wrestling 27 24 51
Greco-Roman wrestling 25 23 48
Tae Kwon Do 45 35 80
Total 245 202 447

Is there any evidence that wearing red is more favourable in some of the sports than others?

Problem 9
Suppose that a company wishes to examine the relationship of gender to job satisfaction, grouping job satisfaction into
four categories: very satisfied, somewhat satisfied, somewhat dissatisfied, and very dissatisfied. The company plans to
ask the opinion of 100 employees. Should you, the company’s statistician, carry out a chi-square test of independence
or a test of homogeneity?

Problem 10
Questions concerning hypotheses testing methodology. Try to give detailed answers.
(a) Consider a hypothetical study of the effects of birth control pills. In such a case, it would be impossible to
assign women to a treatment or a placebo at random. However, a non-randomized study might be conducted
by carefully matching control to treatments on such factors as age and medical history.
The two groups might be followed up on for some time, with several variables being recorded for each subject
such as blood pressure, psychological measures, and incidences of various problems. After termination of the
study, the two groups might be compared on each of these many variables, and it might be found, say, that there
was a "significant difference" in the incidence of melanoma.
What is a common problem with such "significant findings"?
(b) You analyse cross-classification data summarized in a two by two contingency table. You wanted to apply the
chi-square test but it showed that one of the expected counts was below 5. What alternative statistical test you
may try applying?
(c) Why tests like rank sum test, Friedman test, and Kruskal-Wallis tests are often called distribution-free tests?

97
Problem 11
A public policy polling group is investigating whether people living in the same household tend to make independent
political choices. They select 200 homes where exactly three voters live. The residents are asked separately for their
opinion ("yes" or "no") on a city charter amendment. The results of the survey are summarized in the table:
Number of saying "yes" 0 1 2 3
Frequency 30 56 73 41
Based on these data can we claim that opinions are formed independently?

Problem 12
Verify that the hypothesis of homogeneity

H0 : πi|j = πi· for all pairs (i, j),

is mathematically equivalent to the hypothesis of independence

H0 : πij = πi· π·j for all pairs (i, j).

Problem 13
A study of the relationship between facility conditions at gasoline stations and aggressiveness in the pricing of gasoline
is based on n = 441 stations.
Pricing policy
Aggressive Neutral Nonaggressive Total
Substandard condition 24 15 17 56
Standard condition 52 73 80 205
Modern condition 58 86 36 180
Total 134 174 133 441

(a) Suggest a parametric model for the data and write down the corresponding likelihood function.

(b) What is a relevant null hypothesis for the data?


(c) Properly analyse the data and draw your conclusions.

Problem 14
In a controlled clinical trial which began in 1982 and ended in 1987, more than 22000 physicians participated. The
participants were randomly assigned in two groups: Aspirin and Placebo. The aspirin group have been taking 325 mg
aspirin every second day. At the end of trial, the number of participants who suffered from myocardial infarctions was
assessed.
MyoInf No MyoInf Total
Aspirin 104 10933 11037
Placebo 189 10845 11034
The popular measure in assessing the results in clinical trials is the Risk Ratio

Risk of Aspirin 104/11037


RR = = = 0.55.
Risk of Placebo 189/11034

(a) How would you interpret the obtained value of the risk ratio? What ratio of conditional probabilities is estimated
by RR?
(b) Is the observed value of RR significantly different from 1?

98
Chapter 10

Multiple regression

Karl Pearson collected the heights of 1,078 fathers and their full-grown sons in England, circa 1900. The following
scatter diagram illustrates the data, with one dot for each father-son pair. The straight line on the left panel, is the
so called regression line which indicates the average son-height for the various father-heights. Using this like one can
predict the average son-height for a given father-height. In this sense the father-height x serves as a predictor variable
for the response variable y, the son-height.

The right panel adds several important insights on this dataset. It shows that the regression line go through the
middle point of the scatter plot (x̄, ȳ), where x̄ is the sample mean of the father-heights and ȳ is the sample mean of
the son-heights. The cross going through the central point (x̄, ȳ) divides the scatter plot in four quadrants. We see
that the majority of the dots lies in the first and the third quadrants reflecting the fact that the heights of the father
and the son are positively correlated.
Consider now only the dots to the right of the vertical line, that is when the fathers are taller than average. For
this part of the dots, we see that on average the son is shorter than his father. To the left of the vertical line the
relation is opposite: on average the son is taller than his father. Francis Galton called this remarkable phenomenon
regression to mediocrity.

10.1 Simple linear regression model


According to the simple linear regression model the response

Y = β0 + β1 x + σZ, Z ∼ N(0, 1),

is the linear function of the predictor variable x plus a normally distributed noise of size σ. The key assumption of
homoscedasticity requires that the noise size is independent of the x-value. Whenever this assumption is violated, the
situation is described by the term heteroscedasticity.

99
Under this model, the data consist of n pairs

(x1 , y1 ), . . . , (xn , yn ),

where
yi = β0 + β1 xi + ei , i = 1, . . . , n,
involves a vector (e1 , . . . , en ) of independent realisations of a random variable with distribution N(0, σ).

Maximum likelihood estimates


The corresponding likelihood is a function of the three-dimensional parameter θ = (β0 , β1 , σ 2 )
n n (y − β − β x )2 o n
Y 1 i 0 1 i −n
n 1 X 2o
L(θ) = √ exp − = Cσ exp − e ,
i=1
2πσ 2σ 2 2σ 2 i=1 i

where C = (2π)−n/2 . This implies the following expression for the log-likelihood function l(θ) = ln L(θ)
Pn 2
e
l(θ) = ln C − n ln σ − i=12 i .

Observe that
ei = yi − β0 − β1 xi ,
implying
n
X
n−1 e2i = β02 + 2β0 β1 x̄ − 2β0 ȳ − 2β1 xy + β12 x2 + y 2 ,
i=1

where the five summary statistics

x1 + . . . + xn y1 + . . . + yn x21 + . . . + x2n y12 + . . . + yn2 x1 y1 + . . . + xn yn


x̄ = , ȳ = , x2 = , y2 = , xy =
n n n n n
form the set of sufficient statistics.
To obtain the maximum likelihood estimates θ̂ = (b0 , b1 , σ̂ 2 ) of θ = (β0 , β1 , σ 2 ) compute the derivatives

∂l n
= − 2 (β0 + β1 x̄ − ȳ),
∂β0 σ
∂l n
= − 2 (β0 x̄ − xy + β1 x2 ),
∂β1 σ
n
∂l n 1 X 2
= − + e ,
∂σ 2 2σ 2 2σ 4 i=1 i

and set them all equal to zero. As a result we get a system of three equations
ssE
b0 + b1 x̄ = ȳ, b0 x̄ + b1 x2 = xy, σ̂ 2 = ,
n
where
ssE = ê21 + . . . + ê2n ,
is the error sum of squares defined in terms of the residuals

êi = yi − b0 − b1 xi .

Solving the first two equations we obtain the maximum likelihood estimates for the slope and intercept
xy − x̄ȳ
b1 = , b0 = ȳ − b1 x̄.
x2 − x̄2
Notice that the estimates (b0 , b1 ) of parameters (β0 , β1 ) are obtained by minimising the sum of squares
n
nX o
ssE = min (yi − β0 − β1 xi )2 ,
β0 ,β1
i=1

and for this reason, (b0 , b1 ) are called the least squares estimates of (β0 , β1 ).
The maximum likelihood estimate of σ̂ 2 = ssnE is a biased but asymptotically unbiased estimate of σ 2 . An unbiased
estimate of σ 2 is given by
ssE
s2 = .
n−2

100
Residuals
For each predictor value xi define the predicted response by

ŷi = b0 + b1 xi ,

so that the corresponding residual is given by the difference êi = yi − ŷi . The residuals (ê1 , . . . , ên ) are linearly
connected via
ê1 + . . . + ên = 0, x1 ê1 + . . . + xn ên = 0, ŷ1 ê1 + . . . + ŷn ên = 0,
implying that êi are uncorrelated with xi and êi are uncorrelated with ŷi . Thus under the simple linear regression
model, the scatter plot of the residuals êi versus xi should look like a horizontal blur. If the linear model is not valid,
it will show up in a somewhat bowed shape of the (xi , êi ) scatter plot. In some cases, such a non-linearity problem
can be fixed by some kind of a log-transformation of the data.
The residuals êi are realisations of random variables Êi having normal distributions with zero means, whose
variances and covariances are given by
 cii  cij
Var(Êi ) = σ 2 1 − , Cov(Êi , Êj ) = −σ 2 ,
n−1 n−1
where Pn
k=1 (xi − xk )(xj − xk )
cij = .
ns2x
For larger n, the random variables Êi can be viewed as independent having the same N(0, σ) distribution.

Sample correlation coefficient


Define the sample variances and the sample covariance by
n n n
1 X 1 X 1 X
s2x = (xi − x̄)2 , s2y = (yi − ȳ)2 , sxy = (xi − x̄)(yi − ȳ).
n − 1 i=1 n − 1 i=1 n − 1 i=1

Then the sample correlation coefficient between the predictor and response variables is defined as
sxy
r= .
sx sy
Noticing that
rsy
,
b1 =
sx
we find that the fitted regression line y = b0 + b1 x can be written in the form
s
y = ȳ + r sxy (x − x̄),

or equivalently  y − ȳ   x − x̄ 
=r .
sy sx
The last formula is most appealing to the intuition: it claims that the standardised versions of the predictor and the
response are connected through the regression factor.

Coefficient of determination
Using the formula for the predicted responses
s
ŷi = ȳ + b1 (xi − x̄), b1 = r sxy ,

and the relation


yi − ȳ = ŷi − ȳ + êi ,
we obtain a decomposition of the sum of squares

ssT = ssR + ssE ,

where X
ssT = (yi − ȳ)2 = (n − 1)s2y
i

is the total sum of squares, and


X
ssR = (ŷi − ȳ)2 = (n − 1)b21 s2x = (n − 1)r2 s2y
i

101
is the regression sum of squares. The obtained relations yield
ssR ssE
= r2 , = 1 − r2 .
ssT ssT

These equalities explain why the squared sample correlation coefficient r2 is called the coefficient of determination.
Coefficient of determination r2 is the proportion of variation in the response variable explained by the variation of the
predictor variable.

Coefficient of determination r2 has a more intuitive meaning than the sample correlation coefficient r

On the other hand, we have


ssE = ssT (1 − r2 ) = (n − 1)s2y (1 − r2 ),
implying the following useful expression for the unbiased estimate of σ 2 :
n−1 2
s2 = s (1 − r2 ).
n−2 y

10.2 Confidence intervals and hypothesis testing


The least squares estimators (b0 , b1 ) are unbiased and consistent. Due to the normality assumption we have the
following exact distributions for the underlying random variables (B0 , B1 )

σ 2 x2i s2 x2i
P P
B0 − β 0
B0 ∼ N(β0 , σ0 ), 2
σ0 = , 2
s b0 = , ∼ tn−2 ,
n(n − 1)s2x n(n − 1)s2x SB0
σ2 s2 B 1 − β1
B1 ∼ N(β1 , σ1 ), σ12 = , s 2
b = , ∼ tn−2 .
(n − 1)s2x 1
(n − 1)s2x SB1

There is a weak correlation between the two estimators:


σ 2 x̄
Cov(B0 , B1 ) = −
(n − 1)s2x

which is negative, if x̄ > 0, and positive, if x̄ < 0.

For β0 and β1 , we get two exact 100(1 − α)% confidence intervals Iβi = bi ± tn−2 ( α2 ) · sbi

For i = 0 or i = 1 and a given value β ∗ , one could be interested in testing the the null hypothesis H0 : βi = β ∗ .
Use the test statistic
bi − β ∗
t= ,
sbi
which is a realisation of a random variable T that has the exact null distribution

T ∼ tn−2 .

Two important examples of hypotheses testing.

1. The model utility test is built around the null hypothesis

H0 : β1 = 0

stating that there is no relationship between the predictor variable x and the response y. The corresponding
test statistic, often called t-value,
b1
t=
sb1
has tn−2 as the null distribution.
2. The zero-intercept test aims at
H0 : β0 = 0.
Its t-value t = b0
sb0 has the same null distribution tn−2 .

102
Intervals for individual observations
Suppose a sample of size n resulted in the least squares estimates (b0 , b1 ), the noise size estimate s, the sample mean x̄
and the sample standard deviation sx for the predictor values in the sample. We are going to make a new measurement
with the predictor value x = xp and wish to say something on the response value

yp = β0 + β1 xp + σzp ,

where zp is generated by the N(0, 1) distribution independently of the available sample (x1 , y1 ), . . . , (xn , yn ). The
expected value of Yp ,
µp = β0 + β1 xp
is estimated by
µ̂p = b0 + b1 xp .
The standard error σp of µ̂p is computed as the square root of

σ2 σ2 xp −x̄ 2
Var(B0 + B1 xp ) = Var(B0 ) + x2p Var(B1 ) + 2xp Cov(B0 , B1 ) = n + n−1 ·( sx ) .

This leads to the following exact 100(1 − α)% confidence interval for µp
q
x −x̄
Iµp = b0 + b1 xp ± tn−2 ( α2 ) · s n1 + n−1
1
( psx )2 .

The new feature of this section is the so called prediction interval of the response value yp with takes into account
the noise factors accompanying the new measurement.
q
xp −x̄ 2
Exact 100(1 − α)% prediction interval Iyp = b0 + b1 xp ± tn−2 ( α2 ) · s 1 + 1
n + 1
n−1 ( sx )

This prediction interval has wider limits since it reflects the uncertainty due the noise factors. The larger factor
appearing under the square root comes from the variance formula
xp −x̄ 2
Var(Yp − µ̂p ) = Var(µp + σZp − µ̂p ) = σ 2 + Var(µ̂p ) = σ 2 (1 + 1
n + 1
n−1 ·( sx ) ).

The next figure illustrates the relationship between the 95% confidence and prediction intervals (Iµp , Iyp ). Two
features are important to be aware of. Firstly, the prediction interval is broader. Observe that as n → ∞, the width
of the confidence interval goes to zero, while the width of the 95% prediction interval converges to 1.96σ. Secondly,
both intervals gets wider as the value xp is placed further from the sample mean x̄ of the predictor values.

10.3 Multiple linear regression


An important extension of the simple linear regression model involving a single predictor x is the multiple regres-
sion model involving p − 1 predictors with an arbitrary p ≥ 2. The corresponding data set consists of n vectors
(xi,1 , . . . , xi,p−1 , yi ) with n > p, such that

y1 = β0 + β1 x1,1 + . . . + βp−1 x1,p−1 + e1 ,


...
yn = β0 + β1 xn,1 + . . . + βp−1 xn,p−1 + en ,

where e1 , . . . , en are independently generated by the N(0, σ) distribution. In terms of the column vectors

y = (y1 , . . . , yn )⊺ , β = (β0 , . . . , βp−1 )⊺ , e = (e1 , . . . , en )⊺ ,

103
the multiple regression model is compactly described by the relation
y = Xβ + e,
where X is the so called design matrix assumed to have rank p:
 
1 x1,1 . . . x1,p−1
X =  ... ... ... ... .
1 xn,1 . . . xn,p−1
The machinery developed for the simple linear regression model works well for the multiple regression. The least
squares estimates b = (b0 , . . . , bp−1 )⊺ are obtained as
b = (X⊺ X)−1 X⊺ y,
and yield the next formula for the predicted responses ŷ = Xb:
ŷ = Py, where P = X(X⊺ X)−1 X⊺ .
Turning to the random vector B behind the the least squares estimates b, we find that
E(B) = β.
Furthermore, the covariance matrix, the p × p matrix with elements Cov(Bi , Bj ), is given by
E{(B − β)(B − β)⊺ } = σ 2 (X⊺ X)−1 .
The vector of residuals
ê = y − ŷ = (I − P)y
has a zero mean vector and a covariance matrix σ 2 (I − P).

An unbiased estimate of σ 2 is given by s2 = n−p ,


ssE
where ssE = ê21 + . . . + ê2n .

Denote by d20 , . . . , d2p−1 the diagonal elements of the matrix (X⊺ X)−1 . Then the standard error of bj is computed as
sbj = sdj , entailing the exact sampling distributions for the least squares estimates
B j − βj
∼ tn−p , j = 0, 1, . . . , p − 1.
SBj

Special case: p = 2
For the simple linear regression case, the design matrix X has the dimensions n × 2, so that
     
⊺ 1 ... 1 ⊺ n x 1 + . . . + xn 1 x̄
X = , X X= =n .
x1 . . . xn x1 + . . . + xn x21 + . . . + x2n x̄ x2
It follows    
1 x2 −x̄ ȳ
(X⊺ X)−1 = , X⊺ y = n ,
n(x2 − (x̄)2 ) −x̄ 1 xy
and the formulas for b0 and b1 are recovered in the matrix form
  
1 x2 −x̄ ȳ
b = (X⊺ X)−1 X⊺ y = .
2
x − (x̄)2 −x̄ 1 xy

Case study: catheter length


Doctors want predictions on heart catheter length depending on child’s height and weight. The data consist of n = 12
observations for the distance to pulmonary artery:

Height (in) Weight (lb) Length (cm)


42.8 40.0 37.0
63.5 93.5 49.5
37.5 35.5 34.5
39.5 30.0 36.0
45.5 52.0 43.0
38.5 17.0 28.0
43.0 38.5 37.0
22.5 8.5 20.0
37.0 33.0 33.5
23.5 9.5 30.5
33.0 21.0 38.5
58.0 79.0 47.0

104
For this example the response variable y is the catheter length, and the two predictors are x1 is the height of a pa-
tient and x2 is the same patient’s weight. In this case p = 3, with the response vector and the design matrix are given by

   
20.0 1 22.5 8.5

 30.5 


 1 23.5 9.5 


 38.5 


 1 33.0 21.0 


 33.5 


 1 37.0 33.0 


 34.5 


 1 37.5 35.5 

28.0 1 38.5 17.0
(10.1)
   
y= , X= .

 36.0 


 1 39.5 30.0 


 37.0 


 1 42.8 40.0 


 37.0 


 1 43.0 38.5 


 43.0 


 1 45.5 52.0 

 47.0   1 58.0 79.0 
49.5 1 63.5 93.5

Geometrically, the task is to fit a plain


y = b0 + b1 x1 + b2 x2
to the three-dimensional scatter plot for n = 12 points. Applying the formulas given above, we compute four point
estimates b0 = 21, b1 = 0.20, b2 = 0.19, s = 3.9, and three standard errors sb0 = 8.8 and sb1 = 0.36, sb2 = 0.17. Using
t-distribution with n − p = 9 degrees of freedom, we compute three 95% confidence intervals

Iβ0 = 21 ± 2.2622 · 8.8 = 21 ± 19.91,


Iβ1 = 0.20 ± 2.2622 · 0.36 = 0.20 ± 0.81,
Iβ2 = 0.19 ± 2.2622 · 0.17 = 0.19 ± 0.38.

We see that the last two intervals cover zero.

10.4 Coefficients of multiple determination and model utility tests


The coefficient of multiple determination can be computed similarly to the coefficient determination R2 for the simple
linear regression model as
ssE
R2 = 1 − , ssT = (n − 1)s2y .
ssT
The problem with R2 is that it increases wit p even if totally irrelevant predictor variables are added to the model.
To punish for irrelevant variables it is better to use the adjusted coefficient of multiple determination
n − 1 ssE
Ra2 = 1 − · .
n − p ssT

The adjustment factor n−1


n−p reduces the coefficient of determination if p gets larger. Representing

s2
Ra2 = 1 − ,
s2y

s2
we arrive at the following transparent interpretation of the adjusted coefficient of determination. Since s2y is the
proportion of the variance in y explained by the noise, the adjusted R is the proportion of the variance in y explained
2

by the regression model.

Example: flow rate vs stream depth


The data in the following table were gathered for an environmental impact study that examined the relationship
between the depth of a stream and the rate of its flow (Ryan et al, 1976).
Depth x 0.34 0.29 0.28 0.42 0.29 0.41 0.76 0.73 0.46 0.40
Flow rate y 0.64 0.32 0.73 1.33 0.49 0.92 7.35 5.89 1.98 1.12
A bowed shape of the plot of the residuals versus depth suggests that the relation between x and y is not linear. The
multiple linear regression framework can by applied to the quadratic model

y = β0 + β1 x + β2 x 2 ,

with x1 = x and x2 = x2 . The results of such analysis are presented in the table:

105
Coefficient Estimate Standard Error t value
β0 1.68 1.06 1.52
β1 −10.86 4.52 −2.40
β2 23.54 4.27 5.51
The residuals show no sign of systematic misfit. The test statistic t = 5.51 of the utility test for the crucial null
hypothesis H0 : β2 = 0, is larger than the critical value t7 (0.0005) = 5.4079. We conclude that the quadratic term in
the model is statistically significant at the 0.1% significance level.

Case study: catheter length


Doctors want predictions distance to pulmonary artery (y) depending on child’s height (h) and weight (w). The data
consist of n = 12 of the triplets (hi , wi , yi ). We will compare three linear regressions

model 1: y = β0 + β1 h + σz, model 2: y = β0 + β1 w + σz, model 3: y = β0 + β1 h + β2 w + σz.

The analysis of these regression models is summarised in the next table


estimates model 1 t-value model 2 t-value model 3 t-value
b0 (sb0 ) 12.1 (4.3) 2.8 25.6 (2.0) 12.8 21 (8.8) 2.39
b1 (sb1 ) 0.60 (0.10) 6.0 0.28 (0.04) 7.0 0.20 (0.36) 0.56
b2 (sb2 ) – – – – 0.19 (0.17) 1.12
s 4.0 3.8 3.9.
In contrast to the two simple linear regression models, for the multiple regression model we can not reject either
H1 : β1 = 0 or H2 : β2 = 0 because the t-values 0.59 and 1.12 are not significant according to the t-distribution table
with df = 9. This paradox is explained by the different meaning of the slope parameters in the simple and multiple
regression models. In the multiple model β1 is the expected change in L when H increased by one unit and W held
constant.
The comparison of the coefficient of determination R2 for these three models, R2 = 0.78, 0.80, 0.81 for the models
1, 2, 3 respectively, leaves the false impression that the multiple regression model 3 is the better than the simple linear
regressions models 1 and 2. Turning to the adjusted coefficient of determination Ra2 = 0.76, 0.78, 0.77, we conclude
that the model 2 is the best among the three regressions. This is an example of the collinearity problem: height
and weight have a strong linear relationship, see the next figure, therefore, there is little or no gain from adding the
predictor h to the regression model 2 with the single predictor w.

10.5 Exercises
Problem 1
Suppose we are given a two-dimensional random sample

(x1 , y1 ), . . . , (xn , yn ).

Verify that the sample covariance is an unbiased estimate of the population covariance.

Problem 2
Draw by hand a scatter plot for ten pairs of measurements
x 0.34 1.38 -0.65 0.68 1.40 -0.88 -0.30 -1.18 0.50 -1.75
y 0.27 1.34 -0.53 0.35 1.28 -0.98 -0.72 -0.81 0.64 -1.59

106
(a) Fit a straight line y = a + bx by the method of least squares, and sketch it on the plot.
(b) Fit a straight line x = c + dy by the method of least squares, and sketch it on the plot.
(c) Why the lines on (a) and (b) are not the same?

Problem 3
Each student receives two consecutive grade point averages
x = the high school GPA and y = the freshman GPA.
Starting from two coupled models for female students
y = β0 + β1 x + σz, Z ∼ N(0, σ),
and for male students
y = β0′′ + β1 x + σz, Z ∼ N(0, σ),
suggest a joint multiple regression model involving an extra predictor f which equal 1 if the student is female and 0
otherwise. Give the form of the design matrix for such a model.

Problem 4
Check that P = X(X⊺ X)−1 X⊺ is a projection matrix such that P2 = P.

Problem 5
The sample (x1 , y1 ), . . . , (xn , yn ) was collected for n students who took a statistical course, with
x giving the midterm grade and y giving the final grade.
The data produced the following five summary statistics
r = 0.5, x̄ = ȳ = 75, sx = sy = 10.
(a) Given a student’s midterm score x = 95, predict the final score of the student.
(b) Given the final score y = 85 and not knowing the midterm score of a student, predict the student’s midterm
score.

Problem 6
Let X ∼ N(0, 1) and Z ∼ N(0, 1) be two independent random variables and consider a third one
Y = X + βZ.
(a) Show that the correlation coefficient for X and Y is
1
ρ= √ .
1+β 2

(b) Using the result of part (a) suggest a way of generating bivariate vectors (x, y) with a specified positive population
correlation coefficient.

Problem 7
The stopping distance of an automobile on a certain road was studied as a function of velocity (Brownee, 1960)
velocity of a car x (mi/h) 20.5 20.5 30.5 40.5 48.8 57.8
stopping distance y (ft) 15.4 13.3 33.9 73.1 113.0 142.6

Fit y and y as linear functions of the velocity, and examine the residuals in each case. Which fit is better? Can you
suggest any physical reason that explains why?

Problem 8
An excerpt rom Wikipedia:
"The American Psychological Association’s 1995 report "Intelligence: Knowns and Unknowns" stated
that the correlation between IQ and crime was −0.2. It was −0.19 between IQ scores and number of
juvenile offences in a large Danish sample; with social class controlled, the correlation dropped to −0.17.
A correlation of 0.20 means that the explained variance is less than 4%."
Explain how the 4% appearing in the last sentence was computed.

107
Problem 9
Verify that the test statistic for the model utility test can be expressed in terms of the sample correlation r as follows

b1 r n−2
t= = √ .
sb1 1 − r2

Problem 10
The article "Effects of gamma radiation on juvenile and mature cuttings of quaking aspen" (Forest science, 1967)
reports the following data on exposure time to radiation x and dry weight of roots y:
x 0 2 4 6 8
y 110 123 119 86 62

The estimated quadratic regression function is y = 111.8857 + 8.0643x − 1.8393x2 .


(a) What is the underlying multiple regression model? Write down the corresponding design matrix.
(b) Compute the predicted responses. Find an unbiased estimate s2 of the noise variance σ 2 .
(c) Compute the coefficient of multiple determination.

Problem 11
A sports statistician studied the relation between the time (y in seconds) for a particular competitive swimming event
and the swimmer’s age (x in years) for 20 swimmers with age ranging from 8 to 18. She employed quadratic regression
model and obtained the following result

ŷ = 147 − 11.11x + 0.2730x2 .

The standard error for the curvature effect coefficient was estimated as sb2 = 0.1157.

(a) Plot the estimated regression function. Would it be reasonable to use this regression function when the swimmer’s
age is 40?
(b) Construct a 99 percent confidence interval for the curvature effect coefficient. Interpret your interval estimate.
(c) Test whether or not the curvature effect can be dropped from the quadratic regression model, controlling the α
risk at 0.01. State the alternatives, the decision rule, the value of the test statistic, and the conclusion. What is
the p-value of the test?

Problem 12
Suppose that grades of 10 students on a midterm and a final exams have a correlation coefficient of 0.5 and both
exams have an average score of 75 and a standard deviation of 10.
(a) Sketch a scatterplot illustrating performance on two exams for this group of 10 students.
(b) If Carl’s score on the midterm is 90, what would you predict his score on the final to be? How uncertain is this
prediction?

(c) If Maria scored 80 on the final, what would you guess that her score on the midterm was?
(d) Exactly what assumptions do you make to make your calculations in (b) and (c)?

108
Chapter 11

Statistical distribution tables

11.1 Normal distribution table


If Z has a standard normal distribution N(0,1), then the following table gives the probabilities

Φ(z) = P(Z ≤ z).

For example, the number on the row 1.9 and column 0.06 gives Φ(1.96) = 0.9750.

109
11.2 Critical values of the t-distribution
If T has a tk -distribution with k degrees of freedom, then for a chosen α, the following table gives the value of tk (α)
such that P(T > tk (α)) = α. For example, if k = 5 and α = 0.05, then t5 (0.05) = 2.015.

110
11.3 Critical values of the chi-squared distribution
If X has a χ2k -distribution with k degrees of freedom, then for a chosen α, the following table gives the value of xk (α)
such that P(X > xk (α)) = α. For example, if k = 5 and α = 0.05, then x5 (0.05) = 11.070.

111
11.4 Critical values of the F-distribution
If F has a Fdf,k -distribution, then the tables below give the critical value Fdf,k (α) such that P(F > Fdf,k (α)) = α.
Different k are shown as columns and different α are shown as rows.
For example, if df = 2, k = 6, and α = 0.025, then F2,6 (0.025) = 7.2599.

F1,k -distribution table

F2,k -distribution table

F3,k -distribution table

F4,k -distribution table

112
F5,k -distribution table

F6,k -distribution table

F7,k -distribution table

F8,k -distribution table

F9,k -distribution table

113
Chapter 12

Solutions to exercises

Solutions to Section 1.7


Solution 1
We have
Var(Xi ) = E((Xi − µi )2 ) = E(Xi2 − 2µi Xi + µ2i ) = E(Xi2 ) − 2µi E(Xi ) + µ2i = E(Xi2 ) − µ2i ,
and similarly
Cov(X1 , Xi ) = E((X1 − µ1 )(X2 − µ2 )) = E(X1 X2 − µ1 X2 − µ2 X1 + µ1 µ2 ) = E(X1 X2 ) − µ1 µ2 .

Solution 2
With
X1 − µ1 X2 − µ2
Z1 = , Z2 = ,
σ1 σ2
we get
E(Zi ) = E( Xiσ−µ
i
i
)= 1
σi E(Xi − µi ) = 1
σi (µi − µi ) = 0,
and
σi2
Var(Zi ) = E(( Xiσ−µ
i
i 2
) )= 1
σi2
E((Xi − µi )2 ) = σi2
= 1.
By the definition of the correlation coefficient,
1
ρ= E((X1 − µ1 )(X2 − µ2 )) = E (Z1 Z2 ) .
σ1 σ2
This number is a dimensionless quantity in the interval (−1, 1) by the Cauchy–Schwarz inequality, see Wikipedia. To
illustrate, let X1 and X2 are the height and the weight of a randomly chosen person. The standardised height Z1
and weight Z2 would not depend on whether the numbers (X1 , X2 ) are given in (centimetres, kilograms) or (inches,
pounds).

Solution 3
Given (X1 , . . . , Xr ) ∼ Mn(n; p1 , . . . , pr ), the sum X = Xi + Xj can be treated as the number of outcomes among n
independent trials with r possible outcomes such that either i or j are observed. Since X is the number of successes
among n trials with the probability pi + pj of a success, we conclude that
Xi + Xj ∼ Bin(n, pi + pj ).

Solution 4
If X ∼ Gam(α, λ), then the probability density function of X is
1 α α−1 −λx
f (x) = λ x e , x > 0.
Γ(α)
The distribution function of the scaled random variable Y = λX is
P(Y ≤ x) = P(X ≤ λ−1 x) = F (λ−1 x).
Taking the derivatives over x on both sides, we find the probability density function of Y is
1 α −1 α−1 −λλ−1 x 1 α−1 −x
λ−1 f (λ−1 x) = λ−1 λ (λ x) e = x e ,
Γ(α) Γ(α)
indeed the density of the gamma distribution Gam(α, 1).

114
Solution 5

Var(X + Y ) = E((X + Y )2 ) − (E(X + Y ))2 = E((X + Y )2 ) − (µx + µy )2


= E(X 2 ) + E(Y 2 ) + 2E(XY ) − µ2x − µ2y − 2µx µy
= (E(X 2 ) − µ2x ) + (E(Y 2 ) − µ2y ) + 2(E(XY ) − µx µy )
= Var(X) + Var(Y ) + 2Cov(X, Y ).

Solution 6
Because the average event rate is 2.5 goals per match, the Poisson formula with µ = 2.5 gives

2.5x e−2.5
P(x goals in a match) = , x = 0, 1 . . .
x!
Therefore, we get

2.50 e−2.5 e−2.5


P(0 goals in a match) = = ≈ 0.082,
0! 1
2.51 e−2.5 2.5e−2.5
P(1 goal in a match) = = ≈ 0.205,
1! 1
2.52 e−2.5 6.25e−2.5
P(2 goals in a match) = = ≈ 0.257.
2! 2

Solution 7
If XN has the hypergeometric distribution Hg(N, n, p), then
Np N (1−p)
 
x n−x
P(XN = x) = N
 .
n

As N → ∞, we have

Nn (N p)x (N (1 − p))n−x
     
N Np N (1 − p)
∼ , ∼ , ∼ ,
n n! x x! n−x (n − x)!

and the binomial distribution approximation follows

n!px (1 − p)n−x
 
n x
P(XN = x) → = p (1 − p)n−x , N → ∞.
x!(n − x)! x

Solution 8
We have
n−1
X
P(X = n) = 1 − p (1 − p)x = (1 − p)n ,
x=0

Solutions to Section 2.7


Solution 1
For the given population distribution
possible values x 1 2 4 8
probabilities P(X = x) 1
5
2
5
1
5
1
5

the population mean and variance are computed in three steps


1 2 1 1
µ=1· 5 +2· 5 +4· 5 +8· 5 = 3.4,
2 1 2 1 1
E(X ) = 1 · 5 +4· 5 + 16 · 5 + 64 · 5 = 17.8,
2 2
σ = 17.8 − µ = 6.24.

The next table lists the possible values of the sample mean x̄ = x1 +x2
2 together with their probabilities given in
brackets, obtained by multiplication of the marginal probabilities.

115
x2 = 1 x2 = 2 x2 = 4 x2 = 8 marginal prob.
x1 = 1 1.0 (1/25) 1.5 (2/25) 2.5 (1/25) 4.5 (1/25) 1/5
x1 = 2 1.5 (2/25) 2.0 (4/25) 3.0 (2/25) 5.0 (2/25) 2/5
x1 = 4 2.5 (1/25) 3.0 (2/25) 4.0 (1/25) 6.0 (1/25) 1/5
x1 = 8 4.5 (1/25) 5.0 (2/25) 6.0 (1/25) 8.0 (1/25) 1/5
marginal prob. 1/5 2/5 1/5 1/5 total prob. = 1

This table yields the following sampling distribution of X̄:


possible values x̄ 1 1.5 2 2.5 3 4 4.5 5 6 8 total prob.
probabilities P(X̄ = x̄) 1
25
4
25
4
25
2
25
4
25
1
25
2
25
4
25
2
25
1
25 1

From this sampling distribution table, using the same three steps as for the computing of µ and σ 2 , we find
1 4 4 2 4 1 2 4 2 1
E(X̄) = 1 · 25 + 1.5 · 25 +2· 25 + 2.5 · 25 + 3 · 25 + 4 · 25 + 4.5 · 25 +5· 25 + 6 · 25 +8· 25 = 3.4,
E(X̄ 2 ) = 1
25 + (1.5)2 · 4
25 +4· 4
25 + (2.5)2 · 25
2 4
+ 9 · 25 + 16 · 1
25 + (4.5)2 · 2
25 + 25 ·
4
25 + 36 · 252 1
+ 64 · 25 = 14.68,
2
Var(X̄) = 14.68 − (3.4) = 3.12.

As a result, we see that indeed,


σ2
E(X̄) = 3.4 = µ, Var(X̄) = 3.12 = .
n

Solution 2
The question is about a simple random sample, with unspecified population size N . We assume that N is much larger
than the sample size n = 1500 and use the formulas for a random sample with independent observations.
For the given dichotomous data, we have
q q
n = 1500, p̂ = 0.55, 1 − p̂ = 0.45, sp̂ = p̂(1−
n−1
p̂)
= 0.55×0.45
1499 = 0.013,

and want to estimate the population margin of victory

v = p − (1 − p) = 2p − 1.

Replacing here p by the sample proportion p̂, we estimate margin of victory by

v̂ = p̂ − (1 − p̂) = 2p̂ − 1 = 0.1.

(a) Since
Var(V̂ ) = Var(2P̂ − 1) = Var(2P̂ ) = 4Var(P̂ ),
the standard error of v̂ is twice the standard error of p̂
q q
Var(V̂ ) = 2 Var(P̂ ).

Therefore, the estimated standard error of v̂ is


r
p̂(1 − p̂)
sv̂ = 2sp̂ = 2 = 0.026.
n−1
(b) An approximate 95% confidence interval for v is found using the usual formula based on the point estimate v̂
and its standard error sv̂ :
Iv ≈ v̂ ± 1.96 · sv̂ = 0.10 ± 0.05.

Solution 3
By the central limit theorem the t-score X̄−µ
SX̄ is asymptotically N(0, 1)-distributed. Therefore, referring to the normal
distribution table we get

0.90 ≈ P( X̄−µ
S > −1.28) = P(−∞ < µ < X̄ + 1.28SX̄ ),

0.95 ≈ P( X̄−µ
SX̄ < 1.645) = P(X̄ − 1.645SX̄ < µ < ∞).

implying
k1 = 1.28, k2 = 1.645.

116
Solution 4
It is enough to verify that for any random variable X and constant θ,

E((X − θ)2 ) = Var(X) + (µ − θ)2 .

where µ is the expected value of X. This is done by a simple transformation on the left hand side and opening the
brackets

E((X − θ)2 ) = E((X − µ + µ − θ)2 ) = Var(X) + 2E((X − µ)(µ − θ)) + (µ − θ)2 = Var(X) + (µ − θ)2 .

Solution 5
Based on the data summary
X X
N = 2000, n = 25, xi = 2351, x2i = 231305,

we apply the formulas from the Chapter 2.3.

(a) Unbiased estimate of µ is


2351
x̄ = = 94.04.
25
(b) Sample variance
n 25  231305 
s2 = (x2 − x̄2 ) = − (94.04)2 = 425.71.
n−1 24 25
Unbiased estimate of σ 2 is
N −1 2 1999
s = 425.71 = 425.49.
N 2000
Unbiased estimate of Var(X̄) is
s2  n
s2x̄ =
1− = 16.81.
n N
(c) An approximate 95% confidence interval for µ

Iµ = x̄ ± 1.96sx̄ = 94.04 ± 1.96 16.81 = 94.04 ± 8.04.

Solution 6
The bias size is computed as follows

σ2  n−1
E(X̄ 2 ) − µ2 = E(X̄ 2 ) − (EX̄)2 = Var(X̄) = 1− .
n N −1
For the large sample sizes n, the bias is small.

Solution 7
The problem deals with a stratified population of size N = 2010 having k = 7 strata.

(a) With n = 100, we get the following answers using the relevant formulas
Stratum number j 1 2 3 4 5 6 7 Weighted mean
Stratum proportion wj 0.196 0.229 0.195 0.166 0.084 0.056 0.074
Stratum mean µj 5.4 16.3 24.3 34.5 42.1 50.1 63.8 µ = 26.311
Stratum standard deviation σj 8.3 13.3 15.1 19.8 24.5 26.0 35.2 σ̄ = 17.018
w σ
Optimal allocation n jσ̄ j 10 18 17 19 12 9 15
Proportional allocation nwj 20 23 19 17 8 6 7

(b) Since σ̄ 2 = 289.62 and σ 2 = 343.28, we have

σ̄ 2 σ2 σ2
Var(X̄so ) = = 2.896, Var(X̄sp ) = = 3.433, Var(X̄) = = 6.20,
n n n
where σ 2 is computed in the next item.

(c) We have µ = 26.311, and


k
X
wj (µj − µ)2 = 276.889.
j=1

117
Therefore
σ 2 = 343.28 + 276.89 = 620.17, σ = 24.90.
(d) If n1 = . . . = n7 = 10 and n = 70, then
w12 σ12 2 2
wk σk
Var(X̄s ) = n1 + ... + nk = 4.45.
The requested sample size x is found from the equation
σ2 620.17 620.17
Var(X̄) = = = 4.45, x= = 139.364.
x x 4.45
After rounding up, we find the answer x = 140, which is twice larger than the size n = 70 of the stratified sample.

(e) The proportional allocation of the sample size n = 70, gives the variance of the stratified sample mean
Var(X̄sp ) = 4.90. Notice that it is larger than the Var(X̄s ) = 4.45 of the item (d). Solving the equation
σ2 620.17 620.17
Var(X̄) == = 4.90, x= = 126.57,
x x 4.90
we find that the corresponding random sample size is x = 127.

Solution 9
We are dealing with a stratified population having
N = 5, k = 2, w1 = 0.6, w2 = 0.4, µ1 = 1.67, µ2 = 6, σ12 = 0.21, σ22 = 4.
Given n1 = n2 = 1 and n = 2, the stratified sample mean x̄s = 0.6x1 + 0.4x2 can take four possible values listed in
the table below (with probabilities in the brackets).
x1 = 1 x1 = 2 marginal prob.
x2 = 4 2.2 (1/6) 2.8 (2/6) 1/2
x2 = 8 3.8 (1/6) 4.4 (2/6) 1/2
marginal prob. 1/3 2/3 1
Using this table we find that
1 2 1 2
E(X̄s ) = 2.2 · 6 + 2.8 · 6 + 3.8 · 6 + 4.4 · 6 = 3.4,
2
(E(X̄s )) = 11.56,
E(X̄s2 ) = (2.2)2 · 1
6 + (2.8)2 · 2
6 + (3.8)2 · 1
6 + (4.4)2 · 2
6 = 12.28,
Var(X̄s ) = 12.28 − 11.56 = 0.72.
These results are in agreement with the formulas of Section 2.5:
w12 σ12 2 2
wk σk
E(X̄s ) = µ, Var(X̄s ) = n1 + ... + nk = 0.36σ12 + 0.16σ22 .

Solution 10
Data: a random sample of size n = 16 taken from a normal distribution.

(a) The summary statistics computed from the data


x̄ = 3.6109, s2 = 3.4181, sx̄ = 0.4622
suggest an estimate for µ to be 3.6109, and an estimate for σ to be 3.4181. 2

(b), (c) The following exact confidence intervals are computed using the formulas of Section 2.6
90% 95% 99%
Iµ 3.61 ± 0.81 3.61 ± 0.98 3.61 ± 1.36
Iσ 2 (2.05; 7.06) (1.87; 8.19) (1.56; 11.15)
Iσ (1.43; 2.66) (1.37; 2.86) (1.25; 3.34)
(d) To find the sample size x that halves the confidence interval length, we set up an equation using the exact
confidence interval formula for the mean
s s′
t15 ( α2 ) · √ = 2 · tx−1 ( α2 ) · √ ,
16 x
where s′ is the sample standard deviation for the sample of size x. A simplified version of this equation 1
4 = √2
x
implies
x ≈ (2 · 4)2 = 64. Further adjustment for a 95% confidence interval is obtained using
t15 ( α2 ) = 2.13, tx−1 ( α2 ) ≈ 2,
yielding x ≈ (2 · 4 · 2.13
2 2
) = 56.4. We conclude that going from a sample of size 16 to a sample of size 56 would halve
the length of the confidence interval for µ.

118
Solutions to Section 3.4
Solution 1
A method of moment estimate of the parameter µ for the Poisson distribution model is given by the sample mean
µ̃ = 3.9. Using this value we compute the expected counts, see table. Comparing the observed and expected counts
by a naked eye we see that the Poisson model does not fit well. The sample variance is close to 5 which shows that
there is over-dispersion (the estimated variance is larger than the estimated µ).
This extra variation in the data can be explained by the fact that the 300 intervals were distributed over various
hours of the day and various days of the week.
x observed counts expected counts
0 14 6.1
1 30 23.8
2 36 46.3
3 68 60.1
4 43 58.5
5 43 45.6
6 30 29.6
7 14 16.4
8 10 8.0
9 6 3.5
10 4 1.3
11 1 0.5
12 1 0.2
13+ 0 0.1

Solution 2
The likelihood function has the form
n
Y
L(p) = (1 − p)xi −1 p = (1 − p)t−n pn ,
i=1

where
t = x1 + . . . + xn
is a sufficient statistic. The log-likelihood function

l(p) = (t − n) ln(1 − p) + n ln p

has the derivative


t−n n
l′ (p) = − + .
1−p p
Thus the maximum likelihood estimate satisfies
n t−n
= ,
p̂ 1 − p̂
yielding p̂ = n/t = 1/x̄. This is a consistent estimate due to the law of large numbers.

Solution 3
The population distribution is discrete: the random variable X takes the values x = 0, 1, 2, 3 with probabilities
2 1 2 1
p0 = · θ, p1 = · θ, p2 = · (1 − θ), p3 = · (1 − θ),
3 3 3 3
so that
p0 + p1 = θ, p2 + p3 = 1 − θ.
We are given a random sample with
n = 10, x̄ = 1.5, s = 1.08,
and observed counts
x 0 1 2 3 Total
cx 2 3 3 2 10

119
(a) Method of moments. The expression for the first population moment
1 2 1 7
µ= · θ + 2 · · (1 − θ) + 3 · · (1 − θ) = − 2θ,
3 3 3 3
leads to the equation equation
7
x̄ = − 2θ̃.
3
It gives an unbiased estimate
7 x̄ 7 3
θ̃ = − = − = 0.417.
6 2 6 4
(b) To find sθ̃ , observe that
1 σ2
Var(X̄) =
Var(Θ̃) = .
4 40
Thus we need to find sθ̃ , which estimates σθ̃ = 6.325
σ
. Next we estimate σ using two methods.

Method 1. From
 2
2 2 1 2 2 1 7 7 2
σ = E(X ) − µ = · θ + 4 · · (1 − θ) + 9 · · (1 − θ) = − 2θ − − 2θ = + 4θ − 4θ2 ,
3 3 3 3 3 9
we can estimate σ after replacing θ with θ̃:
r
2
σ̃ = + 4θ̃ − 4θ̃2 = 1.093.
9
This gives
1.093
sθ̃ =
= 0.173.
6.325
Method 2. Using the sample standard deviation s we get
s 1.08
=
sθ̃ = = 0.171.
6.325 6.325
(c) Using the multinomial distribution (C0 , C1 , C2 , C3 ) ∼ Mn(n, p0 , p1 , p2 , p3 ) we obtain the likelihood function to
be c0 c1 c2 c3
L(θ) = 2

1

2
3 (1 − θ) 1
3 (1 − θ) = const · θt (1 − θ)n−t ,
where t = c0 + c1 is a sufficient statistic. Notice that the underlying random variable T = C0 + C1 has the binomial
distribution Bin(n, θ). The log-likelihood function and its derivative take the form
l(θ) = const + t ln θ + (n − t) ln(1 − θ),
t n−t
l′ (θ) = − .
θ 1−θ
Setting the last expression to zero, we find
t n−t t 2+3 1
= ⇒ θ̂ = = = .
θ̂ 1 − θ̂ n 10 2
Thus the maximum likelihood estimate is the sample proportion, which is an unbiased estimate of the population
proportion θ.

(d) We find sθ̂ using the formula for the standard error of the sample proportion
q
sθ̂ = θ̂(1− θ̂)
n−1 = 0.167.

Solution 4
For a given n and X = x the likelihood function is
 
n x
L(p) = p (1 − p)n−x ∝ px (1 − p)n−x .
x
(a) We find p̂ maximising L(p) by maximising
ln(px (1 − p)n−x ) = x ln p + (n − x) ln(1 − p).
Since
∂ x n−x
(x ln p + (n − x) ln(1 − p)) = − ,
∂p p 1−p
we have to solve the familiar equation x
p = 1−p ,
n−x
which brings the maximum likelihood estimate formula p̂ = n.
x

(b) The graph of the likelihood function L(p) = 252p5 (1 − p)5 is symmetric around the middle point p̂ = 0.5.

120
Solution 5
The observed serial number x = 888 can be modelled by the discrete uniform distribution

P(X = x) = N −1 , x = 1, . . . , N.

(a) Since for the uniform distribution


N +1
µ= 2 ,
and the sample mean is x̄ = x = 888, the method of moments estimate of N is obtained from the equation
Ñ +1
888 = 2 .

It gives Ñ = 2x̄ − 1 = 1775. This is an unbiased estimate of N = 2µ − 1 since x̄ is an unbiased estimate of µ.

(b) With x = 888, the likelihood function


1{1≤x≤N } 1{N ≥888}
L(N ) = P(X = x) = =
N N
reaches its maximum at N̂ = 888. We see that in this case the maximum likelihood estimate is severely biased.

Solution 6
We will treat x as the number of black balls obtained by sampling k = 50 balls with replacement from an urn with N
balls of which n = 100 balls are black. Then x = 20 is a realisation of the binomial distribution Bin(50, 100
N ). Thus
the likelihod function is
(N − 100)30
 
50 100 20 N −100 30
L(N ) = P(X = 20) = ( N ) ( N ) = const · .
20 N 50
The log-likelihood function takes the form

l(N ) = const + 30 ln(N − 100) − 50 ln N.

Take the derivative and set it equal to 0:


30 50
= .
N − 100 N
Solving this equation we obtain the maximum likelihood estimate N̂ = 250.

Solution 7
Pn Pn
(a) Given i=1 xi = 58 and i=1 x2i = 260, the likelihood function is
n n x2 −2µ n 2
1 (xi −µ)2 1 1 260−116µ+16µ2
P P
− i=1 i i=1 xi +nµ
Y
L(µ, σ 2 ) = √ e 2σ2 = n/2 n
e 2σ 2 = 8 16
e− 2σ 2 .
i=1
2πσ (2π) σ (2π) σ
Pn Pn
(b) It is sufficient to know the values of i=1 xi and i=1 x2i to compute the likelihood function.

(c) Turning to the log-likelihood function


Pn Pn
n x2i − 2µ i=1 xi + nµ2
l(µ, σ 2 ) := ln L(µ, σ 2 ) = − ln(2π) − n ln σ − i=1
2 2σ 2
take two derivatives and put them equal to zero:
Pn
−2 i=1 xi + 2nµ
2
= 0,

Pn Pn
2 2
n i=1 xi − 2µ i=1 xi + nµ
− + = 0.
σ σ3
The solution of this pair of equations gives us the maximum likelihood estimates
n
X
µ̂ = x̄ = 3.63, σ̂ 2 = n−1 x2i − x̄2 = 3.11.
i=1

Since
n
X σ2 n−1 2
E(n−1 Xi2 − X̄ 2 ) = σ 2 − = σ ,
i=1
n n

σ̂ is a biased estimate of σ .
2 2

121
Solution 8
Consider a random sample (x1 , . . . , xn ) drawn from the uniform distribution U(0, θ) with density

f (x|θ) = θ1 1{0≤x≤θ} .

The population mean and variance are given by the formulas


(a−b)2
µ = θ2 , σ2 = 12 .

(a) The method of moments estimate θ̃ = 2x̄ is unbiased, and its sampling distribution has the following mean and
variance
2
θ2
E(Θ̃) = θ, Var(Θ̃) = 4σn = 3n .
(b) In terms of x(n) = max(x1 , . . . , xn ), the likelihood function takes the form
1 1
L(θ) = f (x1 |θ) · · · f (xn |θ) = θ n 1{θ≥x1 } · · · 1{θ≥xn } = θ n 1{θ≥x(n) } ,

so that x(n) is a sufficient statistic. The maximum of this function is achieved at θ̂ = x(n) .

(c) The sampling distribution of the maximum likelihood estimate θ̂ = x(n) is computed as
 x n
P(X(n) ≤ x) = P(X1 ≤ x, . . . , Xn ≤ x) = P(X1 ≤ x) · · · P(Xn ≤ x) = ,
θ
where we used independence between different Xi . Taking the derivative, we find the probability density function to
be
n n−1
b (x) = n · x
fΘ , 0 ≤ x ≤ θ.
θ
Using this density function we can compute the mean and variance of Θ:
b
θ
nθ2
Z
n n n 2
E(Θ)
b = xn dx = θ, b 2) =
E(Θ θ , Var(Θ)
b = .
θn 0 n+1 n+2 (n + 1)2 (n + 2)
The maximum likelihood estimate is biased, but asymptotically unbiased. Notice the unusual asymptotics

θ2
Var(Θ)
b = , n → ∞,
n2
indicating that the conditions on the parametric model implying Θ̂ ≈ N(θ, σnθ ) are violated.
Comparing the two mean square errors:
2
nθ2 2θ2

θ
MSE(Θ) = E(Θ − θ) = −
b b 2
+ = ,
n+1 (n + 1)2 (n + 2) (n + 1)(n + 2)
θ2
MSE(Θ̃) = ,
3n

we conclude that for sufficiently large n, the maximum likelihood estimate θb is more efficient than the method of
moments estimate θ̃.
(d) The corrected maximum likelihood estimate
n+1
θ̂c = · x(n)
n
θ2
becomes unbiased E(Θ
b c ) = θ with Var(Θ
b c) =
n(n+2) .

Solution 9
The data in hand is summarised by four observed counts

c1 = 1997, c2 = 906, c3 = 904, c4 = 32

obtained from a random sample of size n = 3839 drawn from the discrete distribution
2+θ 1−θ 1−θ θ
p1 = P(X = 1) = , p2 = P(X = 2) = , p3 = P(X = 3) = , p4 = P(X = 4) = .
4 4 4 4
Using the multinomial distribution (C1 , C2 , C3 , C4 ) ∼ Mn(n, p1 , p2 , p3 , p4 ) for the given realisation

(c1 , c2 , c3 , c4 ) with c1 + c2 + c3 + c4 = n,

122
with n = 3839, we compute the likelihood function as
 
n
L(θ) = pc1 pc2 pc3 pc4 ∝ (2 + θ)c1 (1 − θ)c2 +c3 θc4 4−n ∝ (2 + θ)c1 θc4 (1 − θ)n−c1 −c4 ,
c1 , c2 , c3 , c4 1 2 3 4
where ∝ means that we drop the factors depending only on (n, c1 , c2 , c3 , c4 ) and not involving θ. The last expression
reveals that we have a case of two sufficient statistics (c1 , c4 ). Putting
d c1 c4 n − c1 − c4
ln L(θ) = + −
dθ 2+θ θ 1−θ
equal to zero, we arrive at the equation
c1 c4 n − c1 − c4
+ =
2+θ θ 1−θ
or equivalently
θ2 n + θu − 2c4 = 0,
where
u = 2c2 + 2c3 + c4 − c1 = 2n − c4 − 3c1 .
We find the maximum likelihood estimate to be

−u + u2 + 8nc4
θ̂ = = 0.0357.
2n

Solution 10
By the law of total probability the probability of the "yes" answer to the randomly generated question equals
1 4 1 + 4q
p = P(a “yes” answer) = + ·q = ,
6 6 6
where 16 is the probability of the number 1 on the top face of the die, implying the answer "yes", 64 is the probability
of the numbers 2, 3, 4, or 5, and q is the probability of the honest answer "yes". For a random sample of size n, put

X = the number of "yes" responses.

Under the independence assumptions, we have X ∼ Bin(n, p). Given X = x, we estimate the unknown parameter p
by the sample proportion p̂ = nx . The underlying random variable P̂ = X
n has the mean value p and the variance

Var(X) p(1 − p)
Var(P̂ ) = = .
n2 n
After computing the sample proportion p̂, we may apply the method of moments by turning to the linear equation
1 + 4q̃
p̂ = .
6
Its solution gives us the following estimate of the population proportion q:
6p̂ − 1
q̃ = .
4
This estimate is unbiased because p̂ is an unbiased estimate of p:
6p − 1
E(Q̃) = = q.
4
The variance of the sampling distribution equals
9 9 p(1 − p) (1 + 4q)(5 − 4q)
Var(Q̃) = · Var(P̂ ) = · = .
4 4 n 16n

To illustrate, take n = 100 and x = 20, so that the sample proportion of "yes" answers is p̂ = 0.2. In this case,
6p̂ − 1
q̃ = = 0.05,
4
implying that the proportion of interest q is around 5%. The corresponding estimated standard error is 6%
r
(1 + 4q̃)(5 − 4q̃)
sq̃ = = 0.06,
16n
indicating that it is desirable to increase the sample size if we want a more precise estimate of q.

123
Solutions to Section 4.6
Solution 1
The z-score
X − 100p
Z= p
10 p(1 − p)
has a distribution that is approximated by N(0, 1).

(a) Under H0 we get Z = Z0 , where


X − 100p0 X − 50
Z0 = p = ,
10 p0 (1 − p0 ) 5

implying that the significance level in question is (using a continuity correction)

α = P(|X − 50| > 10|H0 ) = P(|X − 50| ≥ 11|H0 )


10.5
≈ P(|Z0 | > 5 |H0 ) ≈ 2(1 − Φ(2.1)) = 2 · 0.018 = 0.036.

(b) The power of the test is a function of the population proportion p (here for simplicity computed without
continuity correction)

Pw(p) = P(|X − 50| > 10) = P(X < 40) + P(X > 60)
! !
40 − 100p 60 − 100p
=P Z< p +P Z > p
10 p(1 − p) 10 p(1 − p)
! !
4 − 10p 10p − 6
≈Φ p +Φ p .
p(1 − p) p(1 − p)

Putting δ = 1/2 − p, we see that the power function


! ! ! !
10δ − 1 10δ + 1 10|δ| − 1 10|δ| + 1
Pw(p) = Φ p + Φ −p =Φ p + Φ −p
1/4 − δ 2 1/4 − δ 2 1/4 − δ 2 1/4 − δ 2

is symmetric around p = 1/2


p 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70
Pw(p) 0.986 0.853 0.500 0.159 0.046 0.159 0.500 0.853 0.986

Solution 2
(a) The two composite nested hypotheses have the form

H0 : µ ≤ µ0 , H : −∞ < µ < ∞.

(b) The likelihood function takes the form


n n
(xi −µ)2 2 2
i −µ)
Y X (x
L(µ) = √1 e− 2 ∝ exp{− } ∝ exp{ 2nx̄µ−nµ } ∝ exp{− n2 (µ − x̄)2 }.
2π 2 2
i=1 i=1

Observe that in the last step, the term exp{−nx2 } is treated as a constant of proportionality, as it does not involve
the parameter of interest µ and will not influence the maximum likelihood estimates.

(c) The maximum likelihood estimate under H is µ̂ = x̄. The maximum likelihood estimate under H0 is µ̂0 = x̄ if
x̄ < µ0 , and µ̂0 = µ0 if x̄ ≥ µ0 . Therefore, the likelihood-ratio equals
(
L(µ̂0 ) exp{− n2 (µ̂0 − x̄)2 } 1 if x̄ < µ0 ,
w= = = exp{− n2 (µ̂0 − x̄)2 } = n
− 2 (x̄−µ0 )2
L(µ̂) exp{− n2 (µ̂ − x̄)2 } e if x̄ ≥ µ0

(d) The likelihood-ratio test rejects H0 for small values of w or equivalently, for large positive values of (x̄ − µ0 ).
In particular, the 5% rejection region with n = 25 takes the form
1.645
R = {x̄ − µ0 ≥ 5 } = {x̄ ≥ µ0 + 0.33}.

Here we used X̄ ∼ N(µ0 , 51 ) as the null distribution, since the other eligible choices of µ ∈ H0 result in a smaller area
under the curve to the right of the critical value.

124
Solution 3
In this case tikelihood function takes the form
n n
Y 1 xi −µ Y 1
L(µ) = µ e = e−µn µy
i=1
x i ! x
i=1 i
!

where
y = x1 + . . . + xn
is a sufficient statistic.
Case 1: two simple hypotheses
H0 : µ = µ0 , H1 : µ = µ1 .
Reject H0 for small values of the likelihood-ratio
L(µ0 )  µ y
0
= e−n(µ0 −µ1 ) .
L(µ1 ) µ1
If µ1 > µ0 , then we reject H0 for large values of y. If µ1 < µ0 , then we reject H0 for small values of y. Test statistic
Y has null distribution Pois(nµ0 ).
Case 2: two-sided alternative hypothesis

H0 : µ = µ0 , H1 : µ ̸= µ0 .

Reject H0 for small values of the generalised likelihood ratio


L(µ0 )  µ y
0
= e−n(µ0 −µ̂) , µ̂ = y/n.
L(µ̂) µ̂

Reject H0 for the larger values of |y|. The test statistic Y has the null distribution Pois(nµ0 ).

Solution 4
We have a random sample of size n = 25 drawn from the population distribution N(µ, 10) and would like to test two
simple hypotheses
H0 : µ = 0, H1 : µ = 1.5
An appropriate test statistic and its exact sampling distribution are

X̄ ∼ N(µ, 2),

where the standard deviation 2 is obtained as √σ


n
= √10
25
= 2. Its null and alternative distributions are

H H
X̄ ∼0 N(0, 2), X̄ ∼1 N(1.5, 2).

(a) The rejection region at α = 0.1 is {x̄ > x}, where x is the solution of the equation

0.1 = P(X̄ > x|H0 ) = 1 − P(X̄/2 ≤ x/2|H0 ) = 1 − Φ(x/2).

From the normal distribution table we find x/2 = 1.28, so that x = 2.56 and the rejection region is

R = {x̄ > 2.56}.

The corresponding confidence interval method is based the one-sided 90% confidence interval for the mean

Iµ = (x̄ − 2.56, ∞).

We reject H0 if the interval does not cover µ0 = 0, that is when x̄ − 2.56 > 0.

(b) The power of the test (a) is

P(X̄ > 2.56|H1 ) = P( X̄−1.5


2 > 0.53|H1 ) = 1 − Φ(0.53) = 1 − 0.7019 = 0.298.

(c) For α = 0.01, since 1 − Φ(2.33) = 0.01, the rejection region is

R = {x̄ > 4.66}.

The power of this test is

P(X̄ > 4.66|H1 ) = P( X̄−1.5


2 > 1.58|H1 ) = 1 − Φ(1.58) = 1 − 0.9429 = 0.057.

125
Solution 5
We have a pair of alternative distribution densities (which are two beta-distribution densities)

f (x|H0 ) = 2x, f (x|H1 ) = 3x2 , 0 ≤ x ≤ 1.

(a) The likelihood-ratio as a function of the data value x is

f (x|H0 ) 2
= , 0 ≤ x ≤ 1.
f (x|H1 ) 3x

The corresponding likelihood-ratio test of H0 versus H1 rejects H0 for large values of x.


(b) The rejection region of a level α test is computed from the equation

P(X > xcrit |H0 ) = α,

that is
1 − x2crit = α.
We conclude that √
R = {x : x > 1 − α}.
(c) The power of the test is given by

P(X > 1 − α|H1 ) = 1 − (1 − α)3/2 .

Solution 6
Using the confidence interval method of hypotheses testing we reject H0 in favour of the two-sided alternative, since
the value µ = −3 is not covered by the two-sided confidence interval (−2, 3).

Solution 7
14·S 2
Under the normality assumption the random variable σ2 has the χ214 -distribution, implying
2
P( 14·S
σ 2 ≤ 6.571) = 0.05.

Under H0 : σ = 1, this entails the following one-sided rejection region

R = {s2 ≤ 6.571
14 } = {s ≤ 0.685}.

Given s = 0.7, we do not reject H0 : σ = 1 in favor of H1 : σ < 1 at α = 0.05.

Solution 8
The following analysis is the basis of the sign test.

(a) The likelihood-ratio is computed as


n x
( 12 )n ( n2 )n
 n−x
L(p0 ) x p0 (1 − p0 )
w= = n x = = .
L(p̂) x p̂ (1 − p̂)
n−x ( nx )x ( n−x
n )
n−x xx (n − x)n−x

(b) The likelihood-ratio test rejects H0 for small values of

ln w = n ln(n/2) − x ln x − (n − x) ln(n − x),

or equivalently, for large values of


x ln x + (n − x) ln(n − x),
or equivalently, for large values of

a(y) = (n/2 + y) ln(n/2 + y) + (n/2 − y) ln(n/2 − y),

where
y = |x − n/2|.
It remains to observe that the function a(y) is monotone over y ∈ [0, n/2], since
n
+y
a′ (y) = ln 2
n > 0,
2 −y

126
and therefore the test rejects the null hypothesis for the large values of y.

(c) The significance level for the rejection region |x − n2 | > k is computed by
X n
n
α = P(|X − 2 | > k|H0 ) = 2 2−n .
n i
i< 2 −k

(d) In particular, for n = 10 and k = 2 we get


2  
−9
X 10 1 + 10 + 45
α=2 = = 0.11.
i=0
i 512

(e) Using the normal approximation for n = 100 and k = 10, we find
α = P(|X − n2 | > k|H0 ) ≈ 2(1 − Φ( √k )) = 2(1 − Φ(2)) = 0.046.
n/4

Solution 9
(a) The two-sided p-value = 0.134.

(b) The one-sided p-value = 0.067.

Solution 10
We are supposed to test
H0 : death cannot be postponed,
H1 : death can be postponed until after an important date.
(a) Jewish data: n = 1919 death dates
x = 922 deaths during the week before Passover,
n − x = 997 deaths during the week after Passover.
Under the binomial model X ∼ Bin(n, p), we would like to test
H0 : p = 0.5 against H1 : p < 0.5.
We apply the large-sample test for proportion. Since the observed test statistic value is
x − np0 922 − 1919 · 0.5
z0 = p = √ = −1.712,
np0 (1 − p0 ) 1919 · 0.5
the corresponding one-sided p-value of the test equals
Φ(−1.712) = 1 − Φ(1.712) = 1 − 0.9564 = 0.044.
Conclusion: we reject H0 in favour of the one-sided H1 at the significance level 5%.

(b) To control for the seasonal effect the Chinese and Japanese data were studied
n = 852, x = 418, n − x = 434, z = −0.548.
The one-sided p-value for this data is 29%, showing no significant effect.

(c) Overeating during the important occasion might be a contributing factor.

Solution 11
We apply the multinomial model
(C1 , C2 , C3 ) ∼ Mn(190, p1 , p2 , p3 )
for testing the composite null hypothesis of the Hardy-Weinberg equilibrium:
H0 : p1 = (1 − θ)2 , p2 = 2θ(1 − θ), p3 = θ 2 .
It yields the following likelihood function and the maximum likelihood estimate:
 
190 292
L(θ) = 268 θ292 (1 − θ)88 , θ̂ = = 0.768.
10, 68, 112 380
The Pearson chi-squared test based on the table

127
x 1 2 3 Total
observed counts cx 10 68 112 190
expected counts Ex 10.23 67.71 112.07 190

results in the chi-squared test statistic x2 = 0.0065. With df = 1, we find the p-value to be 2(1 − Φ( 0.0065)) = 0.94.

Conclusion: the Hardy-Weinberg equilibrium model fits well to the haptoglobin data.

Solution 12
month cj number of days Ej cj − E j
Jan 1867 31 1994 −127
Feb 1789 28 1801 −12
Mar 1944 31 1994 −50
Apr 2094 30 1930 164
May 2097 31 1994 103
Jun 1981 30 1930 51
Jul 1887 31 1994 -107
Aug 2024 31 1994 30
Sep 1928 30 1930 -2
Oct 2032 31 1994 38
Nov 1978 30 1930 48
Dec 1859 31 1994 -135
Here we deal with the simple null hypothesis
31 28 30
H0 : p1 = p3 = p5 = p7 = p8 = p10 = p12 = , p2 = , p4 = p6 = p9 = p11 = .
365 365 365
The total number suicides n = 23480, so that the expected counts presented in the table are computed as
(0)
Ej = npj , j = 1, . . . , 12.
Using the table values we find the chi-squared test statistic to be
X (cj − Ej )2
x2 = = 47.4.
j
Ej

Since df = 12 − 1 = 11, and χ211 (0.005) = 26.8, we reject H0 of no seasonal variation. Merry Christmas!

Solution 13
We model the number of heads by
Y ∼ Bin(n, p), n = 17950.
(a) For H0 : p = 0.5 the observed z-score is
z0 = √ y−np0 = 3.46.
np0 (1−p0 )

According to the three-sigma rule this is a significant result and we reject H0 .

(b) Pearson’s chi-squared test for the simple null hypothesis


H0 : p0 = (0.5)5 = 0.031, p1 = 5 · (0.5)5 = 0.156, p2 = 10 · (0.5)5 = 0.313,
p3 = 10 · (0.5)5 = 0.313, p4 = 5 · (0.5)5 = 0.156, p5 = (0.5)5 = 0.031,
is applied using the table
number of heads 0 1 2 3 4 5 Total
observed counts 100 524 1080 1126 655 105 3590
expected counts 112.2 560.9 1121.9 1121.9 560.9 112.2 3590
The chi-squared test statistic takes the xalue x2 = 21.58. With df = 5, it gives the p-value = 0.001. We reject this
H0 .

(c) For the composite null hypothesis


 
5 x
H0 : px = p (1 − p)5−x , x = 0, 1, 2, 3, 4, 5,
i
we apply Pearson’s chi-squared test based on the maximum likelihood estimate p̂ = 0.5129. From the table

128
number of heads 0 1 2 3 4 5 Total
observed counts 100 524 1080 1126 655 105 3590
expected counts 98.4 518.3 1091.5 1149.3 605.1 127.4 3590
we get x2 = 8.74. Here df = 6 − 1 − 1 = 4, yielding p-value = 0.07. We do not reject this H0 at the 5% significance
level.

Solutions to Section 5.5


Solution 1
Since
f (x|θ) ∝ θ5 (1 − θ)5 ,
and the prior is flat, we get
h(θ|x) ∝ f (x|θ) ∝ θ5 (1 − θ)5 .
We conclude that the posterior distribution is Beta (6, 6). This yields

θ̂map = θ̂pme = 0.5.

Solution 2
The data in the table summarises a random sample

x = (x1 , . . . , xn ), n = 130, nx̄ = 363.

Under the geometric model X ∼ Geom(p) the likelihood is

f (x|p) = (1 − p)nx̄−n pn = p130 (1 − p)233 .

(d) Using the uniform prior U(0, 1), we find the posterior to be proportional to the likelihood

h(p|x1 , . . . , xn ) ∝ p130 (1 − p)233 .

It is a beta distribution Beta(131, 234) with the posterior mean


a 131
µ= = = 0.36,
a+b 131 + 234
and the standard deviation of the posterior distribution
r r
µ(1 − µ) 0.36 · 0.64
σ= = = 0.025.
a+b+1 366
Observe that in this setting, the posterior mean estimate
1
1+ n
ppme = 2
x̄ + n

for large n, is close to the maximum likelihood estimate p̂ = 1/x̄.

Solution 3
Let us use the binomial model X ∼ Bin(n, p) for the number x of successes in n independent trials. Given x = n,
we would like to estimate p using the Bayesian approach. Applying the uniform prior Beta(1, 1), we find that the
posterior distribution for the parameter p is Beta(n + 1, 1). Since the posterior mean is n+2
n+1
, we get

n+1
p̂pme = .
n+2

Solution 4
P(x|H0 )
Data: one observation of X = x. Likelihood ratio test: reject for small values of the likelihood-ratio P(x|H1 ) .

(a) The likelihood-ratio is computed on the bottom line of the table:


x x4 x2 x1 x3
P(x|H0 ) 0.2 0.3 0.2 0.3
P(x|H1 ) 0.4 0.4 0.1 0.1
P(x|H0 )
P(x|H1 ) 0.5 0.75 2 3

129
P(x|H0 )
(b) The null distribution of likelihood-ratio is discrete. Under H0 the test statistic w = P(x|H1 ) takes values
0.5, 0.75, 2, 3 with probabilities 0.2, 0.3, 0.2, 0.3. Therefore

P(W ≤ 0.5|H0 ) = 0.2, P(W ≤ 0.75|H0 ) = 0.5, P(W ≤ 2|H0 ) = 0.7, P(W ≤ 3|H0 ) = 1.

Since H0 is rejected for the small values of w, at α = 0.2 we reject H0 only if w = 0.5, that is when x = x4 . Similarly,
at α = 0.5 we reject H0 for w ≤ 0.75, that is when x is either x4 or x2 .

(c) By the Bayes formula,

P(x|H0 )P(H0 ) P(x|H0 )


P(H0 |x) = = .
P(x|H0 )P(H0 ) + P(x|H1 )P(H1 ) P(x|H0 ) + P(x|H1 )

Thus the posterior odds equals the likelihood-ratio

P(H0 |x) P(x|H0 )


= ,
P(H1 |x) P(x|H1 )

and we conclude that the outcomes x1 and x3 favour H0 since with these outcomes the likelihood-ratio is larger than 1.

(d) For the general prior


P(H0 ) = π0 , P(H1 ) = π1 = 1 − π0 ,
we get
P(x|Hi )πi
P(Hi |x) = ,
P(x|H0 )π0 + P(x|H1 )π1
yielding the following relation for the posterior odds

P(H0 |x) P(x|H0 )π0 π0


= =w· .
P(H1 |x) P(x|H1 )π1 π1

Assuming equal costs cost0 = cost1 , the rejection rule is

P(H0 |x) cost1


≤ = 1,
P(H1 |x) cost0

so that in terms of the likelihood-ratio, we reject H0 when


π0 π1 1 1
w· ≤ 1, that is w ≤ = − 1, or equivalently π0 ≤ .
π1 π0 π0 1+w
Recall that at α = 0.2 the rejection region was

R = {x = x4 } = {w = 0.5},

which in terms of the prior probabilities imposes the restriction


1 1 2
π0 ≤ = = .
1+w 1 + 0.5 3
On the other hand, at α = 0.5 the rejection region was

R = {x = x4 } ∪ {x = x2 } = {w = 0.5} ∪ {w = 0.75}.

To be able to reject for the value w = 0.75, we have to put the restriction
1 4
π0 ≤ = .
1 + 0.75 7

Solution 5
For a single observation X ∼ N(µ, σ), where σ is known, we would like to test H0 : µ = 0 against H1 : µ = 1 using the
prior probabilities
2 1
P(H0 ) = , P(H1 ) = .
3 3
(a) Since the likelihood-ratio equals
x2
f (x|0) e− 2σ2 1−2x
= (x−1)2
= e 2σ2 ,
f (x|1) e− 2σ2

130
P(H0 )
and the prior odds is P(H1 ) = 2, the posterior odds takes the form

P(H0 |x) 1−2x


= 2e 2σ2 .
P(H1 |x)
Choosing H0 for x such that
P(H0 |x)
> 1,
P(H1 |x)
or in other words for
x < 0.5 + σ 2 ln 2,
we arrive at the next answer
σ2 0.1 0.5 1 5
choose H0 for x < 0.57 x < 0.85 x < 1.19 x < 3.97
(b) In the long run, the proportion of the time H0 will be chosen is

P(X < 1
2 + σ 2 ln 2) = 32 P(X − µ < 1
2 + σ 2 ln 2) + 31 P(X − µ < σ 2 ln 2 − 12 ) = 23 Φ(σ ln 2 + 1
2σ ) + 13 Φ(σ ln 2 − 1
2σ ).

Using this formula we conclude that


σ2 0.1 0.5 1 5
proportion of the time H0 will be chosen 0.67 0.73 0.78 0.94

Solution 6
We are given a pair of beta-densities

f (x|H0 ) = 2x, f (x|H1 ) = 3x2 , 0 ≤ x ≤ 1,

and equal prior probabilities


P(H0 ) = P(H1 ) = 0.5.
By the Bayes formula, the posterior probabilities are
0.5f (x|H0 ) 2 3x
h(H0 |x) = = , h(H1 |x) = .
0.5f (x|H0 ) + 0.5f (x|H1 ) 2 + 3x 2 + 3x

Therefore, the posterior probability of H0 is greater than that of H1 for x satisfying 2 > 3x, that is when x < 32 .

Solution 7
The parameters (a, b) of the prior beta distribution are found from the equations
1
a 1 − 13 )
3 (1 1
= , = .
a+b 3 a+b+1 32
The prior pseudo-counts are well approximated by a = 2 and b = 4. Thus the posterior beta distribution has
parameters (10, 16) giving the posterior mean estimate p̂pme = 0.38.

Solution 8
(a) We are given two independent samples:
1. sample one has the sample size n = 56 and the sample proportion p̂1 = 8
56 = 0.143,
2. sample two has the sample size m = 74 and the sample proportion p̂2 = 12
74 = 0.162.
An asymptotic 95% confidence interval for the population difference is given by
r
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
Ip1 −p2 ≈ p̂1 − p̂2 ± 1.96 · + = −0.019 ± 0.125 = [−0.144, 0.106].
n−1 m−1
(b) To find a credibility interval for the parameter p, we can use the uniform prior Beta(1, 1). Adding the pseudo-
counts (1, 1) to the total counts (8 + 12, 48 + 62) we getq
Beta(21, 111) as the posterior distribution. Using the posterior
mean µ = 21+111
21
= 0.16 and standard deviation σ = 0.16(1−0.16)
132 = 0.03 we arrive at the normal approximation of
the posterior distribution with mean 0.16 and standard deviation 0.03. This yields an approximate 95% credibility
interval
Jp ≈ 0.16 ± 1.96 · 0.03 = [0.10, 0.22].

131
Solution 9
(a) The exponential prior with parameter 1 is Gam(1, 1). Applying the suggested updating rule consecutively four
times:
(1, 1) → (3, 2) → (3, 3) → (5, 4) → (10, 5),
we find the posterior distribution to be Gam(10, 5). Therefore, θ̂PME = 10/5 = 2.

(b) The general updating rule for the random sample (x1 , . . . , xn ) taken from a Poisson distribution, becomes

- the shape parameter α1 = α + nx̄,


- the inverse scale parameter λ1 = λ + n.

By the formula for the mean of the Gam(α1 , λ1 ) distribution,


α + nx̄
θ̂PME = .
λ+n

Comparing this to the maximum likelihood estimator θ̂MLE = x̄, we see that
α + nx̄ α − λx̄
θ̂PME − θ̂MLE = − x̄ = → 0,
λ+n λ+n
as n → ∞. This demonstrates that the role of the prior is less important for large sample sizes.

Solution 10
The Beta(a, b) probability density function g(p) is proportional to pa−1 (1−p)b−1 . Taking the derivative of this product
and setting it equal to zero we arrive at the equation

(a − 1)pa−2 (1 − p)b−1 = (b − 1)pa−1 (1 − p)b−2 .

Solving this equation we obtain


a−1
p∗ = .
a+b−2

Solution 11
(a) Take the 95% equal-tailed credibility interval. By definition it is formed by the lower and upper quartiles of
the posterior distribution. It implies that the interval must include the posterior median.

(b) All three options produce the same interval since the beta-posterior Beta(5, 5) is symmetric about its mean.

(c) Due to symmetry of the N(3, 0.2) distribution, the 95% HPDI is computed as

3 ± 1.95 · 0.2 = 3 ± 0.39.

Solutions to Section 6.7


Solution 1
For a fixed x, the empirical distribution function F̂ (x) = p̂ is the sample proportion giving an unbiased estimate of
p = F (x) = x. The variance of F̂ (x) is given by

p(1 − p) x(1 − x)
y 2 = σp̂2 = = , x ∈ [0, 1].
n n
Setting n = 16, we get the requested formula

(x − 0.5)2 + 16y 2 = 0.25.

Solution 2
Let 0 ≤ x ≤ y ≤ 1. We have
1{X1 ≤x} + . . . + 1{Xn ≤x} x(1 − x)
F̂ (x) = , E(F̂ (x)) = F (x) = x, Var(F̂ (x)) = ,
n n
1{X1 ≤y} + . . . + 1{Xn ≤y} y(1 − y)
F̂ (y) = , E(F̂ (y)) = F (y) = y, Var(F̂ (y)) = .
n n

132
Using
n n X
1 hX X i
F̂ (x) · F̂ (y) = 2
1 1
{Xi ≤x} {Xi ≤y} + 1{Xi ≤x} 1{Xj ≤y} ,
n i=1 i=1 j̸=i

and that
1{Xi ≤x} 1{Xi ≤y} = 1{Xi ≤x} ,
we obtain
n n X i x + (n − 1)xy
1 hX X
E(F̂ (x) · F̂ (y)) = x + xy = ,
n2 i=1 i=1
n
j̸=i

implying

x + (n − 1)xy x(1 − y)
Cov(F̂ (u), F̂ (v)) = E(F̂ (u) · F̂ (v)) − E(F̂ (u)) · E(F̂ (v)) = − xy = .
n n

Solution 3
Take a look at the ordered sample x(1) , . . . , x(n)

12.28 12.92 13.33 13.64 13.65 13.66 13.68


13.73 13.75 13.83 13.96 13.98 13.98 14.01
14.04 25% quantile
14.10 14.19 14.23 14.27 14.30 14.32 14.41
14.41 14.43 14.44 14.47 14.49 14.52 14.56
14.57 50% quantile
14.57 14.62 14.65 14.68 14.73 14.75 14.77
14.80 14.87 14.90 14.92 15.02 15.03 15.10
15.13 75% quantile
15.15 15.18 15.21 15.28 15.31 15.38 15.40
15.47 15.47 15.49 15.56 15.63 15.91 17.09

(a) The figure below shows the empirical distribution function and the normal QQ-plot.

The distribution appears to be close to normal. The 10% sample quantile equals
x(6) + x(7) 13.66 + 13.68
= = 13.67.
2 2
(b) The expected percentages under different dilution levels are computed as follows
1% dilution µ1 = 14.58 · 0.99 + 85 · 0.01 = 15.28 can not be detected,
3% dilution µ3 = 14.58 · 0.97 + 85 · 0.03 = 16.69 can be detected,
5% dilution µ5 = 14.58 · 0.95 + 85 · 0.05 = 18.10 can be detected.

We see that the value 15.28 can not be detected as an outlier, since it coincides with the 82% sample quantile. There
is only one sample value larger than 16.69, therefore 3% dilution would be easier to detect. Obviously, 5% dilution
resulting in 18.10 is very easy to detect.

133
Solution 4
Taking the negative derivative of the survival function
β
1 − F (x) = e−αx ,

we obtain the density function


β
f (x) = αβxβ−1 e−αx ,
and dividing the latter by the former we obtain the hazard function of the Weibull distribution

h(x) = αβxβ−1 .

Solution 5
Take the Weibull distribution with parameters α and β.

• If β = 1, then h(x) = α is constant and the distribution is memoryless of the current age x.
• If β > 1, say β = 2, then h(x) = 2αx increases with x, implying that the older individuals die more often than
the younger.
• If 0 < β < 1, say β = 0.5, then h(x) = 0.5αx−0.5 decreases with x, predicting that the longer you live the
healthier you become.

Solution 6
The left panel shows a QQ-plot with x for the seeded clouds and y for the control clouds. The QQ-plot ia approximated
by the line x = 2.5y claiming 2.5 times more rainfall from the seeded clouds.

On the right panel, the QQ-plot (constructed after the log-transformation of the data) is approximated by the line

ln x = 2 + 0.8 ln y

revealing a decreasing slope in the non-linear relationship x = 7.4y 0.8 .

Solution 7
(a) The Laplace curve is symmetric. Its shape is formed by two exponentially declining curves: one for positive x
and the other for the negative x.

(b) For λ = 2 the mean is 0, the standard deviation is 1, the skewness is 0, and the kurtosis is 6. Compared to
the normal curve with the same mean and standard deviation but smaller kurtosis (=3), the Laplace distribution has
heavier tails. Moreover, since the variances are equal, the two curves should cross 4 times as they cover the same area.
This implies that the Laplace curve must also have higher peakedness.

134
Solution 8
(a) The null distribution of Y is Bin(25, 12 ) as each observation is smaller than the true median (assuming that the
distribution is continuous) with probability 0.5.

(b) A non-parametric CI for the midean M is given by (x(k) , x(n−k+1) ) where k is such that

P(Y > n − k|H0 ) ≈ 0.025.

With n = 25 we find k using the normal approximation with continuity correction:


 Y − 12.5 13 − k   13 − k 
0.025 ≈ P(Y > 25 − k|H0 ) = P > |H0 ≈ 1 − Φ .
2.5 2.5 2.5
Thus 13−k
2.5 ≈ 1.96 and we get k = 8. The approximate 95% CI for m is given by (x(8) , x(18) ).

Solution 9
(a) When the population under investigation has a clear structure it is more effective to use stratified sampling for
estimating the overall population mean as it has a smaller standard error. In accordance with the optimal allocation
formula:
wi σi
ni = n ,
σ̄
the allocation of observations should follow the next two key rules: put more observations in the larger strata, and
put more observations in the strata with higher variation.

(b) The sample kurtosis is computed from a sample (x1 , . . . , xn ) as


n
1 X
m4 = (xi − x̄)4 ,
ns4 i=1
Pn
where x̄ = x1 +...+x
n
n
is the sample mean and s2 = n−1 1
i=1 (xi − x̄) is the sample variance. If the corresponding
2

coefficient of skewness is close to zero and m4 ≈ 3, then we get an indication that the shape of the population distri-
bution curve is close to the normal distribution.

(c) The standard error for the sample mean is sx̄ = √200
s
. Roughly: the range of heights 160 − 200 in centimeters
covers 95% of the population distribution. Treating this interval as the mean plus-minus two standard deviations, we
find s ≈ 10 cm and sx̄ is something like 0.7 cm. We conclude that a random sample of size 200 may give a decent
estimate of the population mean height.

Solution 10
(a) The sign test statistic y = number of positive xi has the null distribution
H
Y ∼0 Bin(25, 12 ) ≈ N( 25 5
2 , 2 ).

At 5% significance level we would reject H0 in favour of the one-sided alternative for {y ≥ k}, where k is found from
! !
k − 0.5 − 12.5 k − 13
0.05 = P(Y ≥ k|H0 ) = P(Y > k − 1|H0 ) ≈ 1 − Φ =1−Φ .
5/2 2.5

This equation gives


k − 13
= Φ−1 (0.95) = 1.645,
2.5
which yields k = 17. Thus the rejection region for the sign test is

R = {y ≥ 17}.

We know that the true population distribution is X ∼ N(0.3, 1), therefore the true probability of a positive observation
is
P(X > 0|X ∼ N(0.3, 1)) = 1 − Φ(−0.3) = Φ(0.3) = 0.62.
It follows that for computing the power of the sign test we should use the alternative distribution

Y ∼ Bin(25, 0.62) ≈ N(15.5, 2.43).

135
The power of the sign test is
!
17 − 0.5 − 15.5
P(Y ≥ 17|Y ∼ Bin(25, 0.62)) ≈ 1 − Φ = 1 − Φ(0.41) = 0.34.
2.43

(b) Under the normal distribution model X ∼ N(µ, 1) with n = 5, we have X̄ ∼ N(µ, 1/5), and we reject H0 : µ = 0
for
5x̄ > 1.645, that is for x̄ > 0.33.
The power of this test,
!
0.33 − 0.3
P(X̄ > 0.33|X̄ ∼ N(0.3, 1/5)) = 1 − Φ = 1 − Φ(0.15) = 0.44,
1/5

is higher than the power of the sign test.

Solutions to Section 7.6


Solution 1
(a) The sample means x̄ = 0.555 and ȳ = 1.624 give unbiased estimates of µ1 and µ2 . The difference µ1 − µ2 is
estimated by ȳ − x̄ = 1.069.

(b) Using two sample variances s2x = 0.216, s2y = 1.180, we compute the pooled sample variance s2p = 0.767, which
gives us an unbiased estimate of σ 2 .

(c) The standard error of ȳ − x̄ = 1.069 is quite large sȳ−x̄ = 0.587 due to the small sample sizes.

(d) Based on the t7 -distribution, an exact 90% confidence interval for the mean difference is

Iµ2 −µ1 = 1.069 ± 1.113.

(e) It is more appropriate to use the two-sided alternative since we do not have access to any prior information
about the relationship between µ1 and µ2 .

(f) From the observed test statistic value t0 = 1.821, we find the two-sided p-value to be larger than 10% using the
t-distribution table for df = 7.

(g) No, because the obtained p-value is larger than 0.1.

(h) Given σ 2 = 1, we answer differently to some of the the above questions:

(b∗ ) σ 2 = 1,
(c∗ ) sȳ−x̄ = 0.6708,
(d∗ ) Iµy −µx = 1.069 ± 1.104,
(f∗ ) using the observed z-score z0 = 1.594, we find the two-sided p-value = 0.11 with the help of the normal
distribution table.

Solution 2
If m = n, then by the definition of the pooled sample variance,
! Pn Pn
1 1 2 2
i=1 (xi − x̄) + i=1 (yi − ȳ)
2
s2 + s22 s2 s2y
2
sp + = · = 1 = x+ = s2x̄ + s2ȳ .
n m n 2n − 2 n n m

Solution 3
We would like to test the null hypothesis of no drug effect
H0 : µ1 = µ2 , the drug is not effective for reducing high blood pressure.
The suggested measurement design: during the same n = 10 days take blood pressure measurements on 4 people, two
on the treatment

x1,1 , . . . , x1,n ,
x2,1 , . . . , x2,n ,

136
and two controls

y1,1 , . . . , y1,n ,
y2,1 , . . . , y2,n .

Dependencies across the subjects make inappropriate both the two-sample t-test and the rank sum test, since we can
not treat neither

(x1 , . . . , x2n ) = (x1,1 , . . . , x1,n , x2,1 , . . . , x2,n ),

nor

(y1 , . . . , y2n ) = (y1,1 , . . . , y1,n , y2,1 , . . . , y2,n ),

as random samples taken from two population distributions.

Solution 4
The paradox arises from the way how the data was collected. This data is an example of an observational study,
which was properly planned, and the two samples are not random samples. The large proportion of patients in a bad
condition were allocated in the hospital A, which resulted in a smaller overall survival rate for the hospital A.

Solution 5
Two independent samples
x1 , . . . , x n , y 1 , . . . , yn ,
are taken from two population distributions with equal standard deviation σ = 10. Approximate 95% confidence
interval r
2 27.72
Iµ1 −µ2 ≈ x̄ − ȳ ± 1.96 · 10 · = x̄ − ȳ ± √ .
n n
For the confidence interval to have width 2, requires the equality
27.72
√ = 1,
n

implying n ≈ 768.

Solution 6
We are dealing with two independent samples:
Rank Type I Type II Rank
1 3.03 3.19 2
8 5.53 4.26 3
9 5.60 4.47 4
11 9.30 4.53 5
13 9.92 4.67 6
14 12.51 4.69 7
17 12.95 6.79 10
18 15.21 9.37 12
19 16.04 12.75 15
20 16.84 12.78 16
Rank sum 130 80

(a) Assume equal variances and apply the two-sample t-test. Using the summary statistics
q
x̄ = 10.693, ȳ = 6.750, s2x = 23.226, s2y = 12.978, sx̄−ȳ = s2x̄ + s2ȳ = 1.903,

compute the observed test statistic


10.693 − 6.750
t0 = = 2.072.
1.903
For df = 18, the t-distribution table says that the two-sided p-value of test is larger than 5%. We do not reject the
null hypothesis of equality at the 5% significance level.

137
H0
(b) The rank sum test statistics are r1 = 130, r2 = 80. Using the normal approximation R1 ≈ N(µ, σ) with
r
n(2n + 1) n2 (2n + 1)
µ= = 105, σ = = 13.23,
2 12
we find the two-sided p-value to be

2P(R1 ≥ 130) ≈ 2(1 − Φ( 129.5−105


13.23 )) = 0.064

greater than 5%. Again, we can not reject the null hypothesis of equality.

(c) The non-parametric test in (b) is more relevant, since the QQ-plot for the pooled deviations show non-normality.

(d) To estimate the probability π, that a type I bearing will outlast a type II bearing, we turn to the ordered
pooled sample
X-YYYYYY-XX-Y-X-Y-XX-YY-XXXX.
Pick a pair (xi , yj ) at random from the table, to estimate the probability that type II bearing will outlast the type I
by
number of (xi ≤ yj ) 10 + 4 + 4 + 3 + 2 + 2
P(X ≤ Y ) = = = 0.25.
total number of pairs (xi , yj ) 100
This implies a point estimate π̂ = 0.75.

Solution 7
Consider a random sample of the differences d1 , . . . , dn taken from a continuous population distribution which is
symmetric around the unknown median m. The signed rank test statistic
n
X
w= ri · 1{di >0}
i=1

is computed using the following steps


step 1: remove signs |d1 |, . . . , |dn |,
step 2: assign ranks r1 , . . . , rn to |d1 |, . . . , |dn |,
step 3: attach the original signs of di to the ranks r1 , . . . , rn ,
step 4: compute w as the sum of the positive ranks.
Under the null hypothesis of no difference H0 : m = 0, on the step 3 with n = 4, the signs ± are assigned symmetrically
at random at the ranks (1, 2, 3, 4). Due to the model assumption that the population distribution is symmetric around
the median. As a result there are 16 equally likely outcomes
1 2 3 4 w
− − − − 0
+ − − − 1
− + − − 2
+ + − − 3
− − + − 3
+ − + − 4
− + + − 5
+ + + − 6
− − − + 4
+ − − + 5
− + − + 6
+ + − + 7
− − + + 7
+ − + + 8
− + + + 9
+ + + + 10
Thus the null distribution of W is given by the table
k 0 1 2 3 4 5 6 7 8 9 10
1 1 1 2 2 2 2 2 1 1 1
pk 16 16 16 16 16 16 16 16 16 16 16

The smallest one-sided p-value is 1


16 = 0.06 which is higher than 5%. We conclude that n = 4 is a too small sample
size for the signed rank test.

138
Solution 8
The critical values wα◦ based on the normal approximation
 q 
n(n+1) n(n+1)(2n+1)
W ≈N 4 , 24

satisfy (after the continuity correction)


r
◦ n(n + 1) n(n + 1)(2n + 1)
w0.05 = 0.5 + − 1.96 · ,
4 24
r
◦ n(n + 1) n(n + 1)(2n + 1)
w0.01 = 0.5 + − 2.58 · .
4 24
This gives without rounding
n = 10 n = 20 n = 25
n(n+1)
q 4 27.5 105 162.5
n(n+1)(2n+1)
24 9.81 26.79 37.17
α = 0.05 8.8 53.0 90.1
α = 0.01 2.7 36.5 68.1
The last two rows should be compared with the values obtained using the exact null ditsribution
n = 10 n = 20 n = 25
α = 0.05 8 52 89
α = 0.01 3 38 68

Solution 9
(a) Since the variance of the difference D = X − Y equals

Var(D) = σ12 + σ22 − 2Cov(X, Y ) = 100 + 100 − 100 = 100,

we have
H0
D ≈ N(0, 10).
It follows that the sample mean of the differences with n = 25
H0
D̄ = X̄ − Ȳ ≈ N(0, √1025 ) = N(0, 2).

The rejection region becomes


¯
R = { d2 > 1.645} = {d¯ > 3.29}.
The power function (under the one-sided alternative µ1 − µ2 = δ > 0) is computed using the alternative distribution

D̄ ≈ N(δ, 2),

as follows
Pw(δ) ≈ P(D̄ > 3.29|N(δ, 2)) = 1 − Φ( 3.29−δ
2 ) = Φ( δ−3.29
2 ).
(b) With two independent samples
q
H0
D̄ ≈ N(0, 100
25 + 100
25 ) = N(0, 2.83).

The 5% rejection region is


¯
R = { √d8 > 1.645} = {d¯ > 4.65}.
The corresponding power function equals

Pw(δ) ≈ P(D̄ > 4.65|N(δ, 2.83)) = 1 − Φ( 4.65−δ δ−4.65


2.83 ) = Φ( 2.83 ).

Conclusion. The two power functions are compared graphically on the next figure, where the x-axis represents the
effect size δ = µ1 − µ2 . Clearly, the power of the test for the paired samples is higher.

139
Power function

Paired samples

Independent samples
0.5

0.05

3.29 4.65

Solution 10
Summary statistics for n = 15 pairs (xi , yi ):

x̄ = 85.26, s1 = 21.20, sx̄ = 5.47,


ȳ = 84.82, s2 = 21.55, sȳ = 5.57,
d¯ = x̄ − ȳ = 0.44,
sd = 4.63, sx̄−ȳ = 1.20.

If the pairing had been erroneously ignored, then the two independent samples formula would give 6 times larger
standard error q
sx̄−ȳ = s2x̄ + s2ȳ = 7.81.

To test H0 : µ1 = µ2 against H1 : µ1 ̸= µ2 assume D ∼ N(µ1 − µ2 , σ) and apply the one-sample t-test based on
the test statistic

t0 = = 0.368.
sd¯
With df = 14, the two-sided p-value is much larger than 20%, so that we can not reject H0 .

Solution 11
Possible explanations are given below. These are observational studies, based on the data which was not properly
randomised.

(a) Rich patients get rooms with a window and higher quality of the health care.

(b) The smoker is a bad husband making miserable wife’s life, increasing the risk of cancer.

(c) More stressed condition of a person may result in skipping breakfast as well as accidents.

(d) It the condition of schizophrenia that causes the use of marijuana.

(e) Being part of a community can have a positive effect on mental health and emotional well being.

Solution 12
(a) This is another example of Simpson’s paradox. It is resolved by referring to the key factor of the difficulty to
enter different programmes. Men tend to apply for easy programs, while women more often apply for programs with
low admission rates.

(b) A simple hypothetical experimental study could be based on two independent random samples. Focus on one
major program, say major F. Take n randomly chosen female candidates and n randomly chosen male candidates.
Ask all of them to apply for major F. Compare two sample proportions of the admitted applicants. (Of course this
experiment is impossible to perform in practice.)

140
Solution 13
(a) We use the binomial model for the number of females Y ∼ Bin(36, p). Given the observed count y = 13,
we have to test H0 : p = 0.5 against the two-sided alternative H1 : p ̸= 0.5. The approximate null distribution is
Y ∼ N(18, 3), therefore, an approximate two-sided p-value becomes

2(1 − Φ( 18−13
3 ) = 2(1 − Φ(1.67)) = 2 × 0.048 = 9.6%.

With such a high p-value we can not reject the null hypothesis of equal sex ratio.

(b) The random sample mean is


n1 x̄1 + n2 x̄2 13 × 62.8 + 23 × 69.7
x̄ = = = 67.2,
n 36
and the stratified sample mean
1 1 62.8 + 69.7
x̄s =
x̄1 + x̄2 = = 66.3.
2 2 2
(c) The standard error of the stratified sample mean is
s r
1 s21 s22 1 (6.8)2 (11.7)2
sx̄s = + = + = 1.54.
2 n1 n2 2 13 23

Solution 14
(a) This is an example of a paired sample, therefore the signed rank test is more appropriate for testing the null
hypothesis of no difference.

(b) We use the signed rank test. The observed test statistics are w = 39 as the sum of the positive ranks and
w = 6 as the sum of the negative ranks.

Animal Site I Site II Difference Signed rank


1 50.6 38.0 12.6 8
2 39.2 18.6 20.6 9
3 35.2 23.2 12.0 7
4 17.0 19.0 -2.0 -2
5 11.2 6.6 4.6 4
6 14.2 16.4 -2.2 -3
7 24.2 14.4 9.8 5
8 37.4 37.6 -0.2 -1
9 35.2 24.4 10.8 6

Using the normal approximation


q
H0
 
n(n+1) n(n+1)(2n+1)
W ≈ N 4 , 24 = N(22.5, 8.44)

we find that the two-sided p-value

2P(W ≤ 6|H0 ) ≈ 2Φ( 6.5−22.5


8.44 ) = 2Φ(−1.90) = 2(1 − Φ(1.90)) = 2 · (1 − 0.9713) = 0.0574

is larger than 5%. Therefore, we do not reject the null hypothesis of equality in favour of the two-sided alternative.

(c) The extract from the course text book reminds that the null hypothesis for the signed rank test, beside equality
of two population distributions, assumes a symmetric distribution for the differences. It also explains why such an
assumption is reasonable.

Solution 15
The data is collected using the matched pairs design for 50 independent trials with four possible outcomes (slow
correct, fast correct), (slow correct, fast wrong), (slow wrong, fast correct), (slow wrong, fast wrong). We are given
fast correct fast wrong fotal
slow correct 42
slow wrong 0 8
total 35 15 50
which allows us to fill in the table as follows.

141
fast correct fast wrong total
slow correct 35 7 42
slow wrong 0 8 8
total 35 15 50
We apply the McNemar test having the test statistic

(7 − 0)2
x2 = = 7.
7+0
With only 7 informative counts, the null distribution is roughly approximated by the χ21 −distribution. Since the
square root of 7 is 2.65, the standard normal distribution gives the (two-sided) p-value 0.8%. We conclude that the
observed difference is statistically significant.

Solutions to Section 8.6


Solution 1
Recall the formula for the one-way ANOVA F-test statistic
PI
msA In(n − 1) (ȳi· − ȳ·· )2
f= = · PI i=1
Pn .
msE I −1 i=1 j=1 (yij − ȳi· )
2

For I = 2, put
x̄ + ȳ
ȳ1· = x̄, ȳ2· = ȳ, ȳ·· = .
2
In this two-sample setting, the F-test statistic becomes

(x̄ − x̄+ȳ 2
2 ) + (ȳ − x̄+ȳ
2 )
2  x̄ − ȳ 2
f = 2n(n − 1) Pn 2
P n 2
= 2n = t2 ,
j=1 (x j − x̄) + j=1 (y j − ȳ) 2sp

where t = x̄−ȳ
√2 is the two-sample t-test statistic.
sp n

Solution 2
(a) Using the table with the data and the available summary statistics, we compute three sums of squares:

ssA = 10((20.34 − 19.40)2 + (18.34 − 19.40)2 + (21.57 − 19.40)2 + (17.35 − 19.40)2 ) = 109.2,
ssE = 9(0.88 + 0.74 + 0.88 + 0.89) = 30.5,
ssT = 3.58 · 39 = 139.7 = 109.2 + 30.5.

Then the ANOVA table takes the form


Source ss df ms f
Treatment 109.2 3 36.4 42.9
Error 30.5 36 0.85
Total 139.7 39

Comparing of the observed test statistics 42.9 with the critical value for F3,36 (0.001) = 6.7436, see Section 11.4, we
see that the p-value of the F-test is much smaller than 0.001, so that we can reject the null hypothesis of no difference.

(b) The normality assumption is supported by the four skewness and kurtosis values, with the former being close
to zero and the latter close to 3. On the other hand, the four sample variances are close to each other making realistic
the assumption of equal variances.

(c) Since sp = msE = 0.92 and the t-distribution table gives approximately

t36 (0.0042) ≈ t40 (0.005) = 2.7,

we get the following Bonferroni interval


Bµi −µj = ȳi. − ȳj. ± 1.11.
Therefore all observed pairwise differences, except (2, 4), are significant:
Pairs (1, 2) (1, 3) (1, 4) (2, 3) (2, 4) (3, 4)
Differences 2.00 -1.23 2.99 -3.23 0.99 4.22

142
Solution 3
Assume that under the null hypothesis
H0 : µ1 = . . . = µI = µ
all the data
{yij , i = 1, . . . , I, j = 1, . . . , n}
come from a single normal distribution N(µ, σ). Then the corresponding parameter space Ω0 has the dimension dim
Ω0 = 2. On the other hand, the dimension of the general parameter space Ω is

dim Ω = I + 1,

since the setting beyond H0 is described by the parameters µ1 , . . . , µI and σ. The likelihood-ratio

L0 (µ̂, σ̂0 )
w= ,
L(µ̂1 , . . . , µ̂I , σ̂)
is expressed in terms of two likelihood functions
I Y
n (yij −µi )2 2
ij −µi )
Y X X (y
L(µ1 , . . . , µI , σ) = √1 e
2πσ
2σ 2 ∝ σ −N exp{− 2σ 2 },
i=1 j=1 i j
2
ij −µ)
X X (y
L0 (µ, σ) = L(µ, . . . , µ, σ) ∝ σ −N exp{− 2σ 2 }.
i j

where N = In is the total number of observations. We find the maximum likelihood estimates to be
ssT ssE
µ̂ = ȳ·· , σ̂02 = , µ̂i = ȳi· , σ̂ 2 = ,
N N
so that P P (yij −µ̂)2
σ̂0−N exp{− 2σ̂02
}  σ̂ 2 −N/2 exp{− ssT }  σ̂ 2 N/2
2ssT /N
0
w= = · = .
σ̂ −N exp{−
P P (yij −µ̂i )2
} σ̂ 2 exp{− 2ssssEE/N } σ̂02
2σ̂ 2
The likelihood ratio test rejects the null hypothesis for the smaller values of w or equivalently for the larger values
of the ratio
σ̂02 ssT ssA (I − 1)msA (I − 1)
= =1+ =1+ =1+ ·f
σ̂ 2 ssE ssE I(n − 1)msE I(n − 1)
that is for the larger values of F-test statistic f .

Recall that
H0
−2 ln w ≈ χ2k , where k = dim(Ω) − dim(Ω0 ) = I − 1.
To see that the exact null distribution for the F -test statistic agrees with the asymptotic null distribution of the
likelihood-ratio test, observe that for large n,
(I−1) (I−1) ssA
−2 ln w = In · ln(1 + · f ) ≈ In · · f ≈ (I − 1)f = .
I(n−1) I(n−1) msE

Noticing that msE converges to σ 2 as n → ∞, we find that


ssA
−2 ln w ≈ .
σ2
It remains to recognise that the null distribution of the ratio ssA
σ2 is approximated by the chi-squared distribution with
I − 1 degrees of freedom.

Solution 4
Consider the one-way layout setting with I = 10, n = 7, and a dataset generated by independent random variables

Yij ∼ N(µi , σ).

Then the corresponding pooled sample variance


1 XX
s2p = msE = (yij − ȳi. )2
I(n − 1) i j

uses I(n − 1) = 60 degrees of freedom.

143
(a) In this case, the 95% confidence interval for a single difference µi − µj takes the form
q
s
Iµi −µj = ȳi· − ȳj· ± t60 (0.025)sp n2 = ȳi· − ȳj· ± 2.82 · √pn .

(b) Bonferroni simultaneous 95% confidence interval for 10


= 45 differences (µi − µj )

2
q
s
Bµi −µj = ȳi· − ȳj· ± t60 ( 0.025
45 )sp
2
n = ȳi· − ȳj· ± 4.79 · √p ,
n

giving the ratio


Bonferroni 4.79
= = 1.7.
Single pair 2.82
(c) The Tukey simultaneous 95% confidence interval for differences (µi − µj ) is
s s
Tµi −µj = ȳi· − ȳj· ± q10,60 (0.05) √pn = ȳi· − ȳj· ± 4.65 · √p ,
n

giving the ratio


Bonferroni 4.79
= = 1.03.
Tukey 4.65

Solution 5
For I = 4 control groups of n = 5 mice each, we would like to test H0 : no systematic differences between groups.
Applying the one-way F-test we get the following ANOVA table.
Source ss df ms f p
Columns 27230 3 9078 2.271 0.12
Error 63950 16 3997
Total 91190 19

The p-value of 12% is obtained using the F3,16 -distribution. We do not reject H0 at 10% significance level.
After inspecting the boxplots of the four samples we see that the assumption of normality and equal variances is
clearly violated. The largest difference is between the third and the fourth boxplots. A control question: explain why
the third boxplot has no upper whisker?

Turning to the non-parametric Kruskal-Wallis test, consider the pooled sample ranks and their group means:
Group I 2 6 9 11 14 r̄1. = 8.4
Group II 4 5 8 17 19 r̄2. = 10.6
Group III 1 3 7 12.5 12.5 r̄3. = 7.2
Group IV 10 15 16 18 20 r̄4. = 15.8
The observed Kruskal-Wallis test statistic takes value
12 · 5
(8.4 − 10.5)2 + (10.6 − 10.5)2 + (7.2 − 10.5)2 + (15.8 − 10.5)2 = 6.20.

w=
20 · 21
This is close to the critical value x3 (0.1) = 6.25 from the table of Section 11.3, implying that the p-value of the
Kruskal-Wallis test is close to 10%. We do not reject the null hypothesis of no difference between the four control
groups.

144
Solution 6
The two-way ANOVA table presents the results of two F-tests in the case of I = 3 treatments applied to J = 10
subjects with n = 1 replications.

Source SS df MS F P
Columns (blocks) 0.517 9 0.0574 0.4683 0.8772
Rows (treatments) 1.081 2 0.5404 4.406 0.0277
Error 2.208 18 0.1227
Total 3.806 29

We reject
H0 : no treatment effects
at 5% significance level. Notably, the differences among the subjects are not significant.

The non-parametric Friedman test is based on the ranking within the blocks:
Dog 1 Dog 2 Dog 3 Dog 4 Dog 5 Dog 6 Dog 7 Dog 8 Dog 9 Dog 10 r̄i.
Isof 1 2 3 2 1 2 1 3 1 3 1.9
Halo 2 1 1 3 2 1 3 1 2 2 1.8
Cycl 3 3 2 1 3 3 2 2 3 1 2.3
The observed value of the Friedman test statistic
12 · 10
(1.8 − 2)2 + (1.9 − 2)2 + (2.3 − 2)2 = 1.4

q=
3·4
is smaller than the critical value x2 (0.1) = 4.61 for the chi-squared distribution with 2 degrees of freedom. Thus the
p-value of the Friedman test is larger than 10% and we do not reject the null hypothesis of no difference between the
three treatments.

Solution 7
The data consists of 48 survival times obtained for I = 3 poisons, J = 4 treatments, and n = 4 observations per cell.
The cell means for the survival times are listed in the next table.
A B C D
I 4.125 8.800 5.675 6.100
II 3.200 8.150 3.750 6.625
III 2.100 3.350 2.350 3.250
If you draw by hand the three profiles connecting the means for three poisons, you will find that the profiles I and II
cross each other, and the profile III is more flat than the other two. The three profiles being non-parallel indicates
that there might be an interaction between the main two factors: the poisons and treatments. We have three null
hypotheses of interest
HA : no poison effect,
HB : no treatment effect,
HAB : no interaction.
(a) Referring to the two-way ANOVA table
Source SS df MS F P
Columns (treatments) 91.9 3 30.63 14.01 0.0000
Rows (poisons) 103 2 51.52 23.57 0.0000
Intercation 24.75 6 4.124 1.887 0.1100
Error 78.69 36 2.186
Total 298.4 47

we reject HA and HB at 1% significance level. We can not reject HAB even at 10% significance level. Conclusion:
3 poisons act differently,
4 treatments act differently,
some indication of interaction.
The analysis of the residuals shows the following results:
the normal QQ-plot reveals non-normality (see the left panel of the figure below),
the coefficient of skewness = 0.59 is positive,
the coefficient of kurtosis = 4.1 is larger than 3.

145
(b) After transforming the data by death rate = 1/survival time, we obtain new cell means for the death rates
given by the next table.
A B C D
I 0.249 0.116 0.186 0.169
II 0.327 0.139 0.271 0.171
III 0.480 0.303 0.427 0.309
If you now draw three profiles, you will see that they look more parallel. Using the two-way ANOVA results for the
transformed data
Source SS df MS F P
Columns (treatments) 0.204 3 0.068 28.41 0.0000
Rows (poisons) 0.349 2 0.174 72.84 0.0000
Intercation 0.01157 6 0.0026 1.091 0.3864
Error 0.086 36 0.0024
Total 0.6544 47
we reject HA and HB at 1% significance level, and do not reject HAB . Conclusions:
3 poisons act differently,
4 treatments act differently,
no interaction.
The normal QQ-plot of the residuals for the transformed data approves the normality assumption of the performed
three F-tests, see the right panel on the figure above.

Solution 8
(a) The assumptions of the normal theory for the two-way ANOVA require that the 20 measurements of this
example (yij ) are generated by the normal distributions

Yij ∼ N(µij , σ),

where
µij = µ + αj + βi , i = 1, 2, 3, 4, 5, j = 1, 2, 3, 4,
satisfying X X
αj = βi = 0.
j i

Under this model the most interesting is the null hypothesis of no difference among different types of tires

H0 : α1 = α2 = α3 = α4 = 0.

(b) The Friedman test based on the ranked measurements


Bus Tire 1 Tire 2 Tire 3 Tire 4
1 1 3 4 2
2 1 3 4 2
3 1 4 3 2
4 1 3 4 2
5 1 3 4 2
Mean rank 1.0 3.2 3.8 2.0
results in the test statistic value q = 14.04. The null distribution is approximated by a chi-square distribution with
k = 3 degrees of freedom, whose table gives a p-value less than 0.5%. We reject H0 using the Friedman test.

146
Solution 9
(a) In terms of the two-way ANOVA model
Yijk = µ + αi + βj + δij + σZijk , Zijk ∼ N(0, 1)
( grand mean + main effects + interaction+noise), we estimate the main effects as

α̂1 = 11.9, α̂2 = −11.8, β̂1 = 1.99, β̂2 = −5.02, β̂3 = 3.04.
(Notice the effect of rounding errors.)

(b) Compute the cell means


Speed
60 70 80
1 189.7 185.1 189.0
1 188.6 179.4 193.0
1 190.1 177.3 191.1
Cell means 189.5 180.6 191.0
2 165.1 161.7 163.3
2 165.9 159.8 166.6
2 167.6 161.6 170.3
Cell means 166.2 161.0 166.7
and draw two lines for the speed depending on two different formulations, see the left panel on the figure below. These
two lines are almost parallel indicating to the absence of interaction between two main factors. This is confirmed by
the ANOVA table below showing that the interaction is not significant.

190 190

160 160

60 70 80 60 70 80

One possible interaction effect could have the form on the right panel. In this case the formulation 2 interacts with
the speed factor in such a way that the yield becomes largest at the speed 70.

(c) The two-way ANOVA table


Source df ss ms f Critical values Significance
Formulation 1 2253.44 2253.44 376.2 F1,12 (0.001) = 18.6433 Highly significant
Speed 2 230.81 115.41 19.3 F2,12 (0.001) = 12.9737 Highly significant
Interaction 2 18.58 9.29 1.55 F2,12 (0.100) = 2.9245 Not significant
Error 12 71.87 5.99
Total 17
(d) To check the normality assumption of the three F-tests.

Solution 10
(a)
Source of variation ss df ms f
Varieties 328.24 2 164.12 103.55
Density 86.68 3 28.89 18.23
Interaction 8.03 6 1.34 0.84
Errors 38.04 24 1.59
(b) Using the critical values
F2,24 (0.001) = 9.3394, F3,24 (0.001) = 7.5545, F6,24 (0.100) = 2.0351,
we reject both null hypotheses on the main factors and do not reject the null hypothesis on interaction.

(c) s = 1.59 = 1.26.

147
Solution 11
(a) The stratified sample mean is

x̄s = 0.3 · 6.3 + 0.2 · 5.6 + 0.5 · 6.0 = 6.01.

(b) We are in the one-way Anova setting with I = 3 and J = 13. The 95% Bonferroni simultaneous confidence
interval for the differences (µi − µj ) has the form
p
Bµi −µj = x̄i − x̄j ± t36 (0.05/6)sp 2/13,

involving the pooled sample variance

12 · s21 + 12 · s22 + 12 · s23 2.142 + 2.472 + 3.272


s2p = = = 2.672 .
36 3
This yields
Bµi −µj = x̄i − x̄j ± 2.5 · 2.67 · 0.39 = x̄i − x̄j ± 2.62,
and therefore,
Bµ1 −µ2 = 0.7 ± 2.62, Bµ1 −µ3 = 0.3 ± 2.62, Bµ3 −µ2 = 0.4 ± 2.62.
(c) We would not reject the null hypothesis of equality µ1 = µ2 = µ3 , since all three simultaneous confidence
intervals contain zero:

0.7 ± 2.62, 0.3 ± 2.62, 0.4 ± 2.62.

Solution 12
(a) This is an example of a randomised block for I = 3 treatments, J = 2 blocks, and n = 1 observation for each of
6 combinations of levels. The subjects (blocks) were matched by gender and mail to reduce the influence of external
factors.

(b) Consider the Friedman test for I = 3 treatments and J = 2 blocks. The test statistic
3
X
q=2 (r̄i. − 2)2
i=1

is obtained from the ranks given by two subjects (rij ) to the three treatments. There are 62 = 36 possible rank
combinations:
           
1 1 1 1 1 2 3 1 3 1 3 3
(rij ) =  2 2  ,  2 3 , 2 1 ,..., 2 2 , 2 3 , 2 2 
3 3 3 2 3 3 1 3 1 2 1 1

Under the null hypothesis, these 36 combinations are equally likely. The corresponding vector of rank averages
(r̄1. , r̄2. , r̄3. ) may take 5 different values (up to permutations)

A1 = (1, 2, 3), A2 = (1, 2.5, 2.5), A3 = (1.5, 1.5, 3), A4 = (1.5, 2, 2.5), A5 = (2, 2, 2)

according to the following table

1, 2, 3 1, 3, 2 2, 1, 3 2, 3, 1 3, 1, 2 3, 2, 1
1, 2, 3 A1 A2 A3 A4 A4 A5
1, 3, 2 A2 A1 A4 A3 A5 A4
2, 1, 3 A3 A4 A1 A5 A2 A4
2, 3, 1 A4 A3 A5 A1 A4 A2
3, 1, 2 A4 A5 A2 A4 A1 A3
3, 2, 1 A5 A4 A4 A2 A3 A1

From this table, we find the possible outcomes of interest and their probabilities due to the random sampling assuming
that the null hypothesis is true:

(r̄1. , r̄2. , r̄3. ) = A1 A2 A3 A4 A5


q= 4 3 3 1 0
probability = 1/6 1/6 1/6 1/3 1/6

Thus the null distribution of Q is the following one

P(Q = 0) = 1/6, P(Q = 1) = 1/3, P(Q = 3) = 1/3, P(Q = 4) = 1/6.

148
Solutions to Section 9.5
Solution 1
We will test the null hypothesis of homogeneity

H0 : equal genotype frequencies for diabetics and normal groups

using the chi-squared test. The table below gives the expected counts along with the observed counts:

diabetic normal total


Bb or bb 12 (7.85) 4 (8.15) 16
BB 39 (43.15) 49 (44.85) 88
total 51 53 104

From here the chi-squared test statistic is computed to be x2 =5.10. With df=1, the p-value of the test = 0.024 is
found using the normal distribution table. We reject H0 and conclude that the diabetics have the genotype BB less
often than the normals.

Solution 2
Here we twice apply the chi-squared test of independence.

(a) We test H0 : no association of the disease and the ABO blood group, using the observed and expected counts:
O A AB B Total
Moderate 7 (10.4) 5 (9.8) 3 (2.0) 13 (6.2) 28
Minimal 27 (30.4) 32 (29.7) 8 (6.1) 18 (18.8) 85
Not present 55 (48.6) 50 (47.5) 7 (9.8) 24 (30.0) 136
Total 89 87 18 55 249

With x2 = 15.37 and df = 6, the chi-squared distribution table says that the p-value of the test is less than 2.5%. We
reject H0 .

(b) We test H0 : no association of the disease and the MN blood group, using the observed and expected counts:

MM MN NN Total
Moderate 21 (16.7) 6 (9.4) 1 (1.9) 28
Minimal 54 (51.3) 27 (28.9) 5 (5.8) 86
Not present 74 (81.1) 51 (45.7) 11 (9.2) 136
Total 149 84 17 250

With x2 = 4.73, df = 4, the chi-squared distribution table says that the p-value of the test is larger than 10%. We
can not reject H0 .

Solution 3
(a) Applying the chi-squared test of homogeneity:
girl boy total
flying fighter 51 (45.16) 38 (43.84) 89
flying transport 14 (15.22) 16 (14.78) 30
not flying 38 (42.62) 46 (41.38) 84
total 103 100 203

we get x2 = 2.75, df = 2, giving the p-value larger than 10%. Conclusion: the data does not confirm the difference in
boy proportion fathered by different professionals.

(b) We apply the goodness of fit chi-squared test for the null hypothesis

H0 : boys proportions π12 = π22 = π32 = 0.513,

where 0.513 is the population proportion obtained as


105.37
= 0.513.
105.37 + 100
Using the observed and expected counts

149
girl boy total
flying fighter 51 (43.34) 38 (45.66) 89
flying transport 14 (14.61) 16 (15.39) 30
not flying 38 (40.91) 46 (43.09) 84
Total 103 100 203

we get x2 = 3.09, df = 3, giving the p-value larger than 10%. Conclusion: can not reject H0 .
Why do we use here df = 3 is explained next. The general model is described by three independent parameters
(π12 = π22 = π32 ), this gives us
dim Ω = 3
degrees of freedom to start with. Since the null hypothesis model is simple, we get

dim Ω0 = 0,

and the resulting number of degrees of freedom becomes

df = 3 − 0 = 3.

Solution 4
We use the chi-squared test for homogeneity involving five samples.
no nausea incidence of nausea total
placebo 70 (84) 95 (81) 165
chlorpromazine 100 (78) 52 (74) 152
dimenhydrinate 33 (43) 52 (42) 85
pentobarbital (100 mg) 32 (34) 35 (33) 67
pentobarbital (150 mg) 48 (43) 37 (42) 85
total 283 271 554

The observed test statistic x2 = 35.8 according to the χ24 -distribution table gives the p-value smaller than 0.005.
Comparing the observed and expected counts we conclude that chlorpromazine is most effective in ameliorating
postoperative nausea.

Solution 5
(a) We test H0 : no relation between blood group and disease in London, using the chi-squared test of homogeneity.

control peptic ulcer total


group A 4219 (4103.0) 579 (695.0) 4798
group O 4578 (4694.0) 911 (795.0) 5489
total 8797 1490 10287

The test statistic x2 = 42.40, with df = 1, according to the normal distribution table gives very small p-value = 0.000.
We reject H0 . The odds ratio
ˆ = 4219 · 911 = 1.45

4578 · 579
says that blood group O increases the odds of peptic ulcer by factor 1.45 compared to group A.

(b) We test H0 : no relation between blood group and disease in Manchester, using the chi-squared test of homo-
geneity.

control peptic ulcer total


group A 3775 (3747.2) 246 (273.8) 4021
group O 4532 (4559.8) 361 (333.2) 4893
total 8307 607 8914

The test statistic x2 = 5.52, with df = 1, according to the normal distribution table gives the p-value = 0.019. We
reject H0 . The odds ratio fo the Manchester data is ∆
ˆ = 1.22.

(c) We test H0 : London group A and Manchester group A have the same propensity to Peptic Ulcer, using the
chi-squared test of homogeneity.

control peptic ulcer total


London group A 4219 (4349.2) 579 (448.8) 4798
Manchester group A 3775 (3644.8) 246 (376.2) 4021
total 7994 825 8819

150
The test statistic x2 = 91.3, with df = 1, gives a very small p-value = 0.000. We reject H0 .

We test H0 : London Group O and Manchester Group O have the same propensity to Peptic Ulcer, using the
chi-squared test of homogeneity.

control peptic ulcer total


London group O 4578 (4816.5) 911 (672.5) 5489
Manchester group O 4532 (4293.5) 361 (599.5) 4893
total 9110 1272 10382

The test statistic x2 = 204.5, with df = 1, gives a very small p-value = 0.000. We reject H0 .

Solution 6
Denote the two main factors of the study by
D = endometrical carcinoma,

X = estrogen taken at least 6 months prior to the diagnosis of cancer.


(a) For this retrospective case-control study with matched controls, we have the following counts.
D̄X D̄X̄ total
DX 39 113 152
DX̄ 15 150 165
total 54 263 317
Applying the McNemar test we find that the test statistic
(113−15)2
x2 = 113+15 = 75

is highly significant as 75 = 8.7 and the corresponding two-sided p-value obtained from N(0,1) table is very small.
Conclusion: since 113 is significantly larger than 15, taking estrogen increases the risk of endometrical carcinoma.

(b) Possible weak points in a retrospective case-control design

• selection bias: some patients have died prior the study,


• information bias: have to rely on other sources of information.

Solution 7
The exact Fisher test uses Hg(30,17, 30
16
) as the null distribution of the test statistic whose observed value is x = 12.
We will test
H0 : no difference between high-anxiety and low-anxiety groups
against the two-sided alternative

H1 : there is a difference between high-anxiety and low-anxiety groups.

Personally, I do not see a natural choice of the one-sided alternative.


Next, we compute the two-sided p-value of the Fisher test using the normal approximation of the hypergeometric
distribution:
Hg(30, 17, 16
30 ) ≈ N(9.1, 1.4).

With help of the continuity correction, we find the one-sided p-value to be

P(X ≥ 12|H0 ) = P(X > 11|H0 ) ≈ 1 − Φ( 11.5−9.1


1.4 ) = 1 − Φ(1.71) = 0.044.

Thus the two-sided p-value is approximately 9% and we do not reject the null hypothesis.
It is interesting to compare this result with the chi-squared test yields, which in this case, gives the test statistic
x2 = 4.69. With df = 1, the two-sided p-value of the chi-squared test

2(1 − Φ( 4.69)) = 2(1 − Φ(2.16)) = 0.03

is smaller than 5% demonstrating a bad substitute of the exact Fisher test due to a small sample size.

151
Solution 8
Denote
π1 = probability that red wins in boxing,
π2 = probability that red wins in freestyle wrestling,
π3 = probability that red wins in Greco-Roman wrestling,
π4 = probability that red wins in Tae Kwon Do.
Is there evidence that wearing red is more favourable in some of the sports than others? We test

H0 : π 1 = π 2 = π 3 = π 4 vs H1 : πi ̸= πj for some i ̸= j

using the chi-squared test of homogeneity. From


Red Biue Total
Boxing 148 (147) 120 (121) 268
Freestyle wrestling 27 (28) 24 (23) 51
Greco-Roman wrestling 25 (26) 23 (22) 48
Tae Kwon Do 45 (44) 35 (36) 80
Total 245 202 447
Marginal proportions 0.55 0.45 1.00

we find that the test statistic x2 = 0.3 with df = 5 is not significant. We can not reject H0 . We also see that the red
outfit gives an advantage over the blue one as π̂ = 0.55.

Solution 9
After asking the opinion of 50 female employees and 50 male employees, the company’s statistician should carry out
a chi-square test of homogeneity.

Solution 10
(a) Multiple testing problem.

(b) Exact Fisher’s test.

(c) Nonparametric tests do not assume a particular form of the population distribution like normal distribution.

Solution 11
The null hypothesis is that everybody votes independently. Let p be the population proportion for ’yes’. Then the
number of ’yes’ for three voters in a household has the binomial distribution model X ∼ Bin(3, p) with an unspecified
parameter p. Thus the null hypothesis can be expressed in the following form

H0 : p0 = (1 − p)3 , p1 = 3p(1 − p)2 , p2 = 3p2 (1 − p), p3 = p3 .

We apply the Pearson chi-squared test with the expected counts based on the maximum likelihood estimate of p, the
sample proportion p̂ = 0.5417:

E0 = n(1 − p̂)3 = 19, E1 = 3np̂(1 − p̂)2 = 68, E2 = 3np̂2 (1 − p̂) = 81, E3 = 3np̂3 = 32.

The observed chi-square test statistic is x2 = 11.8 which has the p-value less than 0.5% according to the approximate
null distribution χ2df with df = 4 − 1 − 1 = 2. Conclusion: we reject the null hypothesis of independent voting.

Solution 12
The equivalence of
H0 : πi|j = πi·
and
H0 : πij = πi· π·j
follows from the formula for the conditional probabilities
πij
πi|j = .
π·j

152
Solution 13
(a) This is a single sample of size n = 441. Each of n observations falls in of 9 groups. The multinomial distribution
model
(C11 , C12 , C13 , C21 , C22 , C23 , C31 , C32 , C33 ) ∼ Mn(n, p11 , p12 , p13 , p21 , p22 , p23 , p31 , p32 , p33 )
gives the likelihood function

L(p11 , p12 , p13 , p21 , p22 , p23 , p31 , p32 , p33 )


= P (C11 = 24, C12 = 15, C13 = 17, C21 = 52, C22 = 73, C23 = 80, C31 = 58, C32 = 86, C33 = 36)
441!
= p24 · p15 · p17 · p52 · p73 · p80 · p58 · p86 · p36 .
24!15!17!52!73!80!58!86!36! 11 12 13 21 22 23 31 32 33
(b) The null hypothesis of independence H0 : pij = pi. · p.j meaning that there is no relationship between facility
conditions at gasoline stations and aggressiveness in the pricing of gasoline.

(c) The chi-squared test statistic x2 = 22.5 computed from the observed and expected counts should be compared
with the critical values of the χ24 -distribution. Since it is larger than 14.86, we conclude that the p-value of the test is
smaller than 0.5%. We reject the null hypothesis of independence and infer that there is a relationship between the
facility conditions at gasoline stations and aggressiveness in the pricing of gasoline.
Pricing policy
Aggressive Neutral Nonaggressive Total
Substandard condition 24 (17) 15 (22) 17 (17) 56
Standard condition 52 (62.3) 73 (80.9) 80 (61.8) 205
Modern condition 58 (54.7) 86 (71) 36 (54.3) 180
Total 134 174 133 441
It looks like the standard conditions are coupled with the least aggressive pricing strategy.

Solution 14
(a) The risk ratio compares the chances to suffer from myocardial infarction under the aspirin treatment versus
the chances to suffer from myocardial infarction under the placebo treatment:
P(MyoInf|Aspirin)
RR = .
P(MyoInf|Placebo)
(b) The null hypothesis of RR = 1 is equivalent to the hypothesis of homogeneity.
MyoInf No MyoInf Total
Aspirin 104 (146.5) 10933 (10887.5) 11037
Placebo 189 (146.5) 10845 (10887.5) 11034
Total 293 21778 22071
The corresponding chi-squared test statistic is
42.52 42.52 42.52 42.52
x2 = + + + = 25.
146.5 146.5 10887.5 10887.5
Since df = 1, we can use the normal distribution table. The square root of 25 is 5 making the result highly significant.
Aspirin works!

Solutions to Section 10.5


Solution 1
Recall that the formulas for sample and population covariances:
n
1 X
sxy = (xi − x̄)(yi − ȳ), Cov(X, Y ) = E(XY ) − E(X)E(Y ).
n − 1 i=1

We have to check that

E(SXY ) = E(XY ) − E(X)E(Y ).

To this end, observe that


n
X n
X n
X n
X n
X
(xi − x̄)(yi − ȳ) = xi yi − x̄ yi − ȳ xi + nx̄ȳ = xi yi − nx̄ȳ,
i=1 i=1 i=1 i=1 i=1

153
and
n
X n
X n
X n
XX
n2 x̄ȳ = xi yi = xi yi + xi yj ,
i=1 i=1 i=1 i̸=j j=1

so that
n n
1X 1 XX
SXY = Xi Yi − Xi Yj .
n i=1 n(n − 1) j=1 i̸=j

It remains to see that


!  
n
X n
XX
E Xi Yi = nE(XY ), E Xi Yj  = n(n − 1)E(X)E(Y ).
i=1 i̸=j j=1

Solution 2
After ordering over the x values we get
x −1.75 −1.18 −0.88 −0.65 −0.30 0.34 0.50 0.68 1.38 1.40
y −1.59 −0.81 −0.98 −0.53 −0.72 0.27 0.64 0.35 1.34 1.28
From this sample we compute the following five sufficient statistics
x̄ = −0.046, ȳ = −0.075, sx = 1.076, sy = 0.996, r = 0.98.
(a) If x is the predictor and y is the response, then the regression line takes the form
sy
y − ȳ = r · sx (x − x̄),
yielding
y = −0.033 + 0.904 · x.
The estimated noise size is q
n−1 2
s= n−2 sy (1 − r2 ) = 0.22.
(b) If y is the predictor and x is the response, then the regression line takes the form
sx
x − x̄ = r · sy (y − ȳ),

yielding
x = 0.033 + 1.055 · y.
The estimated noise size is q
n−1 2
s= n−2 sx (1 − r2 ) = 0.24.
(c) The first fitted line
y = −0.033 + 0.904 · x
is different from the second
y = −0.031 + 0.948 · x,
because in (a) we minimise the vertical residuals while in (b) we minimise the horizontal residuals.

Solution 3
Using an extra explanatory variable f which equal 1 for females and 0 for males, we combine two simple linear
regression models into the multiple regression
y = β0 + β1 x + β2 f + σZ,
so that
β0 = β0′′ , β2 = β0′ − β0′′ .
Here p = 3 and the design matrix is  
1 x1 f1
 .. .. ..  .
X= . . . 
1 xn fn
After β0 , β1 , β2 are estimated, we can compare the original intercepts β0′ and β0′′ using the relations
β0′′ = β0 , β0′ = β0 + β2 .
We may also test the key null hypothesis β2 = 0 saying that the grades of the female and male students satisfy the
same simple linear regression model.

154
Solution 4
Using P = X(X⊺ X)−1 X⊺ , we get

P2 = X(X⊺ X)−1 X⊺ X(X⊺ X)−1 X⊺ = X(X⊺ X)−1 X⊺ = P.

Solution 5
(a) Given x = 95, we predict the final score by

ŷ = 75 + 0.5(95 − 75) = 85,

based on the formula for the predicted response


ŷ − ȳ x − x̄
=r· .
sy sx

Regression to mediocrity in action: since the predictor x = 95 is larger than the average 75, the predicted response
value 85 is smaller than 95.

(b) Given y = 85, we predict the midterm score by

x̂ = 75 + 0.5(85 − 75) = 80,

based on the formula for the predicted response


x̂ − x̄ y − ȳ
=r· .
sx sy

Again, regression to mediocrity.

Solution 6
(a) First, we find the correlation coefficient ρ between X and Y . Since EX = 0, we have

Cov(X, Y ) = E(XY ) = E(X 2 + βXZ) = 1, VarY = VarX + β 2 VarZ = 1 + β 2 ,

implying that the correlation coefficient is always positive


1
ρ= √ .
1+β 2

(b) To generate pairs (x, y) with a given positive correlation coefficient ρ, one can proceed as follows. From
ρ = √ 1 2 , find
1+β
p
β = ρ−2 − 1.
After generating a pair (x, z) of N(0, 1) numbers, the pair (x, y) with
p
y = x + z ρ−2 − 1

will produce the required result.

Solution 7
Two regression models

y = −62.05 + 3.49 · x, r2 = 0.984,



y = −0.88 + 0.2 · x, r2 = 0.993,

produce two residual plots: the blue circles for the first model and the red stars for the second. Clearly, the second
model has a much better fit to the data.

155
The kinetic energy formula explains why the second model is better.

Solution 8
Coefficient of determination is the squared sample correlation coefficient

r2 = (0.2)2 = 0.04.

Solution 9
Using the formulas for b1 and sb1 we obtain
rsy √
b1 s rsy r n−2
t= = q x2 =q = √ .
sb1 s s2y (1−r 2 ) 1 − r2
(n−1)s2x (n−2)

Solution 10
(a) The corresponding multiple regression model takes the form yi = β0 + β1 xi + β2 x2i + ei , where ei , i = 1, . . . , 5
are independent realisations of a normal random variable with distribution N(0, σ). The corresponding design matrix
is
 
1 0 0
 1 2 4 
 
X=  1 4 16 

 1 6 36 
1 8 64
(b) Using the formula ŷi = 111.8857 + 8.0643xi − 1.8393x2i we get

xi 0 2 4 6 8
yi 110 123 119 86 62
ŷi 111.8857 120.6571 114.7143 94.0571 58.6857
implying Pn
− ŷi )2
i=1 (yi 103.3
s2 = = = 51.65.
n−p 2
(c) The coefficient of multiple determination equals
ssE 103.3
R2 = 1 − =1− = 0.961.
ssT 2630

Solution 11
(a) The underlying parabola makes unrealistic prediction that ŷ40 = 139 sec compared to ŷ10 = 63 sec and ŷ20 = 34
sec. One should be careful to extend the range of explanatory variable from that used in the data.

Empirical relationship developed in a region might break down,


if extrapolated to a wider region in which no data been observed

156
(b) Using t17 (0.005) = 2.898 we get the exact confidence interval (under the assumption of normality and ho-
moscedasticity for the noise component)

Iµ = 0.2730 ± 2.898 · 0.1157 = (−0.0623, 0.6083).

(c) Since the confidence interval from 2b covers zero, we do not reject the null hypothesis H0 : β2 = 0 at the 1%
0.1157 = 2.36, and according to the t17 -distribution table the two-sided
significance level. The observed t-test statistic 0.2730
p-value lies between 2% and 5%.

Solution 12
(b) The fitted regression line for the final score y as a function of the midterm score x is y = 37.5 + 0.5x. Given
x = 90 we get a point prediction y = 82.5. The estimate of σ 2 is
n−1 2
s2 = s (1 − r2 ) = 84.4.
n−2 y
Thus the 95% prediction interval for Carl’s final score is
q
1
I = 82.5 ± t8 (0.025)s 1 + 9 + 18 ( 15 2
10 ) = 82.5 ± 24.6.

(c) The fitted regression line for the midterm score x as a function of the final score y is x = 37.5 + 0.5y. Given
y = 80 we get a point prediction x = 77.5.

157

You might also like