Lecture Notes 1

UNIVERSITY OF CAPE COAST
DEPARTMENT OF STATISTICS
STA 403:
STATISTICAL METHODS II
delivered by
PROF. NATHANIEL HOWARD

1
COURSE OUTLINE
Further methods for discrete data: examples and formulation - binomial,
multinomial and Poisson distributions. Comparison of two binomials;
McNeyman’s test for matched pairs; theory and transformations of
variables; multiple linear regression; selection of variables; use of dummy
variables. Introduction to logistic regression and generalized linear
modeling. Non-parametric methods. Use of least squares principle;
estimation of contrasts, two-way crossed classified data.
Pre-requisite: STA 303
(Statistical methods I)
2
Recommended Literature
1. Ott, R.L. (1992). An Introduction to Statistical Methods and Data
Analysis; 4th Ed.; Duxbury Press, Belmont, California, USA.
2. Freund, J.E. (1992). Mathematical Statistics; 5th Ed.; Prenctice-Hall
Int. Inc., London, UK.
3. Milton, J.S., Corbet, J.J. and McTeer, P.M. (1986). Introduction to
Statistics. D.C. Heath & Co., Toronto, Canada.
4. Gordor, B. K. and Howard, N. K. (2006). Introduction to Statistical
Methods; Ghana Mathematical Group.
5. Wetherill, G.B. (1981). Intermediate Statistical Methods. Chapman
and Hall, London, UK.
3
Quiz Schedules
Quiz 1: (18th January, 2024)
[20%]
Quiz 2: (End of week 9)
[20%]
Schedules of Lectures
Tuesday 12:30 – 1:30pm (LT 20)
Tuesday 1:30 – 2:30pm (LT 20)
Wednesday: 4:30 – 5:30pm (LT 21)
Thursday: 8:30 – 9:30pm (PGR)

Thursday: 9:30 – 10:30pm (PGR)
Thursday: 10:30 – 11:30pm (PGR) 4
COURSE OBJECTIVES
By the end of this Course, students should be able to:
 Perform hypothesis test for quantitative and qualitative data sets.
 Perform and interpret correlation analysis.
 Perform and interpret simple and multiple linear regression analyses.
 Describe logistic regression models and solve problems involving them
 Describe GLMs and solve problems involving them.
 Perform non-parametric tests
5
HYPOTHESES AND TEST PROCEDURES
What is Hypothesis?
A statistical hypothesis is a statement or assertion which may or may not be
true about the value of a population parameter.
Example
The mean age  of Master of Public Health students in the Department
of Physician Assistantship equals 24 years. That is,   24
.
6
Types of Hypotheses
There are two types of hypotheses namely the null hypothesis

and alternative hypothesis.
The main hypothesis which we wish to test is called the null
hypothesis. It is denoted by H 0.
The hypothesis that will be accepted when His0 rejected is called
the alternative hypothesis. It is an assertion that contradicts the null

hypothesis. It is denoted by H1.
7
Classification of Hypothesis Tests
In any hypothesis testing problem, the null and alternative

hypotheses may further be classified as simple or composite,
or as one-sided or two-sided tests depending on how the test is
set up.
8
Simple and Composite Tests
 If  can take on a single value, then both the null and

alternative hypotheses are called simple hypotheses.
Example H 0 :   0 or H 1 :   20.
 If  can take on multiple values, then both the null and

alternative hypotheses are called composite hypotheses.
Example H 0 :   100or H1 :   10, 100.
9
One-side and two-sided tests
When both null and alternative hypotheses are composite and represent
one side of the parameter space around some value  0 , then the test is
said to be a one-sided test. One-sided tests are also called one-tailed
tests.
Example
H 0 :    0 against H 1 :    0
10
When the null is simple and the alternative hypothesis represents the rest
of the parameter space  , then the test is said to be a two-sided test.
Two-sided tests are also called two-tailed tests.
Example
H 0 :    0 againstH 1 :    0
11
Errors in Hypothesis Testing
The purpose of hypothesis testing is to determine whether the evidence on
the basis of available data tends to refute H 0 . Since H 0 can either be true
or false and at the end of the experiment we can reject or fail to reject H 0 ,
there are four possible decisions that we can make.
Hypothesis Testing Decisions
1. Reject H 0 when it is true (wrong decision – Type I error).
H0
2. Reject when it is false (correct decision).
H0
3. Fail to reject when it is true (correct decision).
H0
4. Fail to reject when it is false (wrong decision – Type II error).
12
Decision table for hypothesis testing
H 0 is true H 0 is false
Reject H Type I error correct decision
0
Fail to reject H 0 correct decision Type II error
Hypothesis testing involves the use of sample data to decide whether the
null hypothesis should be rejected or not. The decision to reject the null
hypothesis or not is based on the value of a test statistic. A test statistic is
an estimator whose value is calculated from sample data. Its distribution is
known under the assumption that the null hypothesis is true. 13
If there is a large difference between what is expected under the null
hypothesis and what is observed in a sample, then the null hypothesis is
rejected; and the result is said to be statistically significant. If, on the
other hand, the difference between what is expected and what is observed
is small, then there is not enough evidence to reject the null hypothesis;
and the result is said to be not statistically significant.
14
There are two approaches to determining whether to reject the null
hypothesis or not. One involves the determination of the rejection or
critical region of the test. The rejection or critical region is a set of values
of the test statistic that will enable us to reject H 0 . It is obtained by
using a pre-determined level of significance (or size of the test). The level
of significance, denoted by  , is the probability of committing a Type I
error. The levels of significance often used in literature include
  1% (or 0.01),   5% (or 0.05) and   10%(or 0.10).
15
The second approach involves calculation of the p-value of the test. The p-
value of the test is the probability of observing the test statistic at least as
extreme as observed under the null hypothesis. The null hypothesis is
p  0.05
rejected for “small” p-values (usually for ). Generally, the null
hypothesis is rejected at the level of significance  if p   . For values
of p   , there is not enough evidence to reject the null hypothesis.
We shall limit ourselves to the first approach in this module.
16
In general, hypothesis testing in statistics involves the following steps:
Step 1: State the hypothesis that is to be questioned ( H 0 ).
Step 2: State an alternative hypothesis which will be accepted if the null
hypothesis is rejected ( H 1).
Step 3: Select the decision rule about when to reject H 0 and when to fail to
reject it.
Step 4: Evaluate the appropriate test statistic using sample data from the
population of interest.
Step 5: Carry out your decision.
17
TESTS CONCERNING A SINGLE POPULATION MEAN
Suppose that we wish to test the null hypothesis that the mean of a
 2
normal population with variance

equals a specific 0value, .
That is, if we wish to test the null hypothesis H 0 against any of the three
alternatives H 1 :    0 or H 1 :    0 or H 1 :    0 ; then we
need to perform one of the tests in the table below, based on a random
sample of size n from this population.
18
Null hypothesis H 0 :   0 against various alternatives
(a) (b) (c)

H 0 :   0 H 0 :   0 H 0 :   0
H1 :    0 H1 :    0 H1 :    0
19
Critical regions for testing H 0 are shown below.
H1 Reject H 0 if
  0 z   z (lower-tailed test)
  0 z  z (upper-tailed test)
  0 z   z or z  z (two-tailed test)
2 2
20
0
 z is the value of z that leaves z is the value of z that leaves
a value of  to its left. a value of  to its right.
 z  leaves a value of 2 to its left and z 

2 2

leaves a value of 2 to its right. 21
For the test of means from a single population, there are three possible
scenarios to be considered:
1. Tests for means from a single normal population with a known  .
2. Tests for means from a single population with unknown  but large
sample size.
3. Tests for means from a single population with unknown  but small
sample size.
22
Test for means from a single normal population with a known 
If the population we are sampling is normal and  is known, then the test
statistics is given by
x  0
z
 n
23
Example
The scores for some students in an examination have been normally

distributed for some time with mean 200 and standard deviation 25.
Currently some lecturers think that the performance has changed. To support
this claim, scores of 100 students were taken and found the mean to be 212.
(a) Set up the null and alternative hypotheses for test.
(b) Will you agree with the lecturers’ claim at the 5% significance level?
24
Solution
(a) The null and alternative hypotheses are:
H 0 :   200
H 1 :   200
Substituting for x  212,  o  200,   25 and n  100into

the formula gives
x  0
Z 
/ n
212  200

25 100
 4.8 25
Note that   0.05. Because the test is two-tailed, we find
z   z 0.05  z 0.025. From the statistical table, z0.025  1.96.
2 2
Since z cal  4.8 is greater than z0.025  1.96, we reject H 0 and agree
with the lecturers that the performance of the students have changed.
26
Test for means from a single population with unknown but
large sample
size
If the population standard deviation  is unknown but the sample size is

large n  30  , the test statistics becomes
x  0
z  ,
s n
where s is the standard deviation of the sample data.
27
Example
A random sample of size n  100 observations taken from a population
with mean  yielded the sample mean x  18.9 and sample standard
deviation of s  12.6 . If the hypotheses are
H 0 :   16 and H1 :   16,
(a) Calculate the value of the appropriate test statistics for this test;
(b) Hence determine whether H 0 should be rejected at the 1% level of
significance.
28
Solution
In this problem,  is unknown hence we replace it with s. Therefore

substituting for x  18.9,  o  16, s  12.6 and n  100in the
formula gives
x  0
Z 
s/ n
18.9  16

12.6 100
 2.30
29
(b) Now we have z 2  z 0.01 2  z 0.005. From the z-tables,
z 0.005  2.575. z cal  2.30
 2.575
Now since is neither greater
than 2.575 nor less than , we cannot reject the null
hypothesis at the 1% level of significance.
30
Test for means from a single population with unknown  but
sample size is small
If the population standard deviation  is unknown but the sample

size is small n  30 , the test statistics becomes
x  0
t ,
s n
which has an approximate t-distribution with n  1 degrees of
freedom.
31
The critical regions for such tests are shown below.
H1 Reject H 0if
  0 t  t
  0 t  t
t  t or t  t
  0 2 2
32
Example
The manufacturer of a new fiberglass tire claims that its average life will
be at least 40,000 miles. To verify this claim, a sample of 12 tires is
selected, with their lifetimes (in 1000s of miles) as follows:
Tire 1 2 3 4 5 6 7
Life 36.1 40.2 33.8 38.5 42.0 35.8 37.0
Tire 8 9 10 11 12
Life 41.0 36.8 37.2 33.0 36.0
Test the manufacturer’s claim at the 5% level of significance.

33
Solution
We wish to test H 0 :   40 against H1 :   40.
Since  is unknown and the sample size n  12is small, we have to use
the t-distribution with 11  n  1 degrees of freedom. From the data,
x  37.283,  o  40, s  2.732 and n  12
Therefore, x  0
t 
s/ n
37.283  40

2.732 12
 3.445 34
From the one-tailed t-table, t0.05 11  1.796.
Since t  3 .445is less than  t0.05 11  1.796, we reject H 0

and conclude that the average life of the new fiberglass will be less than
40,000 miles.
35
TEST CONCERNING A POPULATION PROPORTION
(LARGE SAMPLE)
Supposed that we wish to test the null hypothesis H 0 : p  p0

against any of the alternatives
H1 : p  p0 or H1 : p  p0 or H1 : p  p0
If n is large and H 0 is true, then the test statistics is given by

ˆ  p0
p
z
p0 1  p0  n
where p̂ is the sample proportion of the characteristic of interest.

36
The critical regions for testing are shown below.
H1 Reject H 0 if
H1 : p  p0 z   z
H1 : p  p0 z  z
H1 : p  p0 z   z  or z  z 
2 2
37
Example
An oil company claims that less than 20% of all car owners have not tried its
gasoline. Test the claim at the 0.01 level of significance, if a random check
reveals that 22 out of 200 car owners have not tried the company’s gasoline.
Solution
We wish to test H 0 : p  0.20 against H 1 : p  0.20.
Here, p0  0.20, the number of successes, x  22, and the
sample size is n  200.
38
Thus, 22
p̂   0.11 .
200
Substituting these into the formula gives:

ˆ  p0
p
z 
p 0 1  p 0  n
0.11  0.20

0.201  0.20  200
 0.09

0.0283
 3.1802
39
From the z-table, z 0.01  2.33.Therefore the rejection region is
z  2.33. z  3.1802  2.33,
Since we reject the null

hypothesis. We therefore conclude that less than 20% of all car owners
have not tried the company’s gasoline.
40
TEST CONCERNING A SINGLE POPULATION
PROPORTION (SMALL SAMPLES)
Supposed that we wish to test the null hypothesis H 0 : p  p0 against

H1 : p  p0  % at x  isk ,
significant level. Then the critical region
where,
k
x is the number of observed successes;
k
 Biskthe p0   integer
; n,largest ; for which
k 0
41
and Bk ; n , p0 is
 the probability of observing k successes in n binomial
p  p0 .
trials when
If the alternative hypothesis was H1 : p  p0 , the critical region would be

* *
x  k , k
where  is the smallest integer for which
n
 Bk ; n , p0    ;
k  k*
42
Similarly, if the alternative hypothesis was H1 : p  p0 ,the
x  k ,
critical region would be
2
where
k
is the largest
k
integer for which
2

 Bk ; n, p0  
2
;
k 0 2
*
k
and is the smallest integer for which
2 b 
 Bk ; n, p0   .
k  k * 2
43
Example
It is claimed that 40% of patients that attend a certain clinic on
any day are smokers. Suppose that on a particular day, 3 out of
a sample of 13 patients attending the clinic were found smokers.
Test the hypothesis H 0 : p  0.40 against H1 : p  0.4
at the 5% significant level.
44
Solution
In this problem, x  3 and n  13Since
.   0.05, k  k0.025 .
2
Now, from binomial tables,
1
 Bk ;13,0.40   B0;13,0.40   B1;13,0.40 
k 0
 0.0013  0.0113
 0.0126
1
Thus,  Bk ;13,0.40   0.0126  0.025implying
, that the largest
k 0 1
integer k for which  Bk ;13,0.40  0.025 is 1.
k 0
2
45
*
Similarly, the smallest integer k for which
13 2
 Bk ;13,0.40   0.025 is 10.
k 10
That is 13
 Bk ;13,0.40  B10;13,0.40    B13;13,0.40
k 10
 0.0065  0.0012  0.0001  0.0000
 0.0078
To be able to reject the null hypothesis, either the number of
successes, x, is less than or equal to 1; or greater than or equal
to 10.
46
Since x  3 is not less or equal to 1, nor greater or equal to 10,
we cannot reject the null hypothesis. We therefore, conclude
that 40% of patients that attend clinic on any day are smokers.
47
TEST CONCERNING A SINGLE
POPULATION VARIANCE
2 2
Supposed that we wish to test the null hypothesis H 0 :    0 against
any of the alternatives

H1 :  2   02 or H1 :  2   02 or H1 :  2   02
If the population we are sampling is normal, then the test statistics is

given by
2
 
n  1s 2
2
0
where  2
has (n  1) degrees of freedom. 48
The critical regions for testing H 0 :  2   02 are shown below
H1 Reject H 0if
2 2
 2
 0
2
  1
2 2
 2 2
 0   
2 2 2 2
 2
 0
2    or   
1
2 2
49
2 2 2
Given that n  25, s  9 and   10 testH 0 :   10 against
the H1 :  2  10
two-sided alternative at 1% significance level.
Solution n  25 and s 2  9,
Substituting for into the formula
gives: n  1s 2
25  19
  2
  21.6
 2
0 10
50
The critical region is less than  , or greater than 
2 2
0.995 0.005 . From
chi-square tables, the value of  2

(with 24 df.) is 9.886 and that
0.995
 0.005
2
for (with 24 df.) is 45.558.
Since   21.is
2
6 neither less than 9.886 nor greater than 45.558, we
cannot reject the null hypothesis.
51
TEST CONCERNING TWO POPULATION MEANS
(INDEPENDENT SAMPLES)
Suppose that we have two independent random samples with means

x1 and x2 and respective sample sizes n1 and n2 ,from normal

populations with means 1 and  2 and variances 1 and  2 .
2 2
1 and  2 ; H :  
We can compare by testing 0 1 2 , where
 H 1 : 1   2  
: 1 
isHa1given  2  against
constant, : of
H 1any thealternatives
1  2 
or or
under the following three conditions: 52
1. Large independent samples with known  and 
2 2
1 2.
2. Large independent samples with unknown

 2
1 and  2
2 .
3. Small independent samples with unknown

 2
1 and  2
2 .
53

Large independent samples with 1
2
and  2
2 known
The test statistic for testing two population means from independent samples

with 1
2
and  2
2is given by
x1  x2  
z
 2
 2
1
 2
n1 n2
  1   2 z
where and is the usual standard normal random variable.
54
Example
A random sample of 100 observations is drawn from a normal population
with variance 16 and the sample mean was found to be 10.8. Another
sample of 64 observations is drawn from a second and independent
normal population with variance 25 and the sampling mean was found to
be 9.6. Test the hypotheses:
H 0 : the population means are equal
against
H1 : the population means are not equal.
55
Solution
The hypotheses above are equivalent to
H 0 : 1   2  0
H1 : 1   2  0
We now evaluate the test statistic by substituting
n1  100, x1  10.8,  2
1  16, n2  64, x2  10.8,  2
2  25
and   0 into the formula.
56
This gives
x1  x 2  
z 
1
2
2
2

n1 n2
10.8  9.6  0

16 25

100 64
 1.6260
Fromz-tables, z   z 0.025  1.96 ,

giving the critical region as
2
z  1.96 or z  1.96.
57
Since z  1.626
is neither less than -1.96 nor greater than 1.96, we fail
to reject H
We0 .therefore conclude that the population means are
equal.
58

Large independent samples with 1
2
and  2
2 unknown
The test statistic for testing two population means from independent

samples with 1
2
and  2
is2given by
x1  x2  
z 2 2 ,
s1 s2

n1 n2
2
1
2

where s and s are the respective sample estimates for 1
2
2
and  2
2 .
59
Example
Suppose that we have randomly selected two independent samples from

populations having means 1 and 
If 2 . x1  25, x 2  20,
s1  3, s 2  4 n1  100, n2  100.
and Test
H 0 : 1   2  0 against H1 : 1   2  0
at the 0.05 level of significance.
60
Solution
Substituting for x1  25, x2  20, s1  3, s2  4 , n1  100,

n2  100 and  in0the formula, we obtain
x1  x 2  
z 
2 2
s1 s2

n1 n2
25  20  0

9 16

100 100
 10.
61
From the z-tables, we have z  z 0.05  .1This
.645gives the critical
region as z  1.645
Since . z  10
is greater than 1.645, we reject the null
1 2
hypothesis and conclude that is greater than .
62
Small independent samples with  and  unknown
2 2
1 2
Test concerning two population means (Independent samples)

 2
and  2 n  30 n  30
with 1 2 unknown and sample sizes 1 and 2
can be performed under two different assumptions about the population

variances:
 12 and  22
1. are unknown and both are assumed to be equal to a
 .
2
common variance
 1 and  2
2 2
2. are unknown and are assumed to be different from each

other.
63
Assumption 1
Under Assumption 1, the appropriate test statistics for such test is given by
x1  x2  
t ,
1 1
sp 
n1 n2
sp ,
where called the pooled sample variance is given by
n1  1s1  n2  1s2
2 2
sp  ,
n1  n2  2
2 2
s and s
1 2
with the
n1 respective
n2  2 variances for samples 1 and 2, and t has the t-
distribution with degrees of freedom. 64
Example
n1  16 and n2  10
Two independent random samples of sizes from
normal populations with unknown standard deviations have means
x1  23.4 and x2  18.2 ,
with corresponding standard deviations
s1  3.5 s 2  4 .8
and .
H 0 : 1   2  0 against H1 : 1   2  0
Test at the 10%
significance level, assuming that the population variances are equal.
65
Solution
s
We first evaluate p by substituting n1  16, s1  3.5, n2  10 and
s2  4.8 into the formula to give
n1  1s  n2  1s

2 2
sp  1 2
n1  n2  2
16  13.5  10  14.8
2 2

16  10  2
 4.04
66
Now substituting
s p  4.04, n1  16, x1  23.4 , n2  10, x2  18.2 and   0
into the test statistic gives

x1  x2  
t
1 1
sp 
n1 n2
23.4  18.2  0

1 1
4.04 
16 10
 3.193
67
From the t-tables, t0.10 with 24 degrees of freedom is 1.318. Thus the
critical region is t  1.318 . Since t  3.193  1.318, we reject H 0 and
conclude that 1   2 .
68
Under Assumption 2, appropriate test statistics for such test is given by
* x1  x2  
t  2 2
s1 s2

n1 n2
*
where t is approximately t-distribution with df, v, given by
2
s 2
s  2
n  n  If v is not a whole number, then we

1 2
v   12 2  2 have to round it to the nearest whole

 s1 
2
 s2 
2
n   n  number.
 1   2
n1  1 n2  1 69
Example
Two independent random samples of sizes n1  16 and n2  10 from
normal populations with unknown standard deviations have means
x1  23.4 and x2  18.2 ,
with corresponding standard deviations
s1  3.5 s 2  4 .8
and .
H 0 : 1   2  0 against H1 : 1   2  0
Test at the 10%
significance level, assuming that the population variances are different.
70
*
Evaluate t by substituting n1  16, s1  3.5, x1  23.4, n2  10, s2  4.8,
x2  18.2 and   0 into the test statistic to gives
* x1  x2  
t 
s12 s22

n1 n2
23.4  18.2  0

3.5 2
4.8 2

16 10
 2.9680
We evaluate v by substituting n1  16, s1  3.5, n2  10, s2to obtain
4.8
71
2
s2
s  2
n  n 
1 2
v  1 2
2 2 2 2
 s1   s2 
n  n 
 1   2
n1  1 n2  1
2
 3.5
2
4 .8  
2
  
 16 10 

2 2 2 2
 3.5   4.8 
   
 16    10 
16  1 10  1
 15.7
 15
72
critical region is t  1.341. Since t  2.9680  1.341, we reject H 0
* *
and conclude that 1   2 .
73
TEST CONCERNING TWO POPULATION MEAN
(PAIRED DATA)
Suppose that x1 , x2 , , xn are the observations on n individuals

y1 , y2 , , yn
before an experiment, and are the corresponding
 x1 , y1 , x 2 , y 2 ,  ,
observations after the experiment. Then the pairs
xn , yn 
constitute a paired data set.
74
Consider the test of H 0 : 1   2   the various alternatives
against
(a) (b) (c)

H 0 : 1   2   H 0 : 1   2   H 0 : 1   2  
H1 : 1   2   H1 : 1   2   H1 : 1   2  
By calculating the differences d i  yi  xi i  1,2 , , n

between corresponding observations, the test reduces to testing
75
H 0 :  d  against various alternatives
(a) (b) (c)

H 0 : d   H 0 : d   H 0 : d  
H1 :  d   H1 :  d   H1 :  d  

Let dbe the mean of the normally distributed population of paired
differences, d and sd be the mean and standard deviation of a
sample of n paired differences that have been selected.
76
Then the appropriate test statistic for conducting any of the test in the
table above is given by
d 
t
sd n
n  1
where t has the t-distribution with degrees of freedom.
77
Example
The data below are the weights before and after ten boxers were fed with
a weight reducing diet:
i 1 2 3 4 5 6 7 8 9 10
Before, xi 69 50 61 72 78 66 75 89 86 54
After, yi 66 49 63 70 71 65 75 88 87 51
Test the null hypothesis H 0 :  d  0, against the alternative

H1 :  d  0,
hypothesis at the 5% level of significance.
78
Solution
i 1 2 3 4 5 6 7 8 9 10
xi 69 50 61 72 78 66 75 89 86 54
yi 66 49 63 70 71 65 75 88 87 51
yi  xi -3 -1 2 -2 -7 -1 0 -1 1 -3
Considering the differences as one sample data, we find that

n  10, d  1.5, sd  2.5 and   0
Substituting these into the test statistic gives

d   1.5  0
t   1.8973
sd n 2.5 10
79
t  1.897  1.we
critical region is t  1.833.Since 833 ,
reject
H 0and conclude that  2  1 .
80
TEST CONCERNING TWO POPULATION PROPORTIONS
Suppose that we have two independent random samples n1 and n2 with

p̂ and p̂ x1 x2
proportions 1 2 where p̂1  and p̂2  , from normal
n1 n2
populations.
We can compare p̂1 and p̂2 H 0 : p1  p2  

by testing against any
H1 : p1  p2   , or H1 : p1  p2   , or H1 : p1  p2  
of the alternatives
under two conditions if the sample sizes are large.
1.If   0or
 0 81
Condition 1
If   0,the appropriate test statistic is given by

p̂1  p̂2  
z ,
1 1 
p̂ 1  p̂   
 n1 n2 
where p̂ , called the combined sample proportion is given by

x1  x2
p̂  .
n1  n2
82
Condition 2
If   0,the appropriate test statistic is given by

pˆ1  pˆ2 
z .
pˆ 1 1  pˆ1  p ˆ 2 1  p
ˆ2 

n1  1 n2  1
83
Example (a)
If x1  18, x2  15, n1  35 and n2test

 the
42,null hypothesis
H 0 : p1  p2  0
against
H1 : p1  p2  0
at the 5% significance level.
84
Solution
Since   0we
, first find the combined sample proportion as follows:
x1  x2
ˆ 
p
n1  n2
18  15

35  42
 0.4286
Therefore substituting
18 15
p̂1   0.5143, p̂2   0.3571, p̂  0.4286 and   0
35 42
into the test statistic gives:

85
p̂1  p̂2  
z
1 1 
p̂ 1  p̂   
 n1 n2 
0.5143  0.3571  0

 1 1 
0.42860.5714  
 35 42 
 1.3875
Fromthe z-tables z0.05  1.645, resulting in a critical region of z  1.645.

Since z  1.3875  1.645, H0
we fail to reject and conclude
p1  p2 .
that
86
Example (b)
If x1  18, x2  15, n1  35 and n2 test

 42the, null hypothesis
H 0 : p1  p2  0.15
against
H1 : p1  p2  0.15
at the 5% significance level. Interpret your result.
87
Solution
Since  0 , we substitute
18 15
ˆ1 
p ˆ2 
p
35 42
 0.5143,  0.3571,
n1  35, n2  42 and   0.15 into the
test statistic gives:
pˆ1  pˆ2 
z
ˆ 1 1  p
p ˆ1  p ˆ 2 1  p
ˆ2 

n1  1 n2  1
0.5143  0.3571   0.15

0.51430.4857   0.35710.6429 
35  1 42  1
 2.6995
88
Fromthe z-tables z0.05  1.645, resulting in a critical region of z  1.645.
Since z  2.6995  1.645, H0
we reject and conclude that
p1  p2  0.15.
That is, p1exceeds pby

2 15%.
89
TEST CONCERNING TWO POPULATION VARIANCES
Suppose that we have two independent random samples n1 and n2

 2
with variances
1 and  2
2 .

Then we can compare 1
2
and  2
2 by testing the null hypothesis
H 0 :  12   22
against any of the alternatives

H1 :  1   2 , or H1 :  1   2 , or H1 :  1   2
2 2 2 2 2 2
90
The appropriate
2
test statistics is given by
s1
F  2,
s2
2 2
s
where 1 and s 2 are the sample variances.
The F-statistic has the F-distribution with

n1  1 as numerator degrees of freedom and
n2  1 as denominator degrees of freedom.
91
The critical regions for testing H 0 :  12   22as shown:
are
H1 Reject H 0 if
H1 :   
2
1
2
2
F  F1 n1  1, n2  1
H1 :   
2 2 F  F n1  1, n2  1
1 2
F F  n1  1, n2  1 or
1
H1 :   
2
1
2
2
2
F  F n1  1, n2  1
2
92
You may find the following identity useful.
1
F1 n1  1, n2  1 
F n2  1,n1  1
Example
Suppose that observations from two independent random samples from
two normal populations yielded the following result:
2 2
n1  11, s1  18.4 , n2  16 and s2  13.5
againstH1 :   
2 2
Test the null hypothesis H 0 :   
2 2
1 2 1 2 at the 10%
significance level.
93
Solution
2 2
Substituting 1 s  18 . 4 and s 2  13 .5the test statistic gives:
into
s12
F  2
s2
18.4

13.5
 1.363.
The test is two-sided, so we need to evaluate

FF  n1  1, n2  1 and F  F n1  1, n2  1
1
2 2
F0.05 10,15  2.54
From the F-distribution table,
94
Using the identity
1
F1 n1  1,n2  1 
F n2  1,n1  1
we have
1 1
F0.95 10,15    0.35
F0.05 15,10 2.85
Therefore the critical region is F  0.35 or F  2 .54

F  1.36
Since is neither less than 0.35 nor greater than 2.54, we cannot

reject the null hypothesis. We therefore conclude that 1
2
  2
2
95
TEST ON CATEGORICAL DATA
Some common test that can be considered appropriate for

qualitative data include the following:
1. The multinomial distribution
2. Goodness of fit tests (when categorical probabilities are
completely define)
3. Goodness of fit tests for the Poisson, binomial and normal
distributions
4. Goodness of fit tests for Independence
96
THE MULTINOMIAL DISTRIBUTION
Multinomial distribution is an extension of the binomial
distribution. Its properties are as follows:
1. The experiment consist of n identical trials.
2. There are k possible outcomes associated with
each trial.
3. The probabilities of the k outcomes denoted by p1 , p2 ,, pk ,
remain constant from trial to trial; and p1  p2    pk  1.
4. The n trials are independent of each other.
5. The random variable of interest are the counts
in each of the k cells
97
Example
The table below show the market share for different brands of
television.
Brand of TV Market share
LG 20%
Samsung 30%
Panasonic 35%
Sony 15%
Can this be said to follow a multinomial distribution?
98
Solution
It is clear that the brands of television are independent of each
other. The TV brands LG, Samsung, Panasonic and Sony have
p1  0.20, p2  0.30, p3  0.35
distribution probabilities
and p4  0.15,
respectively. So we have
0.20  0.30  0.35  0.15  1
and therefore, the distribution of brand of television sets follows a
multinomial distribution.
99
GOODNESS OF FIT TESTS (WHEN CATEGORICAL
PROBABILITIES ARE COMPLETELY DEFINE)
Suppose we wish to test the null hypothesis

H 0 : p1  p2    pk   and H1 :
At least one of the
multinomial probabilities is not equal its hypothesized value.
100
Then the test statistics is given by
k oi  ei 
2
 
2
,
i 1 ei
where
k denotes the number of classes
oi denotes the number of observations in class i
ei denotes the number of expected observations
in class i ei  npi 
pi , the probability of observing an observation in class i
n denotes the sample size
101
The test statistic has an approximate chi-square distribution with
k  1degrees of freedom.
NOTE:
The approximation is good if the sample size is large enough so
that , for every cell, the expected cell frequency is 5 or more.
102
Example
The head teacher of a primary school is interested in knowing whether
there exist colour preferences among the pupils in his school. A sample of
100 pupils were drawn from the school and shown identically shaped
objects, coloured red, blue, yellow, green or pink. When each child was
asked to pick the most preferred colour, 30 picked red, 18 blue, 12 yellow,
20 green and 20 pink. Test, at 5% significance level, the hypotheses:
H 0 : there does not exist colour preferences
against
H 1 : colour preference does exist
103
Solution
If there are no preferences, then the probability of choosing

1
any colour is the same. That is pi   0.20 i  1,2 , ,5.
5
Thus, we are testing the hypotheses
H 0 : p1  p2  p3  p4  p5  0.20
against
H 1 : at least one of the pi s is not 0.20
104
Colour Red Blue Yellow Green Pink
Observed, oi 30 18 12 20 20
Expected, ei 20 20 20 20 20
oi  ei 10 -2 -8 0 0
Thus
5 oi  ei  2
 2
i 1 ei
10  2   8 02 02
2 2 2
    
20 20 20 20 20
 8.4
105
At the 5% significance level from the chi-square tables,
 2
0.05 at df  5  1  4 is 9.49Therefore,
. the critical region
is   9.49.
2
Since   8.4  2 2
0.05 4   9we
.49,
cannot
reject H 0 . That is, we do not have enough evidence against the
null hypothesis. Therefore we conclude that there does not exist
colour preferences among the pupils
106
GOODNESS-OF-FIT TESTS FOR THE POISSON,
BINOMIAL AND NORMAL DISTRIBUTION
The goodness-of-fit tests can be applied to test sample data set as
coming from a population having a Poisson, or binomial or
normal distribution. The test statistic is given by
k oi  ei 
2
 
2
,
i 1 ei
107
where
k denotes the number of classes
oi denotes the number of observations in class i
ei denotes the number of expected observations
in class i ei  npi 
pi denotes the probability of observing an observation in i
n denotes the sample size
108
The test statistic has an approximate chi-square distribution with
k  m  1degrees of freedom, where m is the number of
independent parameters that have to be estimated from the
sample.
NB: The approximation is good if the sample size is large
enough so that , for every cell, the expected cell frequency is 5 or
more.
109
Example 1
The weekly number of power failures reported in a certain district
in 50 weeks is recorded as follows
Number of failures Number of Weeks
0 6
1 8
2 13
3 11
4 7
5 4
6 1
110
Determine whether the weekly number of power failures in the district
follows a Poisson distribution at the 5% significance level.
Solution
We wish to test the hypothesis
H0 :
the weekly no. of power failures follow a Poisson distribution
against
H1 :
the weekly no. of power failures does not follow a Poisson
distribution. 111
We first calculate the expected frequencies using Poison
probabilities given by;
i 
e
pi  , i  0 ,1,2 , ,6.
i!
where is the mean of the distribution.
x f fx
0 6 0
1 8 8
2 13 26
3 11 33
4 7 28
5 4 20
6 1 6
50 121
112
Therefore, the mean x is given by
 fx 121
x   2.42  2.4
 f 50
We can calculate the various probabilities as follows

2.4  e
0  2 .4
p0   0.091
0!
2.4  e
1  2.4
p1   0.218
1!
The rest is summarized in the table below.
113
Number of Number of Poisson Expected
Failures Weeks Probabilities frequencies
i ni pi ei  npi
0 6 0.091 4.55
1 8 0.218 10.90
2 13 0.261 13.05
3 11 0.209 10.45
4 7 0.125 6.25
5 4 0.060 3.00
6 1 0.024 1.20
114
From the table above, three of the frequencies (43%) are less than
5. to satisfy the condition, we merge the last three class as shown
in the table below
Number of Number of Poisson Expected
Failures Weeks Probabilities frequencies
i ni pi npi
0 6 0.091 4.55
1 8 0.218 10.90
2 13 0.261 13.05
3 11 0.209 10.45
4 12 0.209 10.45
115
The test statistic becomes
k oi  ei 
2
 
2
i 1 ei
6  4.55 2
8  10.90
2
12  10.45
2
  
4.55 10.90 10.45
 0.4621  0.7716    0.2299
 1.49
The critical region for the test is

   k  m  1     0.05 3  5  1  1  7.815
2 2 2 2
116
Since 1.49 is less than 7.815, we fail to reject the null
hypothesis and conclude that the weekly number of power
failures follows a Poisson distribution.
117
Example 2
Four identical six-sided dice, each with faces marked 1 to 6, are rolled
200 times. At each rolling, a record is made of the number of dice
whose score on the uppermost face are even. The result is shown below.
Number of even scores xi 0 1 2 3 4

Frequency fi 10 41 70 57 22
Test, at the 5% level of significance, that the number of even faces

follows a binomial distribution with n  4 and p  0.5.
118
Solution
We wish to test the hypothesis:
H 0 : Number of even scores is ~ B4,0.5
against
H1 : Number of even scores is not ~ B4,0.5
p x   B n, p  C x p 1  p 
n x n x
We have
Thus,
119
p 0   B 4,0.5 C0 0.5 0.5  0.0625
4 0 4
p 1  B 4,0.5 C1 0.5 0.5  0.2500

4 1 3
p 2   B 4,0.5 C2 0.5 0.5  0.3750

4 2 2
p 3  B 4,0.5 C3 0.5 0.5  0.2500

4 3 1
p 4   B 4,0.5 C4 0.5 0.5  0.0625

4 4 0
We can now calculate the expected cell frequencies and

summarize them in a table as shown.
120
oi pi ei  npi
i
0 10 0.0625 12.50
1 41 0.2500 50.00
2 70 0.3750 75.00
3 57 0.2500 50.00
4 22 0.0625 12.50
The test statistic becomes
121
k oi  ei 
2
 
2
i 1 ei
10  12.50 2
22  12.50 2
 
12.50 12.50
 0.500  1.620    7.220
 10.653
The critical region for the test is

   k  m  1    
2 2 2 2
0.05 4  5  0  1  9.488
B conclude
Since 10.653 is greater than 9.488, we reject the null hypothesis and 4,0.5.
that the number of even scores is not approximately
122
Example 3
Three hundred marbled ducks in Quack town are weighed and the results are shown
in the following table.
Mass (g) Frequency
m  470 10
470  m  520 158

123
520  m  570
9
m  570
Set up the hypotheses and test, at the 10% significance level, whether the mass of
marbled duck can be modelled by a normal distribution with mean 520g and
standard deviation 30g.
123
Solution
H 0: Mass of the marbled ducks can be modelled by the normal distribution with
mean 520 and standard deviation 30.
against
H 1: Mass of the marbled ducks cannot be modelled by the normal distribution
with mean 520 and standard deviation 30.
Now we calculate the required probabilities as follows:

 M  520 470  520 
Pr( M  470)  Pr   
 30 30 
 Pr z  1.67 
 0.5000  0.4525
 0.0475
124
 470 - 520 520 - 520 
Pr M  
 30 30 
 Pr - 1.67  M  0 
 520 - 520 570 - 520 

Pr M  
 30 30 
 Pr 0  M  1.67 
125
 M  520 570  520 
Pr( M  570)  Pr  
 30 30 
 Pr z  1.67 
 0.5000  0.4525
 0.0475
Mass (g) Frequency Probability

m  470 10 0.0475
470  m  520 158 0.4525
520  m  570 123 0.4525
9 0.0475
m  570
126
Calculate the expected frequencies using ei  npi .
Mass (g) Frequency Probability ei  npi

m  470 10 0.0475 14.25
470  m  520 158 0.4525 135.75
520  m  570 9 0.0475 14.25
m  570 123 0.4525 135.75
127
We summarize the rest of the calculations as follows:
oi ei oi  ei (oi  ei ) 2 (oi  ei )2 ei
10 14.25 -4.25 18.06 1.268
158 135.80 22.20 492.84 3.629
123 135.80 -12.80 163.84 1.206
9 14.25 -5.25 27.56 1.934
8.037
Thus,
 2  8.037
128
From tables,  k  m  1  3  6.251. Since Cal  8.037 is greater
2 2 2
 0.10 
than 0.05 3  6.251, we reject H 0 and conclude that the mass of the marbled
2

ducks cannot be modelled by the normal distribution with mean 520 and standard
deviation 30.
129
GOODNESS-OF-FIT TESTS FOR HOMOGENEITY
This test is used to determine whether frequency counts for a

given variable are distributed identically across different
populations. That is, a single categorical variable from two or
more populations is studied.
130
This approach is considered appropriate when the following
conditions hold.
1. The method for selecting a sample from each population is

simple random sampling.
2. The variable under study is categorical.
3. The expected frequency for each cell should be at least 5.
131
The test statistic is given by
 2

p l nij  E nij 
2
,
i 1 j 1 E nij 
where
E nij  is the expected cell frequency for theij 
th
cell.
nij
is the number of observations that fall into each cell
called observed cell.
 p  1l  1 degrees of freedom
The test statistic has a chi-square distribution with 132
Example
In a study of television viewing habits of children, a
developmental psychologist selects a random sample of 300
primary school pupils, 100 boys and 200 girls. Each child is
asked which of the following television programmes they like
best: The Talented kids, or The Pulpit, or Maths and Science
Quiz. The results are shown below
133
Viewing Preferences
The Talented Kids The Pulpit Math and Science Quiz
Boys 50 30 20
Girls 50 80 70
Do boys’ preferences for the television programmes differ

significantly from the girls’ preferences? Use the 0.05 level of
significance.
134
Solution
Viewing Preferences
The Talented Kids The Pulpit Math & Science Quiz Totals
Boys 50 (33.33) 30 20 100

Girls 50 (66.57) 80 70 200
Totals 100 110 90 300
The hypotheses we wish to test are as follows:
135
Proportion of boys who prefer Talented Kids equals proportion of girls who
H 01 prefer Talented Kids
Proportion of boys who prefer The Pulpit equals proportion of girls who
H 02
prefer The Pulpit
Proportion of boys who prefer Maths & Science Quiz equals proportion of
H 03 girls who prefer Maths & Science Quiz
against
At least one of the is false

H1 H0
136
We now calculate the expected frequencies
P1  L1 100  100
E n11     33.33
n 300
P1  L2 100  110
E n12     36.67
n 300
P1  L3 100  90
E n13     30.00
n 300
P2  L1 200  100
E n21     66.67
n 300
P2  L2 200  110
E n22     73.33
n 300
P2  L3 200  90
E n23     60.00
n 300 137
Substituting them into the test statistics gives
p l n  E n 
2
 
2 ij ij
i 1 j 1 E nij 
50  33.33 2
70  60 2
 
33.33 60
 8.3375    1.6667
 19.3255
The degrees of freedom is given by
df   p  1l  1
 2  13  1  2
138
From the chi-square tables, the value of the chi-square at the
0.05 level of significance, with 2 degrees of freedom is 5.99.
Since 19.3255 is greater than 5.99, we reject the null
hypothesis and conclude that at least one of the null
hypothesis is false.
139
GOODNESS-OF-FIT TESTS FOR INDEPENDENCE
Goodness-of-fit tests for independence are applied to two categorical

variables from a single population.
In these tests, the null hypothesis is such that the variables are
independent against the alternative that the variables are not
independent.
140
Data for the goodness-of-fit of independence is usually presented in
contingency table. The following table is an r  c contingency table.
Variable 2
Variable 1 1 2  c Totals
n11 n12  n1c R1
1 n21 n22  n2 c R2

2     
r nr1 nr 2  nrc Rr
C1 C2  Cc n
Totals
The test statistic for the goodness-of-fit test is given by
r c oi  ei  2
 
2
,
i 1 j 1 ei
141
where
c denotes the number of columns
r denotes the number of rows
oi denotes the number of observed cell frequency
ei denotes the number of expected cell frequency
n denotes the grand total
Ri  C j
ei 
n
142
The test statistic follows a chi-square distribution with c  1r  1
degrees of freedom. We reject the null hypothesis if
 cal
2
 2 c  1r  1
143
Example
The following table is base on the classification by size and colour of a

sample of 120 shirts drawn from a large population.
Colour Size
Small Medium Large
Red 10 13 12
Yellow 12 11 14
Green 18 20 10
Test the hypothesis that size and colour are independent at 5% significance
level.
144
Solution
H 0 : size and colour are independent

against
H1 size and colour are not independent
The row marginal and column marginal totals are

R1  35, R2  37, R3  48, and C1  40, C2  44, C3  36
respectively. We now calculate the expected frequencies using

Ri  C j
eij 
n
145
35  40 35  44 35  36
e11   11 .67 e12   12.83 e13   10.50
120 120 120
37  40 37  44 37  36
e21   12.33 e22   13.57 e23   11 .10
120 120 120
48  40 48  44 48  36
e31   16.00 e32   17.60 e33   14.40
120 120 120
146
The test statistic is given by
r c oi  ei 
2
 
2
i 1 j 1 ei
10  11 .67  13  12.83
2 2
10  14.402
  
11 .67 12.83 14.40
 3.63
The degrees of freedom is df  3  13  1  4
Therefore from chi-square tables, 0.05 4   9.49.
 2
Since   3.63is less than 9.49, we fail to reject H0.

2
Therefore, size and colour are independent.

147

Lecture Notes 1

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Lecture Notes 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Notes 1

Uploaded by

Copyright:

Available Formats

UNIVERSITY OF CAPE COAST

PROF. NATHANIEL HOWARD

Thursday: 8:30 – 9:30pm (PGR)

There are two types of hypotheses namely the null hypothesis

the alternative hypothesis. It is an assertion that contradicts the null

In any hypothesis testing problem, the null and alternative

 If  can take on a single value, then both the null and

 If  can take on multiple values, then both the null and

(a) (b) (c)

 z  leaves a value of 2 to its left and z 

The scores for some students in an examination have been normally

Substituting for x  212,  o  200,   25 and n  100into

If the population standard deviation  is unknown but the sample size is

In this problem,  is unknown hence we replace it with s. Therefore

If the population standard deviation  is unknown but the sample

Test the manufacturer’s claim at the 5% level of significance.

Since t  3 .445is less than  t0.05 11  1.796, we reject H 0

Supposed that we wish to test the null hypothesis H 0 : p  p0

If n is large and H 0 is true, then the test statistics is given by

where p̂ is the sample proportion of the characteristic of interest.

Substituting these into the formula gives:

Since we reject the null

Supposed that we wish to test the null hypothesis H 0 : p  p0 against

If the alternative hypothesis was H1 : p  p0 , the critical region would be

we cannot reject the null hypothesis. We therefore, conclude

any of the alternatives

If the population we are sampling is normal, then the test statistics is

two-sided alternative at 1% significance level.

Substituting for into the formula

chi-square tables, the value of  2

cannot reject the null hypothesis.

Suppose that we have two independent random samples with means

2. Large independent samples with unknown

3. Small independent samples with unknown

Fromz-tables, z   z 0.025  1.96 ,

Suppose that we have randomly selected two independent samples from

at the 0.05 level of significance.

Substituting for x1  25, x2  20, s1  3, s2  4 , n1  100,

Test concerning two population means (Independent samples)

can be performed under two different assumptions about the population

2. are unknown and are assumed to be different from each

n1  1s  n2  1s

into the test statistic gives

n  n  If v is not a whole number, then we

v   12 2  2 have to round it to the nearest whole

and conclude that 1   2 .

Suppose that x1 , x2 , , xn are the observations on n individuals

(a) (b) (c)

By calculating the differences d i  yi  xi i  1,2 , , n

(a) (b) (c)

Test the null hypothesis H 0 :  d  0, against the alternative

Considering the differences as one sample data, we find that

Substituting these into the test statistic gives

Suppose that we have two independent random samples n1 and n2 with

We can compare p̂1 and p̂2 H 0 : p1  p2  

If   0,the appropriate test statistic is given by

where p̂ , called the combined sample proportion is given by

If   0,the appropriate test statistic is given by

If x1  18, x2  15, n1  35 and n2test