Statistics c.1
Statistics c.1
Statistics c.1
CHAPTER ONE
is made after recording the required characteristics. Therefore, the population size at
each selection remain the same (constant), which implies that in this method of
selection a unit can have a chance of selection more than once. Thus, under this
method there are Nn possible samples of size n. Note also that the probability at each
population size reduces by one for each selection from the immediate preceding one.
Hence, there are NCn (N combination of n) possible samples under this method. The
probability of selection, therefore, is not constant at each selection, 1/N for the first
unit, 1/(N-1) for the next, 1/(N-2) for the third and so on
In practice, however, sampling is almost always done without replacement. But to help
us understand some essential concepts, we will also consider here sampling with
replacement in which every elementary unit has an equal chance of being selected for
each successive observation. It is also evident that sample outcomes are statistically
• The allocation of sample sizes is termed as proportional if the sample fraction, i.e., if the
ratio of the sample size to the population size, remains the same in all the strata.
Mathematically the principle of proportional allocation gives: = = =…
By the property of ratio and proportions, each of these ratios is equal to the ratio of the sum of
numerators to the sum of denominators,
∑
i.e., = = = = =∑ = = c, (constant )
since the total sample size n , and the population size N are fixed.
Hence, = ( ), = ( ), = ( ), = ( ), (i = 1, 2, 3,..., k )
Example: A stratified sample of size n =80 is to be taken form a population of size N=2000,
which consists of four strata for which N1 = 500, N2=1200, N3=200 and N4=100. If we use
proportional allocation, how large a sample must be taken from each stratum?
• Solution: In proportional allocation, we know that the sample size for the ith stratum is
given by
• = ( ), (i = 1, 2, 3,..., k ), Then
• = ( ) = 500( ) = 20 must be taken from the 1st stratum.
• = ( ) = 1200( ) = 48,
• = ( )= 200( )=8
• = ( )= 100( )=4
For any given sample size n taken from a population with mean µ and standard
deviation δ, the value of the sample mean X would vary from sample to sample if
several random samples were obtained from the population. This variability serves as
the basis for sampling distribution.
The sampling distribution of the mean is described by two parameters: the expected
value ( X ) = X , or mean of the sampling distribution of the mean, and the standard
deviation of the mean x , the standard error of the mean.
Example:
7
A population consists of the following ages: 10, 20, 30, 40, and 50. A random sample of
three is to be selected from this population and mean computed. Develop the sampling
distribution of the mean.
Solution:
The number of simple random samples of size n that can be drawn without replacement
from a population of size N is NCn. With N= 5 and n = 3, 5C3 = 10 samples can be drawn
from the population as:
X x 30, Regardless of the sample size X .
N n
8
X
2
X 1000
14.142 .
i
N 5
N n 14.142 53
X * * 5.774
n N 1 3 5 1
2
X X
i
333.4
5.774
N 10
Since averaging reduces variability x < δ except the cases where δ = 0 and n = 1.
The relationship between the shape of the population distribution and the shape of
the sampling distribution of the mean is called the Central Limit Theorem.
The significance of the Central Limit Theorem is that it permits us to use sample
statistics to make inference about population parameters with out knowing anything
about the shape of the frequency distribution of that population other than what we can
get from the sample. It also permits us to use the normal distribution (curve for
analyzing distributions whose shape is unknown. It creates the potential for applying
the normal distribution to many problems when the sample is sufficiently large.
Example:
1. The distribution of annual earnings of all bank tellers with five years of experience is
skewed negatively. This distribution has a mean of Birr 15,000 and a standard deviation
of Birr 2000. If we draw a random sample of 30 tellers, what is the probability that their
earnings will average more than Birr 15,750 annually?
Solution:
Steps:
1. Calculate µ and x
µ = Birr 15,000
x = δ/√n= 2000/√30 = Birr 365.15
9
2. Calculate Z for X
X X X
ZX
X X
15,750 15,000
Z15,750 2.05
365
2. Suppose that during any hour in a large department store, the average number of
shoppers is 448, with a standard deviation of 21 shoppers. What is the probability of
randomly selecting 49 different shopping hours, counting the shoppers, and having the
sample mean fall between 441 and 446 shoppers, inclusive?
Solution:
1. Calculate µ and x
µ = 448 shoppers
x = δ/√n= 21/√49 = 3
2. Calculate Z for X
X X X
ZX
X X
441 448 446 448
Z 441 2.33 Z 446 0.67
3 3
3. Find the area covered by the interval
P (441 ≤ X ≤ 446) = P (-2.33 ≤ Z≤ -0.67)
= P (0 to -2.33) - P (0 to - 0.67)
= 0.49010 – 0.24857
= 0.24153
4. Interpret the results
There is a 24.153% chance of randomly selecting 49 hourly periods for which the
sample mean falls between 441 and 446 shoppers.
10
3. A production company’s 350 hourly employees average 37.6 year of age, with a
standard deviation of 8.3 years. If a random sample of 45 hourly employees is taken,
what is the probability that the sample will have an average age of less than 40 years?
Solution:
1. Calculate µ and x
µ = 37.6 years n/N= 45/350 > 5%...... FPCF is needed
N n 8.3 350 45
x * = x * 1.16
n N 1 45 350 1
2. Calculate Z for X
X X X
ZX
X X
40 37.6
Z 40 2.07
1.16
4. Suppose that a random sample size of 36 is being drawn from a population with a mean
of 278. If 86% of the time the sample mean is less than 280, what is the population
standard deviation?
Solution:
µ = 278 n = 36 P ( X < 280) = 0.86 δ =?
(Z/P=0.36) = +1.08
X
ZX X
X n
280 278
Z 280 1.85
X 36
280 278
1.08 1.85
X 6
11
2 6 * 1.85
1.08
X 11 .1
2
X 1.85
1.08
5. A teacher gives a test to a class containing several hundred students. It is known that
the standard deviation of the scores is about 12 points. A random sample of 36 scores is
obtained.
a) What is the probability that the sample mean will differ from the population mean
by more than 6 points?
b) What is the probability that the sample mean will be within 6 points of the
population mean?
Solution:
a)
12 12
n = 36 δ =12 X 2
n 36 6
P ( X > µ +6) + P ( X < µ - 6) =?
6 6
Z 6 3 Z 6 3
2 2
P ( X > µ +6) + P (Z> µ - 6) = P (Z > 3) + P (Z < - 3)
= [0.5 – P (0 to +3)] + [0.5 – P (0 to -3)]
= (0.5 – 0.49865) + (0.5 – 0.49865)
= 0.00135(2) = 0.00270
b)
12 12
n = 36 δ =12 X 2
n 36 6
P (µ - 6≤ X ≤ µ + 6) = P (- 3≤ Z ≤ 3)
= P (0 to 3)*2
= 0.49865*2
= 0.9973
If the population standard deviation is 12, in a random sample of 36 scores there is a
99.73% chance of getting a sample mean score to lie within 6 points of the population
mean.
12
Whereas the mean is computed by averaging a set of values, the sample proportion is
computed by dividing the frequency that a given characteristic occurs in a sample by
the number of items in the sample.
X
P
n
Where P = sample proportions
X = number of items in a sample that possess the characteristic
n = number of items in the sample
Like other probability distribution, sampling distribution of the proportion is described
by two parameters: the mean of the sample proportions, E ( P ) and the standard
deviation of the proportions, P which is called the standard error of the proportion.
Answer: By applying the Central Limit Theorem. The CLT states that normal
distribution approximates the shape of the distribution of sample proportions if np and
nq are greater than 5. Consequently we solve problems involving sample proportions
by using a normal distribution whose mean and standard deviation are:
Pq PP
P P , P and Z P
n P
Example:
1. Suppose that 60% of the electrical contractors in a region use a particular brand of wire.
What is the probability of taking a random sample of size 120 from these electrical
contractors and finding that 0.5 or less use that brand of wire?
Solution:
Steps:
1. Check that np and nq > 5
120*0.6 = 120, and 120*0.4 = 48. Both are greater than 5.
2. Calculate P
Pq 0.6 * 0.4
P = 0.0477
n 120
3. Calculate Z for p
P P
Zp
p
0.5 0.6
Z 0.5 2.24
0.0477
4. Find the area covered by the interval
P ( p < 0.5) = P (Z < -2.24)
= 0.5 - P (0 to -2.24)
= 0.5 – 0.48745
14
= 0.01255
About 6.81% of the time, twelve or more defective parts would appear in a random
sample of eighty parts when the population proportion is 0.10.
3. Suppose that a population proportion is .40 and that 80% of the time you draw a
random sample from this population, you get a sample proportion of 0.35 or more.
How large a sample were you taking?
Solution:
P= 0.4
P ( p > 0.35) = 0.80
n =?
Pq
(Z/P= 0.30) – 0.84 P ; squaring both sides
n
0.35 0.4 .4 * .6
Z 0.35 0.0595 =
p n
15
2
-0.84 = -0.05/ P 0.0595 .4 * .6 0.0035 = 0.24/n
2
n
0.84 P = 0.05 0.0035 = 0.24/n
P = 0.05/0.84 n = 0.24/0.0035
= 0.0595 n = 68
4. If a population proportion is 0.28 and if the sample size is 140, 30% of the time the
sample proportion will be less than what value if you are taking random samples?
Solution:
P= 0.28 n = 140 P ( p < X) = 0.30 X =?
(Z/P = 0.2) = - 0.52
PP Pq
ZP P
p n
P 0.28 0.28 * 0.72
0.52
0.0379 140
0.0197 P 0.28 0.0379
P 0.26
This distribution is concerned with finding the difference between sample means drawn
from two populations; it is interested in determining if the mean of one population is
equal to the mean of another.
- Whether the mean life expectancy of females is equal to the mean life
expectancy for males
- Whether the mean productivity of women and men are equal or not
- Whether the mean CGPA for business students is equal to the mean CGPA
for social science students
- Whether the mean number of white blood cells in a droplet of blood is equal
to the mean number of red blood cells
- etc.
16
In each case we have two different populations (p1 and p2). Population 1 has mean µ1
and variance, , and population 2 has mean µ2 and variance 2 2 .
2
1
1st, Two distribution of the two populations, which have means and variances of
µ 1 µ 2 and 1 & 2 respectively.
2 2
3rd, One sampling distribution of x1 x2 with µ 1 - µ2, and X X .
1 2
X2
1X2
Variance X 1 X 2 = X2 1X2
=
12
n1
22
n2
2
X1
2
X2
[If X and Y are independent random variables: var (X-Y) = var (X) + var (Y)
If more than 5% of the population is sampled without replacement, we must apply the
finite population correction factor and the formula becomes:
12 N1 n1 22 N 2 n2
X2 1X2
=
n *
n *
1 N1 1 2 N2 1
17
ZX
X 1
X 2 1 2
1X2
X
1X2
Example:
1 A financial loan officer claims that the mean monthly payment for credit cards is Br 80
with a variance of 1400 for single females and Br 80 with a variance of 1320 for single
males. You take a random sample of 100 females (population 1) and an independent
random sample of 120 (population 2). What is the probability that the sample mean for
females will be at least Br 5 higher than the sample mean for males?
Solution:
1 80 2 80 12 1400 2 2 1320 n1 = 100 n2 = 120
p x1 x 2 5 ?
Е X1 X 2 = µ 1 - µ 2 X X =
12
22
1 2
n1 n2
1400 1320
= 80 – 80 =
100 120
=0 =5
Z X X
X 1
X 2 1 2
1 2
X X1 2
50
Z5 1.0
5
18
p x1 x 2 5 P( Z 1.0)
= 0.5 – P (0 to +1.00)
= 0.5 - 0.34134
= 0.15866
There is a 15.9% chance that the mean monthly payment for credit cards for single
females will be higher than that of single males by at least Birr 5
2 MOHA soft drinks factory produces two soft drinks: 7 up and Pepsi-cola. The daily
production of 7 up averages 1 15,000 bottles and is normally distributed with a
standard deviation 1 2,000 and 2 12,500 bottles and standard deviation 2 2500
bottles. A sample of five randomly selected daily production figures is taken from each
of the plants. What is the probability that the sample mean production for 7 up will be
less than or equal to the sample mean production for Pepsi-cola?
Solution:
1 15,000 bottles 2 12,500bottles 1 2,000 2 2,500bottles n1 = 5 n2 = 5
p x1 x 2 0 ?
Е X1 X 2 = µ 1 - µ 2 X X =
12
22
1 2
n1 n2
= 15,000 – 12,500
= 2,500 =
2000 2
2,500 2
5 5
= 1,431.78
Z X X
X 1
X 2 1 2
1 2
X X1 2
0 2,500
Z0 1.75
1431.78
p x1 x 2 0 P( Z 1.75)
= 0.5 – P (0 to -1.75)
= 0.5 - 0.45994
= 0.04006
Thus, there is only a 4% chance that the mean productivity for 7UP will be smaller than
the mean productivity for Pepsi-cola. So, if the owner of the two plants found a smaller
19
first sample mean, say x2 13,500 bottles, in independent random samples of five
randomly selected days from each plant, he would suspect that either the sampling was
faulty or that the difference in the plant’s mean daily outputs had changed.
3 X company claims that the mean annual repair bill for its rental cars is Br 290 and the
standard deviation is Br. 50. Y Company also claims its mean annual repair bill is Br
290 and the standard deviation is Br. 50. If independent random samples of 100 cars
from each company are obtained, what is the probability that x1 x2 exceeds Br 5?
Solution:
1 290 2 290 1 50 2 50 n1 = 100 n2 = 100
p X1 X 2 5 ?
Е X1 X 2 = µ 1 - µ 2 X X =
12
22
1 2
n1 n2
= 290 – 290
=0 =
50 2
50 2
100 100
= 7.071
Z X X
X 1
X 2 1 2
1 2
X X1 2
50
Z5 0.711.75
7.071
p x1 x 2 0 P( Z 0.71) * 2
= [0.5 – P (0 to 0.71)]2
= [0.5 - 0.26115]2
= 0.4777
There is 47.77% chance that the difference between the mean annual repair bill for X
and Y companies exceed Br 5.
4 Two population of measurement are normally distributed with 1 57 and 2 25. the
two populations standard deviations are 1 12 and 2 6 . Two independent samples
of n1 = n2 = 36 are taken from the populations.
a. What is the expected value of the difference in sample means, x1 x2 ?
b. What is the standard deviation of the distribution of x1 x2 ?
20
Solution:
a. Expected value of the difference in same means x1 x2 1 2
= 57-25
= 32
b. The standard deviation of the distribution of x1 x2 = Standard error of the
difference between two sample means:
X X = 12
22
1 2
n1 n2
=
12 2
( 6) 2
= 2.24
36 36
If the populations from which the samples are drawn are normal in shape
then the distribution of x1 x2 will be normal in shape.
If the populations from which samples are drawn are not normal in shape,
then the distribution of x1 x2 will be approximately normal, owing to the
central limit effect, if the sample sizes n1 and n2 are both large.
5 Two production processes are, on the average, identical. Both use an average of
1 2 500 kg, of raw material per day. Both have the same standard deviation of
daily use, 1 2 9 kgs per day. Thus the daily use of material may vary for the two
processes, but on the average they are the same. Find the probability that differ by no
more than 1.0 kg.
Solution:
1 500 2 500 1 9 2 9 n1 = 81 n2 = 36
p X 1 X 2 1 P(1 X 1 X 2 1) ?
Е X1 X 2 = µ 1 - µ 2 X X =
12
22
1 2
n1 n2
= 500 – 500
=0 =
9 2
9
2
81 36
= 1.8028
21
Z X X
X 1
X 2 1 2
1 2
X X1 2
1 0
Z1 0.55
1.8028
p x1 x 2 1 P( Z 0.55) * 2
= [P (0 to 0.71)] 2
= [0.20884] 2
= 0.41768
Suppose we take independent samples of size n1 and n2 from two populations. Let p1
and p2 be the proportions of items in each population that possesses a certain
characteristic, and let q1 1 p1 and q2 1 p2 .
If n1p1, n1q1 are greater than 5 and n2p2, n2q2 are greater than 5, then the random variable
p1 p2 is approximately normally distributed with
Mean: Е p1 p2 = P1 – P2; and
Pq
Pq
n2
n
N1
n
Variance: Var p1 p2 = 1 1 2 2 ; if 1 0.05 or 2 , finite population
n1 N2
correction factor is used.
Z P P
P P P P
1 2 1 2
1 2
P P
1 2
Example:
1 At Addis Ababa University there is a movement to re-establish the Students’ Union.
Approximately 90% of the entire students favor the reinstatement. A pro-union student
takes a random sample of 100 students. An anti-union student takes an independent
random sample of 100 students. Let p1 denote the proportion of student who favor
union in a sample taken by the pro-union student and p2 denote the proportion of
students who favor the union in the sample taken by an anti-union student. Calculate
the probability that p1 exceeds p2 by 0.1 or more.
Solution:
Pro-union Anti-union
P1 = 0.9 P2 = 0.9
q1 = 0.1 q2 = 0.1
n1 = 100 n2 = 100
P p1 p 2 0.10 ?
Е p1 p2 = P 1 - P 2 p p =
p1 q1
p2 q2
1 2
n1 n2
= 0.9 – 0.9
=0 =
0.9 * 0.1 0.9 * 0.1
100 100
= 0.04243
Z p p
p p P P
1 2 1 2
1 2
p p 1 2
0.1 0
Z 0.1 2.36
0.04243
P p1 p 2 0.10 P( Z 2.36)
= 0.5 - P (0 to +2.36)
= 0.5 – 0.49086
= 0.00914
P p1 p 2 0 ?
Е p p = P - P
1 2 2
p p =
p1 q1
p2 q2
1 1 2
n1 n2
= 0.6 – 0.5
= 0.10 =
0.6 * 0.4 0.5 * 0.5
400 400
= 0.035
Z p p
p
1
p 2 P1 P2
1 2
p p
1 2
0 0.1
Z 0.1 2.86
0.035
P p1 p 2 0 P( Z 2.86)
= 0.5 + P (0 to -2.86)
= 0.5 + 0.49788
= 0.99788
24
CHAPTER Two
STATISTICAL ESTIMATION
Introduction
The sampling distribution of the mean shows how far sample means could be from a
known population mean. Similarly, the sampling distribution of the proportion shows
how far sample proportions could be from a known population proportion. In
estimation, our aim is to determine how far an unknown population mean could be
from the mean of a simple random sample selected from that population; or how far an
unknown population proportion could be from a sample proportion. Those are the
concerns of statistical inference, in which a statement about an unknown population
parameter is derived from information contained in a random sample selected from the
population.
Basic concepts:
x, p, s 2 , s <<<<<. Estimators
, p, 2 , <<<<<<< items being estimated
1, 0.5, 9, 3 <<<<<<<... Estimates
Types of Estimates:
We can make two types of estimates about a population: a point estimate and an
interval estimate.
26
o Sample mean x for population mean ;
o Sample proportion p for population proportion p ;
o Sample variance s 2 for population variance 2 and
o Sample standard deviation s for population standard deviation
Example:
Suppose we have the sample 10,20,30,40 and 50 selected randomly from a population
whose mean is unknown.
On the other hand, if we state that the mean, , is between x 10 , the range of values
from 20 (30-10) to 40 (30+10) is an interval estimate.
INTERVAL ESTIMATION
When the population distribution is normal and at the same time is known, we can
estimate (regardless of the sample size) using the following formula1.
1
This formula works also for problems which involve large sample size (n>30) even though the
population is not normally distributed. And if n>.05N, finite population correction factor may be used.
28
X Z / 2 n
Where:
X = sample mean
Z = value from the standard normal table reflecting confidence level
σ = population standard deviation
n = sample size
α = the proportion of incorrect statements (α = 1 – C)
= unknown population mean
From the above formula we can learn that an interval estimate is constructed by adding
and subtracting the error term to and from the point estimate. That is, the point estimate
is found at the center of the confidence interval.
To find the interval estimate of population mean, we have the following steps.
1. Compute the standard error of the mean x
2. Compute 2 from the confidence coefficient.
3. Find the Z value for the 2 from the table
4. Construct the confidence interval
5. Interpret the results
Example:
1. The vice president of operations for Ethiopian Tele Communication Corporation (ETC)
is in the process of developing a strategic management plan. He believes that the ability
to estimate the length of the average phone call on the system is important. He takes a
random sample of 60 calls from the company records and finds that the mean sample
length for a call is 4.26 minutes. Past history for these types of calls has shown that the
population standard deviation for call length is about 1.1 minutes. Assuming that the
population is normally distributed and he wants to have a 95% confidence, help him in
estimating the population mean.
Solution:
n= 60 calls X = 4.26 minutes σ = 1.1 minutes C= 0.95
iv. X Z / 2
1.1
i. X = = 0.142
n 60 n
ii. α = 1 – C = 1- 0.95 = 0.05 = 4.26 ± 1.96(0.142)
2 = 0.05/2 = 0.025 = 4.26 ± 0.28
iii. Z / 2 Z 0.025 1.96
3.98 ≤ ≤ 4.54
29
The vice-president of ETC can be 95% confident that the average length of a call for the
population is between 3.98 and 4.54 minutes.
2. A survey conducted by “Addis Zemen Gazetta” found that the sample mean age of
men was 44 years and the sample mean age of women was 47 years. Altogether, 454
people from Addis were included in the reader poll –340 women and 114 men. Assume
that the population standard deviation of age for both men and women is 8 years.
a. Develop a 95% confidence interval estimate for the mean age of the population
men who read the gazetta.
b. Develop a 95% confidence interval estimate for the mean age of the population
women who read the gazetta.
c. Compare the widths of the two interval estimates form part (a) & (b) which one
has a better precision? Why?
Solution:
a.
n= 114 men X = 44 years σ = 8 years C= 0.95
iv. X Z / 2
8
i. X = = 0.75
n 114 n
ii. α = 1 – C = 1- 0.95 = 0.05 = 44 ± 1.96(0.75)
2 = 0.05/2 = 0.025 = 44 ± 1.47
iii. Z / 2 Z 0.025 1.96
42.53 ≤ ≤ 45.47
b.
n= 340 women X = 47 years σ = 8 years C= 0.95
iv. X Z / 2
8
i. X = = 0.434
n 340 n
ii. α = 1 – C = 1- 0.95 = 0.05 = 47 ± 1.96(0.434)
2 = 0.05/2 = 0.025 = 47 ± 0.85
iii. Z / 2 Z 0.025 1.96
46.15 ≤ ≤ 47.85
c. Part b has a better precision because the sample size is larger as compared with part
a.
3. Time magazine reports information on the time required for caffeine from products
such as coffee and soft drinks to leave the body after consumption. Assume that the
99% confidence interval estimate of the population mean time for adults is 5.6 hrs to 6.4
hrs.
30
a. What is the point estimate of the mean time for caffeine to leave the body after
consumption?
b. If the population standard deviation is 2 hrs, how large a sample was used to
provide the interval estimate?
Solution:
C = 0.99 Confidence interval: 5.6 ≤ ≤6.4
5.6 6.4
a. point estimate = 6 hours
2
Or;
5.6 X Z / 2
n
6.4 X Z / 2
n
12 = 2 X
X = 6 hours
6.4 X Z / 2
n
6.4 6 2.57 2
n
0.4 5.14 ; rearranging the expression
n
5.14
n
0.4
n 12 .85 ; squaring both sides
n = 165
We state with 99% confidence that the mean time required for caffeine to leave the body
after consumption lies between 5.6 and 6.4 hrs.
Confidence interval estimate of - Normal population, unknown, n
large
If we know that the population is normal, and we know the population standard
deviation, the confidence interval for should be constructed in the manner already
shown, i.e., X Z / 2 . If the population standard deviation is
n
unknown, it has to be estimated from the sample; i.e., when is unknown, we use
31
2
( X i X )
sample standard deviation: S . Then, the standard error of the mean, X ,
n 1
is estimated by the sample standard error of the mean: S X S .
n
Example:
1. Suppose that a car rental firm in Addis wants to estimate the average number of miles
traveled by each of its cars rented. A random sample of 110 cars rented reveals that the
sample means travel distance per day is 85.5 miles, with a sample standard deviation of
19.3 miles. Compute a 99% confidence interval to estimate .
Solution:
n= 110 rented cars X = 85.5 miles s = 19.3 miles C= 0.99
S 19.3
i. S X = = 1.84 iv. X Z / 2 s
n 110 n
ii. α = 1 – C = 1- 0.99 = 0.01 = 85.5 ± 2.57(1.84)
2 = 0.01/2 = 0.005 = 85.5 ± 4.73
iii. Z / 2 Z 0.005 2.57
80.77 ≤ ≤ 90.23
We state with 99% confidence that the average distance traveled by rented cars lies
between 80.77 and 90.23 miles.
2. A study is being conducted in a company that has 800 engineers. A random sample of
50 of these engineers reveals that the average sample age is 34.3 years, and the sample
standard deviation is 8 years.Assuming normality, construct a 98% confidence interval
to estimate the average age of all engineers in this company.
Solution:
n= 50 engineers N = 800 engineers X = 34.3 years s = 8 years C= 0.98
S N n 3 8 800 50
i. S X * = * = 1.10
n N 1 50 800 1
2
This formula also works for large sample size even though the parent population is not normally
distributed.
3
Since the sample size is greater than 5% of the population size, finite population multiplier is used to
calculate the sample standard error of the mean.
32
31.74 ≤ ≤ 36.86
We state with 98% confidence that the mean age of engineers lies between 31.74 and
36.86 years.
If the sample size is small (n<30), we can develop an interval estimate of a population
mean only if the population has a normal probability distribution. If the sample
standard deviation s is used as an estimator of the population standard deviation and
if the population has a normal distribution, interval estimation of the population mean
can be based up on a probability distribution known as t-distribution.
Characteristics of t-distribution
1. The t-distribution is symmetric about its mean (0) and ranges from - ∞ to ∞.
2. The t-distribution is bell-shaped (unimodal) and has approximately the same
appearance as the standard normal distribution (Z- distribution).
3. The t-distribution depends on a parameter ν (Greek Nu) 4, called the degrees of
freedom of the distribution. Ν = n -1, where n is sample size. The degree of freedom,
ν, refers to the number of values we can choose freely.
4. The variance of the t-distribution is ν/ (ν-2) for ν>2.
5. The variance of the t-distribution always exceeds 1.
4
What are degrees of freedom? We can define them as the number of values we can choose freely. In
general, the degrees of freedom for a t statistic are the degrees of freedom associated with the sum of
squares used to obtain an estimate of the variance. The variance estimate depends on not only on the
sample size but also on how many parameters must be estimated with the sample:
Degrees of Number of Number of parameters that
freedom Observatio ns must be estimated beforehand
Here we calculate sample variance by using n observations and estimating one parameter (the mean).
Thus, there are (n – 1) degrees of freedom.
33
3. Look up t / 2 ,V
4. Construct the confidence interval
5. Interpret results
Example:
1. If a random sample of 27 items produces x = 128.4 and s = 20.6. What is the 98%
confidence interval for ? Assume that x is normally distributed for the population.
What is the point estimate?
Solution:
34
The point estimate of the population mean is the sample mean, in this case 128.4 is the
point estimate.
118.56 ≤ ≤ 138.22
We state with 98% confidence that the population mean lies between 118.56 and 138.23.
2. A sample of 20 cab fares in Bahir Dar city shows a sample mean of Br 2.50 and a sample
standard deviation of Br. 0.50. Develop a 90% confidence interval estimate of the mean
cab fares in Bahir Dar city. Assume the population of cab fares has a normal
distribution.
2.31 ≤ ≤ 2.69
We state with 90% confidence that the mean of cab fares in Bahir Dar city lies between
Birr 2.31 and 2.69.
35
3. Sales personnel for X Company are required to submit weekly reports listing customer
contacts made during the week. A sample of 61 weekly contact reports showed a mean
of 22.4 customer contacts per week for the sales personnel. The sample standard
deviation was 5 contacts.
a. Develop a 95% confidence interval estimate for the mean number of weekly
customer contacts for the population of sales personnel.
b. Assume that the population of weekly contact data has a normal distribution.
Use the t distribution to develop a 95% confidence interval for the mean number
of weekly customer contacts.
c. Compare your answer for parts (a) and (b). What do you conclude from your
results?
Solution:
a.
n= 61 weekly contact reports5 X = 22.4 contacts s = 5 contacts C= 0.95
S 5
i. S X = = 0.64
n 61
ii. α = 1 – C = 1- 0.95 = 0.05
2 = 0.05/2 = 0.025
iii. Z / 2 Z 0.025 1.96
iv. X Z / 2 s
n
= 22.4 ± 1.96(0.64)
= 22.4 ± 1.25
21.15 ≤ ≤ 23.65
I state with 95% confidence that the mean weekly contact lies between 21.15 and 23.65
contacts.
b.
n= 61 weekly contact reports X = 22.4 contacts s = 5 contacts C= 0.95
S 5
i. S X = = 0.64 ν = n – 1 = 61 – 1 = 60
n 61
ii. α = 1 – C = 1- 0.95 = 0.05
2 = 0.05/2 = 0.025
5
Since the sample size is large, we use the Z-distribution to construct the confidence interval.
36
21.12 ≤ ≤ 23.68
I state with 95% confidence that the mean weekly contact lies between 21.12 and 23.68
contacts.
This is solved by non-parametric tests, which do not require assumption about the
underlying form of the population data
Interval Estimation of the Population Proportion
Solving for P results in P p Z p q n and since Z can assume both positive and
P p Z / 2 pq
Since Z represents the confidence level we write it as n
p Z / 2 S p
Where: p = sample proportion
q =1- p
α=1–C
n = sample size
P = unknown population proportion
Example:
1. Recently, a study of 87 randomly selected companies with telemarketing operation was
completed. The study revealed that 39% of the sampled companies had used
telemarketing to assist them in order processing. Using this information estimate the
population proportion of telemarketing companies who use their telemarketing
operation to assist them in order processing taking a 95% confidence level.
Solution:
n= 87 p = 0.39 q = 0.61 C = 0.95
i. S p pq 0.61 * 0.39
n = 87
= 0.0523
ii. α = 1 – C = 1- 0.95 = 0.05
2 = 0.05/2 = 0.025
iii. Z / 2 Z 0.025 1.96
iv. P p Z / 2 S p
= 0.39 ± 1.96(0.0523)
= 0.39 ± 0.1025
0.2875 ≤ P ≤ 0.4925
We state with 955 confidence that the proportion of companies which use telemarketing
to assist order processing lies between 0.2875 and
2. A fast food restaurant took a random sample of 400 customers to determine the
proportion of customers who are female. A confidence interval of .73 to .87 was
reported.
a. Find the number of females and the sample proportion
b. Find the level of confidence of this interval
Solution:
a.
n= 400 0.73 ≤ P ≤ 0.87 p =? Number of females=?
38
0.73 0.87
Point estimate = 0.80
2
Or;
0.73 p Z / 2 s p
0.87 p Z / 2 s p
1.60 = 2 p
p = 0.8
Number of females (X) = n* p = 400*0.8 = 320
b. P p Z / 2 S p
0.87 = 0.8+ Z / 2 S p
It is clear that the unbiased point estimate of the difference between the means of two
populations 1 2 is the difference between two sample means x1 x2 , where each
sample is a random sample taken from the respective target population. The confidence
interval is constructed by adding the relevant standard error value which is called
standard error of the difference between means and the confidence level desired.
If the two parent populations are normal, then the sampling distribution of the
difference between two means will be normally distributed regardless of n (sample
size). And we can estimate 1 2 (regardless of n1 & n2 using the following formula;
given that 1 & 2 are known.
1 2 X 1 X 2 Z / 2 X X 6
1 2
12 22
X X2 X2
1X2 1 2
n1 n2
When 1 and 2 are not known, the standard error between two sample means
x x
1 2 is estimated by the sample standard error of the difference between two
S12 S 22
sample means, S X S X2 S X2 , and the interval estimation takes the
1X2 1 2
n1 n2
large.
Example:
1. In a sex discrimination case, an employee alleged that a large corporation paid men
more than women for comparable work. Let population 1 represents all male
employees performing certain jobs and population 2 represent all female employees
performing comparable jobs at the corporation. Independent samples are taken of
n1 100 males and n2 100 females; the sample means are x1 Birr 20,600 and
x 2 Birr 19,700 , and the sample standard deviations are s1 Birr 3,000 and
s 2 Birr 2,500 . Construct a 95% confidence interval for 1 2 . What do you conclude
from this?
Solution:
Male employee’s Female employees
6
This formula works also for problems which involve large sample sizes n1 & n2 30 even though the
parent population may not be normally distributed.
40
Because this interval contains only positive values, we can be quite confident that
1 2 > 0. Thus, it reasonable to assume that the mean salary for males exceeds the
mean salary for females.
2. A farmer wants to determine if different types of feed can influence the mean member
of eggs that hens lay per month. In a random sample of 100 hens that ate feed 1, the
average member of eggs per month was x1 15.2 with variance 4. In a random sample
of 100 hens that ate feed2, the average number of eggs per month was x2 14 with
variance 4. Construct a 95% confidence interval for 1 2 . What do you conclude?
Solution:
Feed 1 Feed 2
n1 100 hens n2 100 hens C= 0.95
x1 15.2 eggs x 2 14 eggs
s 4 eggs
2
1 s 22 4 eggs
Steps:
i. Calculate the (sample) standard error of the difference between two means
S12 S 22 4 4
SX 0.08 0.283
1X2
n1 n2 100 100
41
ii. Compute 2
α = 1-C = 1- 0.95 = 0.05
α/2 = 0.05/2 = 0.025
iii. Look up Z / 2 Z 0.025 1.96
iv. Construct the confidence interval
1 2 X 1 X 2 Z / 2 S X X 1 2
We state with 95% confidence that the mean number of eggs laid by hens which ate the
two type of feeds lies between 0.6543 eggs and 1.7547 eggs.
Since the interval contains only positive values, then those hens which ate feed type 1
are more productive than those hens that ate feed type 2.
When the sample sizes are small, the population standard deviations are unknown, and
the population distributions are normal, we use t-distribution to construct a confidence
interval for 1 2 . Moreover, to use a t-distribution we have to assume that the two
variances (standard deviations) are equal. In short, to use a t-distribution for
constructing confidence interval for 1 2 , we assume the following:
However, in most cases, 2 is unknown; thus the two sample variances s1 2 and s 22 must
2 2
be used to develop the estimate of 2 . Since X is based on the
1X2
n1 n2
assumption that 1 2 2 , we do not need a separate estimates of 12and 2 2 . In fact,
2 2
we can combine the data from the two samples to provide the best single estimate of 2 .
The process of combing the results of two independent simple random samples to
provide one estimate of 2 is referred to as pooling. The pooled estimator of variance,
2 , denoted by s 2 s p is the weighted average of the two sample variances, s1 2 and s 22 ,
2
with the degrees of freedom associated with each sample being used as the weights.
The formula for the pooled estimator of 2 is:
n 1S12 n2 1S 22 X i1 X 1 X i 2 X 2
2
S2 1
n1 n2 2 n1 n 2 2
P
Where:
S P2 = pooled estimate of the variance
n1 = sample size drawn from population 1
n 2 = sample size drawn from population 2
S 12 = sample variance of the sample drawn from population 1
S 22 = sample variance of the sample drawn from population 2
n1+n2-2 = pooled degrees of freedom
Based on the assumption that the population standard deviations are equal, the
standard error of the difference between means is estimated by the sample standard
error of the difference between two sample means, S X X , according to the following
1 2
1 1 S P2 S P2
equation: S X S P2 =
1X2
n1 n 2 n1 n2
Where:
1 1
SX S P2
1X2
n1 n 2
ν = pooled degrees of freedom (n1 + n2 – 2)
43
Example:
1. Two manufacturing companies produce drill tips that are used to cut holes in steel
sheets. A customer wishing to know which drill tips have the longer site purchases,
independent samples of n1 20 drill tips from company 1 and n2 15 drill tips from
company 2. The mean lives of the drill tips are x1 78 minutes and x 2 84 minutes.
The population variances are unknown but assumed to be equal. The sample variances
are s1 2 41 and s 2 2 36 . Construct a 95% confidence interval for 1 2 assuming that
the two populations are normally distributed.
Solution:
Company One Company Two
n1 20 drill tips n2 15 drill tips C = 0.95
x1 78 minutes x 2 84 minutes
S 12 = 41 S 22 = 36
i. Calculate the sample standard error of the difference between two means and the
pooled degrees of freedom
S P2 S P2 n1 1S12 n2 1S 22 1 1
SX n
1X2
n1 n 2 n1 n 2 2 1 n 2
20 141 15 136 1 1
=
20 15 2 20 15
1,283 1 1
=
33 20 15
= 2.13
ν = n1 +n2 -2
= 20 + 15 -2 = 33
ii. Compute 2 and look up t / 2 , v
α = 1-C = 1- 0.95 = 0.05
α/2 = 0.05/2 = 0.025
t / 2 , v = t 0.025,33 = 2.04
iii. Construct the confidence interval
1 2 X 1 X 2 t / 2 , v S X
1X2
The 95% confidence interval is (-10.34 to –1.66). This interval contains only negative
values indicating that the drill tips made by company 1 do not last as long on average,
as those made by company 2.
2. Five year children were being studied to determine whether children whose parents are
college graduates watched more or less TV than children whose parents are not college
graduates. Independent random samples of 21 children were selected from each
population. The sample means and variances were x1 22hrs, s12 16, x 2 26, ands2 2 14.
The population variances are assumed to be equal and the populations are assumed to
be normal. Calculate the 95% confidence interval for the difference between the two
population means.
Solution:
College graduates’ children Non-college graduates’ children
n1 21 children n2 21 children C = 0.95
x1 22 hours x 2 26 hours
S12 = 16 S 22 = 14
i. Calculate the sample standard error of the difference between two means and the
pooled degrees of freedom
S P2 S P2 n1 1S12 n2 1S 22 1 1
SX
1X2
n1 n 2 2 n n
n1 n 2 1 2
-6.43 ≤ 1 2 ≤ -1.58
We state with 95% confidence that the mean difference between the two populations
lies between –6.42 and –1.58
Children whose parents are college graduates watched less TV than children
whose parents are not college graduates.
proportions p1 p2
We know that the unbiased estimator of the difference between the proportions of two
populations p1 p2 is the difference between two sample proportions p1 p2 , where
each sample is a random sample taken from the respective target population.
Moreover, based on CLT, if n1 p1, n1 q1 and n 2 p 2, n 2 q 2 are greater than 5, the sampling
P1 q1 p 2 q 2
n1 n2
However, here p1andp2 are unknown, and we want to estimate p1andp2 by p 1 and p 2
respectively, and hence Z becomes:
Z 1
P P2 P1 P2
. That is, p p is substituted by S p p
1 2 1 2
P1 q1 p 2 q 2
n1 n2
46
p1 q1 p 2 q 2
P1 P2 p 1 p 2 Z
n1 n2
p1 q1 p 2 q 2
P1 P2 p 1 p 2 Z / 2
n1 n2
Where:
p1 = the sample proportion of success in the first sample
p 2 = the sample proportion of in the second sample
q 1 = 1- p1
q 2 = 1- p 2
n1 = sample size drawn from the first population
n 2 = sample size drawn from the second population
α=1-C
Example:
1. A TV executive is interested in determining if the proportion of people who watch a
late-night talk show is higher with the regular host or a guest host. In a random sample
of 400 people, 175 watch the show when the regular host is on. In an independent
random sample of 500 people, 185 watch the show a guest host is on. Calculate a 95%
confidence interval for p1 p2 . What do you conclude?
Solution:
Regular host Guest Host
n1 = 400 p1 = 0.4375 n 2 = 500 p 2 = 0.37
X1 = 175 q 1 = 0.5625 X2 = 185 q 2 = 0.63
C = 0.95
47
i. Calculate the sample standard error of the diff. between two proportions
p1 q1 p 2 q 2 0.4375 * 0.5625 0.37 * 0.63
S p p 0.033
1 2
n1 n2 400 500
ii. Compute 2
α = 1-C = 1- 0.95 = 0.05
α/2 = 0.05/2 = 0.025
iii. Look up Z / 2 Z 0.025 1.96
iv. Construct the confidence interval
p1 q1 p 2 q 2
P1 P2 p 1 p 2 Z / 2
n1 n2
0.4375 037 1.96(0.033)
= 0.0675 ± 0.065
0.0025 ≤ p1 p2 ≤ 0.1325
We state with 95% confidence that the true difference between p1 p2 is between 0.0025
and 0.1325. Since this interval contains only positive value it is reasonable to say that
the proportion of people who watch TV when the regular host is on is greater than
when the guest host is on.
2. A city planner claims that home owner tend to have closer ties to their community than
do renters. Thus, home owners are more willing to pay for good schools and
recreational facilities than are renters. In a random sample of 120 home owners, 51
stated that the local tax rates were too high and 69 stated that tax rates were “about
right” In an independent random sample of 200 renters 70 stated that the tax rates were
too high and 130 thought they were “about right” .
a. Find a 99% confidence interval for the difference in proportions who think that
taxes are too high.
b. Do the data support the city planners claim?
Solution:
Home Owners Renters
n1 = 120 p1 = 0.425 n 2 = 200 p 2 = 0.35
X1 = 51 q 1 = 0.575 X2 = 70 q 2 = 0.65
C = 0.99
i. Calculate the sample standard error of the diff. between two proportions
48
- 0.069 ≤ p1 p2 ≤ 0.219
We state with 99% confidence that the difference between the proportion of home
owners and renters who said that the tax rates are too high lies between –0.069 and
0.219.
Since the confidence interval contains positive, zero, and positive values, we can not
certainly state that home owners are more willing to pay for good schools and
recreational facilities than are renters. Hence, the data do not necessarily support
the city planner’s claim.
The reason for taking a sample from a population is that it would be too costly to gather
data for the whole population. But collecting sample data also costs money; and the
larger the sample, the higher the cost. To hold cost down, we want to use as small a
sample as possible. On the other hand, we want a sample to be large enough to provide
“good” approximation/estimates of population parameters. Consequently, the question
is “How large should the sample be?”
2
Z
n / 2
e
Example:
1. A gasoline service station shows a standard deviation of Birr 6.25 for the changes made
by the credit card customers. Assume that the station’s management would like to
estimate the population mean gasoline bill for its credit card customers to be with in ±
Birr 1.00. For a 95% confidence level, how large a sample would be necessary?
Solution:
e = Birr 1.00 σ = Birr 6.25 C = 0.95 Z / 2 Z 0.025 1.96
Z
2
n / 2
e
2
1.96 * 6.25
n 7
1
150.06 151
2. The National Travel and Tour Organization (NTO) would like to estimate the mean
amount of money spent by a tourist to be with in Birr 100 with 95% confidence. If the
amount of money spent by tourist is considered to be normally distributed with a
standard deviation of Br 200, what sample size would be necessary for the NTO to meet
their objective in estimating this mean amount?
Solution:
e = Birr 100 σ = Birr 200 C = 0.95 Z / 2 Z 0.025 1.96
7
It a procedure for determining sample size produces a non-integer value, always round to the next
larger integer.
50
Z
2
n / 2
e
2
1.96 * 200
n
100
15.37 16
pq pq
The confidence interval for p is P p Z / 2 . The expression Z / 2 is called the
n n
error term (e). That is,
pq
e Z / 2 , squaring both sides
n
pq
e 2 Z 2 / 2 , solving for n
n
Z 2 / 2 p q
np
e2
Since we are trying to determine n, we cannot have p and q . Instead, we should have p
2
Z / 2
n
and q. so it becomes p pq
e
Example
1. Suppose that a production facility purchases a particular component parts in large lots
from a supplier. The production manager wants to estimate the proportion of defective
parts received from this supplier. She believes that the proportion of defects is no more
51
than 0.2 and wants to be with in 0.02 of the true proportion of defects with a 90% level
of confidence. How large a sample should she take?
Solution:
e = 0.02 p = 0.2 q =0.8 C = 0.90 Z / 2 Z 0.05 1.64
2
Z
n p / 2 pq
e
2
1.64
np 0 .2 * 0 .8
0.02
1075.84 1076
2. What is the largest sample size that would be needed in estimating a population
proportion to with in ± 0.02, with a confidence coefficient of 0.95?
Solution:
e = 0.02 C = 0.95 Z / 2 Z 0.025 1.96
2 * Z 2 / 2 2 Z
2
n 2 / 2
e2 e
The above formula suggests that the necessary sample sizes for comparing two sample
means are each twice as large as the required sample size for estimating single sample
means. It is clear that the larger the sample, the more it costs. Thus sample size
formulas can be effective aids in ensuring that a research project’s goals are met and
that the cost of sampling is minimized.
Example:
1. A college admissions officer wants to estimate the difference in the average GMAT
scores of men and women. She plans to take a random sample of men and women who
have taken the GMAT at the same time. She wants to be with in 10 points of the true
difference in the mean scores of men and women and 95% confident of her results. Past
GMAT test results indicate that the standard deviation of GMAT test scores is about 105
points. How large the sample sizes be?
Solution:
e = 10 points σ = 105 points C = 0.95 Z / 2 Z 0.025 1.96 n=?
Z
2
n 2 / 2
e
2
1.96 *105
n 2
10
2( 421.54)
847.10 848
2. A researcher wants to estimate the difference between the average price of a 21-inch
black and white TV and the average price of a 21-inch color TV set. He believes that the
standard deviation of the price of a 21-inch TV set is about Birr 100. He wants to be 99%
53
confident of his results and with in Birr 20 of the true difference. How large a sample
should he take for each type of television set?
Solution:
e = Birr 20 σ = Birr 100 C = 0.99 Z / 2 Z 0.005 2.57 n=?
Z / 2
2
n 2
e
.
54
CHAPTER Three
HYPOTHESIS TESTING
Introduction
In Chapter II Estimation, we used the information obtained in a simple random sample
to construct a confidence interval estimate of the unknown value of a population
parameter. In this chapter, hypothesis testing, we start with an assumed value of a
population parameter: then we shall use sample evidence to decide wither the assumed
value is unreasonable and should be rejected or whether it should be accepted.
The assumptions we make about the values of population parameters are called
hypotheses. Sample evidence is used to test the reasonableness of hypotheses; hence,
the statistical inferences made in this chapter are referred to as hypothesis testing. A
procedure based on sample evidence and probability theory to determine whether the
hypothesis is a reasonable statement is called hypothesis testing.
In the process of hypothesis testing, the null hypothesis is initially assumed to be true.
The data are gathered and examined to determine whether the evidence is strong
enough away from the null hypothesis to reject it when the researcher in testing an
industry standard or a widely accepted values, the standard or accepted value is
assumed to be true in the null hypothesis. Null in this sense means that nothing is new,
or there in no new value or standard. The burden is then placed on the researcher to
demonstrate through gathered data that the null hypothesis is false.
Ha, Hi = research hypothesis a statement that in accepted if the sample data provide
enough evidence that the Ho is false.
55
To test these competing statements, or hypotheses, a trial is held. The testimony and
evidence obtained during the trial provide the sample information. If the sample
information is not inconsistent with the assumption of innocence, the null hypothesis
that the defendant is innocent cannot be rejected. However, if the sample information is
inconsistent with the assumption of innocence, the null hypothesis will be rejected. In
this case, action will be taken based upon the alternative hypothesis that the defendant
is guilty.
Example
1. The manager of a hotel has stated that the mean guest bill for a weekend is Birr 400 or
less. A member of the hotel’s accounting staff has noticed that the total charges for
guest bills have been increasing in recent months. The accountant will use a sample of
weekend guest bills to test the manager’s claim.
Solution:
2. Production workers at XY Company have been trained in their jobs by using two
different training programs. The company training director would like to know
whether there is a difference in mean productivity for workers trained in the two
programs.
Required: Develop the null and alternative hypotheses.
Solution
Ho: 1 = 2 or 1 - 2 = 0
Ha: 1 2 1 - 2 0
56
3. The manager at a drugstore claims that the company’s employees are honest. However,
there have been many shortages from the cash register lately.
Solution:
Ho: Employees are honest
Ha: Employees are dishonest
“In many situations, the choice of Ho and Ha is not obvious; in such cases, judgment on
the part of the user is needed to select the proper farm of Ho and Ha. However, the
equality part of the expression (either =, or ) always appears in the null hypothesis.
Type I Error
In hypothesis testing sample evidence is used to test the null hypothesis Ho.
Occasionally the sample data gathered in research process lead to a decision to reject a
null hypothesis when actually it is true. A type I error is committed when a true null
hypothesis is rejected. In short, rejecting a true Ho is called Type I error. The
possibility of committing a Type I error is represented by Alpha (), or the level of
significance. Alpha is sometimes referred to as the amount of risk taken in an
experiment. Alpha represents the proportion of the area of the curve occupied by the
rejection region. The most commonly used values of are 0.001, 0.01, 0.05 and 0.10.
The larger the area of the rejection region, the greater is the risk of committing Type I
error.
Type II Error
A Type II error is committed by failing to reject a false null hypothesis. That is to say
that, accepting a null hypothesis when it is false is called a Type II error. The
probability of committing a Type II error is represented by beta ().
Alpha () is determined before the experiment, however, Beta () is computed using
alpha, the hypothesized parameter, and various theoretical alternatives to the null
hypothesis.
(Null Hypothesis)
57
State of Nature
Decision Ho True Ho False
Accept Ho Correct Decision Type II Error
Reject Ho Type I Error Correct Decision
There is a tradeoff between alpha and beta (Type I and Type II errors). The probability
of making one type of error can be reduced only if we are willing to increase the
probability of making the other type of error. However, this does not mean that 1;
rather it means that the smaller is the larger will be , and the larger in the smaller
will be.
Leads to two – tailed test leads to a right – tailed test Leads to a left tests
Aright – tailed test will reject the null hypothesis if the sample statistic is significantly
higher than the hypothesized population parameter.
A left – tailed test will reject the null hypothesis if the sample statistic is significantly
lower than the hypothesized population parameter.
2. Select the test statistic that will be used to decide whether or not to reject the null
hypothesis
E.g. Z – distribution, t – distribution, F- dist, x2 – distribution
3. Select the level of significance to determine the critical values and develop the
rejection rule that indicates the values of the test statistic that will lead to the
rejection of Ho.
E.g. = 0.05 Z025 = 1.96 Reject Ho if /Sample Z/ 1.96
4. Collect the sample data, and compute the value the test statistic. A test statistic is a
random variable whose value is used to determine whether we reject the null
hypothesis.
E.g. Sample Z=2.0
5. Compare the value of the test statistic to the critical value(s) and make the decision
(either reject Ho or accept HO /do not reject).
use Z-Value to test the hypothesis; regardless of the sample size, n. It is also applicable
1. Matador-Addis Tyre Share Company claims that its tires have a mean life of 35,000
miles. A random sample of 16 of these tires is tested if the sample mean in 33,000 miles.
Assume that the population standard deviation is 3000 miles and the lives of tires are
approximately normally distributed. Test the share company’s claim using a 5% level
of significance.
Solution
1. Ho: = 35,000 miles 2. Z – distribution, two tailed test
Ha: ≠ 35,000 miles
3. = 0.05 4. X = 33,000 miles
/2 = 0.025 = 3,000 miles
Z0.025 = ± 1.96 n = 16 tires
Reject Ho if /Sample Z/ > 1.96 Sample Z =?
33,000 35,000
Z 33, 000 2.67
3,000
16
5. Reject Ho because /-2.67/ > 1.96
x
Similar to that of the first method, the critical value method uses the formula Z
n
x
or Z depending on the knowledge of the population standard deviation.
n
However, instead of a calculated z, a critical X value, X c , is determined. The critical
value of Z c is inserted into the formula, along with µ and σ. Thus,
60
Xc
Zc .
n
With the critical value method, most of the computational work is done ahead of time.
In this case, before the sample means are computed, it is known that a sample mean
value of less than 34,250 miles but greater 35,750 than miles must be attained in order to
reject the population mean. Because the sample mean for this problem is 33,000 miles,
we reject the null hypothesis. This method is particularly attractive in industrial settings
where standards can be set ahead of time and then quality control technicians can
gather data and compare actual measurements of products to specifications.
For example, in the mean life of Matador-Addis Tyre case, the computed value of z was
-2.67. The Z table lists the probability of a value this large or larger (this small or
smaller) occurring by chance as 0.00379 (0.5 - 0.49621). As this probability is smaller
than α/2, the null hypothesis is rejected.
In order to reject the null hypothesis with the probability method, the probability of
the computed value must be less than α for a one tailed test or less than α/2 for a two
tailed test.
2. A Teachers’ union is on strike for higher wages. The union claims that the mean salary
for teachers is at most Birr 8,400 per year. The legislator does not want to reject the
61
union’s claim, however, unless the evidence is very strong against if. Assume that
salaries follow a normal distribution and the population standard deviation is known to
be Birr 3000. A random sample of 64 teachers is obtained, and the sample mean is Birr,
9,400. Test if the state legislator accepts the unions’ claim or not at 1% significance level.
Solution:
1. Ho: ≤ Birr 8,400 2. Z – distribution, Right – tailed test
Ha: Birr 8,400
3. = 0.01
Z = Z0.01 = 2.33
Reject Ho if sample Z + 2.33
4. n = 64
X = Birr 9,400
= Birr 3,000
Sample Z =?
9,400 8,400
Z 9, 400 2.67
3,000
64
5. Reject Ho because + 2.67 2.33
3. A fertilizer company claims that the use of its product will result in a yield of at least 35
quintals of wheat per hectare, on average, Application at the fertilizer to a randomly
selected 36 sample hectares resulted in a yield of 34quintals per hectare. Assume the
population standard deviation is 5 quintals and yields per hectare are normally
distributed. Test the company’s claim at 1% level of significance.
Solution
1. Ho: ≥ 35 quintals
Ha: 35 quintals
2. Z – distribution, left – tailed test
3. = 0.01
Z= Z0.01 = 2.33
Reject Ho if sample Z -2.33
4. X = 34 quintals
n = 36
= 5 quintals
Sample Z =?
34 35
Z 34 1.20
5
36
62
4. A survey of college graduates showed that the average yearly cash income for these
graduates in at least Birr 12,000. In Addis where you live this average does not seem
possible, so you decide to test this claim. You randomly select 48 graduates who are
marking. The sample average income for these working graduates is Birr 11,400 with a
standard deviation of Birr 2,280. Is there enough evidence from this sample data to
reject the national claim for your area as being too high? Use = 0.10.
Solution
1. Ho : Birr 12,000
Ha: Birr 12,000
2. Z – distribution, Left tailed test
3. = 0.1
Z = Z0.1 = -1.28
Reject Ho if sample Z < -1.28
4. X = Birr 11,400
S = Birr 2,280
n = 48
Sample Z =?
11,400 12,000
Z 11, 400 1.82
2,280
48
= -1.82
5. Reject Ho because –1.82 < -1.28
1. A contractor assumes that construction workers are idle for 75 minutes or less per day.
A random sample of 25 construction workers was taken and the mean idle time was
found to be 84 minutes per day with a sample standard deviation of 20 minutes.
Assume that the population is approximately normally distributed, use a 5% level of
significance to test the contractor’s assumption.
Solution
1. Ho: ≤ 75 minutes
Ha: 75 minutes
2. t – distribution, Right – tailed test
3. = 0.05
n = 25
= n – 1 = 25 – 1 = 24
t, = t0.05,24 = 1.711
Reject Ho if sample t 1.711
4. n = 25
X = 84 minutes
S = 20 minutes
Sample t =?
84 75
t 84 2.25
20
25
5. Reject Ho; because + 2.25 1.711. Workers are idle for more than 75 minutes per day.
2. A director of a secretarial school claims that its graduates can type at least 50 words per
minute on average. Suppose you want to hire some of these graduates if the director’s
claim is true; and you test the typing speed of 18 of the graduates and obtain a mean of
40 wards per minute with a sample variance of 720. Assuming the typing speed for the
graduates of the secretarial school is normally distributed, test the director’s claim and
decide whether to hire the graduates or not, using a 5% level of significance.
Solution
1. Ho: 50 words
Ha: 50 words
2. t – distribution, Left – tailed test
3. = 0.05
n = 18
=n–1
= 18 – 1 = 17
t, = t0.05, 17 = 1.74
Reject Ho if sample t -1.74
4. X = 40 words
64
n = 18
s2 = 720
Sample t =?
40 50
t 40 1.58 5. Do not reject Ho because –1.58 > -1.74
720
18
HYPOTHESIS TESTING ABOUT A POPULATION PROPORTION (P)
A proportion is a value between 0 and 1 that expresses the part of the whole that
possesses a given characteristic.
The formula (methods) for proportions based on the central limit theorem make
possible the testing of hypotheses about the population proportion in a manner similar
to that of the formula used to test sample means. Similar to that of hypothesis testing
about a population mean, hypothesis testing about a population proportion has three
terms.
The first form is a two – tailed test, where as the second and third forms are one – tailed
tests. The specific form used depends up on the application of interest.
The central Limit Theorem applied to the sample proportions states that p values are
pq
normally distributed, with a mean of P and a standard deviation of p , when np
n
and nq are greater than or equal to 5. If np and nq are greater than or equal to 5, a Z -
test is used to test hypothesis about P.
p p
Z
p
Example:
65
1. A magazine claims that 25% of its readers are college students. A random sample of
200 readers is taken. It is found that 42 of these readers are college students. Use a 10%
level of significance and test the magazine’s claim.
Solution
1. Ho: P = 0.25
Ha: P 0.25
2. Z – distribution; two tailed test
3. = 0.1 /2 = 0.05
Z/2 = Z0.05 = 1.64
Reject Ho if /sample Z/ 1.64
4. n = 200
x = 42
p = 0.21
Sample Z =?
0.21 0.25
Z 0.21 1.31
0.25 * 0.75
200
5. Do not reject Ho because / -1.31/ < 1.64
2. An Economist states that more than 35% of Addis’s labor force in unemployed. You
don’t know if the economist’s estimate is too high or too low. Thus, you want to test the
economist’s claim using a 5% level of significance. You obtain a random sample of 400
people in the labor force, of whom 128 are unemployed. Would you reject the
economist’s claim?
Solution
1. Ho: P 0.35
Ha: P < 0.35
2. Z – distribution , Left tailed test
3. = 0.05
Z 0.05 = 1.64
Reject Ho if sample Z < -1.64
4. n = 400
x = 128
p = 0.32
Sample Z =?
0.32 0.35
Z 0.32 1.26
0.35 * 0.65
400
66
3. A survey of the morning beverage market has shown that the primary breakfast
beverage for 60% of Ethiopian town and city dwellers is tea. Ethiopian coffee and Tea
Authority believes that the figure is higher for Addis. To test this idea, one of the
employees of Ethiopian coffee and Tea Authority contacts a random sample of 500
residents in Addis and asks which primary beverage they consumed for breakfast that
day. Suppose 325 replied that tea was the primary beverage. Using a 0.01 level of
significance, test the idea that the tea figure is higher for Addis.
Solution
1. Ho: P = 0.60
Ha: P 0.60
2. Z – distribution, – two tailed test
3. = 0.01
Z0.01 = 2.33
Reject Ho if sample Z 2.33
4. n = 500
X = 325
P = 0.45
Sample Z =?
0.65 0.60
Z 0.65 2.28
0.65 * 0.4
500
5. Do not reject Ho because 2.28 < 2.33
null hypothesis can take three forms along with the corresponding alternative
hypothesis.
For large sample sizes the sampling distribution of the difference between two sample
means is normally distributed with a Z – test statistic.
Z 1
X X 2 1 2
12 22
n1 n2
And whenever n1 and n2 30, we can use S 12 and S 22 as estimates of 12 and 22 to
compute Z if 12 and 22 are unknown, and Z will be computed as
Z
X 1
X 2 1 2
S 12 S 22
n1 n 2
In most of the hypothesis tests about two means, the hypothesized difference is zero.
Example:
1. Is there any difference between the average salary of a legal secretary and a medical
secretary? In an effort to answer that question a researcher takes a random sample of 33
legal secretaries across a region, resulting in a sample average annual salary of Birr
20,000 with a standard deviation of Birr 1,550. The researcher then takes a random
sample of 35 medical secretaries a class the region, which yields an average annual
salary of Birr 18,500 with a standard deviation of Birr 2,100. Use =0.01 to test this
question
Solution
1. Ho: 1-2 = 0
Ha: 1-2 0
2. Z – distribution , two tailed test
3. = 0.01, /2 = 0.005
Z/ = Z0.005 = 2.57
68
Z 1
X X 2 1 2
S 12 S 22
n1 n 2
20,000 18,500 0 3.36
Z X
1X2 Z 20,00018,500
1,550 2 2,100 2
33 35
5. Reject Ho; because /sample Z/ = 3.36 2.57. There is a difference in the average
annual salary of legal and medical secretary.
2. A firm is studying the delivery times for two raw material suppliers. The firm in
basically satisfied with supplier A and is prepared to stay with this supplier provided
that the mean delivery time is the same or less than that of supplier B. However if a
firm finds that the mean delivery time from supplier B is less than that of supplier A, it
will begin making raw material purchases from supplier B.
a. What are the null and alternative hypotheses for this situation?
b. Assume that independent samples show the following delivery time x is for the two
suppliers.
Supplier A Supplier B
n1 = 50 n2 = 30
X 1 = 14 days X 2 = 12.5 days
S1 = 3 days S2 = 2 days
Using = 0.05, what is your conclusion for the hypotheses from part (a)? What action
do you recommend in terms of supplier selection?
Solution
1. Ho: 1-2 0
Ha: 1-2 0
2. Z – distribution , Right - tailed test
3. = 0.05
Z = Z0.05 = 1.64
Reject Ho if sample Z 1.64
69
4. Supplier A Supplier B
n1 = 50 n2 = 30
X 1 = 14 days X 2 = 12.5 days
S1 = 3 days S2 = 2 days
Sample Z =?
14 12.5 0 2.68
Z X
1X2 Z 1412.5
32 2 2
50 30
3. In a wage discrimination case involving male and female employees, it is assumed that
male employees have a mean salary less than or equal to that of female employees. To
justify this, independent random samples of male and female employees were taken
and the following result obtained.
Male Employees Female Employees
n1 = 100 n2 = 100
X 1 = Birr 20,600 X 2 = Birr 19,700
S1 = 3,000 S2 = Birr 2,500
Test the hypothesis with = 0.025. Does wage discrimination appear to exist in this
case?
Solution
1. Ho: 1-2 0
Ha: 1-2 0
2. Z – distribution , Right - tailed test
3. = 0.025
Z = Z0.025 = 1.96
Reject Ho if sample Z 1.96
4. Male employees Female employees
n1 = 100 n2 = 100
X 1 = Birr 20,600 X 2 = Birr 19,700
S1 = Birr 3,000 S2 = Birr 2,500
Sample Z =?
70
t – Distribution: used when population normal 1, and 2 unknown, and n1, and/or n2
30. The unknown population standard deviations are approximated by sample
standard deviations as:
S 12 n1 1 S 22 n 2 1 1 1
X SX ; Where df = n1 + n2 - 2
1X2 1X2
n1 n 2 2 n1 n 2
This approximation is based on the assumption that the two population standard
deviations are equal.
If 1 and 2 are not equal, the sample standard error of the difference between two
Example:
1. A marketing research firm wishes to know if the mean number of his of TV viewing per
week is the same for teenage boys and teenage girls using a 5% level of significance.
The unknown population variances are assumed to be equal. The following data were
2. A time-and-motion study is conducted to test whether the mean length of time required
to perform a certain task is lesser for employees on the day shift than for employees on
the night shift. The data are as follows.
Day shift Night Shift
n1 = 10 n2 = 8
X 1 = 20 hrs X 2 = 29 hrs
S1 = 64 hrs
2
S22 = 50 hrs
72
Use a 1% level of significance to test the hypothesis. Assume the populations are
approximately normal, the population variances are equal, and the samples are
independent.
Solution
1. Ho: 1 - 2 < 0
Ha: 1 - 2 > 0
2. t – distribution , Right - tailed test
3. = 0.01 v = 10 +8-2 = 16
4. t, v = t0.01,16 = 2.583
Reject Ho if /sample t/ > 2.583
Sample t =?
t
20 29 0 2.49
6410 1 508 1 1 1
10 8 2 10 8
The mean length of time required to perform a certain task for day-shift employees is
greater than or equal to for night shift employees.
To test hypothesis about the difference between two population proportions, we obtain
independent random sample of n1 items from the first population and n2 items from the
X X
second; and calculate P1 1 and P2 2 . Once we obtain P1 and P2 , we use a test
n1 n2
based on the standard normal distribution.
The first form leads to a two-tailed test while the later two lead to a one-tailed test.
The central Limit Theorem applied to the difference between two population
proportions states that P1 P2 values are normally distributed, with a mean of P1-P2 a
standard deviation of
P1 (1 P1 ) P2 (1 P2 )
, and Z 1
P P2 P1 P2
.
n1 n2 P1 q1 P2 q 2
n1 n2
However, since the standard error is unknown, it has to be estimated from the sample
data. While we may be tempted to use P1 and P2 as we did with the interval estimation
procedure, in hypothesis testing we often adjust it to a slightly different form. For the
special case where the hypotheses involve no difference between the population
proportions (i.e. either Ho: P1-P2 = 0 Ho: P1-P2 0, or Ho: P1-P2 0) is modified to reflect
the fact that when we assume Ho to be true at the equality, we are assuming P1 = P2.
When this occurs, we combine or pool the two sample proportions to provide one
n p n2 p2 X 1 X 2
estimate. This pooled estimator, denoted by P , is as P 1 1 , and the
n1 n 2 n1 n 2
standard deviation p p is estimated by S p p , which is calculated as
1 2 1 2
1 1 1 1 PQ PQ
S p p P(1 P) = PQ = and Z is calculated as
n1 n 2 n1 n 2 n1 n2
1 2
Z
P P P P . And when P =P
1 2 1 2
1 2, Z becomes Z .
P P
1 2
1 1 1 1
PQ PQ
n1 n 2 n1 n 2
Example:
1. In a sample of 400 products produced by machine I, 200 were defective, and in a sample
of 400 products produced by machine II, 170 were defective. Using = 0.05, test the
hypothesis that the rate of defect is the same for both machine 1 and machine 2.
Solution
1. Ho: P1-P2 = 0
Ha: P1-P2 0
2. Z – distribution, two – tailed test
3. = 0.05,
/2 = 0.025
Z/2 = Z0.025 = 1.96
74
Sample Z
P P P P =
1 2 1 2 0.5 0.425 0 2.13
1 1 1 1
PQ 0.4625 * 0.5375
n1 n 2 400 400
5. Reject Ho
2. To test the effectiveness of the approach and layout of two direct mail brochures, a
marketing manager of SELAM Inc mailed out 150 copies of each brochure and recorded
the number of responses penetrated by each. There were 30 responses generated by the
first brochure and 15 generated by the second. Can the marketing manager conclude
that the first brochure is more effective? Use = 0.05.
Solution
1. Ho: P1-P2 > 0
Ha: P1-P2 0
2. Z – distribution, Left – tailed test
3. = 0.05,
Z = Z0.05 = 1.64
Reject Ho if Sample Z < -1.64
4. Brochure I Brochure II
n1 = 150 n2= 150
x1 = 30 x2 = 15
X X2 30 15
P 1 0.15
n1 n 2 150 150
Sample Z
P P P P =
1 2 1 2 0.20 0.10 0 2.43
1 1 1 1
PQ 0.15 * 0.85
n1 n 2 150 150
5. Do not Reject Ho. YES, the marketing manager can conclude that the first brochure
is more effective than the second.
included 50 assembled during the first shift and 50 assembled during the second shift.
75
Of the Video Cassette Recorders assembled during the first shift 10 were defective; and
20 were defective from the second shift. From the data, would the production foreman
reject the hypothesis that the proportion of defectives assembled by the first shift is
greater than or equal to that for the second shift? Use a 0.05 level of significance.
Solution
1. Ho: P1-P2 0
Ha: P1-P2 0
2. Z distribution, Left-tailed test
3. = 0.05,
Z = Z0.05 = 1.64
Reject Ho if Sample Z 1.64
4. Shift 1 Shift 2
n1 = 50 n2 = 50
x1 = 10 x2 = 20
X X 2 10 20
P 1 0.30
n1 n 2 50 50
Sample Z
P P P P =
1 2 1 2 0.20 0.40 0 2.18
1 1 1 1
PQ 0.30 * 0.70
n1 n 2 50 50
5. Reject Ho.
The proportion of defectives assembled by the second shift is greater than that for the
first shift.
Determining the probability of committing a Type II error is more complex than finding
the probability of committing a Type I error. The probability of committing a Type I
error either is given in a problem or is stated by the researcher before proceeding with
the study. A Type II error, β, varies with the possible values of the alternative
parameter.
For example, suppose that a researcher is conducting a statistical test on the following
hypothesis:
Ho: 12 oz
Ha: 12 oz
76
A Type II error can be committed only when the researcher fails to reject a false null
hypothesis. In these hypotheses, if the null hypothesis, 12 oz is false, what is the true
value for the population mean? Is the mean really 11.99 oz or 11.90 oz, or 11.50 oz or 10
oz? For each of possible values of the population mean, the researcher can compute the
probability of committing a Type II error. Often, when the null hypothesis is false, the
value of the alternative mean is unknown, so the researcher will compute the
probability of committing Type II errors for several possible values. How can the
probability of committing a Type II error be computed for a specific alternative value of
mean?
Suppose that, in testing the hypotheses above, a sample of 60 cans of beverage yields a
sample mean of 11.985 0z, with a standard deviation of 0.10 oz. For α = 0.05 and a one-
tailed test, the table Z value is -1.64. The calculated Z value is
11.985 12.00
Z 1.16
0.10
60
From this calculated value of z, the researcher determines not to reject the null
hypothesis. By not rejecting the null hypothesis, the researcher either made a correct
decision or committed a Type II error. What is the probability of committing a Type II
error in this problem if the population mean actually is 11.99?
The first step in determining the probability of a Type II error is to calculate the critical
value for the mean, X . This value is used as a cutoff point for the acceptance region in
testing the null hypothesis. For any sample mean obtained that is less than X (or greater
for an upper tail rejection region), the null hypothesis is rejected. Any sample mean
greater than X (or less for an upper tail rejection region) causes the researcher to accept
the null hypothesis. Solving for the critical value of the mean gives
X
Z
S
n
X 12.00
1.64 ;
0.10
60
X = 11.979
From the above computation, we can learn that the null hypothesis will be rejected for a
sample mean vale of less than 11.979 oz. Assume that the alternative hypothesis is a =
11.99 oz. How often will the researcher accept the = 12 as true, when, in reality, =
11.99 is true? If the null hypothesis is false, the null hypothesis will be incorrectly
77
accepted whenever falls in the acceptance region, X 11.979 oz. If actually equals
11.99 oz, what is the probability of failing to reject = 12 oz when 11.979 oz is the critical
value? The researcher calculates this probability by extending the critical value ( X =
11.979 oz) and finding the area to the right of 11.979.
X 11.979 11.99
Z 0.85
S 0.10
n 60
This value of Z yields an area of 0.3023. The probability of committing a Type II error is
all the area to the right of X = 11.979, or 0.3023+0.5000 = 0.8023. Hence, there is an
80.23% chance of committing a Type II error if the alternative mean is 11.990z.
With two-tailed tests, both tails of the distribution contain rejection regions. If the null
hypothesis is false, obtaining a calculated statistic falling in the tails results in the
correct decision: to reject the null hypothesis. In this case, the probability of committing
a Type II error exists only for the area between the two critical values (the acceptance
region). However, the right critical value is so far away from the alternative mean that
the area between the right critical value and the mean essentially is 0.5000. Had there
been any area past the upper critical value of Pc (0.46), it would have been subtracted
from 0.5000, slightly reducing the value of 0.7454.
Example:
1. Suppose that you are conducting a two-tailed hypothesis test of proportions. The null
hypothesis is that the population proportion is 0.40. The alternative hypothesis is that
the population proportion is not 0.40. A random sample of 250 produces a sample
proportion of 0.44. Using alpha of 0.05 and assuming that the alternative population
proportions really is 0.36, what is the probability of committing a Type II error?
Solution:
For an alpha value of 0.05, the table Z value for α/2 is 1.96. Using 1.96, solve for the
critical value of the proportion.
pP
Z
Pq
n
p 0.40
1.96
0 . 4 * 0 .6
250
P 0.40 0.06
78
The critical values are 0.34 on the lower end and 0.46 on the upper end. The alternative
population proportion is 0.36.
The area associated with Z= 0.66 is 0.2454. The probability of committing a Type II error
is 0.5000 + 0.2454 = 0.7454.
2. Suppose that the null hypothesis is that the population mean is greater than or equal to
100. Suppose further that a random sample of 48 items is taken and the sample standard
deviation is 14. For each of the following alpha values, compute the probability of
committing a Type II error if the population mean actually is 99.
a. α = 0.01
b. α = 0.05
c. α = 0.10
d. Based on the answers to (a), (b), and (c), what happens to the value of β as α gets
larger?
Solution:
a. For an alpha value of 0.01, the table Z value for α is -2.33. Using -2.33, solve for the
critical value of the mean.
X
Z c
s
n
X 100
2.33
14
48
X c 95.292
The critical value is 95.292 on the lower end. The alternative population mean is 99.
The area associated with Z= -1.83 is 0.46638. The probability of committing a Type II
error is 0.5000 + 0.46638 = 0.96638.
b. For an alpha value of 0.05, the table Z value for α is -1.64. Using -1.64, solve for the
critical value of the mean.
X
Z c
s
n
X 100
1.64
14
48
X c 96.686
The critical value is 96.686 on the lower end. The alternative population mean is 99.
c. For an alpha value of 0.10, the table Z value for α is -1.28. Using -1.28, solve for the
critical value of the mean.
X
Z c
s
n
X 100
1.28
14
48
X c 97.413
The critical value is 97.413 on the lower end. The alternative population mean is 99.
d. Based on the answers to (a), (b), and (c), the value of β gets smaller as α gets larger.
3. For exercise 2 above, use α = 0.05 a solve the probability of committing a Type II error
for the following possible true alternative means.
a. a = 98.5
b. a = 98
c. a = 97
d. a = 96
e. What happens to the probability of committing a Type II error as the
alternative value of the mean gets farther from the null hypothesized value of
100?
Solution:
a. For an alpha value of 0.05, the table Z value for α is -1.64. Using -1.64, solve for the
critical value of the mean.
X
Z c
s
n
X 100
1.64
14
48
X c 96.686
The critical value is 96.686 on the lower end. The alternative population mean is 98.5.
Solving the area between X c = 96.686 and µ = 98.5 yields
96.686 98.5
Z 90
14
48
The area associated with Z= -0.90 is 0.31594. The probability of committing a Type II
error is 0.5000 + 0.31594 = 0.81594.
b. For an alpha value of 0.05, the table Z value for α is -1.64. Using -1.64, solve for the
critical value of the mean.
X
Z c
s
n
X 100
1.64
14
48
X c 96.686
The critical value is 96.686 on the lower end. The alternative population mean is 98.
Solving the area between X c = 96.686 and µ = 98 yields
81
96.686 98
Z 0.65
14
48
The area associated with Z= -0.65 is 0.24215. The probability of committing a Type II
error is 0.5000 + 0.24215 = 0.74215.
c. For an alpha value of 0.05, the table Z value for α is -1.64. Using -1.64, solve for the
critical value of the mean.
X
Z c
s
n
X 100
1.64
14
48
X c 96.686
The critical value is 96.686 on the lower end. The alternative population mean is 97.
Solving the area between X c = 96.686 and µ = 97 yields
96.686 97
Z 0.16
14
48
The area associated with Z= -0.16 is 0.06356. The probability of committing a Type II
error is 0.5000 + 0.06356 = 0.56356.
d. For an alpha value of 0.05, the table Z value for α is -1.64. Using -1.64, solve for the
critical value of the mean.
X
Z c
s
n
X 100
1.64
14
48
X c 96.686
The critical value is 96.686 on the lower end. The alternative population mean is 96.
Solving the area between X c = 96.686 and µ = 96 yields
96.686 96
Z 0.34
14
48
The area associated with Z= +0.34 is 0.13307. The probability of committing a Type II
error is 0.5000 - 0.13307 = 0.36693.
82
CHAPTER FOUR
CHI-SQUARE DISTRIBUTIONS
Having the above characteristics, X2 dist has the following areas of application:
1. Testing for the equality of several proportions
2. Test for independence between two variables
3. Goodness of fit tests (Binomial, Normal, and Poisson )
Example:
1. A company planning a TV advertising campaign wants to determine which TV shows
its target audience watches and thereby to know whether the choice of TV program an
individual watches is independent of the individuals income. The table supporting this
is shown below. Use a 5% level of significance and the null hypothesis.
Medium 90 67 43 200
High 17 13 20 50
Solution
1. Ho: Choice of TV program an individual watches is independent of the individuals
income
Ha: Income and Choice of TV program are not independent
2. Decision rule
= 0.05
ν = (R-1) (C-1)8*
= (3-1) (3-1)
=4
X , ν = X20.05, 4 = 9.49
2
e11 = 250x250/500 = 125 e21 = 200 x 250/500 = 100 e31 = 50 x 250/500 =25
8
For the RxC contingency table, the degrees of freedom are calculated as (R-1) (C-1). The degrees of
freedom refers to the number of expected frequencies that can be chosen freely provided the row and
column totals of expected frequencies are identical to the row and column totals of the observed
frequency table.
85
A test of the null hypothesis that variables are independent of one another is based on
the magnitudes of the differences between the observed frequencies and the expected
frequencies. Large differences between oij and eij provide evidence that the null
hypothesis is false. The test is based on the following Chi-square test statistic.
Oij eij2 f f e 2
2
Or 2 o
e ij fe
Where:
Oij (fo) = observed frequency for contingency table category in row i and column j.
Eij (fe) = expected frequency for contingency table in row i and column j.
2
143 125 2 70 75 2 37 50 2 90 100 2 67 60 2 17 25 2 13 15 2
125 75 50 100 60 25 15
20 10 2
43 40 2
21 .174
10 40
4. Reject the null hypothesis that choice of TV program is independent from income
level.
2. A human resource manager at EAGLE Inc. was interested in knowing whether the
voluntary absence behavior of the firm’s employees was independent of marital status.
The employee files contained data on marital status and on voluntary absenteeism
behavior for a sample of 500 employees is shown below.
Marital Status
Absence Married Divorced Widowed Single Total
behavior
Often absent 36 16 14 34 100
Seldom absent 64 34 20 82 200
Never absent 50 50 16 84 200
Total 150 100 50 200 500
Solution
1. Ho: Voluntary absence behavior is independent of marital status
86
Applicant Status
Selected Not selected Total
Male 7 33 40
Female 5 35 40
Total 12 68 80
87
Solution
a.
1. Ho: There is no selection bias in favor of males. (Selection status and gender of the
applicant are independent).
Ha: There is selection bias in favor of males. (Selection status and gender of the
applicant are not independent).
2. = 0.1
V = (R-1) (C-1)
= (2-1) (2-1) = 1
X2 ,ν= X2 0.1,1 = 2.71
Reject Ho if sample X2 > 2.71
3. Sample X2
Observed freq Expected Freq (fo-fe)2 f o f e 2
(fo) (fe) fe
7 6 1 0.1667
33 34 1 0.0294
5 6 1 0.1667
35 34 1 0.0294
f o f e 2 0.3922
fe
c. There is no shortcut method to answer this question. Therefore, lets try by increasing the
number of male applicants who are accepted and decreasing the number of female
applicants who are females.
1. Ho: There is no selection bias in favor of males. (Selection status and gender of the
applicant are independent).
Ha: There is selection bias in favor of males. (Selection status and gender of the
applicant are not independent).
2. = 0.1
V = (R-1) (C-1)
= (2-1) (2-1) = 1
X ,ν= X2 0.1,1 = 2.71
2
Therefore, 8 male and 4 female applicants must be hired for the 12 open positions so as
to avoid selection bias in favor of males.
The Chi-square test for independence is useful in helping to determine whether a
relationship exists between two variables, but it does not enable us to estimate or
predict the values of one variable based on the value of the other. If it is determined
that a dependence does exist between two quantitative variables, then the techniques of
regression analysis are useful in helping to find a mathematical formula that expresses
the nature of mathematical relationship.
Small expected frequencies can lead to inordinately large chi-square values with the chi-
square test of independence. Hence contingency tables should not be used with
expected cell values of less than 5. One way to avoid small expected values is to
combine columns or rows whenever possible and whenever doing so makes sense.
Ho: P1 = P10; P2 = P2O; P3 = P30; --- Pk = PkO; and the alternative hypothesis takes the
following form:
Ha: The population proportions are not equal to the hypothesized values
Example:
1. In the business credit institution industry the accounts receivable for companies are
classified as being “current,” “moderately late,” “very late,” and “uncollectible.”
Industry figure show that the ratio of these four classes is 9: 3: 3: 1. ENDURANCE firm
has 800 accounts receivable, with 439, 168, 133, and 60 falling in each class. Are these
proportions in agreement with the industry ratio? Let =0.05.
Solution
1. Ho: P1 = 9/16; P2 =3/16; P3 = 3/16; P4 = 1/16
Ha: One or more of the proportions are not equal to the proportions given in the null
hypothesis.
2. = 0.05
ν =K - 1 = 4-1 = 3
90
2. ETHIO Plastic Factory sells its products in three primary colors: Red, blue, and yellow.
The marketing manager feels that customers have no color preference for the product.
To test this hypothesis the manager set up a test in which 120 purchases were given
equal opportunity to buy the product in each of the three colors. The results were that
60 bought red, 20 bought blue, and 40 bought yellow. Test the marketing manager’s
null hypothesis, using =0.05.
Solution
1. Ho: People have no color preference with this product; P1 = P2 = P3 = 1/3
Ha: People have color preference with this product
2. = 0.05
V= K-1 = 3 -1=2
X2,ν = X2 0.05,2 = 5.99
Reject Ho if sample X2 is greater than 5.99.
3. Sample χ2
Class Observed freq Expected Freq (fo-fe)2 f o f e 2
(fo) (fe = npi); pi = 1/39 fe
Red 60 40 400 10.00
Blue 20 40 400 10.00
Yellow 40 40 0 0.00
f f 2
20.00
of e
e
9
Since the null hypothesis states that there is no color preference, each of the three
colors is preferred by one third of the purchases.
91
4. Reject Ho; because 20 > 5.99. This means that customers do have color preference. It
appears that red is the most popular color and blue is the least popular.
3. Rating sciences, Inc., a TV program – rating service, surveyed 600 families where the
television was turned on during the prime time on week nights. They found the
following numbers of people turned to the various networks.
a) Test the hypothesis that all four networks have the same proportion of viewers
during this prime time period. Use = 0.05
b) Eliminate the results for PBS and repeat the test of hypothesis for the three
commercial networks, using = 0.05
c) Test the hypothesis that each of the three major networks has 30% of the weeknight
prime time market and PBS has 10% using = 0.005
Solution
a.
1. Ho: All of the four networks do have equal number of viewers; P1 = P2 = P3 = P4 = 1/4.
Ha: All of the four networks do not have equal number of viewers.
2. = 0.05
V= K-1 = 4 -1= 3
X2,ν = X2 0.05,3 = 7.81
Reject Ho if sample X2 is greater than 7.81
3. Sample χ2
Class Observed freq Expected Freq (fo-fe)2 f o f e 2
(fo) (fe = npi); pi = 1/410 fe
NBC 210 150 3,600 24.0000
CBS 170 150 400 2.6667
ABC 165 150 225 1.5000
PBS 55 150 9,025 60.1667
10
Since equal number of viewers is expected to watch each network, each of the four
networks is watched by one fourth of the viewers.
92
f o f e 2 88.3334
fe
4. Reject Ho; because 88.34 > 7.81.
b.
1. Ho: All of the three commercial networks do have equal number of viewers; P1 = P2 =
P3 = 1/3.
Ha: All of the three commercial networks do not have equal number of viewers.
2. = 0.05
V= K-1 = 3 -1= 2
X2, ν = X2 0.05,2 = 5.99
Reject Ho if sample X2 is greater than 5.99.
3. Sample χ2
Class Observed freq Expected Freq (fo-fe)2 f o f e 2
(fo) (fe = npi); pi = 1/3 fe
NBC 210 181.67 802.60 4.4179
CBS 170 181.67 136.20 0.7497
ABC 165 181.67 277.90 1.5270
f f 2
6.6946
of e
e
3. Sample X2
4. Do not reject Ho
11
Expected frequency value for P4 is less than 5 (200*0.02 = 4). So, we have to combine P 4 with one of other
expected frequencies, say P3, to obtain a combined expected frequency of 30 (200*0.15). it can also be combined
with other expected frequency values.
94
The chi-square test is widely used for a variety of analyses. One of the more important
uses of Chi-Square is the goodness-of-fit test. That is, it can be used to decide whether a
particular probability distribution, such as the binomial, Poisson or normal, is the
appropriate distribution. This is an important ability, because as decision makers using
statistics, we will need to choose a certain probability distribution to represent the
distribution of the data we happen to be considering.
In tests of hypothesis (Chapter 5), we assumed that the population was normal and
tested the hypothesis =o, p = Po, etc. But what if we want to check on the assumption
of normality itself? The multinomial χ2 goodness–of–fit test can be applied.
The null hypothesis for a goodness-off it test in that the distribution of the population
from which a sample is taken is the one specified. The alternative hypothesis is that the
actual distribution is not the specified distribution. Generally, a researcher specifies
only the name of distribution and uses the sample data to estimate the particular
parameters of the distribution. In this situation one degree of freedom is lest for each
parameter that has to be estimated. However, if the research completely specifies the
distribution including parameter values, then no additional degrees of freedom is lost.
Example (Binomial)
1. Mrs. Tsion, Saleswoman for MOON Paper Company, has five accounts to visit per day.
It is suggested that sales by Mrs. Tsion May be described by the binomial distribution,
95
with the probability of selling each account being 0.4. Given the following frequency
distribution of Mrs. Tsion’s number of sales per day, can we conclude that the data do
in fact follow the binomial distribution? Use the 0.05 significance level.
Solution
1. Ho: The freq. Distribution can be best described by binomial distribution with n=5,
P=0.4
Ha: The freq. Distribution can’t be best described by binomial distribution with n=5,
P=0.4
2. = 0.05
96
V = K-1 –m = 5-1-0 = 4
X2,ν = X2 0.05,4 = 9.49
Reject Ho if sample χ2 > 9.49
3. Sample χ2
No. of hits No. of games with Prob. with Expected (fo-fe)2 f o f e 2
per game that no. of hit (fo) n=5, P=0.4 freq (fe = npi) fe
0 12 .0778 7.78 17.8084 2.2890
1 38 .2592 25.92 145.9264 5.6249
2 27 .3456 34.56 57.1536 1.6538
3 17 .2304 23.04 36.4816 1.5834
4&5 6 .0870 8.70 4.2900 0.8379
f o f e 2 11.9940
f
e
4. Reject Ho. The # of hit over the same in not binomially distributed
3. The Ethiopian postal service is interested in modeling the “mangled letter” problem. It
has been suggested that any letter sent to a certain area has a 0.15 chance of being
mangled. Since the post office is so big, it can be assumed that two letters chances of
being mangled are independent. A sample of 310 people was selected, and two test
letters were mailed to each of them. The number of people receiving zero, one, or two
mangled letters was 260, 40, and 10, respectively. At the 0.10 level of significance, is it
reasonable to conclude that the number of mangled letters received by people follows a
binomial distribution with P = 0.15?
Solution
1. Ho: The number of mangled letters received by people follows a binomial
distribution with n = 2, p = 0.15.
Ha: The number of mangled letters received by people doesn’t follow a binomial
distribution. With n =2, P = 0.15.
2. = 0.1
V = K-1 – m = 3-1-0 = 2
X2, ν = X2 0.1,2 = 4.61
Reject Ho if sample x2 > 4.61
Sample χ2
No. of mangled Observed Prob. with Expected (fo-fe)2 f o f e 2
letters freq. (fo) n=2 P=0.15 freq (fe = npi) fe
0 260 0.7225 223.9750 1297.8006 5.7944
1 40 0.2550 79.0500 1524.9025 19.2904
97
Example (Poisson)
1. It is hypothesized that the number of breakdowns per month of a computer system at a
major university follows a Poisson distribution with µ = 2. The data below show the
observed number of breakdowns per month during a sample of 100 months. Use a 5%
level of significance and test the null hypothesis.
4. Do not Reject Ho. The number of breakdowns per month of a computer system at
the university follows a Poisson distribution with µ = 2.
2. Suppose that a teller supervisor believes that the distribution of random arrivals at a
local bank is Poisson and sets out to test this hypothesis by gathering information. The
following data represent a distribution of frequency of arrivals during one minute
98
intervals at a bank. Use α = 0.05 to test these data in an effort to determine whether they
are Poisson distributed.
Solution
Before we solve the question, first we have to compute the arrival rate per minute, and
hence one degree of freedom is lost.
number of arrivals *
observed frequency 0 * 7 18 *1 25 * 2 17 * 3 12 * 4 5 * 5 192
2.3 cust / min
Observed frequency 84 84
4. Do not Reject Ho. The arrival of customers at a bank follows a Poisson distribution
with λ = 2.3.
3. The number of automobile accidents occurring per day in a particular city is believed to
have a poisson distribution. A sample of 80 days during the past year gives the data
shown below. Do the data support the belief that the number of accidents per day has a
poisson distribution? Use α = 0.05.
99
No. of accidents 0 1 2 3 4
Observed freq. (days) 34 25 11 7 3
Solution
Before we solve the question, first we have to compute the occurrence rate per day, and
hence one degree of freedom is lost.
number of accidents *
observed frequency
0 * 34 25 * 1 11 * 2 7 * 3 3 * 4 80 1accident / day
Observed frequency 80 80
1. Ho: The occurrence of accidents per day follows a poisson distribution with λ = 1.0
Ha: The occurrence of accidents per day does not follow a poisson distribution with
λ = 1.0
2. = 0.05
V = K-1 – m = 4-1-1 = 2
X2, ν = X2 0.05,2 = 5.99
Reject Ho if sample χ2 > 5.99
3. Sample χ2
Number of Observed Prob. with Expected (fo-fe)2 f o f e 2
accidents freq. (fo) λ=1.0 freq (fe = npi) fe
0 34 0.3679 29.4320 20.8666 0.7090
1 25 0.3679 29.4320 19.6426 0.6674
2 11 0.1839 14.7120 13.7789 0.9366
3 or more 10 0.0803 6.4240 12.7878 1.9906
f o f e 2 4.3036
f
e
4. Do not Reject Ho. The occurrence of accidents per day follows a poisson distribution
with λ = 1.0
Example (Normal)
1. Suppose that Ato Paulos developed an overall attitude scale to determine how his
company’s employees feel toward their company. In theory the scores can vary from 0
to 50. Ato Paulos pretests his measurement instrument on a randomly selected group of
100 employees. He tallies the scores and summarizes them into six categories as shown
100
below. Are these pretest scores approximately normally distributed with µ = 24.9 and σ
25 24.9 0.00399
z 25 0.01
7.194
30 24.9 +0.26115
z 30 0.71
7.194
Expected probability 0.25716
The six probabilities do not sum to 1.00. Even though observed frequencies were
obtained only for these six categories, getting a score less than 10 or greater than 40 was
also possible. Because 0.5 of the probabilities lie in each half of a normal distribution
and utilizing the sum of expected probabilities on each side of the mean, 24.9, we can
obtain a probability of the < 10 category: 0.5 – (0.06456 + 0.16446 + 0.25175) = 0.01923.
Similarly, we can obtain the probability of >40 category: 0.5 – (0.00399 + 0.25716 +
0.15809 + 0.06290) = 0.01786. expected frequencies can then be obtained by multiplying
each expected probability by the total frequency (100), as shown below.
As the < 10 and > 40 categories have values of less than 5, each must be combined with
the adjacent category. As a result, the < 10 category becomes part of the 10 – 15 category
Expected freq
Score category Probability (fe = npi)
10 – 15 0.08379 8.379
15 – 20 0.16446 16.446
20 – 25 0.25574 25.574
25 - 30 0.25716 25.716
30 -35 0.15809 15.809
35 – 40 0.08076 8.076
4. Do not Reject Ho. The attitude score are normally distributed with mean 24.9 and
standard deviation 7.194.
2. The director of a major soccer team believes that the ages of purchasers of game tickets
are normally distributed. If the following data represent the distribution of ages for a
sample of observed purchasers of major soccer game tickets, use the chi-square
goodness-of-fit test to determine whether this distribution is significantly different from
the normal distribution. Assume that α = 0.05.
Solution
1. Ho: The ages of purchasers of soccer game tickets are normally distributed.
103
Ha: The ages of purchasers of soccer game tickets aren’t normally distributed
2. = 0.05
V = K-1 – 2 = 6-1-2 = 3
X2, ν = X2 0.05,3 = 7.81
Reject Ho if sample χ2 > 7.81
3. Sample χ2
Age category Observed freq Mid point (M) fm fm2
10-20 16 15 240 3,600
20-30 44 25 1,100 27,500
30-40 61 35 2,135 74,725
40-50 56 45 2,520 113,400
50-60 35 55 1,925 105,875
60-70 19 65 1,235 80,275
231 fm = 9, 155 fm = 405,375
2
X
fm 9,155 39.63
n 231
fm 2
9,1552
fm 2
n
405,375
231
S 13.60
n 1 231 1
X
With Z , the expected probability of each category can be obtained as follows:
30 39.63 0.26115
z 30 0.71
13.6
40 39.63 + 0.01197
z 40 0.03
13.6
Expected probability 0.27312
The six probabilities do not sum to 1.00. Even though observed frequencies were
obtained only for these six categories, getting a score less than 10 or greater than 70 is
also possible.
For > 70
Probability between 70 and the mean = 0.05394 + 0.15682 + 0.2640 + 0.01197 = 0.48713
Probability > 70 = 0.5 – 0.48713 = 0.01287
Since the < 10 and > 70 categories have values of less than 5, each must be combined
with the adjacent category. As a result, the < 10 category becomes part of the 10 – 20
category and the > 70 category becomes part of the 60 – 70 category.
4. Do not Reject Ho. The age of purchasers of soccer game tickets are normally
distributed.
106
3. The instructor for Introductory Statistics course attempts to construct the final
examination so that the grades are normally distributed with a mean of 65. From the
sample of grades appearing in the accompanying frequency distribution table, can you
conclude that they have achieved his objective? Use α = 0.05.
Solution
1. Ho: The grades of students are normally distributed with a mean of 65.
Ha: The grades of students are not normally distributed with a mean of 65.
2. = 0.05
V = K-1 – 1 = 5-1-1 = 3
X2, ν = X2 0.05,3 = 7.81
Reject Ho if sample χ2 > 7.81
3. Sample χ2
Grade Observed freq Mid point (M) fm fm2
30-40 4 35 140 4,900
40-50 17 45 765 34,425
50-60 29 55 1,595 87,725
60-70 49 65 3,185 207,025
70-80 33 75 2,475 185,625
80-90 18 85 1,530 130,050
150 fm = 9, 690 fm = 649,750
2
fm 2
9,690
2
fm 2
n
649,750
150
S 12.63
n 1 150 1
X
With Z , the expected probability of each category can be obtained as follows:
The six probabilities do not sum to 1.00. Even though observed frequencies were
obtained only for these six categories, getting a score less than 30 or greater than 90 is
also possible.
Since the < 30, 30-40 and > 90 categories have values of less than 5, they must be
combined with the adjacent categories. As a result, the < 30 and 30-40 categories become
part of the 40 – 50 category; and the > 90 category becomes part of 80-90 category.
Age category Probability Expected freq
(fe = npi)
40-50 0.11702 17.553
50-60 0.22756 52.5664
60-70 0.31084 71.8040
70-80 0.22756 52.5664
80-90 0.11702 17.553
f o f e 2
fe 1.6190
4. Do not Reject Ho. YES. The grades of students are normally distributed with a mean
of 65.
110
CHAPTER FIVE
ANALYSIS OF VARIANCE
When testing for differences in mans of more than two populations, we usually do not
proceed by considering all combinations of two populations at a time and testing for
differences in each pair.
1. Such an approach would require several tests rather than just one.
2. If each individual test were conducted using a level of significance of say α =
0.05, then the overall level of significance would be higher than 0.05. For
example, if Ho: µ1 = µ2 = µ3, α (the probability of rejecting a true null hypothesis)
= 0.143 (1-0.953).
Thus, we want to test simultaneously for differences among the means of all the
populations, and we want the joint level of significance of the test to be α. To perform
this test we make use of the F-distribution and use a method called ANOVA.
Since both are estimates of σ2, they should be approximately equal in value when the
null hypothesis is true. If the null hypothesis is not true, these two estimates will differ
considerably. The three steps in ANOVA, then, are:
111
1. Determine one estimate of the population variance from the variation among
sample means
2. Determine a second estimate of the population variance from the variation
within the samples
3. Compare these two estimates. If they are approximately equal in value, accept
the null hypothesis.
The variance among the sample means is called Between Column Variance or Mean
Square between (MSB).
X X .
2
Sample variance = S 2
n 1
Now, because we are working with sample means and the grand mean, let’s substitute
X for X, X for X , and K (number of samples) for n to get the formula for the variance
among the sample means:
2
X X
.
Variance among sample means S X2
K 1
In sampling distribution of the mean we have calculated the standard error of the mean
as X . Cross multiplying the terms X n . Squaring both sides 2 X2 n .
n
In ANOVA, we do not have all the information needed to use the above equation to
find σ2. Specifically, we do not know X2 . We could, however, calculate the variance
2
X X
among the sample means, S X , using S X
2 2
. So, why not substitute S X2 for X2
K 1
and calculate an estimate of the population variance? This will give us:
2 2
2 n X X n X X
If n , n ,......n are equal.
x S X2 * n
K 1 K 1
1 2 k
2
We solve this problem by multiplying X j X by it’s won appropriate nj, and hence
2
X becomes:
2
2 n X
j X
j
.
MSB =
K 1
Where:
2
= First estimate of the population variance based on the variation among sample
means (the Between Column Variance – MSB)
nj = the size of the jth sample
X j = the sample mean of the jth sample
It is based on the variation of the sample observations within each sample. It is called
the within column variance or Mean Square Within (MSW). We calculate the sample
X X
2
n 1S 2j
k
j
2
MSW = i 1
nT k
k
n 1 S 2j
If n1, n2, -----, nk are equal MSW =
2
i 1 .
k n 1
Where:
12MSW is based on the variation within each of the samples; it is not influenced by whether or not the
null hypothesis is true. Thus, MSW always provides an unbiased estimate of the population variance.
113
2
= Second estimate of the population variance based on the variation within the
samples (the Within Column Variance – MSB)
nj = the size of the jth sample
nj-1 = degree of freedom in each sample
nT – k = degrees of freedom associated with SSB
S 2j The sample variance of jth sample
K = the number of samples
nT = Σnj = the total sample size = n1 + n2 + <<.. + nk.
The estimate of population variance based on variation that exists between sample
means (MSB) is somewhat suspect because it is based on the notion that all the
populations have the same mean. That is, the estimate MSB is a good estimate of the σ 2
only if Ho is true and all the populations’ means are equal: µ1 = µ2 = µ3 = ------ = µk.
If the unknown population means are not equal, and probably are radically different
from one another, then the sample means ( X j ) will most likely be radically different
from each other too. This difference will have a marked effect on MSB. That is to say,
2
the X j values will vary a great deal and the X j X terms will be large. Thus, if the
population means are not all equal, then the MSB estimate will be large relative to the
MSW estimate. That is, is the MSB is large relative to the MSW, and then the hypothesis
that all the population means are equal is not likely to be true.
The important question is, of course, How large is “large?” also, how do we measure
the relative sizes of the two variance estimates? The answer to these questions is given
by the F-distribution.
If k samples of nj (j = 1, 2< k) items of each are taken from k normal populations that
have equal variances and for which the hypothesis Ho: µ1 = µ2 = <= µk is true, then the
ratio of the MSB to the MSW is an F-value that follows an F-probability distribution.
MSB
F
MSW
114
THE F-DISTRIBUTION
Characteristics of F-distribution
Example
1. The training director of a company is trying to evaluate three different methods of
training new employees. The first method assigns each to an experienced employee for
individual help in the factory. The second method puts all new employees in a training
room separate from the factory, and the third method uses training films and
programmed learning materials. The training director chooses 18 new employees
assigned at random to the three training methods and records their daily production
after they complete the programs. Below are productivity measures for individuals
trained by each method.
At the 0.05 level of significance, do the three training methods lead to different levels of
productivity?
115
Solution
1. Ho: µ1 = µ2 = µ3
µ1, µ2, and µ3 are not all equal
2. α = 0.05
3. Sample F
2
MSW =
n j 1S12
530 .17 47 .60 31 .07 108 .84
36 .28
nT K 15 3
MSB 60.33
F 1.663
MSW 36.28
2. A department store chain is considering building a new store at one of the four different
sites. One of the important factors in the decision is the annual household income of the
residents of the four areas. Suppose that, in a preliminary study, various residents in
each area are asked what their annual household incomes are. The results are shown in
the accompanying table below. Is there sufficient evidence to conclude that differences
exist in the average annual household incomes among the four communities? Use α =
0.01.
19 18
51
27
159 294 182 138
X 1 = 26.50 X 2 = 32.67 X 3 = 26.00 X 4 = 27.60 X = 28.63
2 2 2 2
S = 26.30
1 S = 107.5
2 S = 136.33
3 S = 81.30
4
Solution
1. Ho: µ1 = µ2 = µ3 = µ4
µ1, µ2, µ3 and µ4 are not all equal
2. α = 0.01
3. Sample F
2
n X
j X
j
626.5 28.63 932.67 28.63 726.00 28.63 527.60 28.63
2 2 2 2
MSB =
K 1 4 1
227.84
75.95
3
MSW =
n j 1S12
526 .3 8107 .5 6136 .33 481 .3 2134 .68
92 .81
nT K 27 4 23
MSB 75.95
F 0.82
MSW 92.81
CHAPTER SIX
SIMPLE LINEAR REGRESSION AND CORRELATION
Linear regression and correlation is studying and measuring the linear relationship among two or
more variables. When only two variables are involved, the analysis is referred to as simple
correlation and simple linear regression analysis, and when there are more than two variables the
term multiple regression and partial correlation is used.
Correlation Analysis: deals with the measurement of the closeness of the relationship which are
described in the regression equation.
We say there is correlation if the two series of items vary together directly or inversely.
The presence of correlation between two variables may be due to three reasons:
1.One variable being the cause of the other. The cause is called “subject” or
“independent” variable, while the effect is called “dependent” variable.
2.Both variables being the result of a common cause. That is, the correlation that exists
between two variables is due to their being related to some third force.
118
Example:
Let X1= ESLCE result
Y1= rate of surviving in the University
Y2= the rate of getting a scholar ship.
Both X1&Y1 and X1&Y2 have high positive correlation, likewiseY1 & Y2 have positive
correlation but they are not directly related, but they are related to each other via X1.
Examples:
Price of teff in Addis Ababa and grade of students in USA.
Weight of individuals in Ethiopia and income of individuals in Kenya.
r
( X i X )(Yi Y ) and the short cut formula is
( X i X ) (Yi Y )
2 2
n XY ( X )( Y )
r
[n X 2 ( X ) 2 ] [n Y 2 ( Y ) 2
r
XY nXY
[ X 2 nX 2 ] [ Y 2 nY 2 ]
Remark: Always this r lies between -1 and 1 inclusively and it is also symmetric.
Interpretation of r
1.Perfect positive linear relationship ( if r 1)
2.Some Positive linear relationship ( if r is between 0 and 1)
3.No linear relationship ( if r 0)
4.Some Negative linear relationship ( if r is between -1 and 0)
5.Perfect negative linear relationship ( if r 1)
Examples:
1. Calculate the simple correlation between mid-semester and final exam scores of 10 students
(both out of 50)
119
r
XY nXY
[ X 2 nX 2 ] [ Y 2 nY 2 ]
10331 10(31.2)(32.9)
(9920 10(973.4)) (11003 10(1082.4))
66.2
0.363
182.5
This means mid semester exam and final exam scores have a slightly positive correlation.
Exercise The following data were collected from a certain household on the monthly income (X)
and consumption (Y) for the past 10 months. Compute the simple correlation coefficient.
X: 650 654 720 456 536 853 735 650 536 666
Y: 450 523 235 398 500 632 500 635 450 360
The above formula and procedure is only applicable on quantitative data, but when we have
qualitative data like efficiency, honesty, intelligence, etc we calculate what is called
Spearman’s rank correlation coefficient as follows:
Steps
i. Rank the different items in X and Y.
ii. Find the difference of the ranks in a pair , denote them by Di
iii. Use the following formula
120
6 Di
2
rs 1
n(n 2 1)
Where rs coefficient of rank correlation
D the difference between paired ranks
n the number of pairs
Example:
Aster and Almaz were asked to rank 7 different types of lipsticks, see if there is correlation
between the tests of the ladies.
Lipstick types A B C D E F G
Aster 2 1 4 3 5 7 6
Almaz 1 3 2 4 5 6 7
Solution:
D2
X Y R1-R2
6 Di
2
6(12)
(R1) (R2) (D) rs 1 1 0.786
2 1 1 1 n(n 1)
2
7 ( 48)
1 3 -2 4
4 2 2 4
3 4 -1 1
5 5 0 0 Yes, there is positive correlation.
7 6 1 1
6 7 -1 1 Simple Linear Regression
Total 12 - Simple linear regression refers to the linear
relationship between two variables
- We usually denote the dependent variable by Y and the independent variable by X.
- A simple regression line is the line fitted to the points plotted in the scatter diagram, which
would describe the average relationship between the two variables. Therefore, to see the type
of relationship, it is advisable to prepare scatter plot before fitting the model.
Y X
Where :Y Dependent var iable
- The linear model is: X independent var iable
Re gression cons tan t
regression slope
random disturbanc e term
Y ~ N ( X , 2 )
~ N (0, 2 )
Where a is a constant which gives the value of Y when X=0 .It is called the Y-intercept. b is
a constant indicating the slope of the regression line, and it gives a measure of the change in Y
for a unit change in X. It is also regression coefficient of Y on X.
- a and b are found by minimizing SSE (Yi Yˆi )
2 2
b
( X i X )(Yi Y ) XY nXY
( X i X )2 X 2 nX 2
a Y bX
Example 1: The following data shows the score of 12 students for Accounting and Statistics
Examinations.
Accounting Statistics 2
X Y2 XY
X Y
1 74.00 81.00 5476.00 6561.00 5994.00
2 93.00 86.00 8649.00 7396.00 7998.00
3 55.00 67.00 3025.00 4489.00 3685.00
4 41.00 35.00 1681.00 1225.00 1435.00
122
The Coefficient of Correlation (r) has a value of 0.92. This indicates that the two variables are
positively correlated (Y increases as X increases).
b)
where:
Yˆ 7.0194 0.9560 X
7.0194 0.9560(85) 88.28
Exercise: A car rental agency is interested in studying the relationship between the distance
driven in kilometer (Y) and the maintenance cost for their cars (X in birr). The following
summarized information is given based on samples of size 5.
2
i 1 X i 147,000,000 i 1Yi 314
5 5 2
- To know how far the regression equation has been able to explain the variation in Y we use a
2
measure called coefficient of determination ( r )
(Yˆ Y ) 2
i.e r 2
(Y Y ) 2
Where r the simple correlation coefficient.
SX Y
( X i X )(Yi Y ) XY nXY
n 1 n 1
124
Xˆ a1 b1Y
b1
XY nXY
Y 2 nY 2
b1SY
a1 X b1Y , r
SX
Here X is dependent and Y is independent.
Example: The regression line between height (X) in inches and weight (Y) in lbs of male
students are:
4Y 15 X 530 0 and
20 X 3Y 975 0
Determine which is regression of Y on X and X on Y
Solution
We will assume one of the equation as regression of X on Y and the other as Y on X and
calculate r
15 3 9
r 2 bYX * bXY 0,1
4 20 16