Statistical Data Analysis Book Dang Quang The Hong
Statistical Data Analysis Book Dang Quang The Hong
Statistical Data Analysis Book Dang Quang The Hong
Preface
Statistics is the science of collecting, organizing and interpreting numerical and non-numerical
facts, which we call data.
The collection and study of data is important in the work of many professions, so that training in
the science of statistics is valuable preparation for variety of careers, for example, economists
and financial advisors, businessmen, engineers and farmers.
Knowledge of probability and statistical methods also are useful for informatics specialists in
various fields such as data mining, knowledge discovery, neural networks, and fuzzy systems
and so on.
Whatever else it may be, statistics is first and foremost a collection of tools used for converting
raw data into information to help decision makers in their work.
The science of data - statistics - is the subject of this course.
Chapter 1 is an introduction into statistical analysis of data. Chapters 2 and 3 deal with
statistical methods for presenting and describing data. Chapters 4 and 5 introduce the basic
concepts of probability and probability distributions, which are the foundation for our study of
statistical inference in later chapters. Sampling and sampling distributions is the subject of
Chapter 6. The remaining seven chapters discuss statistical inference - methods for drawing
conclusions from properly produced data. Chapter 7 deals with estimating characteristics of a
population by observing the characteristic of a sample. Chapters 8 to 13 describe some of the
most common methods of inference: for drawing conclusions about means, proportions and
variances from one and two samples, about relations in categorical data, regression and
correlation and analysis of variance. In every chapter we include examples to illustrate the
concepts and methods presented. The use of computer packages such as SPSS and
STATGRAPHICS will be evolved.
Audience
This tutorial as an introductory course to statistics is intended mainly for users such as
engineers, economists and managers who need to use statistical methods in their work and for
students. However, many aspects will be useful for computer trainers.
Objectives
Understanding statistical reasoning
Mastering basic statistical methods for analyzing data such as descriptive and inferential
methods
Ability to use methods of statistics in practice with the help of computer software
Entry requirements
High school algebra course (+elements of calculus)
Elementary computer skills
http://www.netnam.vn/unescocourse/statistics/stat_frm.htm
CONTENTS
Chapter 1 Introduction....................................................................................................1
1.1
1.2
1.3
1.4
1.5
What is Statistics...................................................................................................1
Populations and samples ......................................................................................2
Descriptive and inferential statistics ......................................................................2
Brief history of statistics ........................................................................................3
Computer softwares for statistical analysis...........................................................3
ii
iii
iv
Introduction
CONTENTS
vi
Definition 1.2
A sample is a subset of data selected from a population
Example 1.1 The population may be all women in a country, for example, in Vietnam. If from
each city or province we select 50 women, then the set of selected women is a sample.
Example 1.2 The set of all whisky bottles produced by a company is a population. For the
quality control 150 whisky bottles are selected at random. This portion is a sample.
Definition 1.3
The branch of statistics devoted to the summarization and description of data
(population or sample) is called descriptive statistics.
If it may be too expensive to obtain or it may be impossible to acquire every measurement in the
population, then we will want to select a sample of data from the population and use the sample
to infer the nature of the population.
Definition 1.4
The branch of statistics concerned with using sample data to make an inference
about a population of data is called inferential statistics.
vii
viii
Chapter 2
Data presentation
CONTENTS
2.1. Introduction
2.2. Types of data
2.3. Qualitative data presentation
2.4. Graphical description of qualitative data
2.5. Graphical description of quantitative data: Stem and Leaf displays
2.6. Tabulating quantitative data: Relative frequency distributions
2.7. Graphical description of quantitative data: histogram and polygon
2.8. Cumulative distributions and cumulative polygons
2.9. Summary
2.10. Exercises
2.1 Introduction
The objective of data description is to summarize the characteristics of a data set. Ultimately, we want to
make the data set more comprehensible and meaningful. In this chapter we will show how to construct
charts and graphs that convey the nature of a data set. The procedure that we will use to accomplish this
objective in a particular situation depends on the type of data that we want to describe.
In other words, quantitative data are those that represent the quantity or amount of something.
Example 2.1 Height (in centimeters), weight (in kilograms) of each student in a group are both
quantitative data.
ix
Definition 2.2
Non-numerical data that can only be classified into one of a group of categories are
said to be qualitative data.
In other words, qualitative data are those that have no quantitative interpretation, i.e., they can
only be classified into categories.
Example 2.2 Education level, nationality, sex of each person in a group of people are qualitative
data.
Definition 2.4
The category relative frequency for a given category is the proportion of the total
number of observations that fall in that category.
Instead of the relative frequency for a category one usually uses percentage for a category,
which is computed as follows
Percentage for a category = Relative frequency for the category x 100%
Example 2.3 The classification of students of a group by the score on the subject Statistical
analysis is presented in Table 2.0a. The table of frequencies for the data set generated by
computer using the software SPSS is shown in Figure 2.1.
No of CATEGORY
No of CATEGORY
No of CATEGORY
Stud.
Stud.
Stud.
stud
Bad
13
Good
24
Good
35
Good
Medium
14
Excellent
25
Medium
36
Medium
Medium
15
Excellent
26
Bad
37
Good
Medium
16
Excellent
27
Good
38
Excellent
Good
17
Excellent
28
Bad
39
Good
Good
18
Good
29
Bad
40
Good
Excellent
19
Excellent
30
Good
41
Medium
Excellent
20
Excellent
31
Excellent
42
Bad
Excellent
21
Good
32
Excellent
43
Excellent
10
Excellent
22
Excellent
33
Excellent
44
Excellent
11
Bad
23
Excellent
34
Good
45
Good
12
Good
CATEGORY
Valid Bad
Excelent
Good
Medium
Total
Frequency Percent
6
13.3
18
40.0
15
33.3
6
13.3
45
100.0
Valid
Cumulative
Percent
Percent
13.3
13.3
40.0
53.3
33.3
86.7
13.3
100.0
100.0
Figure 2.1 Output from SPSS showing the frequency table for the variable
CATEGORY.
xi
Example 2.4a (Bar Graph) The bar graph generated by computer using SPSS for the variable
CATEGORY is depicted in Figure 2.2.
Medium
Good
Excelent
Bad
0
10
15
20
Figure 2.2 Bar graph showing the number of students of each category
Pie charts divide a complete circle (a pie) into slices, each corresponding to a category, with the
central angle and hence the area of the slice proportional to the category relative frequency.
Example 2.4b (Pie Chart) The pie chart generated by computer using EXCEL CHARTS for the
variable CATEGORY is depicted in Figure 2.3.
Bad
Excelent
Good
Medium
Figure 2.3 Pie chart showing the number of students of each category
xii
Depending on the data, a display can use one, two or five lines per stem. Among the different
stems, two-line stems are widely used.
Example 2.5 The quantity of glucose in blood of 100 persons is measured and recorded in
Table 2.0b (unit is mg %). Using SPSS we obtain the following Stem-and-Leaf display for this
data set.
Table 2.0b
70
79
80
83
85
85
85
85
86
86
86
87
87
88
89
90
91
91
92
92
93
93
93
93
94
94
94
94
94
94
95
95
96
96
96
96
96
97
97
97
97
97
98
98
98
98
98
98
100
100
101
101
101
101
101
101
102
102
102
103
103
103
103
104
104
104
105
106
106
106
106
106
106
106
106
106
106
107
107
107
107
108
110
111
111
111
111
111
112
112
112
115
116
116
116
116
119
121
121
126
xiii
GLUCOSE
Figure 2.4.
Output from SPSS
showing the Stemand-Leaf display for
the data set of
glucose
Frequency
Stem &
Leaf
1.00 Extremes
(=<70)
1.00
7 .
2.00
8 .
03
11.00
8 .
55556667789
15.00
9 .
011223333444444
18.00
9 .
556666677777888888
18.00
10 .
001111112223333444
16.00
10 .
5666666666677778
9.00
11 .
011111222
6.00
11 .
566669
2.00
12 .
11
1.00 Extremes
Stem width:
Each leaf:
(>=126)
10
1 case(s)
The stem and leaf display of Figure 2.4 partitions the data set into 12 classes corresponding to
12 stems. Thus, here two-line stems are used. The number of leaves in each class gives the
class frequency.
Advantages of a stem and leaf display over a frequency distribution (considered in the
next section):
1. The original data are preserved.
2. A stem and leaf display arranges the data in an orderly fashion and makes it easy to
determine certain numerical characteristics to be discussed in the following chapter.
3. The classes and numbers falling in them are quickly determined once we have selected the
digits that we want to use for the stems and leaves.
Disadvantage of a stem and leaf display:
Sometimes not much flexibility in choosing the stems.
xiv
A frequency distribution is a table that organizes data into classes. It shows the number of
observations from the data set that fall into each of classes. It should be emphasized that we
always have in mind non-overlapping classes, i.e. classes without common items.
3. For each class, count the number of observations that fall in that class. This
number is called the class frequency.
4. Calculate each class relative frequency
Class frequency
Total number of observations
Except for frequency distribution and relative frequency distribution one usually uses relative
class percentage, which is calculated by the formula:
Example 2.6 Construct frequency table for the data set of quantity of glucose in blood of 100
persons recorded in Table 2.0b (unit is mg %).
Using the software STATGRAPHICS, taking Lower limit = 62, Upper limit = 150 and Total
number of classes = 22 we obtained the following table.
xv
Frequency
Relative
Cumulative Cum. Rel.
Frequency Frequency
Frequency
62
66
64
66
70
68
0.01
0.01
70
74
72
0.01
74
78
76
0.01
78
82
80
0.02
0.03
82
86
84
0.08
11
0.11
86
90
88
0.05
16
0.16
90
94
92
14
0.14
30
0.3
94
98
96
18
0.18
48
0.48
98
102
100
11
0.11
59
0.59
10
102
106
104
18
0.18
77
0.77
11
106
110
108
0.06
83
0.83
12
110
114
112
0.08
91
0.91
13
114
118
116
0.05
96
0.96
14
118
122
120
0.03
99
0.99
15
122
126
124
0.01
100
16
126
130
128
100
17
130
134
132
100
18
134
138
136
100
19
138
142
140
100
20
142
146
144
100
21
146
150
100
Remarks:
1. All classes of frequency table must be mutually exclusive.
2. Classes may be open-ended when either the lower or the upper end of a quantitative
classification scheme is limitless. For example
xvi
Class: age
birth to 7
8 to 15
........
64 to 71
72 and older
3. Classification schemes can be either discrete or continuous. Discrete classes are separate
entities that do not progress from one class to the next without a break. Such class as the
number of children in each family, the number of trucks owned by moving companies.
Discrete data are data that can take only a limit number of values. Continuous data do
progress from one class to the next without a break. They involve numerical measurement
such as the weights of cans of tomatoes, the kilograms of pressure on concrete. Usually,
continuous classes are half-open intervals. For example, the classes in Table 2.1 are halfopen intervals [62, 66), [66, 70) ...
xvii
Frequency
20
15
10
5
140
132
124
116
108
100
92
84
76
68
0
Quantity of glucoza (mg% )
Figure 2.5 Frequency histogram for quantities of glucose, tabulated in Table 2.1
Remark: When comparing two or more sets of data, the various histograms can not be
constructed on the same graph because superimposing the vertical bars of one on another
would cause difficulty in interpretation. For such cases it is necessary to construct relative
frequency or percentage polygons.
2.7.2 Polygons
As with histograms, when plotting polygons the phenomenon of interest is plotted along the
horizontal axis while the vertical axis represents the number, proportion or percentage of
observations per class interval depending on whether or not the particular polygon is
respectively, a frequency polygon, a relative frequency polygon or a percentage polygon. For
example, the frequency polygon is a line graph connecting the midpoints of each class interval
in a data set, plotted at a height corresponding to the frequency of the class.
Example 2.8 Figure 2.6 is a frequency polygon constructed from data in Table 2.1.
Figure 2.6 Frequency polygon for data of glucose in Table 2.1
Advantages of polygons:
xviii
be developed from the frequency distribution table, the relative frequency distribution table or
the percentage distribution table.
A cumulative frequency distribution enables us to see how many observations lie above or
below certain values, rather than merely recording the number of items within intervals.
A less-than cumulative frequency distribution may be developed from the frequency table as
follows:
Suppose a data set is divided into n classes by boundary points x1, x2, ..., xn, xn+1. Denote the
classes by C1, C2, ..., Cn. Thus, the class Ck = [xk, xk+1). See Figure 2.7.
C1
x1
C2
x2
Ck
Cn
xk
xk+1
xn
xn+1
Suppose the frequency and relative frequency of class Ck is fk and rk (k=1, 2, ..., n),
respectively. Then the cumulative frequency that observations fall into classes C1, C2, ..., Ck or
lie below the value xk+1 is the sum f1+f2+...+fk. The corresponding cumulative relative
frequency is r1 +r2+...+rk.
Example 2.9 Table 2.1 gives frequency, relative frequency, cumulative frequency and
cumulative relative frequency distribution for quantity of glucose in blood of 100 students.
According to this table the number of students having quantity of glucose less than 90 is 16.
120
100
80
60
40
20
0
14
8
12
6
11
4
10
92
80
68
Cumulative frequency
xix
2.9 Summary
This chapter discussed methods for presenting data set of qualitative and quantitative variables.
For a qualitative data set we first define categories and the category frequency which is the
number of observations falling in each category. Further, the category relative frequency
and the percentage for a category are introduced. Bar graphs and pie charts as the graphical
pictures of the data set are constructed.
If the data are quantitative and the number of the observations is small the categorization and
the determination of class frequencies can be done by constructing a stem and leaf display.
Large sets of data are best described using relative frequency distribution. The latter presents a
table that organizes data into classes with their relative frequencies. For describing the
quantitative data graphically histogram and polygon are used.
2.10 Exercises
1) A national cancer institure survey of 1,580 adult women recently responded to the question
In your opinion, what is the most serious health problem facing women? The responses
are summarized in the following table:
The most serious health
problem for women
Relative
frequency
Breast cancer
0.44
Other cancers
0.31
Emotional stress
0.07
0.06
Heart trouble
0.03
Other problems
0.09
xx
16
21
20
24
11
17
29
18
26
14
25
26
15
16
a) Arrange the data in an array from lowest to heighest. What comment can you make
about patient waiting time from your data array?
b) Construct a frequency distribution using 6 classes. What additional interpretation can
you give to the data from the frequency distribution?
c) Construct the cumulative relative frequency polygon and from this ogive state how long
75% of the patients should expect to wait.
3) Bacteria are the most important component of microbial eco systems in sewage treatment
plants. Water management engineers must know the percentage of active bacteria at each
stage of the sewage treatment. The accompanying data represent the percentages of
respiring bacteria in 25 raw sewage samples collected from a sewage plant.
42.3
50.6
41.7
36.5
28.6
40.7
48.1
48.0
45.7
39.9
32.3
31.7
39.6
37.5
40.8
50.1
39.2
38.5
35.6
45.6
34.9
46.1
38.3
44.5
37.2
22.8
21.9
22.0
20.7
20.9
25.0
22.2
22.8
20.1
25.3
20.7
22.5
21.2
23.8
23.3
20.9
22.9
23.5
19.5
23.7
20.3
23.6
19.0
25.1
25.0
19.5
24.1
24.2
21.8
21.3
21.5
23.1
19.9
24.2
24.1
19.8
23.9
22.8
23.9
19.7
24.2
23.8
20.7
23.8
24.3
21.1
20.9
21.6
22.7
xxi
3.1. Introduction
3.2. Types of numerical descriptive measures
3.3. Measures of central tendency
3.4. Measures of data variation
3.5. Measures of relative standing
3.6. Shape
3.7. Methods for detecting outlier
3.8. Calculating some statistics from grouped data
3.9. Computing descriptive summary statistics using computer softwares
3.10. Summary
3.11. Exercises
3.1 Introduction
In the previous chapter data were collected and appropriately summarized into tables and charts. In
this chapter a variety of descriptive summary measures will be developed. These descriptive measures
are useful for analyzing and interpreting quantitative data, whether collected in raw form (ungrouped
data) or summarized into frequency distributions (grouped data)
xxii
Definition 3.1
The arithmetic mean of a sample (or simply the sample mean) of n observations
x1 , x2 , , xn , denoted by x is computed as
n
x + x 2 + ... + x n
=
x= 1
n
i =1
Definition 3.1a
The population mean is defined by the formula
x
i =1
Note that the definitions of the population mean and the sample mean are the same. It is also
valid for the definition of other measures of central tendency. But in the next section we will give
different formulas for variances of population and sample.
Example 3.1 Consider 7 observations: 4.2, 4.3, 4.7, 4.8, 5.0, 5.1, 9.0.
By definition
3.3.2. Median
Definition 3.2
The median m of a sample of n observations x1 , x2 , , xn arranged in ascending or
descending order is the middle number that divides the data set into two equal
halves: one half of the items lie above this point, and the other half lie below it.
x k if n = 2k 1 ( n is odd)
m = Median = 1
2 ( x k + x k +1 ) if n = 2k ( n is even)
Example 3.2 Find the median of the data set consisting of the observations 7, 4, 3, 5, 6, 8, 10.
Solution First, we arrange the data set in ascending order
3 4 5 6 7 8 10.
Since the number of observations is odd, n = 2 x 4 - 1, then median m = x4 = 6. We see that a half of the
observations, namely, 3, 4, 5 lie below the value 6 and another half of the observations, namely, 7, 8 and
10 lie above the value 6.
Example 3.3 Suppose we have an even number of the observations 7, 4, 3, 5, 6, 8, 10, 1. Find
the median of this data set.
Solution First, we arrange the data set in ascending order
1 3 4 5 6 7 8 10.
Since the number of the observations n = 2 x 4, then by Definition
Median = (x4+x5)/2 = (5+6)/2 = 5.5
Advantage of the median over the mean: Extreme values in data set do not affect the median
as strongly as they do the mean.
Indeed, if in Example 3.1 we have
mean = 5.3, median = 4.8.
The extreme value of 9.0 does not affect the median.
3.3.3 Mode
xxiv
Definition 3.3
that occurs with the
The mode of a data set x1 , x2 , , xn is the value of x
greatest frequency , i.e., is repeated most often in the data set.
Example 3.4 Find the mode of the data set in Table 3.1.
Table 3.1 Quantity of glucose (mg%) in blood of 25 students
70
88
95
101
106
79
93
96
101
107
83
93
97
103
108
86
93
97
103
112
87
95
98
106
115
88
95
101
106
79
93
96
101
107
83
93
97
103
108
86
93
97
103
112
87
95
98
106
115
This data set contains 25 numbers. We see that, the value of 93 is repeated most often.
Therefore, the mode of the data set is 93.
Multimodal distribution:
multimodal distribution.
10
11
11
12
xxv
The mode is not used as often to measure central tendency as are the mean and the
median. Too often, there is no modal value because the data set contains no values that
occur more than once. Other times, every value is the mode because every value occurs for
the same number of times. Clearly, the mode is a useless measure in these cases.
When data sets contain two, three, or many modes, they are difficult to interpret and
compare.
Definition 3.4
Suppose all the n observations in a data set x1 , x 2 , , x n > 0 . Then the geometric
mean of the data set is defined by the formula
xG = G.M = n x1 x2 ...xn
The geometric mean is appropriate to use whenever we need to measure the average rate of
change (the growth rate) over a period of time.
From the above formula it follows
1 n
log xG = log xi
n i =1
where log is the logarithmic function of any base.
xxvi
Thus, the logarithm of the geometric mean of the values of a data set is equal to the arithmetic mean of
the logarithms of the values of the data set.
Definition 3.5
The range of a quantitative data set is the difference between the largest and smallest values
in the set.
Range = Maximum - Minimum,
where Maximum = Largest value, Minimum = Smallest value.
Definition 3.6
The population variance of the population of the observations x is defined the formula
=
2
(x
i =1
xi
= population mean
From the Definition 3.6 we see that the population variance is the average of the squared
distances of the observations from the mean.
xxvii
Definition 3.7
The standard deviation of a population is equal to the square root of the variance
= =
2
(x
i =1
Note that for the variance, the units are the squares of the units of the data. And for the
standard deviation, the units are the same as those used in the data.
Definition 3.6a
The sample variance of the sample of the observations x1 , x2 , , xn is defined the
formula
2
(x x )
i
s =
2
i =1
n 1
= sample mean
s = s2
Remark: In the denominator of the formula for s we use n-1 instead n because statisticians
2
proved that if s is defined as above then s2 is an unbiased estimate of the variance of the
2
population from which the sample was selected ( i.e. the expected value of s is equal to the
population variance ).
Uses of the standard deviation
The standard deviation enables us to determine, with a great deal of accuracy, where the
values of a frequency distribution are located in relation to the mean. We can do this according
to a theorem devised by the Russian mathematician P.L. Chebyshev (1821-1894).
xxviii
Chebyshevs Theorem
For any data set with the mean x and the standard deviation s at least 75% of the
values will fall within the interval x 2 s and at least 89% of the values will fall within
the interval x 3s .
We can measure with even more precision the percentage of items that fall within specific
ranges under a symmetrical, bell-shaped curve. In these cases we have:
x s:
Close to 68%
x 2 s : Close to 95%
x 3s : Near 100%
Definition 3.8
The coefficient of variation of a data set is the relation of its standard deviation to its
mean
cv = Coefficient of variation =
Standard deviation
100 %
Mean
xxix
Example 3.6 Suppose that each day laboratory technician A completes 40 analyses with a
standard deviation of 5. Technician B completes 160 analyses per day with a standard deviation
of 15. Which employee shows less variability?
At first glance, it appears that technician B has three times more variation in the output rate than
technician A. But B completes analyses at a rate 4 times faster than A. Taking all this
information into account, we compute the coefficient of variation for both technicians:
For technician A: cv=5/40 x 100% = 12.5%
For technician B: cv=15/60 x 100% = 9.4%.
So, we find that, technician B who has more absolute variation in output than technician A, has
less relative variation.
Descriptive measures that locate the relative position of an observation in relation to the other
observations are called measures of relative standing.
A measure that expresses this position in terms of a percentage is called a percentile for the
data set.
Definition 3.9
Suppose a data set is arranged in ascending (or descending ) order. The pth
percentile is a number such that p% of the observations of the data set fall below and
(100-p)% of the observations fall above it.
The median, by definition, is the 50th percentile.
The 25th percentile, the median and 75th percentile are often used to describe a data set
because they divide the data set into 4 groups, with each group containing one-fourth (25%) of
the observations. They would also divide the relative frequency distribution for a data set into 4
parts, each contains the same are (0.25) , as shown in Figure 3.1. Consequently, the 25th
percentile, the median, and the 75th percentile are called the lower quartile, the mid quartile, and
the upper quartile, respectively, for a data set.
Definition 3.10
The lower quartile, QL, for a data set is the 25th percentile
xxx
Definition 3.11
The mid- quartile, M, for a data set is the 50th percentile.
Definition 3.12
The upper quartile, QU, for a data set is the 75th percentile.
Definition 3.13
The interquartile range of a data set is QU - QL .
QL
QU
For large data set, quartiles are found by locating the corresponding areas under the relative
frequency distribution polygon as in Figure 3. . However, when the sample data set is small, it
may be impossible to find an observation in the data set that exceeds, say, exactly 25% of the
remaining observations. Consequently, the lower and the upper quartiles for small data set are
not well defined. The following box describes a procedure for finding quartiles for small data
sets.
xxxi
2. Calculate the quantity (n+1)/4 and round to the nearest integer. The observation
with this rank represents the lower quartile. If (n+1)/4 falls halfway between two
integers, round up.
3. Calculate the quantity 3(n+1)/4 and round to the nearest integer. The observation
with this rank represents the upper quartile. If 3(n+1)/4 falls halfway between two
integers, round down.
Example 3.7 Find the lower quartile, the median, and the upper quartile for the data set in
Table 3.1.
Solution For this data set n = 25. Therefore, (n+1)/4 = 26/4 = 6.5, 3(n+1)/4 = 3*26/4 = 19.5. We
round 6.5 up to 7 and 19.5 down to 19. Hence, the lower quartile = 7th observation = 93, the
upper quartile =19th observation = 103. We also have the median = 13th observation = 97. The
location of these quartiles is presented in Figure 3.2.
70
Min
80
90 93
QL
97 100 103
110
QU
115
Max
Figure 3.2 Location of the quartiles for the data set of Table 2.1
Another measure of real relative standing is the z-score for an observation (or standard score).
It describes how far individual item in a distribution departs from the mean of the distribution.
Standard score gives us the number of standard deviations, a particular observation lies below
or above the mean.
xxxii
Definition 3.14
Standard score (or z -score) is defined as follows:
For a population:
z-score=
where
z-score=
where
xx
s
3.6 Shape
The fourth important numerical characteristic of a data set is its shape. In describing a
numerical data set its is not only necessary to summarize the data by presenting appropriate
measures of central tendency, dispersion and relative standing, it is also necessary to consider
the shape of the data the manner, in which the data are distributed.
There are two measures of the shape of a data set: skewness and kurtosis.
3.6.1 Skewness
If the distribution of the data is not symmetrical, it is called asymmetrical or skewed.
Skewness characterizes the degree of asymmetry of a distribution around its mean. For a
sample data, the skewness is defined by the formula:
3
n
n
xi x
Skewness =
,
(n 1)(n 2) i =1 s
xxxiii
Figure
3.3a
Right-skewed
distribution
3.6.2 Kurtosis
Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the
bell-shaped distribution (normal distribution).
Kurtosis of a sample data set is calculated by the formula:
4
n
n(n + 1)
3(n 1) 2
xi x
Kurtosis =
xxxiv
Figure 3.4
The distributions
with positive and
negative kurtosis
Outliers occur when the relative frequency distribution of the data set is extreme skewed,
because such a distribution of the data set has a tendency to include extremely large or small
observations.
There are two widely used methods for detecting outliers.
Method of using z-score:
According to Chebyshev theorem almost all the observations in a data set will have z-score less
than 3 in absolute value i.e. fall into the interval ( x 3s, x + 3s ) , where x the mean and s is is
the standard deviation of the sample. Therefore, the observations with z-score greater than 3
will be outliers.
Example 3.8 The doctor of a school has measured the height of pupils in the class 5A. The
result (in cm) is follows
xxxv
132
138
136
131
153
131
133
129
133
110
132
129
134
135
132
135
134
133
132
130
131
134
135
For the data set in Table 3.1 x = 132.77, s = 6.06, 3s = 18.18, z-score of the observation of
153 is (153-132.77)/6.06=3.34 , z-score of 110 is (110-132.77)/6.06 = -3.76. Since the absolute
values of z-score of 153 and 110 are more than 3, the height of 153 cm and the height of 110
cm are outliers in the data set.
Box plot method
Another procedure for detecting outliers is to construct a box plot of the data. Below we present
steps to follow in constructing a box plot.
xxxvi
Outer fences
Inner fences
Inner fences
Outer fences
*
*
QL
1.5 * IQR
1.5 * IQR
IQR
QU
1.5 * IQR
1.5 * IQR
Figure 3.6 Output from SPSS showing box plot for the data set in Table 3.2
xxxvii
CLASS (DOLLARS)
FREQUENCY
0 49.99
78
50 99.99
123
100 149.99
187
150 199.99
82
150 199.99
82
200 249.99
51
250 299.99
47
300 349.99
13
350 399.99
400 449.99
450 499.99
From the information in this table, we can easily compute an estimate of the value of the mean
and the standard deviation.
Formulas for calculating the mean and the standard deviation for grouped data:
k
f x
i
x=
where
i =1
s2 =
i =1
k
f i xi f i x i
,
i =1
n 1
2
k = number of classes,
Variable:
GLUCOSE.GLUCOSE
--------------------------------------------------------------------Sample size
100.
Average
100.
Median
100.5
Mode
106.
Geometric mean
99.482475
Variance
102.767677
Standard deviation
10.137439
Standard error
1.013744
Minimum
70.
Maximum
126.
Range
56.
Lower quartile
94.
Upper quartile
106.
Interquartile range
12.
Skewness
-0.051526
Kurtosis
0.131118
Coeff. of variation
10.137439
xxxix
3.10 Summary
Numerical descriptive measures enable us to construct a mental image of the relative frequency
distribution for a data set pertaining to a numerical variable. There are 4 types of these
measures: location, dispersion, relative standing and shape.
Three numerical descriptive measures are used to locate a relative frequency distribution are
the mean, the median, and the mode. Each conveys a special piece of information. In a sense,
the mean is the balancing point for the data. The median, which is insensitive to extreme values,
divides the data set into two equal halves: half of the observations will be less than the median
and half will be larger. The mode is the observation that occurs with greatest frequency. It is the
value of the data set that locates the point where the relative frequency distribution achieves its
maximum relative frequency.
The range and the standard deviation measure the spread of a relative frequency distribution.
Particularly, we can obtain a very good notion of the way data are distributed around the mean
by constructing the intervals and referring to the Chebyshevs theorem and the Empirical rule.
Percentiles, quartiles, and z-scores measure the relative position of an observation in a data set.
The lower and upper quartiles and the distance between them called the inter-quartile range can
also help us visualize a data set. Box plots constructed from intervals based on the inter-quartile
range and z-scores provide an easy way to detect possible outliers in the data.
The two numerical measures of the shape of a data set are skewness and kurtosis. The
skewness characterizes the degree of asymmetry of a distribution around its mean. The kurtosis
characterizes the relative peakedness or flatness of a distribution compared with the bellshaped distribution.
xl
3.11 Exercises
1. The ages of a sample of the people attending a training course on networking in IOIT in
Hanoi are:
29
20
23
22
30
32
28
23
24
27
28
31
32
33
31
28
26
25
24
23
22
26
28
31
25
28
27
34
a) Construct a frequency distribution with intervals 15-19, 20-24, 25-29, 30-34, 35-39.
b) Compute the mean and the standard deviation of the raw data set.
c) Compute the approximate values for the mean and the standard deviation using the
constructed frequency distribution table. Compare these values with ones obtained in b).
2. Industrial engineers periodically conduct work measurement analyses to determine the
time used to produce a single unit of output. At a large processing plant, the total number of
man-hours required per day to perform a certain task was recorded for 50 days. his
information will be used in a work measurement analysis. The total man-hours required for
each of the 50 days are listed below.
128
119
95
97
124
128
142
98
108
120
113
109
124
97
138
133
136
120
112
146
128
103
135
114
109
100
111
131
113
132
124
131
133
88
118
116
98
112
138
100
112
111
150
117
122
97
116
92
122
125
a) Compute the mean, the median, and the mode of the data set.
b) Find the range, the variance and the standard deviation of the data set.
c) Construct the intervals x s , x 2 s , x 3s . Count the number of observations that fall
within each interval and find the corresponding proportions. Compare the results to the
Chebyshev theorem. Do you detect any outliers?
e) Find the 75th percentile for the data on total daily man-hours.
3. An engineer tested nine samples of each of three designs of a certain bearing for a new
electrical winch. The following data are the number of hours it took for each bearing to fail when
xli
the winch motor was run continuously at maximum output, with a load on the winch equivalent
to 1,9 times the intended capacity.
DESIGN
A
16
18
21
16
27
17
53
34
23
21
34
32
17
32
21
25
19
18
30
34
21
21
17
28
45
43
19
125
145
180
175
167
154
143
120
180
175
190
200
145
165
210
120
187
179
167
165
134
167
189
182
145
178
231
185
200
231
240
230
180
154
xlii
Chapter 4
CONTENTS
H: Head is observed,
T: Tail is observed.
Example 4.2 Toss a die and observe the number of dots on its upper face. You may observe
one, or two, or three, or four, or five or six dots on the upper face of the die. You can not predict
this number.
Example 4.3 When you draw one card from a standard 52 card bridge deck, some possible
outcomes of this experiment can not be predicted with certainty in advance are:
The probability of an event A, denoted by P(A), in general, is the chance A will happen.
But how to measure the chance of occurrence, i.e., how determine the probability an event?
The answer to this question will be given in the next Sections.
Two
non-mutually
exclusive events
xlv
A+ B
AB
In every problem in the theory of probability one has to deal with an experiment (under some
specific set of conditions) and some specific family of events S.
Definition 4.3
A family S of events is called a field of events if it satisfies the following properties:
1. If the event A and B belong to the family S, the so do the events AB, A+B and AB.
2. The family S contains the certain event E and the impossible event 0 .
We see that the sample space of an experiment together with all the events generated from the
events of this space by operations sum, product and complement constitute a field of
events. Thus, for every experiment we have a field of events.
Thus, for the classical definition of probability we suppose that all possible simple events are
equally likely.
P(A) =
m
N
where m= number of the simple events into which the event A can be decomposed.
Example 4.4 Consider again the experiment of tossing a balanced coin (see Example 4.1). In
this experiment the sample space consists of two simple events: H (Head is observed ) and T
(Tail is observed ). These events are equally likely. Therefore, P(H)=P(T)=1/2.
Example 4.5 Consider again the experiment of tossing a balanced die (see Example 4.2). In
this experiment the sample space consists of 6 simple events: D1, D2, D3, D4, D5, D6, where Dk
is the event that k dots (k=1, 2, 3, 4, 5, 6) are observed on the upper face of the die. These
events are equally likely. Therefore, P(Dk) =1/6 (k=1, 2, 3, 4, 5, 6).
Since Dodd = D1+D3+D5, Deven = D2+D4+D6 , where Dodd is the event that an odd number of dots
are observed, Deven an even number of dots are observed, we have P(Dodd)=3/6=1/2, P(Deven) =
3/6 = 1/2. If denote by A the event that a number less than 6 of dots is observed then P(A) = 5/6
because the event A = D1+ D2+D3+ D4+ D5 .
According to the above definition, every event belonging to the field of events S has a welldefined probability. Therefore, the probability P(A) may be regarded as a function of the event A
defined over the field of events S. This function has the following properties, which are easily
proved.
The properties of probability:
1. For every event A of the field S, P(A) 0
2. For the certain event E, P(E) = 1
3. If the event A is decomposed into the mutually exclusive events B and C
belonging to S then P(A)=P(B)+P(C)
This property is called the theorem on the addition of probabilities.
4. The probability of the event A complementary to the event A is given by the
formula P ( A ) = 1 P ( A) .
5. The probability of the impossible event is zero, P(0) = 0.
6. If the event A implies the event B then P(A) P(B).
7. The probability of any event A lies between 0 and 1: 0 P(A) 1.
xlvii
Example 4.6 Consider the experiment of tossing two fair coins. Find the probability of the event
Fortunately, for the events to which the classical definition of probability is applicable, the
statistical probability is equal to the probability in the sense of the classical definition.
4.4.3 Axiomatic construction of the theory of probability (optional)
The classical and statistical definitions of probability reveal some restrictions and shortcomings
when deal with complex natural phenomena and especially, they may lead to paradoxical
conclusions, for example, the well-known Bertrands paradox. Therefore, in order to find wide
applications of the theory of probability, mathematicians have constructed a rigorous foundation
of this theory. The first work they have done is the axiomatic definition of probability that
includes as special cases both the classical and statistical definitions of probability and
overcomes the shortcomings of each.
Below we formulate the axioms that define probability.
xlviii
Obviously, the classical and statistical definitions of probability which deal with finite sum of
events, satisfy the formulated above axioms. The necessity for introducing the extended axiom
of addition is motivated by the fact that in probability theory we constantly have to consider
events that decompose into an infinite number of sub-events.
Definition 4.6
The probability of an event A, given that an event B has occurred, is called the
conditional probability of A given B and denoted by the symbol P(A|B).
Example 4.7 Consider the experiment of tossing a fair die. Denote by A and B the following
events:
xlix
If the event B has occurred then it reduces the sample space of the experiment from 6 simple
events to 3 simple events (namely those D1, D2, D3 contained in event B). Since the only even
number of three numbers 1, 2, 3 is 2 there is only one simple event D2 of reduced sample
space that is contained in the event A. Therefore, we conclude that the probability that A occurs
given that B has occurred is one in three, or 1/3, i.e., P(A|B) = 1/3.
For the above example it is easy to verify that P(A|B) =
P(AB)
. In the general case, we use
P(B)
P(A|B) =
P(AB)
,
P(B)
(1)
P(B|A) =
P(AB)
P(A)
(1)
Each of formulas (1) and (1) is equivalent to the so-called Multiplication Theorem.
Multiplication Theorem
The probability of the product of two events is equal to the product of the probability
of one of the events by the conditional probability of the other event, given that the
first even has occurred, namely
Definition 4.7
We say that an event A is independent of an event B if
P(A|B) = P(A),
i.e., the occurrence of the event B does not affect the probability of the event A.
If the event A is independent of the event B, then it follows from (2) that
P(A) P(B|A) = P(B) P(A). From this we find P(B|A) = P(B) if P(A)>0, i.e., the event B is also
independent of A. Thus, independence is a symmetrical relation.
Example 4.8 Consider the experiment of tossing a fair die and define the following events:
P(A|B) =
P(AB) 1/ 3 1
=
= = P(A) .
P(B)
2/ 3 2
Thus, assuming B has occurred does not alter the probability of A. Therefore, the events A and
B are independent.
The concept of independence of events plays an important role in the theory of probability and
its applications. In particular, the greater part of the results presented in this course is obtained
on the assumption that the various events considered are independent.
In practical problems, we rarely resort to verifying that relations P(A|B) = P(A) or P(B|A) = P(B)
are satisfied in order to determine whether or not the given events are independent. To
determine independence, we usually make use of intuitive arguments based on experience.
The Multiplication Theorem in the case of independent events takes on a simple form.
li
We next generalize the notion of the independence of two events to that of a collection of
events.
Definition 4.8
The events B1, B2, ..., Bn are called collectively independent or mutually independent
if for any event Bp (p = 1, 2,..., n) and for any group of other events Bq, Br, ...,Bs of
this collection, the event Bp and the event BqBr...Bs are independent.
Note that for several events to be mutually independent, it is not sufficient that they be pair wise
independent.
Addition rule
If the event A1, A2, ..., An are pair wise mutually exclusive events then
Example 4.9 In a box there are 10 red balls, 20 blue balls, 10 yellow balls and 10 white balls.
At random draw one ball from the box. Find the probability that this ball is color.
Solution Call the event that the ball drawn is red to be R, is blue B, is yellow Y, is white W and is
color C. Then P(R) = 10/(10+20+10+10) = 10/50 = 1/5, P(B) = 20/50 = 2/5, P(Y) = 10/50 = 1/5.
Since C = R+B+Y and the events R, B and Y are mutually exclusive , we have P(C) = P(R+B+Y)
= 1/5+2/5+1/5 = 4/5.
lii
In the preceding section we also got the multiplicative theorem. Below for the purpose of
computing probability we recall it.
Multiplicative rule
For any two events A and B from the same field of events there holds the formula
Now suppose that the event B may occur together with one and only one of n mutually exclusive
events A1, A2, ..., An, that is
liii
P(Ak|B) =
P(Ak )P(B|Ak )
P(B)
Bayess Formula
If the event B may occur together with one and only one of n mutually exclusive
events A1, A2, ..., An then
P(Ak|B) =
P(Ak )P(B|Ak )
P(A )P(B|Ak )
= n k
P(B)
P(A j )P(B|A j )
j =1
The formula of Bayes is sometimes called the formula for probabilities of hypotheses.
Example 4.11 As in Example 4.10, there are 5 boxes of lamps:
3 boxes with the content A1: 9 good lamps and 1 defective lamp,
2 boxes with the content A2: 4 good lamps and 2 defective lamp.
From one of the boxes, chosen at random, a lamp is withdrawn. It turns out to be a defective
(the aposteriori
(event B). What is the probability, after the experiment has been performed
probability), that the lamp was taken from an box of content A1?
Solution We have calculated P(A1) = 3/5, P(A2) = 2/5, P(B|A1) = 1/10, P(B|A2) = 2/6 = 1/3, P(B)
= 29/150. Hence, the formula of Bayes gives
P(A1|B) =
Thus, the probability that the lamp was taken from an box of content A1, given the experiment
has been performed, is equal 0.31.
liv
4.7 Summary
In this chapter we introduced the notion of experiment whose outcomes called the events could
not be predicted with certainty in advance. The uncertainty associated with these events was
measured by their probabilities. But what is the probability? For answer to this question we
briefly discussed approaches to probability and gave the classical and statistical definitions of
probability. The classical definition of probability reduces the concept of probability to the
concept of equiprobability of simple events. According to the classical definition, the probability
of an event A is equal to the number of possible simple events favorable to A divided by the
total number of possible events of the experiment. In the time, by the statistical definition the
probability of an event is approximated by the proportion of times that A occurs when the
experiment is repeated very large number of times.
4.8 Exercises
A, B, C are random events.
1) Explain the meaning of the relations:
a) ABC = A;
b) A + B + C = A.
2) Simplify the expressions
a) (A+B)(B+C);
b) (A + B)(A + B );
3)
4)
5)
6)
7)
lv
Chapter 5
CONTENTS
Definition 5.1
A random variable is a variable that assumes numerical values associated with
events of an experiment.
Example 5.1 Observe 100 babies to be born in a clinic. The number of boys, which have been
born, is a random variable. It may take values from 0 to 100.
Example 5.2 Number of patients of a clinic daily is a random variable.
Example 5.3 Select one student from an university and measure his/her height and record this
height by x. Then x is a random variable, assuming values from, say from 100 cm to 250 cm in
dependence upon each specific student.
Example 5.4 The weight of babies at birth also is a random variable. It can assume values in
the interval, for example, from 800 grams to 6000 grams.
Classification of random variables: Random variables may be divided into two types:
discrete random variables and continuous random variables.
lvi
Definition 5.2
A discrete random variable is one that can assume only a countable number of
values.
A continuous random variable can assume any value in one or more intervals on
a line.
Among the random variables described above the number of boys in Example 5.1 and the
number of patients in Example 5.2 are discrete random variables, the height of students and the
weight of babies are continuous random variables.
Example 5.5 Suppose you randomly select a student attending your university. Classify each of
the following random variables as discrete or continuous:
a) Number of credit hours taken by the student this semester
b) Current grade point average of the student.
Solution a) The number of credit hours taken by the student this semester is a discrete random
variable because it can assume only a countable number of values (for example 10, 11, 12, and
so on). It is not continuous since the number of credit hours can not assume values as 11.5678,
15.3456 and 12.9876 hours.
b) The grade point average for the student is a continuous random variable because it could
theoretically assume any value (for example, 5.455, 8.986) corresponding to the points on the
interval from 0 to 10 of a line.
Thus, the probability distribution for a discrete random variable x may be given by one of the
ways:
1. the table
x1
p1
x2
p2
...
...
xn
pn
lvii
where pk is the probability that the variable x assume the value xk (k = 1, 2,..., n).
2. a formula for calculating p(xk) (k = 1, 2,..., n).
3. a graph presenting the probability of each value xk .
Example 5.6 A balanced coin is tossed twice and the number x of heads is observed. Find the
probability distribution for x.
Solution Let Hk and Tk denote the observation of a head and a tail, respectively, on the kth toss,
for k = 1, 2. The four simple events and the associated values of x are shown in Table 5.1.
DESCRIPTION
PROBABILITY
NUMBER OF HEADS
E1
E2
E3
E4
H1H2
H1T2
T1H2
T1T2
0.25
0.25
0.25
0.25
The event x = 0 is the collection of all simple events that yield a value of x = 0, namely, the
simple event E4. Therefore, the probability that x assumes the value 0 is
p(x)
0.25
0.5
0.25
lviii
0.6
0.5
0.4
0.3
0.2
0.1
0
0
Figure 5.1 Probability distribution for x, the number of heads in two tosses of
a coin
p(x) = 1
all x
Relationship between the probability distribution for a discrete random variable and the
relative frequency distribution of data:
Suppose you were to toss two coins over and over again a very large number of times and
record the number x of heads for each toss. A relative frequency distribution for the resulting
collection of 0s, 1s and 2s would be very similar to the probability distribution shown in Figure
5.1. In fact, if it were possible to repeat the experiment an infinitely large number of times, the
two distributions would be almost identical.
Thus, the probability distribution of Figure 5.1 provides a model for a conceptual population of
values x the values of x that would be observed if the experiment were to be repeated an
infinitely large number of times.
lix
Definition 5.4
Let x be a discrete random variable with probability distribution p(x). Then the mean
or expected value of x is
= E(x) =
xp(x)
all x
Example 5.6 Refer to the two-coin tossing experiment of Example 5.5 and the probability
distribution for the random variable x, shown in Figure 5.1. Demonstrate that the formula for E(x)
gives the mean of the probability distribution for the discrete random variable x.
Solution If we were to repeat the two-coin tossing experiment a large number of times say
400,000 times, we would expect to observe x = 0 heads approximately 100,000 times, x = 1
head approximately 200,000 times and x = 2 heads approximately 100,000 times. Calculating
the mean of these 400,000 values of x, we obtain
400,000
1
1
1
(0) + (1) + (2) = p ( x)x
4
2
4
all x
If x is a random variable then any function g(x) of x also is a random variable. The expected
value of g(x) is defined as follows:
Definition 5.5
Let x be a discrete random variable with probability distribution p(x) and let g(x) be a
function of x . Then the mean or expected value of g(x) is
E[g(x)] =
g(x)p(x)
all x
lx
Definition 5.6
Let x be a discrete random variable with probability distribution p(x). Then the
variance of x is
2 = E[(x - ) 2 ]
The standard deviation of x is the positive square root of the variance of x:
= 2
Example 5.7 Refer to the two-coin tossing experiment and the probability distribution for x,
shown in Figure 5.1. Find the variance and standard deviation of x.
Solution In Example 5.6 we found the mean of x is 1. Then
1
4
1
2
1
4
2 = E[(x - ) 2 ] = (x - ) 2 p ( x) = (0 1) 2 + (1 1) 2 + (2 1) 2 =
x =0
1
2
and
= 2 =
1
0.707
2
lxi
The binomial probability distribution, its mean and its standard deviation are given the following
formulas:
n!
= combination of x from n.
x!(n-x)!
= np
2 = npq
The variance:
2. The mean:
3.
Example 5.11 (see also Example 5.9) Test for impurities commonly found in drinking water from
private wells showed that 30% of all wells in a particular country have impurity A. If a random
sample of 5 wells is selected from the large number of wells in the country, what is the
probability that:
a) Exactly 3 will have impurity A?
b) At least 3?
c) Fewer than 3?
Solution First we confirm that this experiment possesses the characteristics of a binomial
experiment. This experiment consists of n = 5 trials, one corresponding to each random
selected well. Each trial results in an S (the well contains impurity A) or an F (the well does not
contain impurity A). Since the total number of wells in the country is large, the probability of
lxii
drawing a single well and finding that it contains impurity A is equal to 0.30 and this probability
will remain the same for each of the 5 selected wells. Further, since the sampling is random, we
assume that the outcome on any one well is unaffected by the outcome of any other and that
the trials are independent. Finally, we are interested in the number x of wells in the sample of n
= 5 that contain impurity A. Therefore, the sampling process represents a binomial experiment
with n = 5 and p = 0.30.
a) The probability of drawing exactly x = 3 wells containing impurity A is
5!
( 0.30 )3( 1 0.30 )53 = 0.1323 .
3!2!
P(x 3) = p(3)+p(4)+p(5). We have calculated p(3) = 0.1323 and we leave to the reader to
verify that p(4) = 0.02835, p(5) = 0.00243. In result, P(3) = 0.1323+0.02835+0.00243 =
0.16380.
c) Although P(x<3) = p(0)+p(1)+p(2), we can avoid calculating 3 probabilities by using the
complementary relationship P(x<3) = 1-P(x 3) = 1-0.16380 = 0.83692.
lxiii
The probability distribution, mean and variance for a Poisson random variable
x:
1. The probability distribution:
p(x) =
x e
x!
( x = 0, 1, 2,...),
where
3. The variance:
2 =
Note that instead of time, the Poisson random variable may be considered in the experiment of
counting the number x of times a particular event occurs during a given unit of area, volume,
etc.
Example 5.12 Suppose that we are investigating the safety of a dangerous intersection. Past
police records indicate a mean of 5 accidents per month at this intersection. Suppose the
number of accidents is distributed according to a Poisson distribution. Calculate the probability
in any month of exactly 0, 1, 2, 3 or 4 accidents.
Solution Since the number of accidents is distributed according to a Poisson distribution and the
mean number of accidents per month is 5, we have the probability of happening
accidents in any month p(x) =
5 x e 5
. By this formula we can calculate
x!
p(0) = 0.00674, p(1) = 0.3370, p(2) = 0.08425, p(3) = 0.14042, p(4) = 0.17552.
The probability distribution of the number of accidents per month is presented in Table 5.3 and
Figure 5.2.
lxiv
Table 5.3 Poisson probability distribution of the number of accidents per month
X- NUMBER OF
P(X) - PROBABILITY
ACCIDENTS
0
0.006738
0.03369
0.084224
0.140374
0.175467
0.175467
0.146223
0.104445
0.065278
0.036266
10
0.018133
11
0.008242
12
0.003434
lxv
these variables can take on any value within an interval. For example, the daily rainfall at some
location, the strength of a steel bar and the intensity of sunlight at a particular time of day. In
Section 5.1 these random variables were called continuous random variables.
The distinction between discrete random variables and continuous random variables is usually
based on the difference in their cumulative distribution functions.
Definition 5.7
Let be a continuous random variable assuming any value in the interval (- , + ).
Then the cumulative distribution function F(x) of the variable is defined as
follows
F(x) = P( x)
i.e., F(x) is equal to the probability that the variable assumes values, which are
less than or equal to x.
Note that here and from now on we denote by letter a continuous random variable and
denote by x a point on number line.
From the definition of the cumulative distribution function F(x) it is easy to show the following its
properties.
In Chapter 2 we described a large data set by means of a relative frequency distribution. If the
data represent measurements on a continuous random variable and if the amount of data is
very large, we can reduce the width of the class intervals until the distribution appears to be a
smooth curve. A probability density is a theoretical model for this distribution.
lxvi
Definition 5.8
If F(x) is the cumulative distribution function for a continuous random variable then
the density probability function f(x) for is
f(x) = F(x),
i.e., f(x) is the derivative of the distribution function F(x).
The density function for a continuous random variable , the model for some real-life population
of data, will usually be a smooth curve as shown in Figure 5.3.
F(x) =
f(t)dt
Thus, the cumulative area under the curve between - and a point x0 is equal to F(x0).
The density function for a continuous random variable must always satisfy the two properties
given in the box.
lxvii
f(x) 0
+
2.
f( x)dx = F() = 1
E() = xf(x)dx
-
Definition 5.9
Let be a continuous random variable with density function f(x) and g(x) is a
function of x. Then the mean or the expected value of g( ) is
+
E[g()] =
g(x)f(x)dx
Definition 5.10
Let be a continuous random variable with the expected value E( ) = . Then the
variance of is
2 = E[( - ) 2 ]
The standard deviation of is the positive square root of the variance = 2
lxviii
The density function, mean and variance for a normal random variable
The density function:
f ( x) =
e ( x )
/ 2 2
The parameters and 2 are the mean and the variance , respectively, of the normal
random variable
There is infinite number of normal density functions one for each combination of and . The
mean measures the location and the variance measures its spread. Several different normal
density functions are shown in Figure 5.4.
1
Curve 2
0.8
0.6
Curve 1
0.4
Curve 3
0.2
0
-0.2
If = 0 and =1 then
f ( x) =
1
2
e ( x )
/2
called the standardized normal distribution. The graph of the standardized normal density
distribution is shown in Figure 5.5.
lxix
0.5
0.4
0.3
0.2
0.1
3.2
2.6
1.4
0.8
0.2
-0.4
-1
-1.6
-2.2
-2.8
-3.4
z=
( x) =
1
2
t
e
/2
dt
P( ) = 0.6826
P( 2 ) = 0.9544
P( 3 ) = 0.9973
These equalities are known as , 2 and rules, respectively and are often used in statistics.
Namely, if a population of measurements has approximately a normal distribution the probability
that a random selected observation falls within the intervals ( - , + ), ( - 2, +2), and
( - 3, + 3), is approximately 0.6826, 0.9544 and 0.9973, respectively.
lxx
probability
Although the normal distribution is continuous, it is interesting to note that it can sometimes be
used to approximate discrete distributions. Namely, we can use normal distribution to
approximate binomial probability distribution.
Suppose we have a binomial distribution defined by two parameters: the number of trials n and
the probability of success p. The normal distribution with the parameters and will be a good
approximation for that binomial distribution if both
0.3
0.25
0.2
0.15
0.1
0.05
0
0
10
Figure 5.6 Approximation of binomial distribution (bar graph) with n=10, p=0.5
by a normal distribution (smoothed curve)
Table 5.4
Binomial
distribution
Normal
distribution
0 0.000977
0.0017
1 0.009766
0.010285
2 0.043945
0.041707
3 0.117188
0.113372
4 0.205078
0.206577
lxxi
5 0.246094
0.252313
6 0.205078
0.206577
7 0.117188
0.113372
8 0.043945
0.041707
9 0.009766
0.010285
10 0.000977
0.0017
5.9. Summary
This chapter introduces the notion of a random variable one of the fundamental concepts of
the probability theory. It is a rule that assigns one and only one value of a variable x to each
simple event in the sample space. A variable is said to be discrete if it can assume only a
countable number of values.
The probability distribution of a discrete random variable is a table, graph or formula that gives
the probability associated with each value of x . The expected value E (x ) = is the mean of
this probability distribution and E[( x )] = 2 is its variance.
Two discrete random variables the binomial, and the Poisson were presented, along with
their probability distributions.
In contrast to discrete random variables, continuous random variable can assume value
corresponding to the infinitely large number can assume value corresponding to the infinitely
large number of points contained in one or more intervals on the real line. The relative
frequency distribution for a population of data associated with a continuous random variable can
be modeled using a probability density function. The expected value (or mean) of a continuous
random variable x is defined in the same manner as for discrete random variables, except that
integration is substituted for summation. The most important probability distribution the normal
distribution - with its properties is considered.
5.10 Exercises
1) The continuous random variable is called a uniform random variable if its density function
is
f(x) = b a
0
Show that for this variable, the mean =
if a x b
elsewhere
a+b
(b a ) 2
and the variance 2 =
.
2
12
2) The continuous random variable is called a exponential random variable if its density
function is
lxxii
f ( x) =
Show that for this random variable
ex/
(0 x )
= , 2 = 2.
3) Find the area beneath a standardized normal curve between the mean z = 0 and the point z
= -1.26.
4) Find the probability that a normally distributed random variable lie more than z = 2 standard
deviations above its mean.
5) Suppose y is normally distributed random variable with mean 10 and standard deviation 2.1.
a) Find P ( y 11).
b) Find P (7.6 y 12.2)
lxxiii
Chapter 6.
Sampling Distributions
CONTENTS
in the first sample (Figure 6.2a) had no child. This value from the first sample compare
favorably with the 7% of "no children" of the entire population (Figure 6.1).
Table 6.1 Frequency distribution of number Figure 6.1 Relative frequency distribution of
of children ever born for 4,171 women
Children
0
1
2
3
4
5
6
7
>7
Total
Frequency
312
708
881
737
570
354
243
172
194
4171
4,171
Relative
Frequency
0.07
0.17
0.21
0.18
0.14
0.08
0.06
0.04
0.05
1.00
.25
Relative frequency
Number of
.20
.15
.10
.05
.00
0
>7
Children
0
1
2
3
4
5
6
7
>7
Total
Number of
Relative
Frequency
5
8
10
9
8
3
4
2
1
50
a
Frequency
0.10
0.16
0.20
0.18
0.16
0.06
0.08
0.04
0.02
1.00
.25
Relative frequency
Frequency
.20
.15
.10
.05
.00
0
>7
Relative
Children
0
Frequency
0.00
0.16
0.16
13
0.26
.30
Relative frequency
Number of
Frequency distribution of
number of children ever born for each of two
samples of 50 women selected from 4,171
women
.25
.20
.15
.10
.05
.00
0
>7
lxxv
0.18
0.12
0.04
0.08
50
b
1.00
Total
To rephrase the question posed in the example, we could ask: Which of the two samples is
more representative of, or characteristics of, the number of children ever born for all 4,171 of the
VNDHS's women? Clearly, the information provided by the first sample (Table and Figure 6.2a)
gives a better picture of the actual population of numbers of children ever born. Its relative
frequency distribution is closer to that for the entire population (Table and Figure 6.1) than is the
one provided by the second sample (Table and Figure 6.2b). Thus, if we were to rely on
information from the second sample only, we may have a distorted, or biased, impression of the
true situation with respect to numbers of children ever born.
How is it possible that two samples from the same population can provide contradictory
information about the population? The key issue is the method by which the samples are
obtained. The examples in this section demonstrate that great care must be taken in order to
select a sample that will give an unbiased picture of the population about which inferences are
to be made. One way to cope with this problem is to use random sampling. Random sampling
eliminates the possibility of bias in selecting a sample and, in addition, provides a probabilistic
basic for evaluating the reliability of an inference. We will have more to say about random
sampling in Section 6.2.
C nN =
N!
8! 8 * 7 * 6 * 5 * 4 * 3 * 2 *1
=
=
= 56
n!( N n)! 3!5! (3 * 2 * 1) (5 * 4 * 2 *1)
A, B, C A, C, F A, E, G B, C, G B, E, H C, E, F D, E, H
A, B, D A, C, G A, E, H B, C, H B, F, G C, E, G D, F, G
A, B, E A, C, H A, F, G B, D, E B, F, H C, E, H D, F, H
A, B, F A, D, E A, F, H B, D, F B, G, H C, F, G D, G, H
A, B, G A, D, F A, G, H B, D, G C, D, E C, F, H E, F, G
A, B, H A, D, G B, C, D B, D, H C, D, F C, G, H E, F, H
A, C, D A, D, H B, C, E B, E, F C, D, G D, E, F E, G, H
A, C, E A, E, F B, C, F B, E, G C, D, H D, E, G F, G, H
b. Each sample must have the same chance of being selected in order to ensure that we have
a random sample. Since there are 56 possible samples of size n = 3, each must have a
probability equal to 1/56 of being selected by the sampling procedure.
What procedures may one use to generate a random sample? If the population is not too large,
each observation may be recorded on a piece of paper and placed in a suitable container. After
the collection of papers is thoroughly mixed, the researcher can remove n pieces of paper from
container; the elements named on these n pieces of paper would be ones included in the
sample.
However, this method has the following drawbacks: It is not feasible when the population
consists of a lager number of observations; and since it is very difficult to achieve a thorough
mixing, the procedure provides only an approximation to random sample.
A more practical method of generating a random sample, and one that may be used with lager
populations, is to use a table of random numbers. At present, in almost statistical program
packages this method is used to select random samples. For example, SPSS PC - a
comprehensive system for analyzing data, provides a procedure to select a random sample
based on an approximate percentage or an exact number of observations. Two samples in
Example 6.1 were drawn by the SPSS's "Select cases" procedure from the data on fertilities of
4,171 women recorded in Appendix A.
For the first sample, the mean is
vf
0 * 5 + 1 * 8 + 2 * 10 + 3 * 9 + 4 * 8 + 5 * 3 + 6 * 4 + 7 * 2 + 8 * 1
= 2.96
n
50
For the second sample, the mean is
x=
lxxvii
x=
vf
n
0 * 0 + 1 * 8 + 2 * 8 + 3 * 13 + 4 * 9 + 5 * 6 + 6 * 2 + 7 * 4 + 8 * 0
= 3.38
50
where the mean for all 4,171 observations is 3.15. In the next section, we discuss how to judge
the performance of a statistic computed from a random sample.
Definition 6.2
A numerical descriptive measure of a population is called a parameter.
Definition 6.3
A quantity computed from the observations in a random sample is called a statistic.
You may have observed that the value of a population parameter (for example, the mean ) is a
constant (although it is usually unknown to us); its value does not vary from sample to sample.
However, the value of a sample statistic (for example, the sample mean x ) is highly dependent
on the particular sample that is selected. As seen in the previous section, the means of two
samples with the same size of n = 50 are different.
Since statistics vary from sample to sample, any inferences based on them will necessarily be
subject to some uncertainty. How, then, do we judge the reliability of a sample statistic as a tool
in making an inference about the corresponding population parameter? Fortunately, the
uncertainty of a statistic generally has characteristic properties that are known to us, and that
are reflected in its sampling distribution. Knowledge of the sampling distribution of a particular
statistic provides us with information about its performance over the long run.
Definition 6.4
A sampling distribution of a sample statistic (based on n observations) is the relative
frequency distribution of the values of the statistic theoretically generated by taking
repeated random samples of size n and computing the value of the statistic for each
sample. (See Figure 6.3.)
We will illustrate the notion of a sampling distribution with an example, which our interest
focuses on the numbers of children ever born of 4,171 women in VNDHS 1988. The data are
given in Appendix A. In particular, we wish to estimate the mean number of children ever born to
lxxviii
all women. In this case, the 4,171 observations constitute the entire population and we know
that the true value of , the mean of the population, is 3.15 children.
Example 6.3 How could we generate the sampling distribution of x , the mean of a
random sample of n = 5 observations from population of 4,171 numbers of children ever
born in Appendix A?
Solution The sampling distribution for the statistic x , based on a random sample of n =
5 measurements, would be generate in this manner: Select a random sample of five
measurements from the population of 4,171 observations on number of children ever
born in Appendix A; compute and record the value of x for this sample. Then return
these five measurements to the population and repeat the procedure. (See Figure 6.3). If
this sampling procedure could be repeated an infinite number of times, the infinite
number of values of x obtained could be summarized in a relative frequency
distribution, called the sampling distribution of x .
The task described in Example 6.3, which may seem impractical if not impossible, is not
performed in actual practice. Instead, the sampling distribution of a statistic is obtained by
applying mathematical theory or computer simulation, as illustrated in the next example.
Figure 6.3 Generating the theoretical sampling distribution of the sample mean x
Example 6.4 Use computer simulation to find the approximate sampling distribution of
x , the mean of a random sample of n = 5 observations from the population of 4,171
number of children ever born in Appendix A.
Solution
We used a statistical program, for example SPSS, to obtain 100 random
samples of size n = 5 from target population. The first ten of these samples are presented
in Table 6.3.
Mean ( x )
1.4
2.4
3.4
lxxix
1.6
3.6
3.8
3.2
2.8
11
4.2
10
1.8
For each sample of five observations, the sample mean x was computed. The relative
frequency distribution of the number of children ever born for the entire population of 4,171
women was plotted in Figure 6.4 and the 100 values of x are summarized in the relative
frequency distribution shown in Figure 6.5.
Click here to see some scripts and print outs from sampling and case summarize procedures in
SPSS with sample size of n = 5.
25
Percentage
20
15
10
5
0
9 10 11 12 13
Figure 6.4 Relative frequency distribution for 4,171 numbers of children ever born
We can see that the value of x in Figure 6.5 tend to cluster around the population mean, =
3.15 children. Also, the values of the sample mean are less spread out (that is, they have less
variation) than the population values shown in Figure 6.4. These two observations are borne out
by comparing the means and standard deviations of the two sets of observations, as shown in
Table 6.4.
Percentage
45
40
35
30
25
20
15
10
5
0
1
lxxx
Table 6.4 Comparison of the population and the approximate sampling distribution of
Standard
Deviation
= 3.15
= 2.229
3.11
.920
Example 6.5 Refer to Example 6.4. Simulate the sampling distribution of x for samples
size n = 25 from population of 4,171 observations of number of children ever born.
Compare result with the sampling distribution of x based on samples of
n = 5, obtained in Example 6.4.
Solution We obtained 100 computer-generated random samples of size n = 25 from
target population. A relative frequency distribution for 100 corresponding values of x is
shown in Figure 6.6.
It can be seen that, as with the sampling distribution based on samples of size n = 5, the values
of x tend to center about the population mean. However, a visual inspection shows that the
variation of the x -values about their mean in Figure 6.6 is less than the variation in the values
of x based on samples of size n = 5 (Figure 6.5). The mean and standard deviation for these
100 values of x are shown in Table 6.5 for comparison with previous results.
Table 6.5
Standard
Deviation
= 3.15
= 2.229
3.11
.920
3.14
.492
100 values of
80
75
P erc entage
70
65
60
55
50
45
40
35
30
25
20
15
10
5
0
lxxxi
Click here to see some scripts and print outs from sampling and case summarize procedures in
SPSSS with sample size of n = 25.
From Table 6.5 we observe that, as the sample size increases, there is less variation in the
sampling distribution of x ; that is, the values of x tend to cluster more closely about the
population mean as n gets larger. This intuitively appealing result will be stated formally in the
next section.
1. The sampling distribution of x has a mean equal to the mean of the population
from which the sample was selected. That is, if we let x denote the mean of the
sampling distribution of x , then
x =
2. The sampling distribution of x has a standard deviation equal to the standard
deviation of the population from which the sample was selected, divided by the
square root of the sample size. That is, if we let x denote the standard deviation
of the sampling distribution of x , then
x =
(*)
lxxxii
Example 6.6
Show that the empirical evidence obtained in Examples 6.4 and 6.5
supports the Central Limit Theorem and two properties of the sampling distribution of x .
Recall that in Examples 6.4 and 6.5, we obtained repeated random samples of size n = 5
and n = 25 from the population of numbers of children ever born in Appendix A. For this
target population, we know that the values of the parameters and :
Population mean:
= 3.15 children
= 2.229 children
Solution In Figures 6.4 and 6.5, we note that the values of x tend to cluster about the
population mean, = 3.15. This is guaranteed that by property 1, which implies that, in the long
run, the average of all values of x that would be generated in infinite repeated sampling would
be equal to .
We also observed, from Table 6.5, that the standard deviation of the sampling distribution of x
decreases as the sample size increases from n = 5 to n = 25. Property 2 quantifies the decrease
and relates it to the sample size. As an example, note that, for our approximate sampling
distribution based on samples of size n = 5, we obtained a standard deviation of .920, whereas
property 2 tells us that, for the actual sampling distribution of x , the standard deviation is equal
to
x =
2.229
5
= .997
Similarly, for samples of size n = 25, the sampling distribution of x actually has a standard
deviation of
x =
2.229
25
= .446
lxxxiii
Solution
a. Although we have no information about the shape of the relative frequency distribution of the
heights of the children, we can apply the Central Limit Theorem to conclude that the
sampling distribution of the sample mean height of the 100 three year old children is
approximately normally distributed. In addition, the mean x , and the standard deviation,
4.99
100
and
= .499 cm
z=
x x
x
91 89.67
= 2.67
.499
Thus, P( x 91) = P(z 2.67), and this probability (area) may be found in Table 1 of Appendix
C.
P( x 91)
= P(z 2.67)
= .5 - A (see Figure 6.7)
= .5 - .4962
= .0038
P( x 91)
A
89.67
(z=0)
91.00
( z = 2.67)
lxxxiv
x
x assure us that the sample mean x is a reasonable statistic to use in making inference about
the population mean , and they allow us to compute a measure of the reliability of references
made about . As we notice earlier, we will not be required to obtain sampling distributions by
simulation or by mathematical arguments. Rather, for all the statistics to be used in this course,
the sampling distribution and its properties will be presented as the need arises.
6.5 Summary
The objective of most statistical investigations is to make an inference about a population
parameter. Since we often base inferences upon information contained in a sample from the
target population, it is essential that the sample be properly selected. A procedure for obtaining
a random sample using statistical software (SPSS) was described in this chapter.
After the sample has been selected, we compute a statistic that contains information about the
target parameter. The sampling distribution of the statistic, characterizes the relative frequency
distribution of values of the statistic over an, infinitely large number of samples.
The Central Limit Theorem provides information about the sampling distribution of the sample
mean, x . In particular, if you have used random sampling, the sampling distribution of x will be
approximately normal if the sample size is sufficiently large.
6.6 Exercises
6.1
6.2
Repeat parts a, b, and c of Exercise 7.1, using random samples of size n = 10. Compare
relative frequency distribution with that of Exercise 7.1. Do the values of x generated from
samples of size n = 10 tend cluster more closely about ?
6.3
f. n = 500
lxxxv
6.4
6.5
b. P( x <73.6)
c. P(69.1< x <74.0)
d. P( x <65.5)
This part year, an elementary school began using a new method to teach arithmetic to first
graders. A standardized test, administered at the end of the year, was used to measure
the effectiveness of the new method. The relative frequency distribution of the test scores
in past years had a mean of 75 and a standard deviation of 10. Consider the standardized
test scores for a random sample of 36 first graders taught by the new method.
a. If the relative frequency distribution of test scores for first graders taught by the new
method is no different from that of the old method, describe the sampling distribution of
x , the mean test score for random sample of 36 first graders.
b. If the sample mean test score was computed to be x = 79, what would you conclude
about the effectiveness of the new method of teaching arithmetic? (Hint: Calculate
P( x 79) using the sampling distribution described in part a.)
lxxxvi
Chapter 7 Estimation
CONTENTS
7.1 Introduction
7.2 Estimation of a population mean: Large-sample case
7.3 Estimation of a population mean: small sample case
7.4 Estimation of a population proportion
7.5 Estimation of the difference between two population means: Independent samples
7.6 Estimation of the difference between two population means: Matched pairs
7.7 Estimation of the difference between two population proportions
7.8 Choosing the sample size
7.9 Estimation of a population variance
7.10 Summary
7.11 Exercises
7.1 Introduction
In preceding chapters we learned that populations are characterized by numerical descriptive
measures (parameters), and that inferences about parameter values are based on statistics
computed from the information in a sample selected from the population of interest. In this
chapter, we will demonstrate how to estimate population means, proportions, or variances, and
how to estimate the difference between two population means or proportions. We will also be
able to assess the reliability of our estimates, based on knowledge of the sampling distributions
of the statistics being used.
Example 7.1 Suppose we are interested in estimating the average number of children
ever born to all 4,171 women in the VNDHS 1998 in Appendix A. Although we already
know the value of the population mean, this example will be continued to illustrate the
concepts involved in estimation. How could one estimate the parameter of interest in this
situation?
Solution An intuitively appealing estimate of a population mean, , is the sample mean,
x , computed from a random sample of n observations from the target population.
Assume, for example, that we obtain a random sample of size n = 30 from numbers of
children ever born in Appendix A, and then compute the value of the sample mean to be
x =3.05 children. This value of x provides a point estimate of the population mean.
Definition 7.1
A point estimate of a parameter is a statistic, a single value computed from the
observations in a sample that is used to estimate the value of the target parameter.
lxxxvii
How reliable is a point estimate for a parameter? In order to be truly practical and meaningful,
an inference concerning a parameter must consist more than just a point estimate; that is, we
need to be able to state how close our estimate is likely to be to the true value of the population.
This can be done by using the characteristics of the sampling distribution of the statistic that
was used to obtain the point estimate; the procedure will be illustrated in the next section.
x 1.96 x = x 1.96
n
where is the population standard deviation of the 4,171 numbers of children ever born and
Area = .95
1.96
1.96
lxxxviii
Figure 7.1, is obtained from Table 1 of Appendix C.) This applies that before
the sample of measurements is drawn, the probability that x will fall within
the interval 1.96 x .
Step 2
If in fact the sample yields a value of x that falls within the interval
1.96 x , then it is true that x 1.96 x will contain , as demonstrated in
Figure 7.2. For particular value of x that falls within the interval 1.96 x , a
distance of 1.96 x is marked off both to the left and to the right of x . You
can see that the value of must fall within x 1.96 x .
Step 3
Step 1 and Step 2 combined imply that, before the sample is drawn, the
probability that the interval x 1.96 x will enclose is approximately .95.
the population mean . The term large-sample refers to the sample being of a sufficiently large
size that we can apply the Central Limit Theorem to determine the form of the sampling
distribution of x .
Definition 7.2
A confidence interval for a parameter is an interval of numbers within which we
expect the true value of the population parameter to be contained. The endpoints of
lxxxix
x 1.96 x = x 1.96
= 92.67 1.96
30
n
In most practical applications, the value of the population deviation will be unknown.
However, for larger samples (n 30), the sample standard deviation s provides a good
approximation to , and may be used on the formula for the confidence interval. For this
example, we obtain
4.09
88.62 1.96
= 88.62 1.46
= 88.62 1.96
30
30
or (87.16, 90.08). Hence, we estimate that the population mean height falls within the interval
from 87.16 cm to 90.08 cm.
How much confidence do we have that , the true population mean height, lies within the
interval (87.16, 90.08)? Although we cannot be certain whether the sample interval contain
(unless we calculate the true value of for all 823 observations in Appendix B), we can be
reasonably sure that it does. This confidence is based on the interpretation of the confidence
interval procedure: If we were to select repeated random samples of size n = 30 heights, and
from a 1.96 standard deviation interval around x for each sample, then approximately 95% of
the intervals constructed in this manner would contain . Thus, we are 95% confident that the
particular interval (89.93, 95.41) contains , and this is our measure of the reliability of the
point estimate x .
Example 7.4
To illustrate the classical interpretation of a confidence interval, we
generated 40 random samples, each of size n = 30, from the population of heights in
Appendix B. For each sample, the sample mean and standard deviation are presented in
Table 7.1. We then constructed the 95% confidence interval for , using the information
from each sample. Interpret the results, which are shown in Table 7.2.
Table 7.1
Sample
Mean
Standard
Deviation
Sample
Mean
Standard
Deviation
1
2
3
4
5
89.53
90.70
89.02
90.45
89.96
6.39
4.64
5.08
4.69
4.85
21
22
23
24
25
91.17
89.47
88.86
88.70
90.13
5.67
6.68
4.63
5.02
5.07
xc
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Table
89.96
89.81
90.12
89.45
89.00
89.95
90.18
89.15
90.11
90.40
90.04
88.88
90.98
88.44
89.44
5.53
5.60
6.70
3.46
4.61
4.48
6.34
5.98
5.86
4.50
5.26
4.29
4.56
3.64
5.05
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
7.2
91.10
89.27
88.85
89.34
89.07
91.17
90.33
89.31
91.05
88.30
90.13
90.33
86.82
89.63
88.00
for
for
5.27
4.91
4.77
5.68
4.85
5.30
5.60
5.82
4.96
5.48
6.74
4.77
4.82
6.37
4.51
40
random
samples
Sample
LCL
UCL
Sample
LCL
UCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
87.24
89.04
87.20
88.77
88.23
87.99
87.81
87.72
88.21
87.35
88.35
87.91
87.01
88.01
88.79
88.16
87.35
89.35
87.14
87.63
91.81
92.36
90.84
92.13
91.69
91.94
91.82
92.51
90.69
90.65
91.56
92.45
91.29
92.21
92.01
91.92
90.41
92.61
89.75
91.25
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
89.14
87.07
87.20
86.90
88.31
89.22
87.51
87.14
87.31
87.33
89.27
88.33
87.23
89.27
86.34
87.71
88.62
85.10
87.35
86.39
93.20
91.86
90.52
90.50
91.95
92.99
91.02
90.56
91.37
90.80
93.07
92.33
91.39
92.83
90.26
92.54
92.04
88.55
91.91
89.62
of
= 89.67 cm)
Solution For the target population of 823 heights, we have obtained the population mean
value = 89.67 cm. In the 40 repetitions of the confidence interval procedure described
above, note that only two of the intervals (those based on samples 38 and 40, indicated
xci
by red color) do not contain the value of , where the remaining 38 intervals (or 95% of
the 40 interval) do contain the true value of .
Note that, in actual practice, you would not know the true value of and you would not perform
this repeated sampling; rather you would select a single random sample and construct the
associated 95% confidence interval. The one confidence interval you form may or not contain
, but you can be fairly sure it does because of your confidence in the statistical procedure, the
basis for which was illustrated in this example.
Suppose you want to construct an interval that you believe will contain with some degree of
confidence other than 95%; in other words, you want to choose a confidence coefficient other
than .95.
Definition 7.3
The confidence coefficient is the proportion of times that a confidence interval
encloses the true value of the population parameter if the confidence interval
procedure is used repeatedly a very large number of times.
The first step in constructing a confidence interval with any desired confidence coefficient is to
notice from Figure 7.1 that, for a 95% confidence interval, the confidence coefficient of 95% is
equal to the total area under the sampling distribution (1.00), less .05 of the area, which is
divided equally between the two tails of the distribution. Thus, each tail has an area of .025.
Second, consider that the tabulated value of z (Table 1 of Appendix C) that cuts off an area of
.025 in the right tail of the standard normal distribution is 1.96 (see Figure 7.3). The value z =
1.96 is also the distance, in terms of standard deviation, that x is from each endpoint of the
95% confidence interval. By assigning a confidence coefficient other than .95 to a confidence
interval, we change the area under the sampling distribution between the endpoint of the
interval, which in turn changes the tail area associated with z. Thus, this z-value provides the
key to constructing a confidence interval with any desired confidence coefficient.
Definition 7.4
We define z / 2 to be the z-value such that an area of / 2 lies to its right (see Figure
7.4).
xcii
- z /2
z/2
x z / 2 x
Example 7.5 In statistic problems using confidence interval techniques, a very common
confidence coefficient is .90. Determine the value of z / 2 that would be used in
constructing a 90% confidence interval for a population mean based on a large sample.
Solution For a confidence coefficient of .90, we have
1 = .90
= .10
/ 2 = .05
and we need to obtain the value z / 2 = z.05 that locates an area of .05 in the upper tail of the
standard normal distribution. Since the total area to the right of 0 is .50,
z.05 is the value such that the area between 0 and z.05 is .50 - .05 = .45. From Table 1 of
Appendix C, we find that z.05 = 1.645 (see Figure 7.5). We conclude that a large-sample 90%
confidence interval for a population mean is given by
x 1.645 x
In Table 7.3, we present the values of z / 2 for the most commonly used confidence coefficients.
xciii
Table 7.3
Confidence
Coefficient
(1 )
/2
z / 2
.90
.050
1.645
.95
.025
1.960
.98
.010
2.330
.99
.005
2.58
0
z.05 = 1.645
x z / 2 x = x z / 2
n
where z / 2 is the z-value that locates an area of / 2 to its right, is the standard
deviation of the population from which the sample was selected, n is the sample size,
and x is the value of the sample mean.
Assumption:
n 30
[When the value of is unknown, the sample standard deviation s may be used to
approximate in the formula for the confidence interval. The approximation is generally quite
satisfactory when n 30.]
Example 7.6 Suppose that in the previous year all graduates at a certain university
reported the number of hours spent on their studies during a certain week; the average
was 40 hours and the standard deviation was 10 hours. Suppose we want to investigate
the problem whether students now are studying more than they used to. This year a
random sample of n = 50 students is selected. Each student in the sample was
interviewed about the number of hours spent on his/her study. This experiment produced
the following statistics:
x = 41.5 hours
s = 9.2 hours
Estimate , the mean number of hours spent on study, using a 99% confidence interval.
Interpret the interval in term of the problem.
Solution The general form of a large-sample 99% confidence interval for is
9 .2
s
= 41.5 3.36
= 41.5 2.58
x 2.58
x 2.58
50
n
n
xciv
or (38.14, 44.86).
We can be 99% confident that the interval (38.14, 44.86) encloses the true mean weekly time
spent on study this year. Since all the values in the interval fall above 38 hours and below 45
hours, we conclude that there is tendency that students now spend more than 6 hours and less
than 7.5 hours per day on average (suppose that they don't study on Sunday).
Example 7.7 Refer to Example 7.6.
a. Using the sample information in Example 7.6, construct a 95% confidence interval for mean
weekly time spent on study of all students in the university this year.
b. For a fixed sample size, how is the width of the confidence interval related to the confidence
coefficient?
Solution
a. The form of a large-sample 95% confidence interval for a population mean is
9 .2
s
= 41.5 2.55
= 41.5 1.96
x 1.96
x 1.96
50
n
n
or (38.95, 44.05).
b. The 99% confidence interval for was determined in Example 7.6 to be (38.14, 44.86).
The 95% confidence interval, obtained in this example and based on the same sample
information, is narrower than the 99% confidence interval. This relationship holds in general,
as stated in the next box.
Relationship between width of confidence interval and confidence coefficient
For a given sample size, the width of the confidence interval for a parameter increases
as the confidence coefficient increases. Intuitively, the interval must become wider for
us to have greater confidence that it contains the true parameter value.
Example 7.8 Refer to Example 7.6.
a. Assume that the given values of the statistic x and s were based on a sample of size n =
100 instead of a sample size n = 50. Construct a 99% confidence interval for , the
population mean weekly time spent on study of all students in the university this year.
b. For a fixed confidence coefficient, how is the width of the confidence interval related to the
sample size?
Solution
a. Substitution of the values of the sample statistics into the general formula for a 99%
confidence interval for yield
9 .2
s
= 41.5 2.37
= 41.5 2.58
x 2.58
x 2.58
100
n
n
or (39.13, 43.87)
b. The 99% confidence interval based on a sample of size n = 100, constructed in part a., is
narrower than the 99% confidence interval based on a sample of size n = 50, constructed
xcv
in
Example
7.6.
This
will
also
hold
in
general,
as
stated
in
the
box.
x t / 2
n
where the distribution of t based on (n - 1) degrees of freedom.
Upon comparing this to the large-sample confidence interval for , you will observe that the
sample standard deviation s replaces the population standard deviation . Also, the sampling
distribution upon which the confidence interval is based is known as a Student's t-distribution.
xcvi
Table 7.6
Some values for Student's t-distribution
t
Degrees
of
freedom
t.100
t.050
t.025
t.010
t.005
t.001
t.0005
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
3.078
1.886
1.638
1.533
1.476
1.440
1.415
1.397
1.383
1.372
1.363
1.356
1.350
1.345
1.341
6.314
2.920
2.353
2.132
2.015
1.943
1.895
1.860
1.833
1.812
1.796
1.782
1.771
1.760
1.753
12.706
4.303
3.182
2.776
2.571
2.447
2.365
2.306
2.262
2.228
2.201
2.179
2.160
2.145
2.131
31.821
6.965
4.541
3.747
3.365
3.143
2.998
2.896
2.821
2.764
2.718
2.681
2.650
2.624
2.602
63.657
9.925
5.841
4.604
4.032
3.707
3.499
3.355
3.250
3.169
3.106
3.055
3.102
2.977
2.947
318.31
22.326
10.213
7.173
5.893
5.208
4.785
4.501
4.297
4.144
4.025
3.930
3.852
3.787
3.733
636.62
31.598
12.924
8.610
6.869
5.959
5.408
5.041
4.781
4.587
4.437
4.318
4.221
4.140
4.073
Example 7.9 Using Table 7.4 to determine the t-value that would be used in constructing
a 95% confidence interval for based on a sample of size n = 14.
Solution For confidence coefficient of .95, we have
1 = .95
= .05
/ 2 = .025
xcvii
s
x 2.160
14
t-distribution with 13 df
t
0
Figure 7.6
t.025 = 2.160
At this point, the reasoning for the arbitrary cutoff point of n = 30 for distinguishing between
large and small samples may be better understood. Observe that the values in the last row of
Table 2 in Appendix C (corresponding to df = ) are the values from the standard normal zdistribution. This phenomenon occurs because, as the sample size increases, the t distribution
becomes more like the z distribution. By the time n reaches 30, i.e., df = 29, there is very little
difference between tabulated values of t and z.
Before concluding this section, we will comment on the assumption that the sampled population
is normally distributed. In the real world, we rarely know whether a sampled population has an
exact normal distribution. However, empirical studies indicate that moderates departures from
this assumption do not seriously affect the confidence coefficients for small-sample confidence
intervals. As a consequence, the definition of the small-sample confidence given in this section
interval is frequently used by experimenters when estimating the population mean of a nonnormal distribution as long as the distribution is bell-shaped and only moderately skewed.
xcviii
proportion p of all crimes committed in the area in which some type of firearm was
reportedly used.
Solution A logical candidate for a point estimate of the population proportion p is the
proportion of observations in the sample that have the characteristic of interest (called a
"success"); we will call this sample proportion p (read "p hat"). In this example, the
sample proportion of crimes related to firearms is given by
p =
That is, 60% of the crimes in the sample were related to firearms; the value p = .60 servers as
our point estimate of the population proportion p.
To assess the reliability of the point estimate p , we need to know its sampling distribution. This
information may be derived by an application of the Central Limit Theorem. Properties of the
sampling distribution of p are given in the next box.
Sampling distribution of p
For sufficiently large samples, the sampling distribution of p is approximately normal,
with
and
Mean:
p = p
Standard deviation:
p =
pq
n
where q = q- p.
A large-sample confidence interval for p may be constructed by using a procedure analogous to
that used for estimating a population mean.
Large-sample (1 ) 100% confidence interval for a population proportion, p
p z / 2 p p z / 2
p q
n
confidence interval. This approximation will be valid as long as the sample size n is sufficiently
large.
Example 7.11 Refer to Example 7.10. Construct a 95% confidence interval for p, the
population proportion of crimes committed in the area in which some type of firearm is
reportedly used.
xcix
Solution
= .05 ; / 2 = .025 ;
= 1.96. In Example 7.10, we obtained p = 180 / 300 = .60 .
p z / 2
p q
(.60)(.40)
= .60 1.96
= .60 .06
n
300
or (.54, .66). Note that the approximation is valid since the interval does not contain 0 or 1.
We are 95% confident that the interval from .54 to .66 contains the true proportion of crimes
committed in the area that are related to firearms. That is, in repeated construction of 95%
confidence intervals, 95% of all samples would produce confidence interval that enclose p.
It should be noted that small-sample procedure are available for the estimation of a population
proportion p. We will not discuss details here, however, because most surveys in actual practice
use samples that are large enough to employ the procedure of this section.
In Section 7.2, we learned how to estimate the parameter based on a large sample from a
single population. We now proceed to a technique for using the information in two samples to
estimate the difference between two population means. For example, we may want to compare
the mean heights of the children in province No.18 and in province No.14 using the
observations in Appendix B. The technique to be presented is a straightforward extension of
that used for large-sample estimation of a single population mean.
Example 7.12 To estimate the difference between the mean heights for all children of
province No. 18 and province No. 14 use the following information
1. A random sample of 30 heights of children in province No. 18 produced a sample mean of
91.72 cm and a standard deviation of 4.50 cm.
2. A random sample of 40 heights of children in province No. 14 produced a sample mean of
86.67 cm and a standard deviation of 3.88 cm.
Calculate a point estimate for the difference between heights of children in two provinces.
Solution We will let subscript 1 refer to province No. 18 and the subscript 2 to province
No. 14. We will also define the following notation:
n1 = 30
Province No. 14
n2 = 40
Sample mean
x1 = 91.72 cm
x2 = 86.67 cm
s1 = 4.50 cm
s2 = 3.88 cm
To estimate ( 1 2 ) , it seems sensible to use the difference between the sample means
( x1 x2 ) = (91.72 - 86.67) = 5.05 as our point estimate of the difference between two
population means. The properties of the point estimate ( x1 x2 ) are summarized by its
sampling distribution shown in Figure 7.8.
( x1 x 2 )
(1 - 2)
2 ( x1 x 2 )
2 ( x1 x 2 )
Figure 7.8 Sampling distribution of ( x1 x2 )
Sampling distribution of ( x 1 x 2 )
For sufficiently large sample size (n1 and n2 30), the sampling distribution of
( x1 x2 ) , based on independent random samples from two population, is
approximately normal with
Mean:
( x x ) = ( 1 2 )
Standard deviation:
(x
=
1 x2 )
12
n1
22
n2
where 1 and 2 are standard deviations of two population from which the samples
were selected.
2
As was the case with large-sample estimation of single population mean, the requirement of
large sample size enables us to apply the Central Limit Theorem to obtain the sampling
2
2
distribution of ( x1 x2 ) ; it also suffices use to s1 and s 2 as approximation to the respective
population variances, 1 and 2 .
2
The procedure for forming a large-sample confidence interval for ( 1 2 ) appears in the
accompanying box.
ci
( x1 x 2 ) z / 2 ( x1 x 2 ) = ( x1 x 2 ) z / 2
( x1 x 2 ) z / 2
(Note: We have used the sample variances
population parameters.)
12
n1
22
n2
s12 s 22
+
n1 n 2
The assumptions upon which the above procedure is based are the following:
Assumptions required for large-sample estimation of ( 1 2 )
1. The two random samples are selected in an independent manner from the target
populations. That is the choice of elements in one sample does not affect, and is
not affected by, the choice of elements in the other sample.
2. The sample sizes n1 and n2 are sufficiently large. (at least 30)
Example 7.13 Refer to Example 7.12. Construct a 95% confidence interval for ( 1 2 ) ,
the difference between mean heights of all children in province No. 18 and province No.
14. Interpret the interval.
Solution The general form of a 95% confidence interval for ( 1 2 ) based on large
samples from the target populations, is given by
( x1 x 2 ) z / 2
12
n1
22
n2
Recall that z.025 = 1.96 and use the information in Table 7.5 to make the following substitutions
to obtain the desired confidence interval:
12
30
22
40
( 4.50) 2 (3.88) 2
+
30
40
5.05 2.01
or (3.04, 7.06).
The use of this method of estimation produces confidence intervals that will enclose ( 1 2 ) ,
the difference between population means, 95% of the time. Hence, we can be reasonably sure
cii
that the mean height of children in province No. 18 was between 3.04 cm and 7.06 cm higher
than the mean height of children in province No. 14 at the survey time.
When estimating the difference between two population means, based on small samples from
each population, we must make specific assumptions about the relative frequency distributions
of the two populations, as indicated in the box.
Assumptions required for small-sample estimation of ( 1 2 )
1. Both of the populations which the samples are selected have relative frequency
distributions that are approximately normal.
1 1
( x1 x 2 ) t / 2 s 2p +
n1 n 2
where
(n1 1) s12 + (n 2 1) s 22
s =
n1 + n 2 2
2
p
construct an estimate of based on the information contained in both samples. This pooled
2
estimate is denoted by s p and is computed as in the previous box.
2
achievement. For each pair, one member would be randomly selected to be taught by method
1; the other member would be assigned to class taught by method 2. Then the differences
between matched pairs of achievement test scores should provide a clearer picture of the
difference in achievement for the two reading methods because the matching would tend to
cancel the effects of the factors that formed the basic of the matching.
In the following boxes, we give the assumptions required and the procedure to be used for
estimating the difference between two population means based on matched-pairs data.
Assumptions required for estimation of ( 1 2 ) : Matched pairs
1. The sample paired observations are randomly selected from the target population
of paired observations.
2. The population of paired differences is normally distributed.
s
d t / 2 d
n
where d is the mean of n sample differences, sd is their standard deviation, and t / 2
is based on (n-1) degrees of freedom.
Example 7.14 Suppose that the n = 10 pairs of achievement test scores were given in
Table 7.7 . Find a 95% confidence interval for the difference in mean achievement,
d = ( 1 2 ) .
Table 7.7 Reading achievement test scores for Example 7.14
Student pair
1
10
Method 1 score
78
63
72
89
91
49
68
76
85
55
Method 2 score
71
44
61
84
74
51
55
60
77
39
Pair difference
19
11
17
-2
13
16
16
Solution The differences between matched pairs of reading achievement test scores are
computed as
d = (method 1 score - method 2 score)
The mean, variance, and standard deviation of the differences are
d=
d = 110 = 11.0
n
10
civ
( d )
s d2 =
n 1
(110) 2
1,594
10 = 1,594 1,210 = 42.6667
=
9
9
s d = 42.67 = 6.53
The value of t.025, based on (n -1) = 9 degrees of freedom, is given in Table 2 of Appendix C as
t.025 = 2.262. Substituting these values into the formula for the confidence interval, we obtain
s
d t .025 d
n
6.53
= 11.0 2.262
= 11.0 4.7
10
or (6.3, 15.7).
We estimate, with 95% confidence that the difference between mean reading achievement test
scores for method 1 and 2 falls within the interval from 6.3 to 15.7. Since all the values within
the interval are positive. method 1 seems to produce a mean achievement test score that
substantially higher than the mean score for method 2.
Number surveyed
Number in sample who said they were
satisfied with their life
1990
1998
n1 = 1,400
n2 = 1,400
462
674
p2 =
Population proportion of adults who said that they were satisfied with their life in 1998.
As a point estimate of (p1 - p2), we will use the difference between the corresponding sample
proportions, ( p 1 p 2 ) , where
p 1 =
Number of adults in 1990 who said that they were satisfied with their life 462
=
= .33
1,400
Number of adults surveyed in 1990
cv
and
p 2 =
Number of adults in 1998 who said that they were satisfied with their life 674
=
= .48
Number of adults surveyed in 1998
1,400
1 p 2 )
Sampling distribution of ( p
For sufficiently large sample size, n1 and n2, the sample distribution of ( p 1 p 2 ) ,
based on independent random samples from two populations, is approximately normal
with
Mean:
( p 1 p 2 )
= ( p 1 p 2 )
Standard deviation:
( p 1 p 2 )
and
p1q1
p q
+ 2 2
n1
n2
It follows that a large-sample confidence interval for ( p 1 p 2 ) may be obtained as shown in the
box.
1 p 2 )
Large-sample (1 ) 100% confidence interval for ( p
( p 1 p 2 ) z / 2
( p 1 p 2 )
( p 1 p 2 ) z / 2
p 1 q 1
p q
+ 2 2
n1
n2
where p 1 and p 2 are the sample proportions of observations with the characteristics
of interest.
Assumption: The samples are sufficiently large so that the approximation is valid. As a
general rule of thumb we will require that intervals
p 1 2
p 1 q 1
and p 2 2
n1
p 2 q 2
do not contain 0 or 1.
n2
cvi
Example 7.16 Refer to Example 7.15. Estimate the difference between the proportions of
the adults in this country in 1990 and in 1998 who said that they were satisfied with their
life, using a 95% confidence interval.
Solution From Example 7.15, we have n1 = n2 = 1,400, p 1 = .33 and p 2 = .48.
Thus, q1 = 1 - .33 = .67 and q 2 = 1 - .48 = .52. Note that the intervals
p 1 2
p 1 q 1
= . 33 2
n1
p 2 2
p 2 q 2
= . 48 2
n2
(. 33 )(. 67 )
= . 33 . 025
1, 400
(. 48 )(. 67 )
= . 48 . 027
1, 400
do not contain 0 and 1. Thus, we can apply the large-sample confidence interval for
(p1 - p2).
The 95% confidence interval is
( p 1 p 2 ) z .025
p 1 q1 p 2 q 2
+
= (. 33 . 48 ) 1 . 96
n1
n2
(. 33 )(. 67 ) (. 48 )(. 52 )
+
1, 400
1, 400
= .15 .036
or (-.186, -.114). Thus we estimate that the interval (-.186, -.114) enclose the difference (p1 - p2)
with 95% confidence. It appears that there were between 11.4% and 18.6% more adults in 1998
than in 1990 who said that they were satisfied with their life.
1.96 x = 1.96
n
of the mean shipping time, , is approximately .95 (see Figure 7.9). Therefore, we want to
choose the sample size n so that 1.96 / n equals .5 day:
cvii
1.96
= .5
n
1.96
.5 day
1.96
.5 day
Figure 7.10
Figure 7.9 provides the information we need to find an approximation for . Since the Empirical
Rule tells us that almost all the observations in a data set will fall within the interval 3 , it
follows that the range of a population is approximately 6 . If the range of population of
shipping times is 7 days, then
6 = 7 days
and is approximately equal to 7/6 or 1.17 days.
The final step in determining the sample size is to substitute this approximate value of into
the equation obtained previously and solve for n.
cviii
Thus, we have
1.17
1.96
= .5 or
n
n=
1.96(1.17)
= 4.59
.5
z / 2
= d
n
where the value of z / 2 is obtained from Table 1 of Appendix C. The solution is given by
z
n= /2
d
For example, for a confidence coefficient of .90, we would require a sample size of
1.64
n=
Choosing the sample size for estimating a population mean to within d units
with probability (1 )
z
n= /2
d
cix
z
n = / 2 pq
d
where p is the value of the population proportion that we are attempting to estimate
and q = 1 - p.
(Note: This technique requires previous estimates of p and q. If none are available, use
p = q = .5 for a conservative choice of n.)
cx
2.050
2.025
2.010
2.005
2.70554
3.84146
5.02389
6.63490
7.87944
4.60517
5.99147
7.37776
9.21034
10.59660
6.25139
7.81473
9.34840
11.34490
12.83810
7.77944
9.48773
11.14330
13.27670
14.86020
9.23635
11.07050
12.83250
15.08630
16.74960
10.64460
12.59160
14.44940
16.81190
18.54760
12.01700
14.06710
16.01280
18.47530
20.27770
13.36160
15.50730
17.53460
20.09020
21.95500
cxi
14.68370
16.91900
19.02280
21.66600
23.58930
10
15.98710
18.30700
20.48310
23.20930
25.18820
11
17.27500
19.67510
21.92000
24.72500
26.75690
12
18.54940
21.02610
23.33670
26.21700
28.29950
13
19.81190
22.36210
24.73560
27.68830
29.81940
14
21.06420
23.68480
26.11900
29.14130
31.31930
15
22.30720
24.99580
27.48840
30.57790
32.80130
16
23.54180
26.29620
28.84540
31.99990
34.26720
17
24.76900
27.58710
30.19100
33.40870
35.71850
18
25.98940
28.86930
31.52640
34.80530
37.15640
19
27.20360
30.14350
32.85230
36.19080
38.58220
(n 1) s 2
2 / 2
(n 1) s 2
2 (1 / 2 )
where 12 / 2 , and 2 / 2 are values of 2 that locate an area of /2 to the right and /2
to the left, respectively, of a chi-square distribution based on (n - 1) degrees of
freedom.
Assumption: The population from which the sample is selected has an approximate
normal distribution.
For a 95% confidence interval, (1 - ) = .95 and /2 = .05/2 = .025. There- fore, we need the
2.025, and 2.975 for (n - 1) = 143 df. Looking in the
tabulated values
df = 150 row of Table 3 of Appendix C (the row with the df values closest to 143), we find 2.025
= 185.800 and 2.975 = 117.985. Substituting into the formula given in the box, we obtain
(144 1)(376.6) 2
(144 1)(376.6) 2
2
185.800
117.985
cxii
We are 95% confident that the true variance in weights of contaminated fish in the river falls
between 109,156.8 and 171,898.4.
Figure 7.11 The location of 21-/2 and 2/2 for a chi-square distribution
Example 7.20 Refer to Example 7.19. Find a 95% confidence interval for , the true
standard deviation of the fish weights.
Solution A confidence interval for is obtained by taking the square roots of the lower
and upper endpoints of a confidence interval for 2. Thus, the 95% confidence interval is
109,156.8 2 171,898.4
330.4 2 414.6
Thus, we are 95% confident that the true standard deviation of the fish weights is between
330.4 grams and 414.6 grams.
Note that the procedure for calculating a confidence interval for 2 in Example 7.19 (and the
confidence interval for a in Example 7.20) requires an assumption regardless of whether the
sample size n is large or small (see box). We must assume that the population from which the
sample is selected has an approximate normal distribution. It is reasonable to expect this
assumption to be satisfied in Examples 7.19 and 7.20 since the histogram of the 144 fish
weights in the sample is approximately normal.
7.10 Summary
This chapter presented the technique of estimation - that is, using sample information to make
an inference about the value of a population parameter, or the difference between two
population parameters. In each instance, we presented the point estimate of the parameter of
interest, its sampling distribution, the general form of a confidence interval, and any
assumptions required for the validity of the procedure. In addition, we provided techniques for
determining the sample size necessary to estimate each of these parameters.
7.11 Exercises
7.1. Use Table 1 of Appendix C to determine the value of z/2 that would be used to construct a
large-sample confidence interval for , for each of the following confidence coefficients:
a) .85
b) .95
cxiii
c) .975
7.2. Suppose a random sample of size n = 100 produces a mean of x =81 and a standard
deviation of s = 12.
a) Construct a 90% confidence interval for .
b) Construct a 95% confidence interval for .
c) Construct a 99% confidence interval for .
7.3. Use Table 2 of Appendix C to determine the values of t/2 that would used in the
construction of a confidence interval for a population mean for each of the following
combinations of confidence coefficient and sample size:
a) Confidence coefficient .99, n = 18.
b) Confidence coefficient .95, n = 10.
c) Confidence coefficient .90, n = 15.
7.4. A random sample of n = 10 measurements from a normally distributed population yields
x = 9.4 and s = 1.8.
a) Calculate a 90% confidence for .
b) Calculate a 95% confidence for .
c) Calculate a 99% confidence for .
7.5. The mean and standard deviation of n measurements randomly sampled from a normally
distributed population are 33 and 4, respectively. Construct a 95% confidence interval for
when:
a) n = 5
b) n = 15
c) n = 25
7.6. Random samples of n measurements are selected from a population with unknown
proportion of successes p. Compute an estimate of p for each of the following situations:
a) n = 250,
p = .4
b) n = 500, p = .85
c) n = 95, p = .25
7.7. A random sample of size 150 is selected from a population and the number of successes
is 60.
a)
b)
c)
d)
Find p .
Construct a 90% confidence interval for p.
Construct a 95% confidence interval for p.
Construct a 99% confidence interval for p.
7.8. Independent random samples from two normal population produced the sample means
and variances listed in the following table.
Sample
population 1
from Sample
population 2
n1 = 14
n2 = 7
x1 = 53.2
x1 = 43.4
s12 = 96.8
s 22 = 102.0
from
d = 2.3 sd = 2.67
a) Find a 90% confidence interval for d.
b) Find a 95% confidence interval for d.
c) Find a 99% confidence interval for d.
cxv
Chapter 8
Hypothesis Testing
CONTENTS
8.1 Introduction
8.2 Formulating Hypotheses
8.3 Types of errors for a Hypothesis Test
8.4 Rejection Regions
8.5 Summary
8.6 Exercises
8.1 Introduction
In this chapter we will study another method of inference-making: hypothesis testing. The
procedures to be discussed are useful in situations where we are interested in making a
decision about a parameter value, rather than obtaining an estimate of its value. It is often
desirable to know whether some characteristics of a population is larger than a specified value,
or whether the obtained value of a given parameter is less than a value hypothesized for the
purpose of comparison.
It should be stressed that researchers frequently put forward a null hypothesis in the hope that
they can discredit it. For example, consider an educational researcher who designed a new way
to teach a particular concept in science, and wanted to test experimentally whether this new
method worked better than the existing method. The researcher would design an experiment
comparing the two methods. Since the null hypothesis would be that there is no difference
between the two methods, the researcher would be hoping to reject the null hypothesis and
conclude that the method he or she developed is the better of the two.
cxvi
The null hypothesis is typically a hypothesis of no difference, as in the above example where it
is the hypothesis that there is no difference between population means. That is why the word
"null" in "null hypothesis" is used it is the hypothesis of no difference.
Example 8.1 Formulate appropriate null and alternative hypotheses for testing the
demographer's theory that the mean number of children born to urban women is less
than the mean number of children born to rural women.
Solution The hypotheses must be stated in terms of a population parameter or
parameters. We will thus define
H0: p = .80
Ha: p > .80
Observe that the statement of H0 in these examples and in general, is written with an equality (=)
sign. In Example 8.2, you may have been tempted to write the null hypothesis as H0: p .80.
However, since the alternative of interest is that p > .80, then any evidence that would cause you
to reject the null hypothesis H0: p = .80 in favor of Ha: p > .80 would also cause you to reject H0:
p = p', for any value of p' that is less than .80. In other words, H0: p = .80 represents the worst
possible case, from the researcher's point of view, when the alternative hypothesis is not correct.
Thus, for mathematical ease, we combine all possible situations for describing the opposite of Ha
into one statement involving equality.
Example 8.3
A metal lathe is checked periodically by quality control inspectors to
determine if it is producing machine bearings with a mean diameter of .5 inch. If the
mean diameter of the bearings is larger or smaller than .5 inch, then the process is out of
control and needs to be adjusted. Formulate the null and alternative hypotheses that
could be used to test whether the bearing production process is out of control.
Solution We define the following parameter:
= True mean diameter (in inches) of all bearings produced by the lathe
If either > .5 or < .5, then the metal lathe's production process is out of control. Since we
wish to be able to detect either possibility, the null and alternative hypotheses would be
cxvii
Definition 8.2
A Type II error is the error of accepting the null hypothesis when it is false. The
probability of making a Type II error is usually denoted by .
The null hypothesis can be either true or false further, we will make a conclusion either to reject
or not to reject the null hypothesis. Thus, there are four possible situations that may arise in
testing a hypothesis (see Table 8.1).
Table 8.1 Conclusions and consequences for testing a hypothesis
Conclusions
Do not reject
Reject
Null Hypothesis
Null Hypothesis
True
"State of Nature"
Null Hypothesis
Correct conclusion
Type I error
Alternative Hypothesis
Type II error
Correct conclusion
The kind of error that can be made depends on the actual state of affairs (which, of course, is
unknown to the investigator). Note that we risk a Type I error only if the null hypothesis is
rejected, and we risk a Type II error only if the null hypothesis is not rejected. Thus, we may
make no error, or we may make either a Type I error (with probability ), or a Type II error
(with probability ), but not both. We don't know which type of error corresponds to actuality
and so would like to keep the probabilities of both types of errors small. There is an intuitively
appealing relationship between the probabilities for the two types of error: As increases,
cxviii
decreases, similarly, as increases, a decreases. The only way to reduce and simultaneously
is to increase the amount of information available in the sample, i.e., to increase the sample size.
Example 8.4
Refer to Example 8.3. Specify what Type I and Type II errors would
represent, in terms of the problem.
Solution A Type I error is the error of incorrectly rejecting the null hypothesis. In our
example, this would occur if we conclude that the process is out of control when in fact
the process is in control, i.e., if we conclude that the mean bearing diameter is different
from .5 inch, when in fact the mean is equal to .5 inch. The consequence of making such
an error would be that unnecessary time and effort would be expended to repair the
metal lathe.
A Type II error that of accepting the null hypothesis when it is false, would occur if we conclude
that the mean bearing diameter is equal to .5 inch when in fact the mean differs from .5 inch. The
practical significance of making a Type II error is that the metal lathe would not be repaired,
when in fact the process is out of control.
The probability of making a Type I error () can be controlled by the researcher (how to do this
will be explained in Section 8.4). is often used as a measure of the reliability of the conclusion
and called the level of significance (or significance level) for a hypothesis test.
You may note that we have carefully avoided stating a decision in terms of "accept the null
hypothesis H0." Instead, if the sample does not provide enough evidence to support the
alternative hypothesis Ha, we prefer a decision "not to reject H0." This is because, if we were to
"accept H0," the reliability of the conclusion would be measured by , the probability of Type II
error. However, the value of is not constant, but depends on the specific alternative value of the
parameter and is difficult to compute in most testing situations.
In summary, we recommend the following procedure for formulating hypotheses and stating
conclusions.
Formulating hypotheses and stating conclusions
1. State the hypothesis as the alternative hypothesis Ha.
2. The null hypothesis, H0, will be the opposite of Ha and will contain an equality sign.
3. If the sample evidence supports the alternative hypothesis, the null hypothesis will be
rejected and the probability of having made an incorrect decision (when in fact H0 is true) is
, a quantity that can be manipulated to be as small as the researcher wishes.
4. If the sample does not provide sufficient evidence to support the alternative hypothesis, then
conclude that the null hypothesis cannot be rejected on the basis of your sample. In this
situation, you may wish to collect more information about the phenomenon under study.
Example 8.5 The logic used in hypothesis testing has often been likened to that used in
the courtroom in which a defendant is on trial for committing a crime.
a. Formulate appropriate null and alternative hypotheses for judging the guilt or innocence of
the defendant.
b. Interpret the Type I and Type II errors in this context.
c. If you were the defendant, would you want to be small or large? Explain.
Solution
a. Under a judicial system, a defendant is "innocent until proven guilty." That is, the burden of
proof is not on the defendant to prove his or her innocence; rather, the court must collect
cxix
sufficient evidence to support the claim that the defendant is guilty. Thus, the null and
alternative hypotheses would be
H0:
Defendant is innocent
Ha:
Defendant is guilty
b. The four possible outcomes are shown in Table 8.2. A Type I error would be to conclude
that the defendant is guilty, when in fact he or she is innocent; a Type II error would be to
conclude that the defendant is innocent, when in fact he or she is guilty.
Defendant is innocent
Defendant is guilty
Correct decision
Type I error
Defendant is
guilty
Type II error
Correct decision
c. Most would probably agree that the Type I error in this situation is by far the more serious.
Thus, we would want , the probability of committing a Type I error, to be very small indeed.
A convention that is generally observed when formulating the null and alternative hypotheses of
any statistical test is to state H0 so that the possible error of incorrectly rejecting H0 (Type I
error) is considered more serious than the possible error of incorrectly failing to reject H0 (Type
II error). In many cases, the decision as to which type of error is more serious is admittedly not
as clear-cut as that of Example 8.5; experience will help to minimize this potential difficulty.
Ha: > 72
What is the general format for carrying out a statistical test of hypothesis?
Solution The first step is to obtain a random sample from the population of interest. The
information provided by this sample, in the form of a sample statistic, will help us decide
whether to reject the null hypothesis. The sample statistic upon which we base our
decision is called the test statistic.
The second step is to determine a test statistic that is reasonable in the context of a given
hypothesis test. For this example, we are hypothesizing about the value of the population mean
. Since our best guess about the value of is the sample mean x (see Section 7.2), it seems
reasonable to use x as a test statistic. We will learn how to choose the test statistic for other
hypothesis-testing situations in the examples that follow.
cxx
The third step is to specify the range of possible computed values of the test statistic for which
the null hypothesis will be rejected. That is, what specific values of the test statistic will lead us
to reject the null hypothesis in favor of the alternative hypothesis? These specific values are
known collectively as the rejection region for the test. For this example, we would need to
specify the values of x that would lead us to believe that Ha is true, i.e., that is greater than
72. We will learn how to find an appropriate rejection region in later examples.
Once the rejection region has been specified, the fourth step is to use the data in the sample to
compute the value of the test statistic. Finally, we make our decision by observing whether the
computed value of the test statistic lies within the rejection region. If in fact the computed value
falls within the rejection region, we will reject the null hypothesis; otherwise, we do not reject
the null hypothesis.
An outline of the hypothesis-testing procedure developed in Example 8.6 is given followings.
Ha: (1 - 2) < 0
where 1, and 2, are the population mean numbers of children born to urban women and rural
women, respectively. Suggest an appropriate test statistic in the context of this problem.
Solution The parameter of interest is (1 - 2), the difference between the two population
means. Therefore, we will use ( x1 x 2 ) , the difference between the corresponding sample
means, as a basis for deciding whether to reject H0. If the difference between the sample
means, ( x1 x 2 ) , falls greatly below the hypothesized value of (1 - 2) = 0, then we have
evidence that disagrees with the null hypothesis. In fact, it would support the alternative
hypothesis that (1 - 2) < 0. Again, we are using the point estimate of the target
parameter as the test statistic in the hypothesis-testing approach. In general, when the
hypothesis test involves a specific population parameter, the test statistic to be used is
the conventional point estimate of that parameter.
In step 3, we divide all possible values of the test into two sets: the rejection region and its
complement. If the computed value of the test statistic falls within the rejection region, we reject
the null hypothesis. If the computed value of the test statistic does not fall within the rejection
region, we do not reject the null hypothesis.
cxxi
Ha: > 72
indicate which decision you may make for each of the following values of the test statistic:
a. x = 110
b. x = 59
c. x = 73
Solution
a. If x = 110 , then much doubt is cast upon the null hypothesis. In other words, if the null
hypothesis were true (i.e., if is in fact equal to 72), then it is very unlikely that we would
observe a sample mean x as large as 110. We would thus tend to reject the null hypothesis
on the basis of information contained in this sample.
b. Since the alternative of interest is > 72, this value of the sample mean, x = 59 , provides
no support for Ha. Thus, we would not reject H0 in favor of Ha: > 72, based on this sample.
c. Does a sample value of x = 73 cast sufficient doubt on the null hypothesis to warrant its
rejection? Although the sample mean x = 73 is larger than the null hypothesized value of
=72, is this due to chance variation, or does it provide strong enough evidence to conclude
in favor of Ha? We think you will agree that the decision is not as clear-cut as in parts a and
b, and that we need a more formal mechanism for deciding what to do in this situation.
We now illustrate how to determine a rejection region that takes into account such factors as the
sample size and the maximum probability of a Type I error that you are willing to tolerate.
Example 8.9 Refer to Example 8.8. Specify completely the form of the rejection region for
a test of
H0: = 72
Ha: > 72
z=
x x
x 72
/ n
x 72
s/ n
The z-score is obtained by using the values of x and x that would be valid if the null
hypothesis were true, i.e., if = 72. The z-score then gives us a measure of how many standard
deviations the observed x is from what we would expect to observe if H0 were true.
cxxii
We examine Figure 8.1a and observe that the chance of obtaining a value of x more than 1.645
standard deviations above 72 is only .05, when in fact the true value of is 72. We are
assuming that the sample size is large enough to ensure that the sampling distribution of x is
approximately normal. Thus, if we observe a sample mean located more than 1.645 standard
deviations above 72, then either H0, is true and a relatively rare (with probability .05 or less)
event has occurred, or Ha is true and the population mean exceeds 72. We would tend to favor
the latter explanation for obtaining such a large value of x , and would then reject H0.
Ha: < 72
at significance level = .01.
Solution Here, we want to be able to detect the directional alternative that is less than
72; in this case, it is "sufficiently small" values of the test statistic x that would cast
doubt on the null hypothesis. As in Example 8.9, we will standardize the value of the test
statistic to obtain a measure of the distance between x and the null hypothesized value of
72:
z=
(x x )
x 72
x 72
s/
This z-value tells us how many standard deviations the observed x is from what would be
expected if H0 were true. Here, we have also assumed that n 30 so that the sampling
distribution of x will be approximately normal. The appropriate modifications for small samples
will be indicated in Chapter 9.
cxxiii
Figure 8.2a shows us that, when in fact the true value of is 72, the chance of observing a
value of x more than 2.33 standard deviations below 72 is only .01. Thus, at significance level
(probability of Type I error) equal to .01, we would reject the null hypothesis for all values of z
that are less than - 2.33 (see Figure 8.2b), i.e., for all values of x that lie more than 2.33
standard deviations below 72.
Ha: 72
where we are willing to tolerate a .05 chance of making a Type I error.
For this two-sided (non-directional) alternative, we would reject the null
Solution
hypothesis for "sufficiently small" or "sufficiently large" values of the standardized test
statistic
x 72
s/ n
Now, from Figure 8.3a, we note that the chance of observing a sample mean x more than 1.96
standard deviations below 72 or more than 1.96 standard deviations above 7 2, when in fact H0 is
true, is only = .05. Thus, the rejection region consists of two sets of values: We will reject H0 if
z is either less than -1.96 or greater than 1.96 (see Figure 8.3b). For this rejection rule, the
probability of a Type I error is .05.
The three previous examples all exhibit certain common characteristics regarding the rejection
cxxiv
3. The location of the rejection region depends on whether the test is one-tailed or two-tailed,
and on the pre-specified significance level, .
a. For a one-tailed test in which the symbol ">" occurs in H0, the rejection region consists of
values in the upper tall of the sampling distribution of the standardized test statistic. The
critical value is selected so that the area to its right is equal to .
b. For a one-tailed test in which the symbol "<" appears in Ha, the rejection region consists
of values in the lower tail of the sampling distribution of the standardized test statistic. The
critical value is selected so that the area to its left is equal to .
c. For a two-tailed test, in which the symbol "" occurs in Ha, the rejection region consists of
two sets of values. The critical values are selected so that the area in each tail of the
sampling distribution of the standardized test statistic is equal to /2.
Figure 8.4 Size of the upper-tail rejection region for different values of
Steps 4 and 5 of the hypothesis-testing approach require the computation of a test statistic from
the sample information. Then we determine if the standardized of the test statistic value lies
within the rejection region in order to make a decision about whether to reject the null
hypothesis.
cxxv
Example 8.12 Refer to Example 8.9. Suppose the following statistics were calculated
based on a random sample of n = 30 measurements: x = 73, s = 13. Perform a test of
H0: = 72
Ha: > 72
at a significance level of = .05.
Solution In Example 8.9, we determined the following rejection rule for the given value of
and the alternative hypothesis of interest:
Reject H0 if z > 1.645.
The standardized test statistic, computed assuming H0 is true, is given by
z=
x x
x 72
/ n
x 72
s/ n
73 72
13 / 30
= .42
cxxvi
8.5 Summary
In this chapter, we have introduced the logic and general concepts involved in the statistical
procedure of hypothesis testing. The techniques will be illustrated more fully with practical
applications in Chapter 9.
8.6 Exercises
8.1. A medical researcher would like to determine whether the proportion of males admitted to
a hospital because of heart disease differs from the corresponding proportion of females.
Formulate the appropriate null and alternative hypotheses and state whether the test is
one-tailed or two-tailed.
8.2. Why do we avoid stating a decision in terms of "accept the null hypothesis H0"?
8.3. Suppose it is desired to test
H0: = 65
Ha: 65
at significance level = .02. Specify the form of the rejection region. (Hint: assuming that
the sample size will be sufficient to guarantee the approximate normality of the sampling
distribution of x .)
8.4. Indicate the form of the rejection region for a test of
cxxvii
z=
p p 0
p0 q 0
n
The key to correctly diagnosing a hypothesis test is to determine first the parameter of interests.
In this section, we will present several examples illustrating how to determine the parameter of
interest. The following are the key words to look for when conducting a hypothesis test about a
population parameter.
Determining the parameter of interest
PARAMETER
(1 2)
p
(p1 p2)
12
2
2
DESCRIPTION
Mean; average
Difference in means or averages;
comparison of means or averages
mean
difference;
In the following sections we will present a summary of the hypothesis-testing procedures for
each of the parameters listed in the previous box.
cxxviii
determine whether the mean time spent on studies of all students at the university is in excess of
40 hours per week. That is, we will test
H0: = 40
Ha: > 40
where
H0: = 0
H0: = 0
Ha: 0
Test statistic:
z=
Rejection region:
z > z (or z < - z)
x 0
x 0
s/ n
Rejection region:
z < -z/2
where z is the z-value such that P(z > z) = ; and z/2 is the z-value such that P(z >
z/2) = /2. [Note: 0 is our symbol for the particular numerical value specified for
in the null hypothesis.]
Assumption: The sample size must be sufficiently large (say, n 30) so that the
sampling distribution of x is approximately normal and that s provides a good
approximately to .
Example 9.1 The mean time spent on studies of all students at a university last year was
40 hours per week. This year, a random sample of 35 students at the university was
drawn. The following summary statistics were computed:
x = 42.1 hours;
s = 13.85 hours
Test the hypothesis that , the population mean time spent on studies per week is equal to 40
hours against the alternative that is larger than 40 hours. Use a significance level of = .05.
cxxix
Ha: > 40
Note that the sample size n = 35 is sufficiently large so that the sampling distribution of x is
approximately normal and that s provides a good approximation to . Since the required
assumption is satisfied, we may proceed with a large-sample test of hypothesis about .
Using a significance level of = .05, we will reject the null hypothesis for this one-tailed test if
z > z/2 = z.05
i.e., if z > 1.645. This rejection region is shown in Figure 9.1.
z=
x 0
s/ n
42.1 40
13.85 / 35
= .897
Since this value does not fall within the rejection region (see Figure 9.1), we do not reject H0.
We say that there is insufficient evidence (at = .05) to conclude that the mean time spent on
studies per week of all students at the university this year is greater than 40 hours. We would
need to take a larger sample before we could detect whether > 40, if in fact this were the
case.
Example 9.2 A sugar refiner packs sugar into bags weighing, on average 1 kilogram. Now
the setting of machine tends to drift i.e. the average weight of bags filled by the machine
sometimes increases sometimes decreases. It is important to control the average weight
of bags of sugar. The refiner wish to detect shifts in the mean weight of bags as quickly
as possible, and reset the machine. In order to detect shifts in the mean weight, he will
periodically select 50 bags, weigh them, and calculate the sample mean and standard
deviation. The data of a periodical sample as follows:
x = 1.03 kg
s = .05 kg
cxxx
Ha: 1
The sample size (50) exceeds 30, we may proceed with the larger sample test about .
Because shifts in in either direction are important, so the test is two-tailed.
At significance level = .01, we will reject the null hypothesis for this two tail test if
x = 3 .7
s = 1 .3
a. Is there sufficient evidence to conclude (at significance level .01) that the average number of
on-the-job accidents per day at the factory has decreased since the institution of the safety
program?
b. What is the practical interpretation of the test statistic computed in part a?
Solution
a. In order to determine whether the safety program was effective, we will conduct a largesample test of
H0: = 4.5
where represents the average number of on-the-job accidents per day at the factory after
institution of the new safety program. For a significance level of
= .01, we will reject the null hypotheses if
cxxxi
z=
x 0
s/ n
3 .7 4 .5
1.3 / 30
= 3.37
Since this value does fall within the rejection region, there is sufficient evidence (at =.01)
to conclude that the average number of on-the-job accidents per day at the factory has
decreased since the institution of the safety program. It appears that the safety program was
effective in reducing the average number of accidents per day.
b. If the null hypothesis is true, = 4.5. Recall that for large samples, the sampling distribution
of x is approximately normal, with mean x = and standard deviation x = / n . Then
the z-score for x , under the assumption that H0 is true, is given by
z=
x 4 .5
/ n
cxxxii
upon the type of test (one or two tailed). But if we use small samples, then the critical values will
depend upon the degrees of freedom as well as the type of test.
A hypothesis test about a population mean, , based on a small sample (n < 30) consists of the
elements listed in the accompanying box.
TWO-TAILED TEST
H0: = 0
H0: = 0
Ha: 0
Test statistic:
t=
x 0
s/ n
Rejection region:
Rejection region:
Assumption: The relative frequency distribution of the population from Which the
sample was selected is approximately normal.
H0: = 1500
Ha: 1500
Since we are restricted to a small sample, we must make the assumption that the lifetimes of the
electric light bulbs have a relative frequency distribution that is approximately normal. Under
cxxxiii
H0: p = p0
H0: p = p0
Ha: p p0
Test statistic:
z=
p p 0
p0 q0 / n
Rejection region:
Rejection region:
cxxxiv
increased, and this will be done by drawing randomly a sample of 150 components. In
the sample, 20 are defectives. Does this evidence indicate that the true proportion of
defective components is significantly larger than 10%? Test at significance level = .0 5.
Solution We wish to perform a large-sample test about a population proportion, p:
H0: p = .10 (i.e., no change in proportion of defectives)
p =
=
Noting that q0 = 1 p0 = 1 - .10 = .90, we obtain the following value of the test statistic:
z=
p p 0
p 0 q0 / n
.133 .10
(.10)(.90) / 150
= 1.361
This value of z lies out of the rejection region; so we would conclude that the proportion
defective in the sample is not significant. We have no evidence to reject the null hypothesis that
the proportion defective is .01 at the 5% level of significance. The probability of our having made
a Type II error (accepting H0 when, in fact, it is not true) is = .05.
[Note that the interval
Although small-sample procedures are available for testing hypotheses about a population
proportion, the details are omitted from our discussion. It is our experience that they are of
limited utility since most surveys of binomial population performed in the reality use samples
that are large enough to employ the techniques of this section.
cxxxv
H0: (1 - 2) = 0 against the alternative ((1 - 2) > 0. The large-sample procedure described in
the box is applicable testing a hypothesis about (1 - 2), the difference between two population
means.
H0: (1 - 2) = D0
H0: (1 - 2) = D0
Ha: (1 - 2) D0
Test statistic:
z=
( x1 x 2 ) D0
(x x
1
2)
( x1 x 2 ) D0
2
s1
s
+ 2
n1
n2
Rejection region:
Rejection region:
Assumptions:
1. The sample sizes n1 and n2 are sufficiently large (n1 30 and n2 30).
2. The samples are selected randomly and independent from the target
populations.
Example 9.6
A consumer group selected independent random samples of suppermarkets located throughout a country for the purpose of comparing the retail prices per
pound of coffee of brands A and B. The results of the investigation are summarized in
Table 9.1. Does this evidence indicate that the mean retail price per pound of brand A
coffee is significantly higher than the mean retail price per pound of brand B coffee? Use
a significance level of = .01.
n2 = 64
x 2 = $2.95
s2 = $.09
Ha: (1 - 2) > 0
(i.e., mean retail price per pound of brand A is higher than that of brand
B)
where
z=
( x1 x 2 ) D0
2
s1
s
+ 2
n1
n2
(3.00 2.95) 0
(.11) 2 (.09) 2
+
75
64
= 2.947
H0: (1 - 2) = D0
H0: (1 - 2) = D0
Ha: (1 - 2) D0
cxxxvii
Test statistic:
t=
( x1 x 2 ) D0
1
1
s 2p +
n1 n 2
Rejection region:
Rejection region:
where
s 2p =
(n1 1) s12 + (n 2 1) s 22
n1 + n 2 2
Assumptions:
1. The population from which the samples are selected both have approximately
normal relative frequency distributions.
2. The variances of the two populations are equal.
3. The random samples are selected in an independent manner from the two
populations.
Example 9.7 There was a research on the weights at birth of the children of urban and
rural women. The researcher suspects there is a significant difference between the mean
weights at birth of children of urban and rural women. To test this hypothesis, he selects
independent random samples of weights at birth of children of mothers from each group,
calculates the mean weights and standard deviations and summarizes in Table 9.2. Test
the researcher's belief, using a significance of = .02.
n2 = 14
x 2 = 3.2029 kg
s2 = .4927 kg
Ha: (1 - 2) 0
at
birth
of
children
of
urban
and
rural
where 1 and 2 are the true mean weights at birth of children of urban and rural women,
respectively.
Since the sample sizes for the study are small (n1 = 15, n2 = 14), the following assumptions are
required:
1. The populations of weights at birth of children both have approximately normal distributions.
cxxxviii
2. The variances of the populations of weights at birth of children for two groups of mothers are
equal.
3. The samples were independently and randomly selected.
If these three assumptions are valid, the test statistic will have a t-distribution with
(n1 + n2 - 2) = (15 + 14 - 2) = 27 degree of freedom with a significance level of
= .02, the rejection region is given by
s 2p =
Using this pooled sample variance in the computation of the test statistic, we obtain
t=
( x1 x 2 ) D0
1
1
s 2p +
n1 n 2
(3.5933 3.2029) D0
1 1
.1881 +
15 14
= 2.422
Now the computed value of t does not fall within the rejection region; thus, we fail to reject the
null hypothesis (at = .02) and conclude that there is insufficient evidence of a difference
between the mean weights at birth of children of urban and rural women.
In this example, we can see that the computed value of t is very closed to the upper boundary of
the rejection region. This region is specified by the significance level and the degree of freedom.
How is the conclusion about the difference between the mean weights at births affected if the
significance level is = .05? We will answer the question in the next example.
cxxxix
Example 9.8 Refer Example 9.7. Test the investigator's belief, using a significance level
of = .05.
Solution With a significance level of = .05, the rejection region is given by
t < - t.025 = - 2.052 or t > t.025 = 2.052
(see Figure 9.5)
Since the sample sizes are not changed, therefore test statistic is the same as in Example 9.10,
t = 2.422.
Now the value of t falls in the rejection region; and we have sufficient evidence at a significance
level of = .05 to conclude that the mean weight at birth of children of urban women differs
significantly (or we can say that is higher than) from the mean weight at birth of children of rural
women. But you should notice that the probability of our having committed a Type I error is =
.05.
Test statistic:
z=
( p 1 p 2 ) D0
( p1 p 2 )
cxl
Rejection region:
Rejection region:
( p1 p 2 ) =
where
p1 q1 p 2 q 2
+
n1
n2
p1q1 p 2 q2
+
n1
n2
where q1 =1 p 1 and q 2 =1 p 2 .
( p 1 p 2 )
For
the
special
case
where
D0
0,
calculate
1
1
( p1 p 2 ) p q +
n1 n2
when the total number of successes in the combined samples is (x1 + x2) and
p 1 = p 2 = p =
x1 + x 2
.
n1 + n 2
Assumption:
p 1 2 p 1 q1 / n1 and
The
p 2 2 p 2 q 2 / n 2 do not contain 0 and 1.
intervals
When testing the null hypothesis that (p1 - p2) equals some specified difference D0, we make a
distinction between the case D0 = 0 and the case D0 0. For the special case D0 = 0, i.e., when
we are testing H0: (p1 - p2) = 0 or, equivalently, H0: p1 = p2, the best estimate of p1 = p2 = p is
found by dividing the total number of successes in the combined samples by the total number of
observations in the two samples. That is, if x1 is the number of successes in sample 1 and x2 is
the number of successes in sample 2, then
p =
x1 + x 2
.
n1 + n 2
In this case, the best estimate of the standard deviation of the sampling distribution of
( p 1 p 2 ) is found by substituting p for both p 1 and p 2 :
( p1 p 2 ) =
p1 q1 p 2 q 2
+
n1
n2
p q p q
+
=
n1 n 2
1
1
p q +
n1 n 2
For all cases in which D0 0 [for example, when testing H0: (p1 - p2)=.2], we use
Example 9.9 Two types of needles, the old type and the new type, used for injection of
medical patients with a certain substance. The patients were allocated at random to two
group, one to receive the injection from needle of the old type, the other to receive the
injection from needles of the new type. Table 9.3 shows the number of patients showing
reactions to the injection. Does the information support the belief that the proportion of
patients giving reactions to needles of the old type is less than the corresponding
proportion patients giving reactions to needles of the new type? Test at significance
level of = .01.
p 1 = Sample proportion of patients giving reactions with needles of the old type
37
= .37
100
p 2 = Sample proportion of patients giving reactions with needles of the new type
=
56
= .56
100
Hence,
q1 =1 p 1 =1 .37 = .63 and q 2 =1 p 2 =1 .56 = .44
=
z=
( p 1 p 2 ) D0
1
1
p q +
n1 n 2
where
p =
Then we have
z=
(.37 .56) 0
1
1
+
(.465)(.535)
100 100
= 2.69
This value falls below the critical value of - 2.33. Thus, at = .01, we reject the null hypothesis;
there is sufficient evidence to conclude that the proportion of patients giving reactions to
needles of the old type is significantly less than the corresponding proportion of patients giving
reactions to needles of the new type, i.e., p1 < p2.
The inference derived from the test in Example 9.12 is valid only if the sample sizes, n1 and n2,
are sufficiently large to guarantee that the intervals
p 1 2
p 1 q1
n1
and p 2 2
p 2 q 2
n1
p 1 2
p 2 2
p 1 q1
(.37)(.63)
= .37 2
= .37 .097 or (.273 ,.467)
n1
100
p 2 q 2
(.56)(.44)
= .56 2
= .56 .099 or (.467 ,.659)
n1
100
cxliii
of the amount of fill. If 2 is large, some cans will contain too little and others too much.
Suppose regulatory agencies specify that the standard deviation of the amount of fill
should be less than .1 ounce. The quality control supervisor sampled n = 10 cans and
calculated s = .04. Does this value of s provide sufficient evidence to indicate that the
standard deviation of the fill measurements is less than .1 ounce?
Test of hypothesis about a population variance 2
ONE -TAILED TEST
H0: = 0
2
H0: 2 = 02
Ha: 2 02
Test statistic:
2 =
(n 1) s 2
02
Rejection region:
Rejection region:
where 2 and 12 are values of 2 that locate an area of to the right and to
the left, respectively, of a chi-square distribution based on (n -1) degrees of
freedom.
[Note:
02
in the null
hypothesis.]
Assumption: The population from which the random sample is selected has an
approximate
normal
distribution.
Solution Since the null and alternative hypotheses must be stated in terms of 2 (rather
than ), we will want to test the null hypothesis that 2 = .01 against the alternative that
2 < .01. Therefore, the elements of the test are
H0: 2 = .01
Test statistic : 2 =
(n 1) s 2
02
Rejection region: The smaller the value of s2 we observe, the stronger the evidence in favor of
Ha. Thus, we reject H0 for "small values" of the test statistic. With = .05 and 9 df, the 2 value
for rejection is found in Table 3, Appendix C and pictured in Figure 9.7. We will reject H0 if 2 <
3.32511.
Remember that the area given in Table 3 of Appendix C is the area to the right of the numerical
value in the table. Thus, to determine the lower-tail value that has = .05 to its left, we use the
2.95 column in Table 3 of Appendix C.
cxliv
Since
2 =
(n 1) s 2
02
9(.04) 2
= 1.44
.01
is less than 3.32511, the supervisor can conclude that the variance of the population of all
amounts of fill is less than .01 ( < 0.1) with 95 % confidence. As usual, the confidence is in the
procedure used - the 2 test. If this procedure is repeatedly used, it will incorrectly reject H0 only
5% of the time. Thus, the quality control supervisor is confident in the decision that the cannery
is operating within the desired limits of variability.
22 . Variance tests have broad applications in business. For example, a production manager
may be interested in comparing the variation in the length of eye-screws produced on each of
two assembly lines. A line with a large variation produces too many individual eye-screws that
do not meet specifications (either too long or too short), even though the mean length may be
satisfactory. Similarly, an investor might want to compare the variation in the monthly rates of
return for two different stocks that have the same mean rate of return. In this case, the stock
with the smaller variance may be preferred because it is less risky - that is, it is less likely to
have many very low and very high monthly return rates.
H0: 12 / 22 = 1
(i.e. 12 = 22 )
H0: 12 / 22 = 1
(i.e. 12 = 22 )
cxlv
Ha: 12 / 22 > 1
(i.e. 12 > 22 ) or
[Ha: 12 / 22 < 1
(i.e. 12 < 22 ) ]
Ha: 12 / 22 1
Test statistic:
F=
s12
s 22
=
F
or
s 22
s12
(i.e. 12 22 )
Test statistic:
F=
s12
2
s2
F = 2
s2
s2
1
Rejection region:
Rejection region:
F > F
F > F/2
where F, and F/2 are values that locate an area and /2, respectively, in the
upper tail of the F-distribution with 1 = numerator degrees of freedom (i.e., the df
for the sample variance in the numerator) and 2 = denominator degrees of
freedom (i.e., the df for the sample variance in the denominator).
Assumptions: 1. Both of the populations from which the samples are selected have
relative frequency distributions that are approximately normal.
2. The random samples are selected in an independent manner from
the two populations.
Variance tests can also be applied prior to conducting a small-sample t test for
(1 - 2), discussed in Section 9.4. Recall that the t test requires the assumption that the
variances of the two sampled populations are equal. If the two population variances are greatly
different, any inferences derived from the t test are suspect. Consequently, it is important that
we detect a significant difference between the two variances, if it exists, before applying the
small-sample t test.
The common statistical procedure for comparing two population variances, 12 and 22 , makes
an inference about the ratio 12 / 22 . This is because the sampling
distribution of the estimator for 12 / 22 is well known when the samples are randomly and
independently selected from two normal populations.
The elements of a hypothesis test for the ratio of two population variances, 12 / 22 , are given in
the preceding box.
Example 9.11 A class of 31 students were randomly divided into an experimental set of
size n1 = 18 that received instruction in a new statistics unit and a control set of size n2 =
cxlvi
13 that received the standard statistics instruction. All students were given a test of
computational skill at the end of the course. A summary of the results appears in Table
9.4. Do the data provide sufficient evidence to indicate a difference in the variability of
this skill in the hypothetical population of students who might be given the new
instruction and the population of students who might be given the standard instruction?
Test using = .01.
H0: 12 / 22 = 1
( 12 = 22 )
Ha: 12 / 22 1
( 12 22 )
According to the box, the test statistic for this two-tailed test is
F=
s 22 (3.10) 2
Larger s 2
=
=
= 2.58
Smaller s 2 s11 (1.93) 2
To find the appropriate rejection region, we need to know the sampling distribution of the test
statistic. Under the assumption that both samples of test scores come from normal populations,
the
F
statistic,
F = s 22 / s12 ,
possesses
an
F
distribution
with
F.05 = 2.38
cxlvii
As shown in Figure 9.8, /2 = .0 5 is the tail area to the right of 2.38 in the
F-distribution with 12 numerator df and 17 denominator df. Thus, the probability that the F
statistic will exceed 2.38 is /2 = .05.
Given this information on the F-distribution, we are now able to find the rejection region for this
test. Since the test is two-tailed, we will reject H0 if F > F/2. For
= .10, we have /2 = .05 and F.05 = 2.38 (based on 1 = 12 and 2 = 17 df). Thus, the rejection
region is
cxlviii
ratio of the population variances in H0 and Ha. That is, we can always make a one-tailed test an
upper-tailed test.
1
2
10
1
Denominator
degrees
of
freedom
12
15
20
24
30
40
60
120
241.90 243.90 245.90 248.00 249.10 250.10 251.10 252.20 253.33 254.30
19.40
19.41
19.43
19.45
19.45
19.46
19.47
19.48
19.49
19.50
8.79
8.74
8.70
8.66
8.64
8.62
8.59
8.57
8.55
8.53
5.96
5.91
5.86
5.80
5.77
5.75
5.72
5.69
5.66
5.63
4.74
4.68
4.62
4.56
4.53
4.50
4.46
4.43
4.40
4.36
4.06
4.00
3.94
3.87
3.84
3.81
3.77
3.74
3.70
3.67
3.64
3.57
3.51
3.44
3.41
3.38
3.34
3.30
3.27
3.23
3.35
3.28
3.22
3.15
3.12
3.08
3.04
3.01
2.97
2.93
3.14
3.07
3.01
2.94
2.90
2.86
2.83
2.79
2.75
2.71
10
2.98
2.91
2.85
2.77
2.74
2.70
2.66
2.62
2.58
2.54
11
2.85
2.79
2.72
2.65
2.61
2.57
2.53
2.49
2.45
2.40
12
2.75
2.69
2.62
2.54
2.51
2.47
2.43
2.38
2.34
2.30
13
2.67
2.60
2.53
2.46
2.42
2.38
2.34
2.30
2.25
2.21
14
2.60
2.53
2.46
2.39
2.35
2.31
2.27
2.22
2.18
2.13
15
2.54
2.48
2.40
2.33
2.29
2.25
2.20
2.16
2.11
2.07
16
2.49
2.42
2.35
2.28
2.24
2.19
2.15
2.11
2.06
2.01
17
2.45
2.38
2.31
2.23
2.19
2.15
2.10
2.06
2.01
1.96
cxlix
9.8 Summary
In this chapter we have learnt the procedures for testing hypotheses about various population
parameters. Often the comparison focuses on the means. As we note with the estimation
techniques of Chapter 7, fewer assumptions about the sampled populations are required when
the sample sizes are large. It would be emphasized that statistical significance differs from
practical significance, and the two must not be confused. A reasonable approach to hypothesis
testing blends a valid application of the formal statistical procedures with the researcher's
knowledge of the subject matter.
9.9 Exercises
9.1.
= .05
a. H 0 : = 40, H a : > 40; n = 35, x = 60, s 2 = 64;
b. H 0 : = 120, H a : 120; n = 40, x = 140.5, s = 9.6; = .01
c. H 0 : = 11, H a : < 11; n = 48, x = 9.5, s = .6;
= .10
9.2.
x = 50.3
9.3.
= 68
a. Test the null hypothesis that = 1.18 against the alternative that < 1.18. Use =
.01.
b. Test the null hypothesis that = 1.18 against the alternative that < 1.18. Use =
.10.
A random sample of n observations is selected from a binominal population. For each of
the following situations, specify the rejection region, test statistic value, and conclusion:
Two independent random samples are selected from populations with means 1 and 2,
respectively. The sample sizes, means, and standard deviations are shown in the table.
Sample 1
Sample 2
x = 7 .5
s = 3 .0
x = 6 .5
s = 1 .0
n = 45
n = 55
a. Test the null hypothesis H0: (1 - 2) = 0 against the alternative hypothesis Ha: (1 2) 0 at = .05.
b. Test the null hypothesis H0: (1 - 2) = .5 against the alternative hypothesis Ha: (1 2) .5 at = .05.
9.5.
Independent random samples selected from two binomial populations produced the
results given in the table
cl
Sample 1
Number of successes
Sample sizes
Sample 2
80
74
100
100
A random sample of n = 10 observations yields x = 231.7 and s 2 = 15.5. Test the null
hypothesis H0: 2 = 20 against the alternative hypothesis
Ha: 2 < 20. Use = .05. What assumptions are necessary for the test to be valid.
9.7.
9.8.
Calculate the value of the test statistic for testing H0: 12/22 in each of following cases:
cli
10.1 Introduction
10.2 Tests of goodness of fit
10.3 The analysis of contingency tables
10.4 Contingency tables in statistical software packages
10.5 Introduction to analysis of variance
10.6 Design of experiments
10.7 Completely randomized designs
10.8 Randomized block designs
10.9 Multiple comparisons of means and confidence regions
10.10 Summary
10.11 Exercises
10.1 Introduction
In this chapter we present some methods for treatment of categorical data. The methods involve
the comparison of a set of observed frequencies with frequencies specified by some hypothesis
to be tested. A test of such a hypothesis is called a test of goodness of fit.
We will show how to test the hypothesis that two categorical variables are independent. The test
statistics discussed have sampling distributions that are approximated by chi-square
distributions. The tests are called chi-square tests. These tests are useful in analyzing more
than two population means.
In this chapter we will discuss the procedures for selecting sample data and analyzing variances.
The objective of these sections is to introduce some aspects of experimental design and analysis
of data from such experiments using an analysis of variance.
percentages, a random sample of n = 100 women at the region were selected and their
level of education recorded. The number of the women whose level of education falling
into each of the three categories is shown in Table 10.1.
Table 10.1 Categories corresponding to level of education
Primary degree
Level of education
Secondary degree
Higher secondary
Total
22
64
14
100
Do the data given in Table 10.1 disagree with the percentages of 28%, 61%, and 11%
estimated by the demographer? As a first step in answering this question, we need to find the
number of women in the sample of 100 that would be expected to fall in each of the three
educational categories of Table 10.1, assuming that the demographer's percentages are
accurate.
Solution Each woman in the sample was assigned to one and only one of the three
educational categories listed in Table 10.1. If the demographer's percentages are correct,
then the probabilities that a education level will fall in the three educational categories
are as shown in Table 10.2.
Table 10.2 Categories probabilities based on the demographer's percentages
Level of education
Can
Primary
Secondary
read/write
and above
Cell number
Cell probability
p1 = .28
p2=.61
p3 =.11
Total
1.00
Consider first the "Can read/write" cell of Table 10.2. If we assume that the level of education of
any woman independent of the level of education of any other, then the observed number O1, of
responses falling into cell 1 is a binomial random variable and its expected value is
e1 = np1 = (100)(.28) = 28
Similarly, the expected observed numbers of responses in cells 2 and 3 (categories 2 and 3) are
e2 = np2 = (100)(.61) = 61
and
e3 = np3 = (100)(.11) = 11
The observed numbers of responses and the corresponding expected numbers (in parentheses)
are shown in Table 10.3.
Table 10.3 Observed and expected numbers of responses falling in the cell categories
for Example 10.1
Level of education
Can
Primary
Secondary
read/write
and above
Observed numbers
22
64
14
Total
100
cliii
Expected numbers
(28)
(61)
(11)
100
ei = npi
where
H0:
The category (cell) probabilities are p1= .28, p2= .61, p3= .11
Ha:
At least two of the probabilities, p1, p2, p3, differ from the values specified in the null
hypothesis
2 =
i =1
(Oi ei ) 2
ei
Substituting the values of the observed and expected cell counts from Table 10.3 into the
formula for calculating 2, we obtain
Example 10.2
Specify the rejection region for the test described in the preceding
discussion. Use = .05. Test to determine whether the sample data disagree with the
demographer's estimated percentages.
Solution
Since the value of chi-square increases as the differences between the
observed and expected cell counts increase, we will reject
cliv
The critical values of the 2 distribution are given in Table 3 of Appendix C. The degrees of
freedom for the chi-square statistic used to test the goodness of fit of a set of cell probabilities
will always be 1 less than the number of cells. For example, if k cells were used in the
categorization of the sample data, then
Degrees of freedom: df = k - 1
For our example, df = (k - 1) = (3 - 1) = 2 and = .05. From Table 3 of Appendix C, the
tabulated value of 2 , corresponding to df = 2 is 5.99147.
The rejection region for the test, 2 > 2 , is illustrated in Figure 10.1. We will reject H0 if 2 >
5.99147.
Since
the
calculated
value
of
the
test
statistic,
2
2
= 2.26 , is less than .05 , we can not reject H0. There is insufficient information to indicate a
lack of fit of the sample data to the percentages estimated by the demographer.
i =1
(Oi ei ) 2
ei
where
clv
Table 10.4 Contingency table for views of women and men on a proposal
In favour
Opposed
Undecided
Total
Women
Men
118
84
62
78
25
37
205
199
Total
202
140
62
404
We are to test the statement that there is no difference in opinion between men and women, i.e.
the response is independent of the sex of the person interviewed, and we adopt this as our null
hypothesis. Now if the statement is not true, then the response will depend on the sex of the
person interviewed, and the table will enable us to calculate the degree of dependence. A table
clvi
constructed in this way (to indicate dependence or association) is called a contingency table.
"Contingency" means dependence many of you will be familiar with the terms "contingency
planning"; i.e. plans that will be put into operation if certain things happen. Thus, the purpose of
a contingency table analysis is to determine whether a dependence exists between the two
qualitative variables.
We adopt the null hypothesis that there is no association between the response and the sex of
person interviewed. On this basis we may deduce that the proportion of the sample who are
female is 205/404, and as 202 people are in favour of the proposal, the expected number of
women in favour of proposal is 205/404 202 = 102.5. Therefore, the estimated expected
number of women (row 1) in favour of the proposal (column 1) is
Row 1 total
205
e11 =
(Column 1 total) = 102.5
202 =
n
404
Also, as 140 people are against the proposal, the expected number of women against the
proposal is (row 1, column 2)
Row 1 total
205
e12 =
(Column 2 total) = 71
140 =
n
404
Row 1 total
205
e13 =
(Column 3 total) = 31.5
62 =
n
404
We now move to row 2 for men and note that the row total is 199. Therefore, we would expect
the proportion of the sample who are male is 199/404 for all three types of opinion. The
estimated expected cell counts for columns of row 2 are
Row 2 total
119
e21 =
(Column 1 total) = 99.5
202 =
n
404
Row 2 total
119
e22 =
(Column 2 total) = 69
140 =
n
404
Row 2 total
119
e21 =
(Column 3 total) = 30.5
62 =
n
404
The formula for calculating any estimated expected value can be deduced from the values
calculated above. Each estimated expected cell count is equal to the product of its respective
row and column totals divided by the total sample size n:
eij =
Ri C j
n
where eij = Estimated expected counts for the cell in row i and column j
Ri = Row total corresponding to row i
Cj = Column total corresponding to column j
n = Sample size
clvii
The observed and estimated expected cell counts for the herring gull contingency table are
shown in Table 10.5.
Opposed
118
(102.5)
84
(99.5)
Women
Men
62
(71)
78
(69)
Undecided
25
(31.5)
37
(30.5)
In this example, the chi-square test statistic , 2, is calculated in the same manner as shown in
Example 10.1.
2 =
The appropriate degrees of freedom for a contingency table analysis will always be
(r - 1) (c -1), where r is the number of rows and c is the numbers of columns in the table. In
this example, we have two degrees of freedom in calculating the expected values. Consulting
Table 3 of Appendix C, we see that the critical values for 2 are 5.99 at a significance level of
= .05 and 9.21 at level of = .01. In both cases, the computed test statistic is lager than these
critical values. Hence, we would reject the null hypothesis accepting the alternative hypothesis
that men and women think differently with 99% confidence.
i =1 j =1
(Oij eij ) 2
eij
where
and computing value of the 2 statistic to test dependence of the education level on living region
of women interviewed in the DHS Survey 1988 in Vietnam (data of the survey is given in
Appendix A).
CROSSTABS
/TABLES=urban BY gd1
/FORMAT= AVALUE TABLES
/STATISTIC=CHISQ CC PHI
/CELLS= COUNT EXPECTED ROW .
Crosstabs
Case Processing Summary
Cases
Missing
N
Percent
0
.0%
Valid
N
Percent
4172
100.0%
Total
N
Percent
4171
100.0%
URBAN
Urban
Rural
Total
Education Level
Can
Primary Secondary
read/write
and above
Total
163
299
266
728
197.5
415.5
115.0
728.0
22.4%
41.1%
36.5%
100.0%
969
2082
393
3444
934.5
1965.5
544.0
3444.0
28.1%
60.5%
11.4%
100.0%
1132
2381
659
4172
1132.0
2381.0
659.0
4172.0
27.1%
57.1%
15.8%
100.0%
Count
Expected Count
% within URBAN
Count
Expected Count
% within URBAN
Count
Expected Count
% within URBAN
Chi-Square Tests
Value
Asymp. Sig.
(2-sided)
.000
Likelihood Ratio
241.252
.000
Linear-by-Linear Association
137.517
.000
Pearson Chi-Square
N of Valid Cases
a
df
287.084
4172
0 cells (.0%) have expected count less than 5. The minimum expected count is 114.99.
Before changing to discuss about analysis of variance, we make some remarks on methods for
treating categorical data.
clix
Surveys that allow for more than two categories for a single response (a one-way table) can
be analyzed using the chi-square goodness of fit test. The appropriate test statistic, called 2
statistic, has a sampling distribution approximated by the chi-square probability distribution
and measures the amount of disagreement between the observed number of responses and
the expected number of responses in each category.
A contingency table analysis is an application of the 2 test for a two-way (or two-variable)
classification of data. The test allows us to determine whether the two directions of
classification are independent.
Workbook 1
Workbook 2
2
4
3
4
5
6
24
4
Sums
Sample means
Workbook 3
9
10
10
7
8
10
54
9
4
5
6
3
7
5
30
5
10
x1
10
x2
0
10
x3
10
All children
x
Figure 10.2 Reading scores by workbook used and for combined sample
The means of the three samples are 4, 9, and 5, respectively. Figure 10.2 shows these as the
centers of the three samples; there is clearly variability from group to group. The variability in
the entire pooled sample of 18 is shown by the last line.
In contrast to this rather typical allocation, we consider Tables 10.7 and 10.8 as illustrations of
extreme cases. In Table 10.7 every observation in Group A is 3, every observation in Group B is
5, and every observation in Group C is 8. There is no variation within groups, but there is
variation between groups.
Group
B
3
3
3
3
5
5
5
5
8
8
8
8
clxi
Means
In Table 10.8 the mean of each group is 3. There is no variation among the group means,
although there is variability within each group. Neither extreme can be expected to occur in an
actual data set. In actual data, one needs to make an assessment of the relative sizes of the
between-groups and within-groups variability. It is to this assessment that the term "analysis of
variance" refers.
Means
A
3
5
1
3
Group
B
3
6
2
1
C
1
4
3
4
In Example 10.3, the overall mean, x , is the sum of all the observations divided by the total
number of observations:
x=
(2 + 4 + ... + 5) 108
=
=6
18
18
The sum of squared deviations of all 18 observations from mean of the combined sample is a
measure of variability of the combined sample. This sum is called Total Sum of Squares and is
denoted by SS(Total).
SS(Total) = (2 - 6)2 + (4 - 6)2 + (3 - 6)2 + (4 - 6)2 + (5 - 6)2 + (6 - 6)2 +
(9 - 6)2 + (10- 6)2 + (10 - 6)2 + (7 - 6)2 + (8 - 6)2 + (10 - 6)2 +
(4 - 6)2 + (5 - 6)2 + (6 - 6)2 + (3 - 6)2 + (7 - 6)2 + (5 - 6)2
= 34 + 62 + 16 = 112
Next we measure the variability within samples. We calculate the sum of squared deviation of
each of 18 observations from their respective group means. This sum is called the Sum of
Squares Within Groups (or Sum of Squared Errors) and is denoted by SS(Within Groups) (or
SSE).
SSE = (2 - 4)2 + (4 - 4)2 + (3 - 4)2 + (4 - 4)2 + (5 - 4)2 + (6 - 4)2 +
(9 - 9)2 + (10- 9)2 + (10 - 9)2 + (7 - 9)2 + (8 - 9)2 + (10 - 9)2 +
(4 - 5)2 + (5 - 5)2 + (6 - 5)2 + (3 - 5)2 + (7 - 5)2 + (5 - 5)2
= 10 + 8 + 10 = 28
Now let us consider the group means of 4, 9, and 5. The sum of squared deviation of the group
means from the pooled mean of 6 is
(4 - 6)2 + (9 - 6)2 + (5 - 6)2 = 4 + 9 + 1 =14.
However, this sum is not comparable to the sum of squares within groups because the sampling
variability of means is less than that of individual measurements. In fact, the mean of a sample
of 6 observations has a sampling of 1/6 the sampling variance of a single observation. Hence, to
clxii
put the sum of squared deviations of group mean on a basis that can be compared with
SS(Within Groups), we must multiply it by 6, the number of observation in each sample, to
obtain 6 14 = 84. This is called the Sum of Squares Between Groups (or Sum of the Squares
for Treatment) and is denoted by SS(Between Groups) (or SST).
Now we have three sums that can be compared: SS(Between Groups), SS(Within Groups),
and SS(Total). They are given in Table 10.9. Observe that addition of the first two sum of
squares gives the last sum. This demonstrates what we mean by the allocation of the total
variability to the variability due to differences between means of groups and variability of
individuals within groups.
84
28
112
In this example we notice that the variability between groups is a large proportion of the total
variability. However, we have to adjust the numbers in Table 10.9 in order to take account of the
number of pieces of information going into each sum of squares. That is, we want to use the
sums of squares to calculate sample variances. The sum of squares between groups has 3
deviations about the mean of combined sample. Therefore its number of degrees of freedom is
3 - 1 = 2 and the sample variance based on this sum of squares is
SS ( Between Groups ) 84
=
= 42
3 1
2
SS (Within Groups ) 28
=
= 1.867
18 3
15
F=
MST
42
=
= 22.50 .
MSE 1.867
The fact that MST is 22.5 times MSE seems to indicate that the variability among groups is
much greater than that within groups. However, we know that such a ratio computed for
different triplets of random samples would vary from triplet to triplet, even if the population
means were the same. We must take account of this sampling variability. This is done by
referring to the F-tables depending on the desired significance level as well as on the number of
degrees of freedom of MST, which is 2 here, and the number of degrees of freedom of MSE,
which is 15 here. The value in the F-table for a significance level of .01 is 6.36. Thus we would
consider the calculated ratio of 22.50 as very significant. We conclude that there are real
differences in average reading readiness due to the use of different workbooks.
The results of computation are set out in Table 10.10.
Source of Variation
Sum of
Squares
Degrees of
Freedom
Mean of
Squares
Between groups
84
42
Within groups
28
15
1.867
F
22.50
In next sections we will consider the analysis of variance for the general problem of comparing k
population means for three special types of experimental designs.
H0: 1 = 2 = . . . = k
and the alternative hypothesis is that at least two of the treatment means differ.
An analysis of variance provides an easy way to analyze the data from a completely
randomized design. The analysis partitions SS(Total) into two components, SST and SSE.
These two quantities are defined in general term as follows:
k
SST = ( x j x ) 2
j =1
k
nj
SSE = ( xi j x j )
j =1 i =1
Recall that the quantity SST denotes the sum of squares for treatments and measures the
variation explained by the differences between the treatment means. The sum of squares for
error, SSE, is a measure of the unexplained variability, obtained by calculating a pooled
measure of the variability within the k samples. If the treatment means truly differ, then SSE
should be substantially smaller than SST. We compare the two sources of variability by forming
an F statistic:
F=
where n is the total number of measurements. Under certain conditions, the F statistic has a
repeated sampling distribution known as the F-distribution. Recall from Section 9.6 that the F
distribution depends on 1 numerator degrees of freedom and 2, denominator degrees of
freedom. For the completely randomized design, F is based on 1 = (k - 1) and 2 = (n - k)
degrees of freedom. If the computed value of F exceeds the upper critical value, F we reject H0
and conclude that at least two of the treatment means differ.
clxiv
H0: 1 = 2 . . . = k
F = MST/MSE
Rejection region: F > F
where the distribution of F is based on (k - 1) numerator df and (n - k) denominator df, and F is
the F value found in Table 4 of Appendix C such that P(F > F) = .
Assumptions: 1. All k population probability distributions are normal.
2. The k population variances are equal.
3. The samples from each population are random and
independent.
The results of an analysis of variance are usually summarized and presented in an analysis of
variance (ANOVA) table. Such a table shows the sources of variation, their respective degrees
of freedom, sums of squares, mean squares, and computed F statistic. The results of the
analysis of variance for Example 10.3 are given in Table 10.9, and the general form of the
ANOVA table for a completely randomized design is shown in Table 10.11.
clxv
Sum of
Squares
Degrees of
Freedom
Mean of
Squares
Between groups
SST
MST/(k 1)
Within groups
SSE
k-1
n-k
F=
MST/MSE
SS(Total)
n -1
Total
SSE/(n - k)
Example 10.4 Consider the problem of comparing the mean number of children born to
women in 10 provinces numbered from 1 to 10. Numbers of children born to 3448 women
from these provinces are randomly selected from the column heading CEB of Appendix
A. The women selected from 10 provinces are considered to be the only ones of interest.
This ensure the assumption of equality between the population variances. Now, we want
to compare the mean numbers of children born to all women in these provinces, i.e., we
wish to test
H0: 1 = 2 . . . = 10
ONEWAY
ceb BY province
/STATISTICS DESCRIPTIVES
/MISSING ANALYSIS .
ONEWAY
Descriptives
Mean
Std. Error
Std.
Deviation
Lower
Bound
Upper
Bound
Minimum
Maximum
1
2
3
4
5
6
7
8
9
10
228
323
302
354
412
366
402
360
297
403
2.40
2.84
3.15
2.80
2.53
3.08
3.26
3.45
3.87
3.75
1.55
.10
2.30
.13
2.09
.12
2.00
.11
1.61 7.93E-02
1.99
.10
1.83 9.13E-02
2.21
.12
2.66
.15
2.52
.13
2.19
2.59
2.91
2.59
2.37
2.88
3.08
3.23
3.56
3.51
2.60
3.09
3.39
3.01
2.68
3.29
3.44
3.68
4.17
4.00
0
0
0
0
0
0
0
0
0
0
10
11
12
10
9
11
10
11
12
12
Total
3448
3.13
2.15 3.66E-02
3.06
3.20
12
clxvi
ANOVA
Children born
Sum of Squares
Between Groups
Within Groups
Total
702.326
15221.007
15923.333
df
9
3437
3446
Mean
Square
78.036
4.429
F
17.621
Sig.
.000
From the printout we can see that the SPSS One-Way ANOVA procedure presents the results
in the form of an ANOVA table. Their corresponding sums of squares and mean squares are:
SST = 702.326
SSE = 15221.007
MST = 78.036
MSE = 4.429
The computed value of the test statistic, given under the column heading F is
F = 17.621
with degrees of freedom between provinces is 1 = 9 and degrees of freedom within provinces is
2 = 3437.
To determine whether to reject the null hypothesis
H0: 1 = 2 . . . = 10
in favor of the alternative
clxvii
P1
P2
P3
P4
P5
Sums
24
10
21
Sums
14
12
12
10
55
In general terms, we can define that a randomized block design as a design in which k
treatments are compared within each of b blocks. Each block contains k matched experimental
units and the k treatments are randomly assigned, one to each of the units within each block.
Table 10.13 shows the pattern of a data set resulting from a randomized blocks design; it is a
two-way table with single measurements as entries. In the example people correspond to blocks
and cans to treatments. The observation xgj is called the response to treatment g in block j.
The treatment mean x g . , estimates the population mean g, for treatment g (averaged out over
people). An objective may be to test the hypothesis that treatments make no difference,
H0: 1 = 2 = . . . = k
clxviii
...
1
2
.
.
.
k
x11
x21
.
.
.
xk1
x12
x22
.
.
.
xk2
...
...
x1b
x2b
.
.
.
xkb
...
Each observation xgj can be written as a sum of meaningful terms by means of the identity
x gj = x + ( x g . x ) + ( x. j x ) + ( x gj x g . x. j + x ) .
In word, the
The "residual" is
x gj x + ( x g . x ) + ( x. j x ) ,
which is the difference between the observation and
x + ( x. j x ) + ( x. j x ) ,
obtained by taking into account the overall mean, the effect of the gth treatment, and the effect
of the jth block. Algebra shows that the corresponding decomposition is true for sums of
squares:
k
g =1
j =1
( x gj x ) 2 = b ( x g . x ) 2 + k ( x. j x ) 2 + ( x gj x g . x. j + x ) 2
g =1 j =1
g =1 j =1
that is,
SS(Total) = SS(Treatment) + SS(Blocks) + SS(Residuals).
The number of degrees of freedom of SS(Total) is kb - 1 = n - 1, the number of observations
less 1 for the overall mean.
The number of degrees of freedom of SS(Treatments) is k - 1, the number of treatments less 1
for the overall mean.
Similarly, the number of degrees of freedom of SS(Blocks) is b - 1. There remain, as the number
of degrees of freedom for SS(Residuals)
kb - 1 - (k - 1) - (b - 1) = (k - 1)(b - 1).
There is a hypothetical model behind the analysis. It is assumes that in repeated experiments
the measurement for the gth treatment in the jth block would be the sum of a constant
clxix
pertaining to the treatment, namely g, a constant pertaining to the jth block, and a random
"error" term with a variance of 2. The mean square for residuals,
SS(Residuals) / (k - 1)(b - 1)
MS(Residuals) =
is an unbiased estimate of 2 regardless of whether the g's differ (that is, whether there are true
effects due to treatments). If there are no differences in the g's,
MS(Treatments) = MS(Treatments) / (k - 1)
is an unbiased estimate of 2 (whether or not there are true effects due to blocks). If there are
differences among the g's, then MS(Treatments) will tend to be larger than 2. One tests H0 by
means of
F = MS(Treatments) / MS(Residuals)
When H0 is true, F is distributed as an F-distribution based on (k - 1) numerator df and (k - 1) (b
-1) df. One rejects H0 if F is sufficiently large, that is, if F exceeds F. Table 10.14 is the analysis
of variance table.
Sum of squares
Degrees of
freedom
Mean square
k-1
MS(Treatments)
MS (Treatments )
MS ( Residuals)
b-1
MS(Blocks)
MS ( Blocks )
MS ( Residuals)
Treatments
b ( x g . x ) 2
g =1
b
Blocks
k ( x. j x ) 2
j =1
Residuals
( x
x g . x. j + x ) 2
(k -1)(b -1)
x ) 2 = SS (Total )
n-1
gj
MS(Residuals)
g =1 j =1
k
Total
( x
gj
g =1 j =1
SS (Total ) = x gj
g =1 j =1
1 k b
x gj
kb g =1 j =1
1 k b
1 k b
SS (Treatments ) = x gj x gj
b g =1 j =1
kb g =1 j =1
2
1 b k
1 k b
SS ( Blocks) = x gj x gj
k j =1 g =1
kb g =1 j =1
clxx
2
gj
= 6 2 + 5 2 + ... + 3 2 = 237
g =1 j =1
1 k b
55 2
x gj =
= 201.67
kb g =1 j =1
15
SS(Total)
SS (Treatments )
SS ( Blocks )
SS(Residuals)
24 2 + 10 2 + 212
201.67 = 223.40 201.67 = 21.73
5
14 2 + 12 2 + 12 2 + 10 2 + 7 2
201.67 = 211 201.67 = 9.33
3
= 35.33 - 21.73 = 4.27
The analysis of variance table is Table 10.15. From Table 4 in Appendix C, the tabulated value
of F.05 with 2 and 8 df is 4.46. Therefore, we will reject H0 if the calculated value of F is F > 4.46.
Since
the
computed
value
of
test
statistic,
F = 20.40, exceeds 4.46, we have sufficient evidence to reject the null hypothesis of no
difference in metallic taste of types of can at = .05.
Sum of
Squares
Degrees of
freedom
Cans
Persons
Residual
21.73
9.33
4.27
2
4
8
Total
35.33
14
Mean
square
10.87
2.33
0.533
F
20.40
4.38
The roles of cans and people can be interchanged. To test the hypothesis that there are no
differences in scoring among persons (in the hypothetical population of repeated experiments),
one uses the ration of MS(Blocks) to MS(Residuals) and rejects the null hypothesis if that ratio
is
greater
than
an
F-value
for
b
1
and
(k - 1)(b - 1) degrees of freedom. The value here of 4.38 is referred to Table 4 of Appendix C
with 4 and 8 degrees of freedom, for which the 5% point is 3.84; it is barely significant.
s 1 / n1 + 1 / n 2 .
clxxi
If one were interested simply in determining whether the first two population means differed, one
would test the null hypothesis that 1 = 2 at significance level by using a t-test, rejecting the
null hypothesis if
x1 x 2 /( s 1 / n1 + 1 / n 2 ) > t / 2
where the number of degrees of freedom for the t-value is the number of degrees of freedom for
s. However, now we want to consider each possible difference g - h ; that is, we want to test all
the null hypotheses
Hgh: g = h, with g h; g, h = 1, . . . , k.
There are k(k - 1)/2 such hypotheses.
If, indeed, all the 's were equal, so that there were no real differences, the probability that any
particular one of the pair wise differences in absolute value would exceed the relevant t-value is
. Hence the probability that at least one of them would exceed the t-value, would be greater
than . When many differences are tested, the probability that some will appear to be
"significant" is greater than the nominal significance level when all the null hypotheses are
true. How can one eliminate this false significance? It can be shown that, if m comparisons are
to be made and the overall Type I error probability is to be at most , it is sufficient to use /m
for the significance level of the individual tests. By overall Type I error we mean concluding g
h for at least one pair g, h when actually 1 = 2 = . . .= k .
Example 10.5
We illustrate with Example 10.3 (Tables 10.6 and 10.9). Here
2
s = 1.867, based on 15 degrees of freedom (s = 1.366). Since all the sample sizes are 6,
the value with which to compare each differences x g x h is
x3 x1 = 5 4 = 1 is not significant. The conclusion is that 2 is different from both 1 and 3, but
1 and 3 may be equal; Workbook 2 appears to be superior.
clxxii
Confidence Regions
x g x h t * / 2 s 1 / n g + 1 / n h < ( g h ) < x g x h + t * / 2 s 1 / n g + 1 / nh
for g h; g, h = 1, . . . , k, if * = /m and the distribution of t is based on (n - k) degrees of
freedom.
10.10 Summary
This chapter presented an extension of the methods for comparing two population means to
allow for the comparison of more than two means. The completely randomized design uses
independent random samples selected from each of k populations. The comparison of the
population means is made by comparing the variance among the sample means, as measured
by the mean square for treatments (MST), to the variation attributable to differences within the
samples, as measured by the mean square for error (MSE). If the ratio of MST to MSE is large,
we conclude that a difference exists between the means of at least two of the k populations.
We also presented an analysis of variance for a comparison of two or more population means
using matched groups of experimental units in a randomized block design, an extension of the
matched-pairs design. The design not only allows us to test for differences among the treatment
means, but also enables us to test for differences among block means. By testing for
differences among block means, we can determine whether blocking is effective in reducing the
variation present when comparing the treatment means.
Remember that the proper application of these ANOVA techniques requires that certain
assumptions are satisfied. In most applications, the assumptions will not be satisfied exactly.
However, these analysis of variance procedures are flexible in the sense that slight departures
from the assumptions will not significantly affect the analysis or the validity of the resulting
inferences.
10.11 Exercises
10.1.
Total
27
62
241
69
101
500
clxxiii
10.2.
Totals
14
37
23
74
21
32
38
91
35
69
61
165
a. Calculate the estimated expected cell counts for the contingency table.
b. Calculate the chi-square statistic for the table.
10.3.
A partially completed ANOVA table for a completely randomized design is shown here.
Source
Between groups
Within groups
Total
SS
df
MS
24.7
62.4
34
A randomized block design was conducted to compare the mean responses for three
treatments, A, B, and C, in four blocks. The data are shown in the accompanying table,
followed by a partial summary ANOVA table.
Block
Treatment
Source
SS
df
MS
Treatments
23.167
Blocks
14.250
4.750
.917
42.917
Residuals
Total
At the 5% level make the F-test of equality of population (treatment) means for the data
in the table.
clxxiv
Blocks
Treatment
16
16
23
clxxv
Chapter 11
CONTENTS
Regression and correlation analyses are based on the relationship or association between two
or more variables.
Definition 11.1
clxxvi
Example 11.1 A farmer may be interested in the relationship between the level of fertilizer x
and the yield of potatoes y. Here the level of fertilizer x is independent variable and the yield of
potatoes y is dependent variable.
Example 11.2 A medical researcher may be interested in the bivariate relationship between a
patients blood pressure x and heart rate y. Here x is independent variable and y is dependent
variable.
Example 11.3 Economists might base their predictions of the annual gross national product
(GDP) on the final consumption spending within the economy. Then, the final consumption
spending is the independent variable, and the GDP would be the dependent variable.
In regression analysis we can have only one dependent variable in our estimating equation.
However, we can use more than one independent variable. We often add independent variables
in order to improve the accuracy of our prediction.
Definition 11.2
If when the independent variable x increases, the dependent variable y also increases
then the relationship between x and y is direct relationship. In the case, the dependent
variable y decreases as the independent variable x increases, we call the relationship
inverse.
Scatter diagrams
The first step in determining whether there is a relationship between two variables is to
examine the graph of the observed (or known) data, i.e. of the data points.
Definition 11.3
The graph of the data points is called a scatter diagram or scatter gram.
Example 11.4 In recent years, physicians have used the so-called diving reflex to reduce
abnormally rapid heartbeats in humans by submerging the patients face in old water. A
research physician conducted an experiment to investigate the effects of various cold
temperatures on the pulse rates of ten small children. The results are presented in Table 11.1.
clxxvii
Table 11.1
Temperature of water Pulse rate data
Temperature of
Water, xo F
Reduction in
Pulse, y
beats/minute
68
65
70
62
10
60
55
13
58
10
65
69
10
63
Child
The scatter gram of the data set in Table 11.1 is depicted in Figure 11.1.
14
12
10
8
6
4
2
0
50
55
60
65
70
75
clxxviii
14
12
10
8
6
4
2
0
50
55
60
65
70
75
We see that the relationship described by the data points is well described by a straight line.
Thus, we can say that it is a linear relationship. This relationship, as we see, is inverse because
y decreases as x increases
Example 11.5 To model the relationship between the CO (Carbon Monoxide) ranking, y, and
the nicotine content, x, of an American-made cigarette the Federal Trade commission tested a
random sample of 5 cigarettes. The CO ranking and nicotine content values are given in Table
11.2
Cigarett
e
1
Nicotine Content, x,
mgs
0.2
CO ranking, y, mgs
0.4
10
0.6
13
0.8
15
20
The scatter gram with straight line representing the relationship between Nicotine Content x and
CO Ranking y fitted through it is depicted in Figure 11.3. From this we see that the
relationship here is direct.
clxxix
CO ranking y, mgs
25
20
15
10
5
0
0
0.5
1.5
y = A + B x + e,
where A and B are unknown parameters of the deterministic (nonrandom ) portion of the model.
If we suppose that the points deviate above or below the line of means and with expected value
E(e) = 0 then the mean value of y is
y = A + B x.
Therefore, the mean value of y for a given value of x, represented by the symbol E(y) graphs as
straight line with y-intercept A and slope B.
A graph of the hypothetical line of means, E(y) = A + B x is shown in Figure 11.4.
clxxx
y = A + B x + e,
where
y = dependent variable (variable to be modeled sometimes called
the response variable)
x = independent variable ( variable used as a predictor of y)
e = random error
A = y-intercept of the line
B = slope of the line
In order to fit a simple linear regression model to a set of data , we must find estimators for the
unknown parameters A and B of the line of means y = A + B x. Since the sampling distributions
of these estimators will depend on the probability distribution of the random error e, we must
first make specific assumptions about its properties.
clxxxi
y i = a + bxi
and the deviation of the ith value of y from its predicted value is
n
The values of a and b that make the SSE minimum is called the least squares estimators of the
population parameters A and B and the prediction equation y = a + bx is called the least
squares line.
Definition 11.4
The least squares line is one that has a smaller than any other straight-line model.
clxxxii
b=
Slope:
SS xy
SS xx
, y-intercept: a = y bx
where
n
SS xy = ( x i x )( y i y ) ,
i =1
x=
1 n
xi ,
n i =1
y=
SS xx = ( x i x ) 2 ,
i =1
1 n
yi ,
n i =1
n = sample size
Example 11.6 Refer to Example 11.5. Find the best-fitting straight line through the sample
data points.
Solution By the least squares method we found the equation of the best-fitting straight line. It
is y = 0.3 + 20.5 x . The graph of this line is shown in Figure 11.5
11.4 Estimating 2
In most practical situations, the variance 2 of the random error e will be unknown and must be
estimated from the sample data. Since 2 measures the variation of the y values about the
regression line, it seems intuitively reasonable to estimate 2 by dividing the total error SSE by
an appropriate number.
clxxxiii
ESTIMATION OF 2
s2 =
SSE
SSE
=
Degree of freedom fo r error n 2
where
n
SSE = ( y i y i ) 2
i =1
From the following Theorem it is possible to prove that s is an unbiased estimator of , that is
E(s2) = 2.
2
Theorem 11.1
Let s 2 =
2 =
SSE
. Then, when the assumptions of Section 11.2 are satisfied, the statistic
n2
SSE
( n 2) s 2
freedom.
Usually, s is referred to as a standard error of estimate.
Example 11.7 Refer to Example 11.5. Estimate the value of the error variance 2 .
Data analysis or statistical softwares provide procedures or functions for computing the
standard error of estimate s. For example, the function STEYX of MS-Excel gives, for the data
of Example 11.5, the result s =1.816590.
Recall that the least squares line estimates the mean value of y for a given value of x. Since s
measures the spread of distribution of y values about the least squares line, most observations
will lie within 2s of the least squares line.
INTERPRETATION OF s, THE ESTIMATED STANDARD DEVIATION OF e
We expect most of the observed y values to lie within 2s of their respective least
squares predicted value y .
clxxxiv
1. Under the assumptions in section 11.2, b will possess sampling distribution that
is normally distributed.
2. The mean of the least squares estimator b is B, E(b) = B, that is, b is an
unbiased estimator for B.
3. The standard deviation of the sampling distribution of b is
b =
SS xx
(x
x)2
i =1
We will use these results to test hypotheses about and to construct a confidence interval for the
slope B of the population regression line.
Since is usually unknown, we use its estimator s and instead of b =
estimate sb =
s
SS xx
SS xx
we use its
For testing hypotheses about B first we state null and alternative hypotheses:
H 0 : B = B0
H a : B B0 (or B < B0 or B > B0 )
where B0 is the hypothesized value of B.
Often, one tests the hypothesis if B = 0 or not, that is, if x does or does not contribute
information for the prediction of y. The setup of our test of utility of the model is summarized in
the box.
clxxxv
TWO-TAILED TEST
H0 : B = 0
H0 : B = 0
Ha : B < 0
Ha : B 0
(or B > 0)
Test statistic:
Test statistic:
t=
b
b
=
s b s / SS xx
Rejection region
t < t
( or t > t ),
where t is based on (n - 2) df.
t=
b
b
=
s b s / SS xx
Rejection region
t < t / 2 or t > t / 2 ,
where t / 2 is based on (n-2) df.
H0 : B = 0
Ha : B 0
with n = 5 and = 0.05 , the critical value based on (5 -2) = 3 df is obtained from Table 7.4
t / 2 = t 0.025 = 3.182 .
Thus, we will reject H0 if t < -3.182 or t > 3.182.
In order to compute the test statistic we need the values of b, s and SSxx. In Example 11.6 we
computed b =20.5. In Example 11.7 we know s = 1.82 and we can compute SSxx = 0.4. Hence,
the test statistic is
clxxxvi
t=
b
s / SS xx
20.5
1.82 / 0.4
= 7.12
Since the calculated t-value is greater than the critical value t0.025 = 3.182, we reject the null
hypothesis and conclude that the slope B 0 . At the significance level = 0.05, the sample
data provide sufficient evidence to conclude that nicotine content does contribute useful
information for prediction of carbon-monoxide ranking using the linear model.
Example 11.9 A consumer investigator obtained the following least squares straight line model
( based on a sample on n = 100 families ) relating the yearly food cost y for a family of 4 to
annual income x:
y = 467 + 0.26 x .
In addition, the investigator computed the quantities s = 1.1, SSxx = 26. Compute the observed
p-value for a test to determine whether mean yearly food cost y increases as annual income x
increases , i.e., whether the slope of the population regression line B is positive.
Solution The consumer investigator wants to test
H0 : B = 0
Ha : B > 0
To compute the observed significance level (p-value ) of the test we must first find the
calculated value of the test statistic, tc . Since b = 0.26, s =1.1, and SSxx = 26 we have
t=
b
s / SS xx
0.26
1.1 / 26
= 1.21
P(t > tc ) = P(t >1.21), where t-distribution is based on (n - 2) = (100 - 2) = 98 df. Since df >30
we can approximate the t-distribution with the z-distribution. Thus,
b t / 2 s b , where sb =
s
SS xx
Example 11.10 Find the 95% confidence interval for B in Example 11.8.
clxxxvii
Solution For a 95% confidence interval = 0.05. Therefore, we need to find the value of t/2 =
t0.025 based on ( 5-2 ) = 3 df. In Example 11.8 we found that t0.025 = 3.182. Also, we have b =
20.5, SSxx = 0.4. Thus, a 95% confidence interval for the slope in the model relating carbon
monoxide to nicotine content is
s
b t / 2
SS
xx
0 .4
Our interval estimate of the slope parameter B is then 11.34 to 29.66. Since all the values in
this interval are positive, it appears that B is positive and that the mean of y, E(y) increases as x
increases.
Remark From the above we see the complete similarity between the t-statistic for testing
hypotheses about the slope B and the t-statistic for testing hypotheses about the means of
normal populations in Chapter 9 and the similarity of the corresponding confidence intervals. In
each case, the general form of the test statistic is
t=
Definition 11.5
The Pearson product moment coefficient of correlation (or simply, the coefficient of
correlation) r is a measure of the strength of the linear relationship between two
variables x and y. It is computed ( for a sample of n measurements on x and y ) as
follows
r=
SS xy
SS xx SS yy
where
clxxxviii
i =1
i =1
SS xy = ( x i x )( y i y ) , SS xx = ( x i x ) 2 ,
n
SS yy = ( y i y ) 2 ,
i =1
x=
1 n
xi ,
n i =1
y=
1 n
yi ,
n i =1
i)
ii)
r and b ( the slope of the least squares line ) have the same sign
iii)
A value of r near or equal to 0 implies little or no linear relationship between x and y. The
closer r is to 1 or to 1, the stronger the linear relationship between x and y.
Keep in mind that the correlation coefficient r measures the correlation between x values and y
values in the sample, and that a similar linear coefficient of correlation exists for the population
from which the data points were selected. The population correlation coefficient is denoted by
(rho). As you might expect, is estimated by the corresponding sample statistic r. Or, rather
than estimating , we might want to test the hypothesis H0: = 0 against Ha: 0, i.e., test the
hypothesis that x contributes no information for the predicting y using the straight line model
against the alternative that the two variables are at least linearly related. But it can be shown
that the null hypothesis H0: = 0 is equivalent to the hypothesis H0: B = 0. Therefore, we omit
the test of hypothesis for linear correlation.
11.6.1 The coefficient of determination
Another way to measure the contribution of x in predicting y is to consider how much the errors
of prediction of y can be reduced by using the information provided by x.
The sample coefficient of determination is develped from the relationship between two kinds of
variation: the variation of the y values in a data set around:
1. The fitted regression line
2. Their own mean
The term variation in both cases is used in its usual statistical sense to mean the sum of a
group of squared deviations.
The first variation is the variation of y values around the regression line, i.e., around their
predicted values. This variation is the sum of squares for error (SSE) of the regression model
n
SSE = ( y i y i ) 2
i =1
The second variation is the variation of y values around their own mean
n
SS yy = ( y i y ) 2
i =1
Definition 11.6
clxxxix
SS yy SSE
SS yy
It is easy to verify that
r2 =
SS yy SSE
SS yy
= 1
SSE
,
SS yy
Figure 11.6
The
unexplained deviations
explained
and
Here we singled out one observed value of y and showed the total variation of this y from its
mean y , y y , the unexplained deviation y y and the remaining explained deviation
y y . Now consider a whole set of observed y values instead of only one value. The total
variation, i.e., the sum of squared deviations of these points from their mean would be
n
SS yy = ( y i y ) 2 .
i =1
The unexplained portion of the total variation of these points from the regression line is
n
SSE = ( y i y i ) 2 .
i =1
( y
y) 2 .
i =1
It is true that
Total variation = Explained variation + Unexplained variation.
cxc
Therefore,
r2 =
Explained variation
Total variation
About 100(r2) % of the total sum of squares of deviations of the sample y-values
about their mean y can be explained by (or attributed to) using x to predict y in the
straight-line model.
Example 11.11 Refer to Example 11.5. Calculate the coefficient of determination for the
nicotine content-carbon monoxide ranking and interpret its value.
Solution By the formulas given in this section we found r2 = 0.9444. We interpret this value as
follows: The use of nicotine content, x, to predict carbon monoxide ranking, y, with the least
squares line
y = 0.3 + 20.5 x
accounts for approximately 94% of the total sum of squares of deviations of the five sample CO
rankings about their mean. That is, we can reduce the total sum of squares of our prediction
errors by more than 94% by using the least squares equation instead of y .
cxci
y =
1 (x x)2
+
n
SS xx
( y y ) = 1 +
1 (x x)2
+
n
SS xx
where is the square root of 2, the variance of the random error (see Section 11.2)
The true value of will rarely be known. Thus, we estimate by s and calculate the estimation
and prediction intervals as follows
A (1-)100% CONFIDENCE INTERVAL
FOR THE MEAN VALUE OF y FOR x =
xp
y t / 2 ( Estimate std of y )
y t / 2 [ Estimate std of ( y y )]
2
1 (x p x)
+
or y t / 2 .s.
n
SS xx
2
1 (x p x)
or y t / 2 .s. 1 + +
n
SS xx
Example 11.12 Find a 95% confidence interval for the mean carbon monoxide ranking of all
cigarettes that have a nicotine content of 0.4 milligram. Also, find a 95% prediction interval for a
particular cigarette if its nicotine content is 0.4 mg.
Solution For a nicotine content of 0.4 mg, xp = 0.4 and the confidence interval for the mean of y
is calculated by the formula in left of the above box with s = 1.82, n = 5, df = n - 2 = 5 - 2 = 3,
t0.025 = 3.182 y = 0.3 + 20.5 x p = 0.3 + 20.5 * 0.4 = 7.9 , SSxx = 0.4. Hence, we obtain the
cxcii
From the Example 11.12 it is important note that the prediction interval for the carbon monoxide
ranking of an individual cigarette is wider than corresponding confidence interval for the mean
carbon monoxide ranking. By examining the formulas for the two intervals, we can see that this
will always be true.
Additionally, over the range of sample data, the width of both intervals increase as the value of x
gets further from x (see Figure 11.7).
cxciii
Tillers,
kg/ha
no./m2
4,862
160
5,244
175
5,128
192
5,052
195
5,298
238
5,410
240
5,234
252
5,608
282
Step 1 Suppose that the assumptions listed in Section 11.2 are satisfied, we hypothesize a
straight line probabilistic model for the relationship between the grain yield, y, and the tillers, x
y = A + B x + e.
Step 2 Use the sample data to find the least squares line. For the purpose we make
calculations:
n
SS xx = ( x i x ) 2 ,
i =1
n
SS xy = ( x i x )( y i y )
i =1
b=
SS xy
SS xx
, a = y bx
y = 4242 + 4.56 x
The scattergram for the data and the least squares line fitted to the data are depicted in Figure
11.8.
cxciv
5,800
5,600
5,400
5,200
5,000
4,800
150
200
250
300
Figure 11.8 Simple linear model relating Grain Yield to Tiller Number
Step 3 Compute an estimator, s2, for the variance 2 of the random error e :
s2 =
SSE
n2
where
n
SSE = ( y i y i ) 2 .
i =1
The result of computations gives s2 = 16,229.66, s = 127.39. The value of s implies that most
of the observed 8 values will fall within 2s = 254.78 of their respective predicted values.
Step 4 Check the utility of the hypothesized model, that is, whether x really contributes
information for the prediction of y using the straight-line model. First test the hypothesis that the
slope B is 0, i.e., there is no linear relationship between the grain yield, y, and the tillers, x. We
test:
H0 : B = 0
Ha : B 0
Test statistic:
t=
b
b
=
s b s / SS xx
t=
4.56
127.39 / 125415
= 4.004 .
s
b t / 2
SS
xx
12541.5
r=
SS xy
SS xx SS yy
, where SS yy =
(y
y)2 .
i =1
Suppose the researchers want to predict the grain yield if the tillers are 210 per m2, i.e., xp =210.
The predicted value is
y t / 2 .s. 1 +
2
1 (x p x)
1 (210 26.75) 2
+
= 5199.6 2.447 * 127.39 1 + +
n
SS xx
8
12541.5
cxcvi
11.9 Summary
In this chapter we have introduced bivariate relationships and showed how to compute the
coefficient of correlation, r , a measure of the strength of the linear relationship between two
variables. We have also presented the method of least squares for fitting a prediction equation
to a data set. This procedure, along with associated statistical tests and estimations, is called a
regression analysis. The steps that we follow in the simple linear regression analysis are:
To hypothesize a probabilistic straight-line model y=A + Bx + e.
To make assumptions on the random error component e.
To use the method of least squares to estimate the unknown parameters in the deterministic
component, y=A + Bx.
To assess the utility of the hypothesized model. Included here are making inferences about the
slope B, calculating the coefficient of correlation r and the coefficient of determination r2.
If we are satisfied with the model we used it to estimate the mean y value, E(y), for a given x
value and to predict an individual y value for a specific x value
11.10 Exercises
1. Consider the seven data points in the table
-5
-3
-1
0.8
1.1
2.5
3.1
5.0
4.7
6.2
a) Construct a scatter diagram for the data. After examining the scattergram, do you think
that x and y are correlated? If correlation is present, is it positive or negative?
b) Find the correlation coefficient r and interpret its value.
c) Find the least squares prediction equation.
d) Calculate SSE for the data and calculate s2 and s.
e) Test the null hypothesis that the slope B = 0 against the alternative hypothesis that
B 0 . Use = 0.05.
f) Find a 90% confidence interval for the slope B.
2. In fitting a least squares line to n = 22 data points, suppose you computed the following
quantities:
SSxx = 25
x=2
SSyy = 17
SSxy = 20
y =3
cxcvii
3. A study was conducted to examine the inhibiting properties of the sodium salts of
phosphoric acid on the corrosion of iron. The data shown in the table provide a measure of
corrosion of Armco iron in tap water containing various concentrations of NaPO4 inhibitor:
Concentratio
n of NaPO4,
x, parts per
million
Measure of
corrosion
rate, y
Concentratio
n of NaPO4,
x, parts per
million
Measure of
corrosion
rate, y
2.50
7.68
26.20
0.93
5.03
6.95
33.00
0.72
7.60
6.30
40.00
0.68
11.60
5.75
50.00
0.65
13.00
5.01
55.00
0.56
19.60
1.43
-------------------------------------------------------------------------------------------------------------Standard
Parameter
Estimate
Error
T
Value
Prob.
Level
-------------------------------------------------------------------------------------------------------------Intercept
Slope
279.763
116.445
2.40252
0.04301
0.720119
0.0623473
11.5501
0.00000
-------------------------------------------------------------------------------------------------------------Analysis of Variance
------------------------------------------------------------------------------------------------------------Source
Sum of Squares
Df Mean Square
F-Ratio
Prob. Level
Model
798516.89
133.4
0.00000
Residual
47885.214
798516.89
5985.652
-------------------------------------------------------------------------------------------------------------Total (Corr.)
846402.10
cxcviii
cxcix
A quadratic model often referred to as a second-order linear model in contrast to a straight line
or first-order model.
If, in addition, we think that the mean time required to process a job is also related to the size x2
of the job, we could include x2 in the model. For example, the first-order model in this case is
E(y) = A e-Bx
cc
is not a linear model because E(y) is not a linear function of the unknown model parameters A
and B.
Note that by introducing new variables, second-order models may be written in the form of firstorder models. For example, putting x2 = x12, the second-order model
cci
Table 12.1
DATA POINT
Y VALUE
x1
x2
...
xk
y1
x11
x21
...
xk1
y2
x12
x22
...
xk2
yn
x1n
x2n
xkn
We will use the method of least squares and choose estimates of B0, B1, B2,..., Bk that minimize
n
i =1
i =1
y1
y
Y = 2 ,
yn
1
1
X =
x11
x 21
x12
x 22
x1n
x2 n
xk1
x k 2
,
x kn
b0
b
b = 1 ,
bk
(XX )b = XY,
where X is the transpose of X
b = (XX)-1XY .
Example 12.3 Refer to Example 12.1 relating Grain Yield , y, to Plant Height, x1, and Tiller
Number, x2, by the linear model
ccii
Find the least squares estimates of B0, B1, B2. The data are shown in Table 12.2
GRAIN YIELD,
kg/ha
PLANT HEIGHT,
cm
(y)
( x1 )
5755
110.5
14.5
5939
105.4
16.0
6010
118.1
14.6
6545
104.5
18.2
6730
93.6
15.4
6750
84.1
17.6
6899
77.8
17.9
7862
75.6
19.4
TILLER, no./hill
( x2 )
5755
5939
6010
6545
Y=
,
6730
6750
6899
7862
1
1
1
X =
1
1
1
110.5 14.5
105.4 16.0
118.1 14.6
104.5 18.2
,
93.6 15.4
84.1 17.6
77.8 17.9
75.6 19.4
b0
b = b1 ,
b2
cciii
12.4 Estimating 2
We recall that the variances of the estimators of all the B parameters and of y will depend on
the value of 2, the variance of the random error e that appears in the linear model. Since 2
will rarely be known in advance, we must use the sample data to estimate its value.
ESTIMATOR OF 2, THE VARIANCE OF e IN A MULTIPLE REGRESSION
MODEL
s2 =
SSE
SSE
=
Degree of freedom for error n Number of B parameters in model
where
n
SSE = ( y i y i ) 2
i =1
cciv
Before making inferences about the B parameters of the multiple linear model we provide some
properties of the least squares estimators b , which serve the theoretical background for
estimating and testing hypotheses about B.
From Section 12.3 we know that the least squares estimators b are computed by the formula
b = (XX)-1XY. Now, we can rewrite b in the form
b = [(XX)-1X]Y.
From this form we see that the components of b: b0, b1, ..., bk are linear functions of n normally
distributed random variables y1, y2,..., yn. Therefore, bi (i =0,1, ..., k) has a normal sampling
distribution.
One showed that the least squares estimators provide unbiased estimators of B0, B1, ..., Bk, that
is, E(bi) = Bi (i = 0,1, ..., k).
The standard errors and covariances of the estimators are defined by the elements of the matrix
(XX)-1. Thus, if we denote
( X ' X ) 1
c00 c01
c
10 c11
= c 20 c 21
c k 0 c k 1
cok
c1k
c 2 k ,
c kk
then the standard deviation of the sampling distributions of b0, b1, ..., bk are
b = cii
i
(i = 0, 1,..., k )
(i = 0, 1,..., k )
Cov(bi , b j ) = cij 2 (i j ) .
ccv
t=
bi Bi bi Bi
=
s bi
s cii
where s is an estimate of .
A (1-)100% CONFIDENCE INTERVAL FOR Bi
t=
bi
Estimated standard error of bi
TWO-TAILED TEST
H 0 : Bi = 0
H 0 : Bi = 0
H a : Bi < 0
H a : Bi 0
(or Bi > 0)
Test statistic:
b
b
t= i = i
sbi s cii
Rejection region
t < t
(or t > t )
where t / 2 is based on [ n- (k+1)]
df,
n = number of observations,
k= number of independent
variables in the model
Test statistic:
t=
bi
b
= i
sbi s cii
Rejection region
t < t / 2 or t > t / 2 ,
where t / 2 is based [ n- (k+1)] df,
n = number of observations,
k= number of independent variables
in the model
ccvi
Example 12.4 An electrical utility company wants to predict the monthly power usage of a
home as a function of the size of the home based on the model
y = B0 + B1x + B2x2 + e.
Data are shown in Table 12.3.
MONTHY USAGE
x, square feet
y, kilowatt-hours
1290
1182
1350
1172
1470
1264
1600
1493
1710
1571
1840
1711
1980
1804
2230
1840
2400
1956
2390
1954
a.
b.
c.
d.
Solution We use computer with the software STATGRAPHICS to do this example. Below is a
part of the printout of the procedure Multiple regression .
ccvii
(Constant) -1303.383
X
X2
Lower
Bound
Upper
Bound
.016 -2285.196
-321.570
Std. Error
Model
1
95% Confidence
Interval for B
Sig.
415.210
-3.139
2.498
.461
5.418
.001
1.408
3.588
-4.768E-04
.000
-3.869
.006
-.001
.000
ccviii
Definition 12.1
R2 = 1
SSE
SS yy
where
n
i =1
i =1
SSE = ( yi y i ) 2 , SS yy = ( y i y ) 2
and y i is the predicted value of yi for the multiple regression model.
From the definition we see that R2 = 0 implies a complete lack of fit of the model to the data, , R2
= 1 implies a perfect fit with the model passing through every data point. In general, the larger
the value of R2, the better the model fits the data.
R2 is a sample statistic that tells how well the model fits the data , and thereby represents a
measure of the utility of the entire model . It can be used to make inferences about the utility of
the model for predicting y values for specific settings of the independent variables.
R2 / k
Mean Square for Model
SS (Model) / k
F=
=
=
2
SSE /[ n (k + 1)]
(1 R ) /[ n (k + 1)] Mean Square for Error
Rejection region: F > F , where F is value that locate area in the upper tail of
the F-distribution with 1 = k and 2 = n - (k+1),
n = Number of observations, k = Number of parameters in the model (excluding B0 )
R2 = Multiple coefficient of determination.
Example 12.5 Refer to Example 12.4. Test to determine whether the model contributes
information for the prediction of the monthly power usage.
Solution For the electrical usage example, n = 10, k = 2 and n ( k+1) = 7. At the significance
level = 0.05 we will reject H0 : B1 = B2 = 0 if F > F0.05. where 1 = 2 and 2 = 7, or F > 4.74.
From the computer printout ( see Figure 12.3 ) we find that the computed F is 190.638. Since
this value greatly exceeds 4.74 we reject H0 and conclude that at least one of the model
coefficients B1 and B2 is nonzero. Therefore, this F test indicates that the second order model y
= B0 + B1x + B2x2 + e, is useful for predicting electrical usage.
ccix
Example 12.6 Refer to Example 12.3. test the utility of the model E(y) = A + B1x1 + B2x2.
Solution From the SPSS Printout ( Figure 12.4) we see that the F value is 11.356 and the
corresponding observed significance level is 0.014. Thus, at the significance level greater than
0.014 we reject the null hypothesis, and conclude that the linear model E(y) = A + B1x1 + B2x2 is
useful for prediction of the grain yield.
ANOVA
Model
Sum of
Squares
1 Regression 2632048.15
3
Residual
579455.347
Total
3211503.50
0
df
Mean
Square
2 1316024.07 11.356
6
5
115891.069
7
Sig.
.014
ccx
line model. We will use the model to form a confidence interval for the mean E(y) for a given
value x* of x, or a prediction interval for a future value of y for a specific x*.
The procedure for forming a confidence interval for E(y) is shown in following box.
y t / 2 s ( x * )' ( X ' X ) 1 x *
where
y = b0 + b1 x1* + b2 x 2* + + bk x k*
y t / 2 s 1 + ( x * )' ( X ' X ) 1 x *
where
y = b0 + b1 x1* + b2 x 2* + + bk x k*
Sale price,
y
68900
48500
55500
62000
116500
45000
38000
83000
59000
47500
40500
40000
97000
45500
40900
80000
56000
37000
50000
22400
Land
value, x1
5960
9000
9500
10000
18000
8500
8000
23000
8100
9000
7300
8000
20000
8000
8000
10500
4000
4500
3400
1500
Improvement
s value , x2
44967
27860
31439
39592
72827
27317
29856
47752
39117
29349
40166
31679
58510
23454
20897
56248
20859
22610
35948
5779
Area,
x3
1873
928
1126
1265
2214
912
899
1803
1204
1725
1080
1529
2455
1151
1173
1960
1344
988
1076
962
s2 =
SSE
n (k + 1)
where
ccxii
SSE = ( y i y i ) 2 .
i =1
R2 = 1
SSE
SS yy
You can see in the printout in Figure 12.6 that SSE = 1003491259 ( in column Sum of Squares
and row Error) and SSyy = 9783168000 ( in column Sum of Squares and row Total), and R2
is R-squared =0.897427. This large value of R2 indicates that the model provides a good fit to
the n = 20 sample data points.
b) Usefulness of the model
Test H0 : B1 = B2 = ...= Bk = 0 ( Null hypothesis) against Ha : At least one Bi 0
( Alternative hypothesis).
Test statistic:
R2 / k
Mean Square for Model
SS (Model) / k
F=
=
=
2
SSE /[ n (k + 1)]
(1 R ) /[ n (k + 1)] Mean Square for Error
In the printout F= 46.6620, the observed significance level for this test is 0.0000 (under the
column P-value ). This implies that we would reject the null hypothesis for any level, for example
0.01. Thus, we have strong evidence to reject H0 and conclude that the model is useful for
predicting the sale price of residential properties.
Model fitting results for: ESTATE.Y
-------------------------------------------------------------------------------Independent variable
coefficient std. error
t-value
sig.level
-------------------------------------------------------------------------------CONSTANT
1470.275919 5746.324583
0.2559
0.8013
ESTATE.X1
0.81449
0.512219
1.5901
0.1314
ESTATE.X2
0.820445
0.211185
3.8850
0.0013
ESTATE.X3
13.52865
6.58568
2.0543
0.0567
-------------------------------------------------------------------------------R-SQ. (ADJ.) = 0.8782 SE=
7919.482541 MAE=
5009.367657 DurbWat= 1.242
Previously:
0.0000
0.000000
0.000000
0.000
20 observations fitted, forecast(s) computed for 0 missing val. of dep. var.
ccxiii
(1) Construct a confidence interval for E(y) for particular values of the independent
variables.
Estimate the mean sale price, E(y), for a property with x1 = 15000, x2 = 50000 and x3 = 1800,
using 95% confidence interval. Substituting these particular values of the independent variables
into the least squares prediction equation yields the predicted value equal 79061.4. In the
printout reproduced in Figure 12.7 the 95% confidence interval for the sale price corresponding
to the given (x1, x2, x3) is (733379.3, 84743.6).
Regression results for ESTATE.Y
Observation
Number
Observed
Values
Fitted
Values
Lower
CL
95% Upper
CL
for means
1
68900
68556.7
48500
44212.9
55500
50235.2
62000
59212
116500
105834
45000
43143.7
38000
44643.6
83000
83773.6
59000
56449.5
10
47500
56216.8
11
40500
54981
12
40000
54662.4
95%
for means
ccxiv
13
97000
98977.1
14
45500
42800.4
15
40900
41000.1
16
80000
82686.9
17
56000
40024.4
18
37000
37052
19
50000
48289.7
20
22400
20447.9
21
79061.4
73379.3
84743.6
Observation
Number
Observed
Values
68900
68556.7
48500
44212.9
55500
50235.2
62000
59212
116500
105834
45000
43143.7
38000
44643.6
83000
83773.6
59000
56449.5
10
47500
56216.8
11
40500
54981
12
40000
54662.4
Upper 95% CL
for forecasts
ccxv
13
97000
98977.1
14
45500
42800.4
15
40900
41000.1
16
80000
82686.9
17
56000
40024.4
18
37000
37052
19
50000
48289.7
20
22400
20447.9
21
79061.4
61333.4
96789.4
E(y)=1 + 2x1 x2 ,
the graphs of E(y) for x2 = 0, x2 = 2 and x2 = -3 are depicted in Figure 12.9.
When this situation occurs ( as it always does for a first-order model), we say that the
relationship between E(y) and any one independent variable does not depend on the value of
the other independent variable(s)
in the model that is, we say that the independent
variables do not interact.
20
E(y)
15
10
5
0
0
-5
x2=0
x2=2
x2=-3
However, if the relationship between E(y) and x1 does, in fact, depend on the value of x2 held
fixed, then the first-order model is not appropriate for predicting y. In this case we need another
model that will take into account this dependence. This model is illustrated in the next example
Example 12.8 Suppose that the mean value E(y) of a response y is related to two quantitative
variables x1 and x2 by the model
E(y) = 1 + 2x1 x2 + x1x2.
Graph the relationship between E(y) and x1 for x2 = 0, 2 and 3. Interpret the graph.
30
E(y)
20
10
0
-10
x1
x2=0
x2=2
x2=-3
Solution For fixed values of x2, E(y) is linear functions of x1. Graphs of the straight lines of E(y)
for
x2 = 0, 2 and 3 are depicted in Figure 12.10. Note that the slope of each line is represented by
2+ x2 . The effect of adding a term involving the product x1x2 can be seen in the figure. In
contrast to Figure 12.9, the lines relating E(y) to x1 are no longer parallel. The effect on E(y) of a
change in x1 (i.e. the slope) now depends on the value of x2 .
When this situation occurs, we say that x1 and x2 interact.. The cross-product term, x1x2, is
called an interaction term and the model
ccxvii
ccxviii
12.11 Summary
In this chapter we have discussed some of the methodology of multiple regression analysis, a
technique for modeling a dependent variable y as a function of several independent variables
x1 , x 2 ,..., x k . The steps employed in a multiple regression analysis are much the same as those
employed in a simple regression analysis:
1. The form of the probabilistic model is hypothesized.
2. The appropriate model assumptions are made.
3. The model coefficients are estimated using the method of least squares.
4. The utility of the model is checked using the overall F-test and t-tests on individual Bparameters.
5. If the model is deemed useful and the assumptions are satisfied, it may be used to make
estimates and to predict values of y to be observed in the future.
12.12 Exercises
1. Suppose you fit the first-order multiple regression model
y = B0 + B1x1 + B2x2 + e
to n = 20 data points and obtain the prediction equation
Grain Yield,
Nitrogen Rate,
kg/ha, y
kg/ha, x
4878
5506
30
6083
60
6291
90
6361
120
Pair Number
coefficient
CONSTANT
sig.level
47.349987
102.6707
0.0001
26.64619
1.869659
14.2519
0.0049
-0.117857
0.014941
-7.8884
0.0157
x *x
SE=
t-value
4861.457143
std. error
50.312168
MAE=
25.440000
DurbWat= 3.426
Sum of Squares
DF
Mean Square
Model
1564516
782258
Error
5062.63
2531.31
Total (Corr.)
1569579
R-squared = 0.996775
F-Ratio
309.032
P-value
0.0032
ccxxi
Chapter 13
Nonparametric statistics
CONTENTS
13.1 Introduction
13.2 The sign test for a single population
13.3 Comparing two populations based on independent random samples: Wilcoxon rank sum test
13.4 Comparing two populations based on matched pairs: the Wilcoxon signed ranks test
13.5 Comparing population using a completely randomized design: The Kruskal-Wallis H test
13.6 Rank Correlation: Spearmans rs statistic
13.7 Summary
13.8 Exercises
----------------------------------------------------------------------------------------------------------------
13.1. Introduction
The majority of hypothesis tests ( t- and F-tests) discussed so far have made inferences about
population parameters, such as the mean and the proportion. These parametric tests have used
the parametric statistics of samples that came from the population being tested. To formulate
these tests, we made restrictive assumptions about the populations from which we drew our
samples. In each case of Chapter 9, for example, we assumed that our samples either were
large or came from normally distributed populations. But populations are not always normal.
And even if a goodness-of-fit test indicates that a population is approximately normal, we can
not always be certain were right, because the testis not 100 percent reliable. Clearly, there are
certain situations in which the use of the normal curve is not appropriate.
An another case in which the t- and F-tests are inappropriate is when the data are not
measurements but can be ranked in order of magnitude. For example, suppose we want to
compare the ease of operation of two types of computer software based on subjective
evaluations by trained observers. Although we can not give an exact value to the variable Ease
of operation of the software package, we may be able to decide that package A is better than
package B. If packages A and B are evaluated by each of ten observers, we have the standard
problem of comparing the probability distributions for two populations of ratings one for
package A and one for package B. But the t-test of Chapter 9 would be inappropriate, because
the only data that can be recorded are preferences; that is, each observer decides either that A
is better than B or vice versa.
For the two types of the situations statisticians have developed useful techniques called
nonparametric methods or nonparametric statistics. The nonparametric counterparts of the tand F-tests compare the relative locations of the probability distributions of the sampled
populations, rather than specific parameters of these populations (such as the means or
variances). Many nonparametric methods use the relative ranks of the sample observations
rather than their actual numerical values.
A large number of nonparametric tests exist, but this chapter will examine only a few of the
better known and more widely used ones.
ccxxii
TWO-TAILED TEST
H0 : M = M0
H0 : M = M0
Ha : M M0
Test statistic:
S = Number of sample
observations greater than M0
( or S = Number of sample
observations less than M0 )
Test statistic:
where Sc is the computed value of the test statistic and S has a binomial
distribution with parameters n and p = 0.5.
Rejection region: Reject H0 if > p-value.
Example 13.1 Suppose from a population the following sample is randomly selected:
41 33 43 52 46 37 44 49 53 30.
Do the data provide sufficient evidence to indicate that the median percentage of the population
is greater than 40? Test using = 0.05.
Solution We want to test
H0: M = 40
Ha: M > 40
using the sign test. The test statistic is
( n 10)
ONE-TAILED TEST
TWO-TAILED TEST
H0 : M = M0
H0 : M = M0
Ha : M M0
Test statistic:
z=
S E (S )
(S )
S 0 .5 n
(0.5)(0.5)n
S 0 .5 n
0 .5 n
ccxxiv
S = Number of sample
observations greater than M0
( or S = Number of sample
observations less than M0 )
Rejection region:
Rejection region:
Example 13.2 Refer to Example 13.1 using the sign test based on z-statistic.
Solution For this example the software STATGRAPHICS provides the following printout.
10 observations.
ccxxv
TWO-TAILED TEST
T1 if n1 < n2 or T2 if n2 n1
Test statistic:
T1 if n1 n2 ; T2 if n2 n1 . We
will denote this rank sum as T.
ccxxvi
Rejection region:
Rejection region:
T1 TU if T1 is test statistic; or
T TU , or T TL
T2 TL if T2 is test statistic
where TU and TL are obtained from Table 1 of Appendix D
Example 13.3 Independent random samples were selected from two populations. The data are
shown in Table 13.1. Is there sufficient evidence to indicate that population 1 is shifted to the
right of population 2. Test using = 0.05.
Sample from
Population 2
10
15
7
6
13
11
12
9
17
14
Solution The ranks of the 20 observations from lowest to highest, are shown in Table 13.2.
We test
H0: The sampled populations have identical probability distributions
Ha: The probability distribution for population 1 is shifted to the right of that for population 2
The test statistic T2 =78. Examining Table 13.3 we find that the critical values, corresponding to
n1 = n2 =10 are TL = 79 and TU = 131. Therefore, for one-tailed test at = 0.025, we will reject
H0
if T2 TL, i.e., reject H0 if T2 79. Since the observed value of the test statistic, T2 =78
<79 we reject H0 and conclude ( at = 0.025) that the probability distribution for population 1 is
shifted to the right of that for population 2.
Raw data
Rank
Raw data
Rank
ccxxvii
17
15.5
10
5.5
14
11.5
15
13
12
8.5
16
14
23
20
13
10
18
17
11
10
5.5
12
8.5
19
18
17
15.5
22
19
14
11.5
T1 = 132
Table 13.3
T2 = 78
Independent
n1
10
n2
TL
TU
TL
TU
6 18
TL
TU
6 21
TL
TU
7 23
TL
TU
7 26
TL
TU
8 28
TL
TU
8 31
TL TU
5 16
9 33
6 18 11 25 12 28 12 32 13 35 14 38 15 41 16 44
6 21 12 28 18 37 19 41 20 45 21 49 22 53 24 56
7 23 12 32 19 41 26 52 28 56 29 61 31 65 32 70
7 26 13 35 20 45 28 56 37 68 39 73 41 78 43 83
8 28 14 38 21 49 29 61 39 73 49 87 51 93 54 98
8 31 15 41 22 53 31 65 41 78 51 93 63 108 66 114
10
9 33 16 44 24 56 32 70 43 83 54 98 66 114 79 131
Many nonparametric test statistics have sampling distributions that are approximately normal
when n1 and n2 are large. For these situations we can test hypotheses using the large-sample
z-test.
ccxxviii
ONE-TAILED TEST
TWO-TAILED TEST
Test statistic:
n n + n1 (n1 + 1)
T1 1 2
z=
n1n2 (n1 + n2 + 1)
12
Rejection region:
Rejection region:
z < z / 2 or z > z / 2
Example 13.4 Refer to Example 13.3. Using the above large-sample z-test check whether
there is sufficient evidence to indicate that population 1 is shifted to the right of population 2.
Test using = 0.05.
Solution We do this example with the help of computer using STATGRAPHICS. The printout
is given in Figure 13.2.
Comparison of Two Samples
------------------------------------------------------------------------------Sample 1: 17 14 12 16 23 18 10 8 19 22
Sample 2: 10 15 7 6 13 11 12 9 17 14
Test: Unpaired
Average rank of first group = 13.2 based on 10 values.
Average rank of second group = 7.8 based on 10 values.
Large sample test statistic Z = -2.00623
Two-tailed probability of equaling or exceeding Z = 0.0448313
NOTE:
20 total observations.
ccxxix
From the printout we see that the computed test statistic zc = -2.00623 and the two-tailed
probability P (|z| z c ) = 0.0448313 . Therefore, P ( z z c ) = 0.022415 . Hence, at significance
level < 0.023 we reject the null hypothesis and conclude that the probability distribution for
population 1 is shifted to the right of that for population 2 at this significance level.
TWO-TAILED TEST
have
identical probability
distributions
Ha: The probability distribution for
population 1 is shifted to the right
of that for population 2
Test statistic:
T = min(T , T + )
differences
Rejection region:
Rejection region:
T T0
T T0
ccxxx
(A) introducing the new technology. Each customer rates the quality of each product on a scale
from 1 to 10. The results of the experiment are shown in Table 13. . Is there sufficient evidence
to indicate that the product after introducing the new technology is rated higher than the one
before new technology.
Test using = 0.05.
1
2
3
4
5
6
7
8
9
10
Product
6
8
4
9
4
7
6
5
6
8
4
5
5
8
1
9
2
3
7
2
Difference
(AB)
Absolute
Rank of
value
| A B|
| A B|
2
2
5
3
3
7.5
-1
1
2
1
1
2
3
3
7.5
-2
2
5
4
4
9
2
2
5
-1
1
2
6
6
10
T+ = Sum of positive ranks = 46
T- = Sum of negative ranks = 9
ccxxxi
ONE-
TWOTAILED
TAILED
0.05
0.1
0.025
0.05
0.01
0.02
0.005
0.01
n=5
n=6
1
n=11
n=7
n=8
n=9
n=10
11
n=12
n=13
n=14
n=15
n=16
0.1
14
17
21
26
30
36
0.025
0.05
11
14
17
21
25
30
0.01
0.02
10
13
16
20
24
0.005
0.01
10
13
16
19
n=17
n=18
n=19
n=20
n=21
n=22
0.1
41
47
54
60
68
75
0.025
0.05
35
40
46
52
59
66
0.01
0.02
28
33
38
43
49
56
0.005
0.01
23
28
32
37
43
49
n=23
n=24
n=25
n=26
n=27
n=28
0.1
83
92
101
110
120
130
0.025
0.05
73
81
90
98
107
117
0.01
0.02
62
69
77
85
93
102
0.005
0.01
55
61
68
76
84
92
The Wilcoxon signed ranks test statistic has a sampling distribution that is approximately normal
when the number n of pairs is large say, n 25. This large sample nonparametric matchedpairs test is summarized in the following box.
ccxxxii
TWO-TAILED TEST
have
identical probability
distributions
Ha: The probability distribution for
population 1 is shifted to the right
of that for population 2
( or population 1 is shifted to the
left of population 2 )
Test statistic:
z=
T + [n(n + 1) / 4]
[n(n + 10(2n + 1)] / 24
Rejection region:
Rejection region:
z < z / 2 or z > z / 2
where z and z / 2 are tabulated values given in any table of normal curve
areas.
( See Table 1 of Appendix C )
Example 13.6 Suppose from each of two populations we select a sample. They are 30
matched pairs
Sample
1
Sample
2
4
7
5
8
5
8
6
9
6 4 7 8 6 9 7 4 10 7 6 8 5 4 6 7 9 7 4 6 7 9 6 10 9
5
7 8 5 9 6 8 3 7 5 7 5 8 9 4 6 8 4 6 7 9 10 6 8 5 7
6
Use the Wilcoxon signed ranks test to check whether the probability distributions of the
populations are identical.
Solution For this example using STATGRAPHICS we obtain the following printout.
ccxxxiii
30 total pairs.
H=
k
Ri
12
3(n + 1)
n(n + 1) i =1 n
will have a sampling distribution that can be approximated by a chi-square distribution with (k
1) degrees of freedom. Large values of H imply rejection of H0. Therefore, the rejection region
for the test is H > 2 where 2 is the value that locates in the upper tail of the chi-square
distribution.
The test is summarized in the following box.
KRUSKAL-WALLIS H- TEST FOR COMPARING k POPULATION PROBABILITY
ccxxxiv
DISTRIBUTIONS
k
Ri
12
H=
3(n + 1)
n(n + 1) i =1 n
where
Example 13.7 Independent random samples of three different brands of magnetron tubes were
subjected to stress testing, and the number of hours each operated without repair was recorded.
Although these times do not represent typical lifetimes, they do indicate how well the tubes can
withstand extreme stress.. The data are shown in the table. Experience has shown that the
distributions of lifetimes for manufactured products are usually non-normal.
A
36
48
5
67
53
B
49
33
60
2
55
C
71
31
140
59
42
Use the Kruskal-Wallis H-test to determine whether evidence exists to conclude that the brands
of magnetron tubes tend to differ in length of life under stress. Test using = 0.05.
Solution The first step in performing the Kruskal-Wallis H-test is to rank the n = 15
observations in the complete data set. The ranks and rank sums for three samples are shown
in Table 13.6
ccxxxv
RANK
RANK
RANK
36
48
5
67
53
5
7
2
13
9
R1 =36
49
33
60
2
55
8
4
12
1
10
R2
=35
71
31
140
59
42
14
3
15
11
6
R3
=49
H=
2
k
Ri
12
12 (36) 2 (35) 2 (49) 2
3
(
1
)
n
+
=
+
+
3(16) = 1.22
n(n + 1) i =1 n
(15)(16) 5
5
5
The rejection region for the H-test is H > 2 with df = k 1 = 3 1 = 2. For = 0.05 and df = 2,
2 = 5.99147 . Since the computed value of H =1.22 is less than 5.99147 we can not reject H0.
There is insufficient evidence to indicate a difference in location among the distributions of
lifetimes for the three brands of magnetron tubes.
For this example the STATGRAPHICS printout is given in Figure 13.3. In the printout we see
that Test statistic = 1.22, Significance level = 0.543351. Therefore, at significance level =
0.05 we can not reject the hypothesis H0.
ccxxxvi
Several different nonparametric statistics have been developed to measure and to test for
correlation between two random variables. One of these statistics is the Spearmans rank
correlation coefficient rs.
The first step in finding rs is to rank the values of each of the variables separately; ties are
treated by averaging the tied ranks. Then rs is computed in exactly the same way as the simple
correlation coefficient r. The only difference is that the values of x and y that appear in the
formula for rs denote the ranks of the raw data rather than the raw data themselves.
Formulas for computing Spearmans rank correlation coefficient
Rank the values for each of the variables and let x and y denote the ranks of a pair
of observations. Then
rs =
SS xy
SS xx SS yy
where
SS xx = ( x x ) 2 , SS yy = ( y y ) 2 , SS xy = ( x x )( y y )
rs = 1
6 d 2
n(n 2 1)
The nonparametric test of hypothesis for rank correlation is shown in the box.
Spearmans Nonparametric Test for Rank Correlation
ONE-TAILED TEST
TWO-TAILED TEST
Test statistic: rs
Rejection region:
Rejection region:
rs r0 ( or rs -r0 )
rs r0 or rs -r0
where the value of r0 is given in Table 3 of Appendix D
ccxxxvii
Example 13.8 A large manufacturing firm wants to determine whether a relationship exists
between the number of works-hours an employee misses per year and the employees annual
wages ( in thousands of dollars ). A sample of 15 employees produced the data shown in
Table 13.7.
Table 13.7
EMPLOYEE
HOURS
WAGES
49
15.8
36
17.5
127
11.3
91
13.2
72
13.0
34
14.5
155
11.8
11
20.2
191
10.8
10
18.8
11
63
13.8
12
79
12.7
13
43
15.1
14
57
24.2
15
82
13.9
RANK
WAGES
RANK
di 2
di
49
15.8
11
-5
25
36
17.5
12
-8
64
ccxxxviii
127
13
11.3
11
121
91
12
13.2
36
72
13.0
16
34
14.5
-6
36
155
14
11.8
11
121
11
20.2
14
-12
144
191
15
10.8
14
196
10
18.8
13
-12
144
11
63
13.8
12
79
10
12.7
36
13
43
15.1
10
-5
25
14
57
24.2
15
-8
64
15
82
11
13.9
di2=1038
rs = 1
6 d 2
n(n 1)
2
= 1
6(1038)
= 0.854
15(224)
This large negative value of rs implies that a negative correlation exists between workhours missed and annual wages in the sample of 15 employees.
c) To test H0: No correlation exists between work-hours missed and annual wages in the
population against H1: Work-hours missed and annual wages are negatively correlated, we
use rs as the test statistic and obtain the critical value r0 from Table
This table gives the critical values of r0 for an upper-tailed test, i.e., a test to detect a positive
rank correlation. For our example, = 0.01, n =15, the critical value is r0 = 0.623. Therefore,
we reject the null hypothesis in favor of the alternative hypothesis if the computed rs statistic
is less or equal 0.623. Since our computed rs = -0.854 < -0.623, we reject H0 and conclude
that there is ample evidence to indicate that work-hours missed decrease as annual wages
increases.
Below we reproduce the STATGRAPHICS printout for our example. In this printout we see that
the rank correlation coefficient between the variable HOURS (work-hours missed ) and the
variable WAGES (annual wages) is 0.8536 and the significance level is 0.0014. Since the
observed significance level is very small, it is naturally to reject the null hypothesis.
ccxxxix
HOURS
1.0000
(
15)
1.0000
-0.8536
(
15)
0.0014
WAGES
-0.8536
(
15)
0.0014
1.0000
(
15)
1.0000
13.7 Summary
We have presented several useful nonparametric techniques for testing the location of a single
population, or for comparing two or more populations. Nonparametric techniques are useful
when the underlying assumptions for their parametric counterparts are not justified or when it is
impossible to assign specific values to the observations. Nonparametric methods provide more
general comparisons of populations than parametric methods, because they compare the
probability distributions of the populations rather than specific parameters.
Rank sums are the primary tools of nonparametric statistics. The Wincoxon rank sum test can
be used to compare two populations based on independent random samples, and Wincoxon
signed ranks test can be used for a matched-pairs experiment. The Kruskal-Wallis H-test is
applied when comparing k populations using a completely randomized design.
13.8 Exercises
1. Suppose you want to use the sign test to test the null hypothesis that the population median
equals 75, i.e., H0: M = 75. Use the table of binomial probabilities to find the observed
significance level (p-value ) of the test for each of the following situations:
a) Ha: M > 75, n = 5, S = 2
b) Ha: M 75 , n = 15, S = 9
c) Ha: M < 75, n = 10, S = 7
2. A random sample of 8 observations from a continuous population resulted in the following:
17 16.5 20 18.2 19.6 14.9 21.1 19.4
Is there sufficient evidence to indicate that the population median differs from 20? test using
= 0.05.
3. Independent random variables were selected from two populations. The data are shown in
the table
Sample from
15
16
13
14
12
17
13
population 1
Sample from
10
population 2
ccxl
a) Use the Wilcoxon rank sum test to determine whether the data provide sufficient
evidence
to indicate a shift in the locations of the probability distributions of the sampled
populations. Test using = 0.05.
b) Do the data provide sufficient evidence to indicate that the probability distribution for
population 1 is shifted to the right of the probability distribution for population 2? Use the
Wilcoxon rank sum test with = 0.05.
4. The following data show employee rates of defective work before and after a change in
wage incentive plan. Compare the two sets of data to see if the change lowered the
defective units produced (Use the Wilcoxon signed rank test for a matched pairs design with
= 0.01)
Before
10
10
After
10
5. The following table shows sample retail prices for three brands of shoes. Use the KruskalWallis test to determine if there is any difference among the retail prices of the brands
throughout the country. Use 0.05 level of significance.
Brand
A
Brand
B
Brand
C
$89
90
92
81
76
88
85
95
97
86
$78
93
81
87
89
71
90
96
82
85
$80
88
86
85
79
80
84
85
90
92
100
6. A random sample of seven pairs of observations are recorded on two variables, X and Y.
The data are shown in the table. use Spearmans nonparametric test for rank correlation to
answer the following :
a) Do the data provide sufficient evidence to conclude that the rank correlation between X
and Y is greater than 0? Test using = 0.05.
b) Do the data provide sufficient evidence to conclude that the rank correlation between X
and Y is not 0? Test using = 0.05.
65
58
X
Y
57
61
55
58
38
23
29
34
43
38
49
37
7. Below are ratings of aggressiveness (X) and amount of sales in the last year (Y) for eight
salespeople. Is there a significant rank correlation between the two measures? Use the 0.05
significance level.
X
Y
30
35
17
31
35
43
28
46
42
50
25
32
19
33
29
42
ccxli
References
1. Berenson, M.L. and D.M. Levine, Basic Business Statistics: Concepts and Applications, 4th
ed. Englewood Cliffs, NJ, Prentice Hall, 1989.
2. McClave, J.T. & Dietrich, F.H. Statistics, 4th ed., San Francisco: Dellen,1988.
3. Fahrmeir L. and Tutz G., Multivariate statistical modeling based on generalized linear
models, New York: Springer-Verlag, 1994.
4. Gnedenko B.V., The theory of probability, Chelsea Publ. Comp., New York, 1962.
5. Goldstein H. (ed.) Multilevel statistical models, London: Edward Arnold, 1995.
6. Iman, R. L., and W. J. Conover, Modern Business Statistics, 2nd ed. New York, NY, John
Wiley & Sons, 1989.
7. Kwanchai A. Gomez and Arturo A. Gomez, Statistical procedures for agricultural research,
John Wiley & Sons, 1982.
8. Levin, R.I. and D. S. Rubin, Statistics for management, 5th ed. Englewood Cliffs, NJ,
Prentice Hall, 1991.
9. Moore D. S. and G.P. McCabe, Introduction to the Practice of Statistics, W.H. Freeman and
Company, 1989.
10.Mendehall, W. and Sincich T., Statistics for the engineering and computer sciences, 2nd
edition, Dellen Publ. Comp., 1989.
11.Rosenbaum P.R., Observational Studies, New York: Springer-Verlag, 1995.
12.STATGRAPHICS Plus, Reference manual, Manugistics Inc., 1992.
ccxlii
Table 1
Appendix C
Normal
Areas
Curve
0
.00
.01
.02
.03
.04
.05
.06
.07
.08
.09
0.0
.0000
.0040
.0080
.0120
.0160
.0199
.0239
.0279
.0319
.0359
0.1
.0398
.0438
.0478
.0517
.0557
.0596
.0636
.0675
.0714
.0753
0.2
.0793
.0832
.0871
.0910
.0948
.0987
.1026
.1064
.1103
.1141
0.3
.1179
.1217
.1255
.1293
.1331
.1368
.1406
.1443
.1480
.1517
0.4
.1554
.1591
.1628
.1664
.1700
.1736
.1772
.1808
.1844
.1879
0.5
.1915
.1950
.1985
.2019
.2054
.2088
.2123
.2157
.2190
.2224
0.6
.2257
.2291
.2324
.2357
.2389
.2422
.2454
.2486
.2517
.2549
0.7
.2580
.2611
.2642
.2673
.2704
.2734
.2764
.2794
.2823
.2852
0.8
.2881
.2910
.2939
.2967
.2995
.3023
.3051
.3078
.3106
.3133
0.9
.3159
.3186
.3212
.3238
.3264
.3289
.3315
.3340
.3365
.3389
1.0
.3413
.3438
.3461
.3485
.3508
.3531
.3554
.3577
.3599
.3621
1.1
.3643
.3665
.3686
.3708
.3729
.3749
.3770
.3790
.3810
.3830
1.2
.3849
.3869
.3888
.3907
.3925
.3944
.3962
.3980
.3997
.4015
1.3
.4032
.4049
.4066
.4082
.4099
.4115
.4131
.4147
.4162
.4177
1.4
.4192
.4207
.4222
.4236
.4251
.4265
.4279
.4292
.4306
.4319
1.5
.4332
.4345
.4357
.4370
.4382
.4394
.4406
.4418
.4429
.4441
1.6
.4452
.4463
.4474
.4484
.4495
.4505
.4515
.4525
.4535
.4545
1.7
.4554
.4564
.4573
.4582
.4591
.4599
.4608
.4616
.4625
.4633
1.8
.4641
.4649
.4656
.4664
.4671
.4678
.4686
.4693
.4699
.4706
1.9
.4713
.4719
.4726
.4732
.4738
.4744
.4750
.4756
.4761
.4767
2.0
.4772
.4778
.4783
.4788
.4793
.4798
.4803
.4808
.4812
.4817
2.1
.4821
.4826
.4830
.4834
.4838
.4842
.4846
.4850
.4854
.4857
2.2
.4861
.4864
.4868
.4871
.4875
.4878
.4881
.4884
.4887
.4890
2.3
.4893
.4896
.4898
.4901
.4904
.4906
.4909
.4911
.4913
.4916
2.4
.4918
.4920
.4922
.4925
.4927
.4929
.4931
.4932
.4934
.4936
2.5
.4938
.4940
.4941
.4943
.4945
.4946
.4948
.4949
.4951
.4952
2.6
.4953
.4955
.4956
.4957
.4959
.4960
.4961
.4962
.4963
.4964
2.7
.4965
.4966
.4967
.4968
.4969
.4970
.4971
.4972
.4973
.4974
2.8
.4974
.4975
.4976
.4977
.4977
.4978
.4979
.4979
.4980
.4981
2.9
.4981
.4982
.4982
.4983
.4984
.4984
.4985
.4985
.4986
.4986
3.0
.4987
.4987
.4987
.4988
.4988
.4989
.4989
.4989
.4990
.4990
ccxliii
Table 2
Critical Values for Student's t
Appendix C
1
3.078
6.314
12.706
31.821
63.657
318.310
636.620
1.886
2.920
4.303
6.965
9.925
22.326
31.598
1.638
2.353
3.182
4.541
5.841
10.213
12.924
1.533
2.132
2.776
3.747
4.604
7.173
8.610
1.476
2.015
2.571
3.365
4.032
5.893
6.869
1.440
1.943
2.447
3.143
3.707
5.208
5.959
1.415
1.895
2.365
2.998
3.499
4.785
5.408
1.397
1.860
2.306
2.896
3.355
4.501
5.041
1.383
1.833
2.262
2.821
3.250
4.297
4.781
10
1.372
1.812
2.228
2.764
3.169
4.144
4.587
11
1.363
1.796
2.201
2.718
3.106
4.025
4.437
12
1.356
1.782
2.179
2.681
3.055
3.930
4.318
13
1.350
1.771
2.160
2.650
3.102
3.852
4.221
14
1.345
1.760
2.145
2.624
2.977
3.787
4.140
15
1.341
1.753
2.131
2.602
2.947
3.733
4.073
16
1.337
1.746
2.120
2.583
2.921
3.686
4.015
17
1.333
1.740
2.110
2.567
2.898
3.646
3.965
18
1.330
1.734
2.101
2.552
2.878
3.610
3.922
19
1.328
1.729
2.093
2.539
2.861
3.579
3.883
20
1.325
1.725
2.086
2.528
2.845
3.552
3.850
21
1.323
1.721
2.080
2.528
2.831
3.527
3.819
22
1.321
1.717
2.074
2.508
2.819
3.505
3.792
23
1.319
1.714
2.069
2.500
2.807
3.485
3.767
24
1.318
1.711
2.064
2.492
2.797
3.467
3.745
25
1.316
1.708
2.060
2.485
2.787
3.450
3.725
26
1.315
1.706
2.056
2.479
2.779
3.435
3.707
27
1.314
1.703
2.052
2.473
2.771
3.421
3.690
28
1.313
1.701
2.048
2.467
2.763
3.408
3.674
29
1.311
1.699
2.045
2.462
2.756
3.396
3.659
30
1.310
1.697
2.042
2.457
2.750
3.385
3.646
40
1.303
1.684
2.021
2.423
2.704
3.307
3.551
60
1.296
1.671
2.000
2.390
2.660
3.232
3.460
120
1.289
1.658
1.980
2.358
2.617
3.160
3.373
1.282
1.645
1.960
2.326
2.576
3.090
3.291
ccxliv
Table 1 Critical values of TL and TU for the Wincoxon Rank Sum Test: Independent
samples
a. Alpha = 0.025 one-tailed; alpha = 0.05 two-tailed
n1
10
n2
TL
3
4
5
6
7
8
9
10
5
6
6
7
7
8
8
9
TU
16
18
21
23
26
28
31
33
TL
6
11
12
12
13
14
15
16
TU
18
25
28
32
35
38
41
44
TL
6
12
18
19
20
21
22
24
TU
21
28
37
41
45
49
53
56
TL
7
12
19
26
28
29
31
32
TU
23
32
41
52
56
61
65
70
TL
7
13
20
28
37
39
41
43
TU
26
35
45
56
68
73
78
83
TL
8
14
21
29
39
49
51
54
TU
28
38
49
61
73
87
93
98
TL
8
15
22
31
41
51
63
66
TU
31
41
53
65
78
93
108
114
9
16
24
32
43
54
66
79
TU
33
44
56
70
83
98
114
131
10
n2
TL
3
4
5
6
7
8
9
10
6
7
7
8
9
9
10
11
TU
15
17
20
22
24
27
29
31
TL
7
12
13
14
15
16
17
18
TU
17
24
27
30
33
36
39
42
TL
7
13
19
20
22
24
25
26
TU
20
27
36
40
43
46
50
54
TL
8
14
20
28
30
32
33
35
TU
22
30
40
50
54
58
63
67
TL
9
15
22
30
39
41
43
46
TU
24
33
43
54
66
71
76
80
TL
9
16
24
32
41
52
54
57
TU
27
36
46
58
71
84
90
95
TL
10
17
25
33
43
54
66
69
TU
29
39
50
63
76
90
105
111
TL
11
18
26
35
46
57
69
83
TU
31
42
54
67
80
95
111
127
ccxlv
ccxlvi
Alpha
Alpha
ONE-TAILED
0.05
0.025
0.01
0.005
n=5
TWO-TAILED
0.1
0.05
0.02
0.01
0.025
0.01
0.005
0.1
0.05
0.02
0.01
0.025
0.01
0.005
0.1
0.05
0.02
0.01
0.025
0.01
0.005
0.1
0.05
0.02
0.01
0.025
0.01
0.005
0.1
0.05
0.02
0.01
0.025
0.01
0.005
0.1
0.05
0.02
0.01
0.025
0.01
0.005
0.1
0.05
0.02
0.01
0.025
0.01
0.005
0.1
0.05
0.02
0.01
n=6
n=7
n=8
n=9
2
1
4
2
0
6
4
2
0
8
6
3
2
n=11
14
11
7
5
n=17
41
35
28
23
n=23
83
73
62
55
n=29
141
127
111
100
n=35
214
195
174
160
n=40
287
264
238
221
n=46
389
361
329
307
n=12
17
14
10
7
n=18
47
40
33
28
n=24
92
81
69
61
n=30
152
137
120
109
n=36
228
208
186
171
n=41
303
279
252
234
n=47
408
379
345
323
n=13
21
17
13
10
n=19
54
46
38
32
n=25
101
90
77
68
n=31
163
148
130
118
n=37
242
222
198
183
n=42
319
295
267
248
n=48
427
397
362
339
n=14
26
21
16
13
n=20
60
52
43
37
n=26
110
98
85
76
n=32
175
159
141
128
n=38
256
235
211
195
n=43
336
311
281
262
n=49
446
415
380
365
n=15
30
25
20
16
n=21
68
59
49
43
n=27
120
107
93
84
n=33
188
171
151
138
n=39
271
250
224
208
n=44
353
327
297
277
n=50
466
434
398
373
n=10
11
8
5
3
n=16
36
30
24
19
n=22
75
66
56
49
n=28
130
117
102
92
n=34
201
183
162
149
n=45
371
344
313
292
ccxlvii
alpha=0.05
alpha=0.025
alpha=0.01
alpha=0.005
0.900
0.829
0.886
0.943
0.714
0.786
0.893
0.643
0.738
0.833
0.881
0.600
0.683
0.783
0.833
10
0.564
0.648
0.745
0.794
11
0.523
0.623
0.736
0.818
12
0.497
0.591
0.703
0.780
13
0.475
0.566
0.673
0.745
14
0.457
0.545
0.646
0.716
15
0.441
0.525
0.623
0.689
16
0.425
0.507
0.601
0.666
17
0.412
0.490
0.582
0.645
18
0.399
0.476
0.564
0.625
19
0.388
0.462
0.549
0.608
20
0.377
0.450
0.534
0.591
21
0.368
0.438
0.521
0.576
22
0.359
0.428
0.508
0.562
23
0.351
0.418
0.496
0.549
24
0.343
0.409
0.485
0.537
25
0.336
0.400
0.475
0.526
26
0.329
0.392
0.465
0.515
27
0.323
0.385
0.456
0.505
28
0.317
0.377
0.448
0.496
29
0.311
0.370
0.440
0.487
30
0.305
0.364
0.432
0.478
ccxlviii
Index
A
Additive rule of probabilities, 4.6
Alternative hypothesis, 8.2
Analysis of variance, 10.4
completely randomized design, 10.6
one-way, 10.6
randomized block design, 10.7
Arithmetic mean, 3.3
Axiomatic construction of the theory of probability, 4.4
B
Bar graph, 2.4
Bayess formula, 4.6
Bernoulli process, 5.4
Biased estimator, 10.7
Bimodal distribution, 3.3
Binomial probability distribution, 5.4
normal approximation to, 5.8
Bivariate relationships, 11.1
Box plot, 3.7
C
Categorical data, 10.1
Central limit theorem, 6.4
Central tendency, 3.3
Chebyshevs theorem, 3.4
Chi-square distribution, 7.9, 9.6
Chi-square test, 10.1, 10.2
Class
frequency, 2.5, 2.6
interval, 2.6
relative frequency , 2.6
ccxlix
D
Data
grouped, 3.1, 3.8
qualitative, 2.3, 2.4
quantitative, 2.5, 2.6
raw, 3.8
Degree of freedom, 7.3
Dependent variable, 11.1
Descriptive statistics, 1.3
Direct relationship, 11.1
Discrete data, 2.6
Discrete probability distribution, 5.2
Discrete random variable 5.1, 5.2, 5.3
Dispersion, 3.2, 3.4
Distribution
bimodal, 3.3
binomial, 5.4
chi-square, 7.9
frequency, 2.6
ccl
normal, 5.8
Poisson, 5.5
probability, 5.2 5.8
sampling, 6.1, 6.3
standard normal probability, 5.8
Students, 7.3
E
Empirical Rule, 3.4
Error
of Type I, 8.3
of Type II, 8.3
Estimator
error variance, least squares line, 11.4
error variance, multiple regression, 12.4
Events, 4.1
certain, 4.3
complementary, 4.3
equally likely, 4.4
impossible, 4.3
independent, 4.3
simple, 4.3, 4.4
mutually exclusive, 4.3
non-mutually exclusive, 4.3
Expected value, 5.3, 5.7
Experiment, 4.1
Exponential random variable, 5.9
F
F probability distribution, 9.7, 10.6
F statistic, 9.7
Factor level
Frequency distribution, 2.6, 2.7
Frequency polygon,2.7
ccli
G
General linear model, 12.1
Geometric mean, 3.3
Goodness-of-fit test, 10.2
H
Highly suspect outlier, 2.7
Histogram, 2.7
Hypothesis
alternative, 8.2
null, 8.2
one-tailed, 8.2
two-tailed, 8.2
Hypothesis testing, 8.1
I
Independence, 4.5
Independent events, 4.5
Independent variables, 11.1
Inferential statistics, 1.3
Inner fences, 3.7
Interaction model, 12.8
Intersection of events, 4.1
Interquartile range, 3.5
Inverse relationship, 11.1
L
Least squares
estimates, 11.3
line, 11.3
cclii
matrix equation,12.3
method of, 11.3, 12.3
prediction equation, 11.3, 12.3
Level of significance, 8.3
Linear relationship, 11.6
Linear regression model, 11.2, 12.2
Lower quartile, 3.5
M
Matched pairs, 7.6
Mean, 3.3
Median, 3.3
Measure of central tendency, 3.3
Measure of dispersion, 3.4
Measure of location, 3.3
Method of least squares, 11.3, 12.3
Midquartile, 3.5
Mode, 3.3
Model
first-order, 12.1
probabilistic, 11.2, 12.2
quadratic, 12.9
second-order, 12.1
Model building, 12.8, 12.9
Multiple coefficient of determination, 12.6
Multiple regression analysis, 12.1
Multiplication rule for probability, 4.6
Mutually exclusive events, 4.3
N
Nonparametric methods, 13.1-13.7
Kruskal-Wallis test, 13.5
Sign test for a population median, 13.2
Spearmans rank correlation coefficient, 13.6
Wilcoxon rank sum test, 13.3
ccliii
O
Objective of statistics, 1.1
Ogive, 2.7, 2.8
One-tailed test, 8.2
Outer fences, 3.7
Outlier, 3.7
P
Parameters, 3.2
Percentage relative, 2.6
Pie chart, 2.4
Point estimate, 7.1
Poisson random variable, 5.5
Poisson probability distribution, 5.5
Population, 1.2
Prediction equation, 11.3, 12.3
Prediction interval (regression ), 11.7, 12.7
multiple, 12.7
single, 11.7
Probabilistic model, 11.2, 12.2
Probability
axiomatic definition, 4.4
classical definition, 4.4
conditional, 4.5
statistical definition, 4.4
total, 4.6
unconditional, 4.5
Q
Quadratic model, 12.9
ccliv
R
Random sample, 6.2
Random sampling, 6.2
Range
interquatile, 3.5
Rank correlation coefficient, 13.6
Rank sum, 13.3
Regression analysis
multiple, 11.1
simple, 12.1
Regression models, 11.2, 12.2
Relative frequency, 2.6
Relative frequency distribution, 2.6
Relative standing, measures of, 3.5
S
Sample space, 4.3
Sampling distribution, 6.1
Scatter gram, 11.1
Shape, 3.6
Sign test,13.2
Signed ranks, 13.4
Significance level, 8.3
Simple linear regression, 11.2
Skewness, 3.6
Spearmans rank coefficient coefficient, 13.6
Standard deviation, 3.4, 5.3, 5.6
Standard normal variable, 5.8
Standard score, 3.5
Statistical software packages, 1.5
Statistics
cclv
descriptive, 1.3
nonparametric, 11.3
summary, 3.9
Stem and leaf display, 2.5
Straight line model, 11.2
T
t distribution, 7.3
Test of hypotheses
Test statistic, 8.4
Two-tailed test, 8.2
Type I error, 8.3
Type II error, 8.3
U
Unconditional probability, 4.5
Uniform random variable, 5.9
Union of events, 4.3
Upper quartile, 3.5
Utility of model, 11.7, 12.6
V
Variability, measures of, 3.4
Variance, 3.4
Venn diagram, 4.3
W
Wilcoxon rank sum test, 13.3
Wilcoxon signed ranks test, 13.4
Z
z-score, 3.5, 6.4
z statistic, 9.4
cclvi