Epei
Epei
SN
Statistics
1. Introduction to statistics
1.1. Basic notions. We begin with a simple example. There are millions of passenger au-
tomobiles in the United States. What is their average value? It is obviously impractical
to attempt to solve this problem directly by assessing the value of every single car in the
country, add up all those values, then divide by the number of values, one for each car. In
practice the best we can do would be to estimate the average value. A natural way to do
so would be to randomly select some of the cars, say 200 of them, ascertain the value of
each of those cars, and find the average of those 200 values. The set of all those millions of
vehicles is called the population of interest, and the number attached to each one, its value,
is a measurement. The average value is a parameter: a number that describes a charac-
teristic of the population, in this case monetary worth. The set of 200 cars selected from
the population is called a sample, and the 200 numbers, the monetary values of the cars we
selected, are the sample data. The average of the data is called a statistic: a number calcu-
lated from the sample data. This example illustrates the meaning of the following definitions.
Continuing with our example, if the average value of the cars in our sample was 8,357
dollars , then it seems reasonable to conclude that the average value of all cars is about 8,357
2
dollars
. In reasoning this way we have drawn an inference about the population based on informa-
tion obtained from the sample. In general, statistics is a study of data: describing properties
of the data, which is called descriptive statistics, and drawing conclusions about a population
of interest from information extracted from a sample, which is called inferential statistics.
Computing the single number 8,357 to summarize the data was an operation of descriptive
statistics; using it to make a statement about the population was an operation of inferential
statistics.
Statistics is a collection of methods for collecting, displaying, analyzing, and drawing con-
clusions from data.
Descriptive statistics is the branch of statistics that involves organizing, displaying, and
describing data.
Inferential statistics is the branch of statistics that involves drawing conclusions about a
population based on information contained in a sample taken from that population.
The measurement made on each element of a sample need not be numerical. In the case of
automobiles, what is noted about each car could be its color, its make, its body type, and
so on. Such data are categorical or qualitative, as opposed to numerical or quantitative data
such as value or age. This is a general distinction.
Definition 1.3. Qualitative data are measurements for which there is no natural numer-
ical scale, but which consist of attributes, labels, or other non-numerical characteristics.
Quantitative data are numerical measurements that arise from a natural numerical scale.
Qualitative data can generate numerical sample statistics. In the automobile example, for
instance, we might be interested in the proportion of all cars that are less than six years
old. In our same sample o*f 200 cars we could note for each car whether it is less than six
years old or not, which is a qualitative measurement. If 172 cars in the sample are less than
six years old, which is 0.86 or 86, then we would estimate the parameter of interest, the
population proportion, to be about the same as the sample statistic, the sample proportion,
that is, about 0.86 .
such as hair color, eye color, religion, favorite movie, gender, and so on. The values of a
qualitative variable do not imply a numerical ordering. Values of the variable “religion” differ
qualitatively; no ordering of religions is implied. Qualitative variables are sometimes referred
to as categorical variables.
Definition 1.5. A frequency distribution for qualitative data lists all categories and the
number of elements that belong to each of the categories
ioio
Exercise 1.1. A sample of 30 employees from large companies was selected, and these em-
ployees were asked how stressful their jobs were. The responses of these employees are recorded
below, where very represents very stressful, somewhat means somewhat stressful, and none
stands for not stressful at all.
somewhat none somewhat very very none
very somewhat somewhat very somewhat somewhat
very somewhat none very none somewhat
somewhat very somewhat somewhat very none
somewhat very very somewhat none somewhat
Solution 1.1. Note that the variable in this example is how stressful is an employee’s job.
This variable is classified into three categories: very stressful, somewhat stressful, and not
stressful at all. We record these categories in the first column of Table 2.4. Then we read
each employee’s response from the given data and mark a tally, denoted by the symbol —,
in the second column of Table 2.4 next to the corresponding category. For example, the first
employee’s response is that his or her job is somewhat stressful. We show this in the frequency
table by marking a tally in the second column next to the category somewhat. Note that the
tallies are marked in blocks of five for counting convenience. Finally, we record the total of
the tallies for each category in the third column of the table. This column is called the column
of frequencies and is usually denoted by f . The sum of the entries in the frequency column
5
gives the sample size or total frequency. In Table 2.4, this total is 30 , which is the sample
size.
The percentage for a category is obtained by multiplying the relative frequency of that cat-
egory by 100. A percentage distribution lists the percentages for all categories.
Exercise 1.2. Determine the relative frequency and percentage distributions for the data of
Exercise.1
6
Solution 1.2. .
All of us have heard the adage “a picture is worth a thousand words.” A graphic display
can reveal at a glance the main characteristics of a data set. The bar graph and the pie chart
are two types of graphs that are commonly used to display qualitative data.
Bar graphs. To construct a bar graph (also called a bar chart), we mark the various cat-
egories on the horizontal axis as in Figure 2.1. Note that all categories are represented by
intervals of the same width. We mark the frequencies on the vertical axis. Then we draw
one bar for each category such that the height of the bar represents the frequency of the
corresponding category. We leave a small gap between adjacent bars. The figure below gives
Definition 2.1. A graph made of bars whose heights represent the frequencies of respective
categories is called a bar graph.
7
The bar graphs for relative frequency and percentage distributions can be drawn simply
by marking the relative frequencies or percentages, instead of the frequencies, on the vertical
axis.
Pie graph. A pie chart is more commonly used to display percentages, although it can be
used to display frequencies or relative frequencies. The whole pie (or circle) represents the
total sample or population. Then we divide the pie into different portions that represent the
different categories.
Definition 2.2. A circle divided into portions that represent the relative frequencies or per-
centages of a population or a sample belonging to different categories is called a pie chart.
As we know, a circle contains 360 degrees. To construct a pie chart, we multiply 360 by
the relative frequency of each category to obtain the degree measure or size of the angle for
the corresponding category, so using the data of exercise.1 we have
.
8
2.1. Organization and grouping quantitative data. The table below gives the weekly
earnings of 100 employees of a large company. The first column lists the classes, which rep-
resent the (quantitative) variable weekly earnings. For quantitative data, an interval that
includes all the values that fall within two numbers—the lower and upper limits—is called
a class. Note that the classes always represent a variable. As we can observe, the classes
are nonoverlapping; that is, each value on earnings belongs to one and only one class. The
second column in the table lists the number of employees who have earnings within each
class. For example, 9 employees of this company earn 801 dollars to 1000 dollars per week.
The numbers listed in the second column are called the frequencies, which give the number
of values that belong to different classes. The frequencies are denoted by f.
For quantitative data, the frequency of a class represents the number of values in the data
set that fall in that class. The table contains six classes. Each class has a lower limit and
an upper limit. The values 801, 1001, 1201, 1401, 1601, and 1801 give the lower limits, and
the values 1000, 1200, 1400, 1600, 1800, and 2000 are the upper limits of the six classes,
respectively. The data presented in Table are an illustration of a frequency distribution table
for quantitative data. Whereas the data that list individual values are called ungrouped data,
the data presented in a frequency distribution table are called grouped data.
Definition 2.3. a A frequency distribution for quantitative data lists all the classes and
the number of values that belong to each class. Data presented in the form of a frequency
distribution are called grouped data.
9
.
To find the midpoint of the upper limit of the first class and the lower limit of the second
class in the Table , we divide the sum of these two limits by 2. Thus, this midpoint is
1000 + 1001
= 1000.5
2
The value 1000.5 is called the upper boundary of the first class and the lower boundary of
the second class. By using this technique, we can convert the class limits of the Table to
class boundaries, which are also called real class limits
Definition 2.4. The class boundary is given by the midpoint of the upper limit of one class
and the lower limit of the next class.
The difference between the two boundaries of a class gives the class width. The class width
is also called the class size.
18
Exercise 2.1. The following data give the total number of iPods sold by a mail order
company on each of 30 days. Construct a frequency distribution table.
8 25 11 15 29 22 10 5 17 21
22 13 26 16 18 12 9 26 20 16
23 14 19 23 20 16 27 16 21 14
Solution 2.1. In these data, the minimum value is 5 , and the maximum value is 29 .
Suppose we decide to group these data using five classes of equal width. Then,
29 − 5
Approximate width of each class = = 4.8
5
Now we round this approximate width to a convenient number, say 5 . The lower limit of
the first class can be taken as 5 or any number less than 5 . Suppose we take 5 as the lower
limit of the first class. Then our classes will be
We record these five classes in the first column of the Table below.
Now we read each value from the given data and mark a tally in the second column of the
Table below next to the corresponding class. The first value in our original data set is 8,
which belongs to the 5–9 class. To record it, we mark a tally in the second column next to the
5–9 class. We continue this process until all the data values have been read and entered in the
tally column. Note that tallies are marked in blocks of five for counting convenience. After
the tally column is completed, we count the tally marks for each class and write those numbers
in the third column. This gives the column of frequencies. These frequencies represent the
number of days on which iPods indicated in classes are sold. For example, on 8 of 30 days,
15 to 19 iPods were sold.
11
Exercise 2.2. Calculate the relative frequencies and percentages for Exercice3.1
Solution 2.2. .
2.4. Histogram. A histogram can be drawn for a frequency distribution, a relative frequency
distribution, or a percentage distribution. To draw a histogram, we first mark classes on the
horizontal axis and frequencies (or relative frequencies or percentages) on the vertical axis.
Next, we draw a bar for each class so that its height represents the frequency of that class.
The bars in a histogram are drawn adjacent to each other with no gap between them. A
histogram is called a frequency histogram, a relative frequency histogram, or a percentage
histogram depending on whether frequencies, relative frequencies, or percentages are marked
on the vertical axis.
Definition 2.5. A histogram is a graph in which classes are marked on the horizontal axis
and the frequencies, relative frequencies, or percentages are marked on the vertical axis. The
frequencies, relative frequencies, or percentages are represented by the heights of the bars. In
a histogram, the bars are drawn adjacent to each other.
2.5. Polygon. A polygon is another device that can be used to present quantitative data
in graphic form. To draw a frequency polygon, we first mark a dot above the midpoint of
each class at a height equal to the frequency of that class. This is the same as marking the
midpoint at the top of each bar in a histogram. Next we mark two more classes, one at each
end, and mark their midpoints. Note that these two classes have zero frequencies. In the
last step, we join the adjacent dots with straight lines. The resulting line graph is called a
frequency polygon or simply a polygon. A polygon with relative frequencies marked on the
vertical axis is called a relative frequency polygon. Similarly, a polygon with percentages
marked on the vertical axis is called a percentage polygon.
13
Definition 2.6. Polygon A graph formed by joining the midpoints of the tops of successive
bars in a histogram with straight lines is called a polygon
For a very large data set, as the number of classes is increased (and the width of classes is
decreased), the frequency polygon eventually becomes a smooth curve. Such a curve is called
a frequency distribution curve or simply a frequency curve. Figure 2.6 shows the frequency
curve for a large data set with a large number of classes.
Consider again Exercise3.1 about the total number of iPods sold by a company. Suppose
we want to know on how many days the company sold 19 or fewer iPods. Such a question
can be answered by using a cumulative frequency distribution. Each class in a cumulative
frequency distribution table gives the total number of values that fall below a certain value.
14
The cumulative relative frequencies are obtained by dividing the cumulative frequencies by
the total number of observations in the data set. The cumulative percentages are obtained
by multiplying the cumulative relative frequencies by 100.
15
gives. When plotted on a diagram, the cumulative frequencies give a curve that is called
an ogive (pronounced o-jive ). Figure below gives an ogive for the cumulative frequency
distribution of the Table . To draw the ogive in the Figure , the variable, which is total iPods
sold, is marked on the horizontal axis and the cumulative frequencies on the vertical axis.
Then the dots are marked above the upper boundaries of various classes at the heights equal
to the corresponding cumulative frequencies. The ogive is obtained by joining consecutive
points with straight lines. Note that the ogive starts at the lower boundary of the first class
and ends at the upper boundary of the last class.
Definition 4.2. An ogive is a curve drawn for the cumulative frequency distribution by
joining with straight lines the dots marked above the upper boundaries of classes at heights
equal to the cumulative frequencies of respective classes.
4.1. Stem-and-Leaf Display. A stem and leaf plot, or stem plot, is a technique used to
classify either discrete or continuous variables. A stem and leaf plot is used to organize data
as they are collected. A stem and leaf plot looks something like a bar graph. Each number
in the data is broken down into a stem and a leaf, thus the name.
Definition 4.3. In a stem-and-leaf display of quantitative data, each value is divided into
two portions—a stem and a leaf. The leaves for each stem are shown separately in a display.
4.3. Mean. The mean, also called the arithmetic mean, is the most frequently used measure
of central tendency. This book will use the words mean and average synonymously. For
16
ungrouped data, the mean is obtained by dividing the sum of all values by the number of
values in the data set:
Sum of all values
Mean =
Number of values
The mean calculated for sample data is denoted by x̄ (read as ” x bar”), and the mean
calculated for population data is denoted by µ (Greek letter mu ). We know from the
discussion in Chapter 2 that the number of values in a data set is denoted by n for a sample
and by N for a population. In Chapter 1, we learned that a variable is denoted by x, and the
sum of all values of x is denoted by Σx. Using these notations, we can write the following
formulas for the mean.
Calculating Mean for Ungrouped Data The mean for ungrouped data is obtained by divid-
ing the sum of all values by the number of values in the data set. Thus, Mean for population
data: Mean for sample data:
Σx
µ=
N
Σx
x̄ =
n
where Σx is the sum of all values, N is the population size, n is the sample size, µ is the
population mean, and x̄ is the sample mean.
Exercise 4.1. The table below represents lists the total sales (rounded to billions of dollars)
of six U.S. companies for 2008.
Total Sales
Company
(billions of dollars)
General Motors 149
Wal-Mart Stores 406
General Electric 183
Citigroup 107
Exxon Mobil 426
Verizon Communication 97
4.4. Median.
Definition 4.4. The median is the value of the middle term in a data set that has been
ranked in increasing order.
As is obvious from the definition of the median, it divides a ranked data set into two equal
parts. The calculation of the median consists of the following two steps:
17
Example 4.1. The following data give the prices (in thousands of dollars) of seven houses
selected from all houses sold last month in a city. 312 257 421 289 526 374 497 Find the
median
Example 4.2. The table below gives the 2008 profits (rounded to billions of dollars) of 12
companies selected from all over the world
There are 12 values in this data set. Because there is an even number of values in the data
set, the median is given by the average of the two middle values. The two middle values are
the sixth and seventh in the foregoing list of data, and these two values are 12 and 13. The
median, which is given by the average of these two values, is calculated as follows.
18
12+13
And so median = 2
= 12.5
Thus, the median profit of these 12 companies is 12.5 billion
4.5. Mode.
Definition 4.5. The mode is the value that occurs with the highest frequency in a data set
Example 4.3. The following data give the speeds (in miles per hour) of eight cars that were
stopped on I-95 for speeding violations. 77 82 74 81 79 84 74 78 Find the mode.
In this data set, 74 occurs twice, and each of the remaining values occurs only once. Be-
cause 74 occurs with the highest frequency, it is the mode. Therefore
Mode= 74 miles par hour
A major shortcoming of the mode is that a data set may have none or may have more than
one mode, whereas it will have only one mean and only one median. For instance, a data set
with each value occurring only once has no mode. A data set with only one value occurring
with the highest frequency has only one mode. The data set in this case is called unimodal.
A data set with two values that occur with the same (highest) frequency has two modes. The
distribution, in this case, is said to be bimodal. If more than two values in a data set occur
with the same (highest) frequency, then the data set contains more than two modes and it
is said to be multimodal.
Example 4.4. Last year’s incomes of five randomly selected families were 76,150, 95,750,
124,985, 87,490, and 53,740.
Because each value in this data set occurs only once, this data set contains no mode.
19
4.6. Measures of dispersion for ungrouped data. The measures of central tendency,
such as the mean, median, and mode, do not reveal the whole picture of the distribution of
a data set. Two data sets with the same mean may have completely different spreads. The
variation among the values of observations for one data set may be much larger or smaller
than for the other data set. (Note that the words dispersion, spread, and variation have the
20
same meaning.) Consider the following two data sets on the ages (in years) of all workers
working for each of two small companies.
Company 1: 47 38 35 40 36 45 39
Company 2: 70 33 18 52 27
he ages of individual workers at these two companies and are told only that the mean age
of the workers at both companies is the same, we may deduce that the workers at these two
companies have a similar age distribution. As we can observe, however, the variation in the
workers’ ages for each of these two companies is very different. As illustrated in the diagram,
the ages of the workers at the second company have a much larger variation than the ages
of the workers at the first company
shape of the distribution of a data set. We also need a measure that can provide some
information about the variation among data values. The measures that help us learn about
the spread of a data set are called the measures of dispersion. The measures of central
tendency and dispersion taken together give a better picture of a data set than the mea-
sures of central tendency alone. This section discusses three measures of dispersion: range,
variance, and standard deviation.
4.7. Range. The range is the simplest measure of dispersion to calculate. It is obtained by
taking the difference between the largest and the smallest values in a data set.
4.8. Variance and standard deviation. The standard deviation is the most-used measure
of dispersion. The value of the standard deviation tells how closely the values of a data set
are clustered around the mean. In general, a lower value of the standard deviation for a data
set indicates that the values of that data set are spread over a relatively smaller range around
the mean. In contrast, a larger value of the standard deviation for a data set indicates that
the values of that data set are spread over a relatively larger range around the mean
21
The standard deviation is obtained by taking the positive square root of the variance. The
variance calculated for population data is denoted by σ 2 (read as sigma squared), 2
and the
variance calculated for sample data is denoted by s2 . Consequently, the standard deviation
calculated for population data is denoted by σ, and the standard deviation calculated for
sample data is denoted by s. Following are what we will call the basic formulas that are used
3
to calculate the variance:
Σ(x − µ)2 Σ(x − x̄)2
σ2 = and s2 =
N n−1
where σ 2 is the population variance and s2 is the sample variance. The quantity x − µ or
x − x̄ in the above formulas is called the deviation of the x value from the mean. The sum
of the deviations of the x values from the mean is always zero; that is, Σ(x − µ) = 0 and
Σ(x − x̄) = 0.
For example, suppose the midterm scores of a sample of four students are 82, 95, 67, and
92 , respectively. Then, the mean score for these four students is
82 + 95 + 67 + 92
x̄ = = 84
4
The deviations of the four scores from the mean are calculated in Table 3.5. As we can
observe from the table, the sum of the deviations of the x values from the mean is zero; that
is, Σ(x − x̄) = 0. For this reason we square the deviations to calculate the variance and
standard deviation.
x x−x
82 82 − 84 = −2
95 95 − 84 = +11
67 67 − 84 = −17
92 92 − 84 = +8
Σ(x − x̄) = 0
Short-Cut Formulas for the Variance and Standard Deviation for Ungrouped
Data
2 2
Σx2 − (Σx)
2 N 2 Σx2 − (Σx)
n
σ = and s =
N n−1
where σ 2 is the population variance and s2 is the sample variance.
The standard deviation is obtained by taking the positive square root of the variance. Pop-
ulation standard deviation:
√
σ = σ2
22
Exercise 4.2. The following table gives the 2008 market values (rounded to billions of dol-
lars) of five international companies.
Market Value
Company
(billions of dollars)
PepsiCo 75
Google 107
PetroChina 271
Johnson & Johnson 138
Intel 71
Solution 4.1. .
x x2
75 5625
107 11, 449
271 73, 441
138 19, 044
71 5041
Σx = 662 Σx2 = 114, 600
Step 1. Calculate Σx. The sum of the values in the first column of Table 3.6 gives the
value of Σx, which is 662 .
Step 2. Find Σx2 . The value of Σx2 is obtained by squaring each value of x and then
adding the squared values. The results of this step are shown in the second column of Table
3.6. Notice that Σx2 = 114, 600.
Step 3. Determine the variance. Substitute all the values in the variance formula and
simplify. Because the given data are on the market values of only five companies, we use the
formula for the sample variance.
2 2
2 Σx2 − (Σx)
n
114, 600 − (662)
5 114, 600 − 87, 648.80
s = = = = 6737.80
n−1 5−1 4
23
Step 4. Obtain the standard deviation. The standard deviation is obtained by taking the
(positive) square root of the variance.
√
s= 6737.80 = 82.0841 = $82.08 billion
Thus, the standard deviation of the market values of these five companies is $82.08 billion.
Remarque 2. . The values of the variance and the standard deviation are never negative.
That is, the numerator in the formula for the variance should never produce a negative value.
Usually the values of the variance and standard deviation are positive, but if a data set has
no variation, then the variance and standard deviation are both zero. For example, if four
persons in a group are the same age—say, 35 years—then the four values in the data set are
35 35 35 35 If we calculate the variance and standard deviation for these data, their values
are zero. This is because there is no variation in the values of this data set.
2. The measurement units of variance are always the square of the measurement units of the
original data. This is so because the original values are squared to calculate the variance. In
Example , the measurement units of the original data are billions of dollars. However, the
measurement units of the variance are squared billions of dollars, which, of course, does not
make any sense. Thus, the variance of the 2008 market values of these five companies is
6737.80 squared billion dollars. But the measurement units of the standard deviation are the
same as the measurement units of the original data because the standard deviation is obtained
by taking the square root of the variance.
4.9. Measures of dispersion for grouped data. Following are what we will call the basic
formulas used to calculate the population and sample variances for grouped data:
Σf (m − µ)2 Σf (m − x̄)2
σ2 = and s2 =
N n−1
where σ 2 is the population variance, s2 is the sample variance, and m is the midpoint of a
class. In either case, the standard deviation is obtained by taking the positive square root of
the variance.
Again, the short-cut formulas are more efficient for
Short-Cut Formulas for the Variance and Standard Deviation for Grouped
Data
(Σmf )2 2
Example 4.5. The following data, give the frequency distribution of the daily commuting
times (in minutes) from home to work for all 25 employees of a company
Step 1. Calculate the value of Σmf . To calculate the value of Σmf , first find the midpoint m
of each class (see the third column in Table 3.12) and then multiply the corresponding class
midpoints and class frequencies (see the fourth column). The value of Σmf is obtained by
adding these products. Thus,
Σmf = 535
Step 2. Find the value of Σm2 f . To find the value of Σm2 f , square each m value and
multiply this squared value of m by the corresponding frequency (see the fifth column in Table
3.12). The sum of these products (that is, the sum of the fifth column) gives Σm2 f . Hence,
Step 3. Calculate the variance. Because the data set includes all 25 employees of the
company, it represents the population. Therefore, we use the formula for the population
variance: 2 2
2 Σm2 f − (Σmf
N
)
14, 825 − (535)
25 3376
σ = = = = 135.04
N 25 25
Step 4. Calculate the standard deviation. To obtain the standard deviation, take the
(positive) square root of the variance.
√ √
σ = σ 2 = 135.04 = 11.62 minutes
Thus, the standard deviation of the daily commuting times for these employees is 11.62
minutes.