0% found this document useful (0 votes)
16 views25 pages

Epei

Uploaded by

Zeph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views25 pages

Epei

Uploaded by

Zeph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

2023/2024

SN
Statistics

1. Introduction to statistics

1.1. Basic notions. We begin with a simple example. There are millions of passenger au-
tomobiles in the United States. What is their average value? It is obviously impractical
to attempt to solve this problem directly by assessing the value of every single car in the
country, add up all those values, then divide by the number of values, one for each car. In
practice the best we can do would be to estimate the average value. A natural way to do
so would be to randomly select some of the cars, say 200 of them, ascertain the value of
each of those cars, and find the average of those 200 values. The set of all those millions of
vehicles is called the population of interest, and the number attached to each one, its value,
is a measurement. The average value is a parameter: a number that describes a charac-
teristic of the population, in this case monetary worth. The set of 200 cars selected from
the population is called a sample, and the 200 numbers, the monetary values of the cars we
selected, are the sample data. The average of the data is called a statistic: a number calcu-
lated from the sample data. This example illustrates the meaning of the following definitions.

Definition 1.1. A population is any specific collection of objects of interest. A sample is


any subset or subcollection of the population, including the case that the sample consists of
the whole population, in which case it is termed a census.
In statistics, we often rely on a sample — that is, a small subset of a larger set of data — to
draw inferences about the larger set. The larger set is known as the population from which
the sample is drawn.

Definition 1.2. A measurement is a number or attribute computed for each member of a


population or of a sample. The measurements of sample elements are collectively called the
sample data.
A parameter is a number that summarizes some aspect of the population as a whole. A
statistic is a number computed from the sample data.

Continuing with our example, if the average value of the cars in our sample was 8,357
dollars , then it seems reasonable to conclude that the average value of all cars is about 8,357
2

dollars
. In reasoning this way we have drawn an inference about the population based on informa-
tion obtained from the sample. In general, statistics is a study of data: describing properties
of the data, which is called descriptive statistics, and drawing conclusions about a population
of interest from information extracted from a sample, which is called inferential statistics.
Computing the single number 8,357 to summarize the data was an operation of descriptive
statistics; using it to make a statement about the population was an operation of inferential
statistics.
Statistics is a collection of methods for collecting, displaying, analyzing, and drawing con-
clusions from data.
Descriptive statistics is the branch of statistics that involves organizing, displaying, and
describing data.
Inferential statistics is the branch of statistics that involves drawing conclusions about a
population based on information contained in a sample taken from that population.
The measurement made on each element of a sample need not be numerical. In the case of
automobiles, what is noted about each car could be its color, its make, its body type, and
so on. Such data are categorical or qualitative, as opposed to numerical or quantitative data
such as value or age. This is a general distinction.

Definition 1.3. Qualitative data are measurements for which there is no natural numer-
ical scale, but which consist of attributes, labels, or other non-numerical characteristics.
Quantitative data are numerical measurements that arise from a natural numerical scale.

Qualitative data can generate numerical sample statistics. In the automobile example, for
instance, we might be interested in the proportion of all cars that are less than six years
old. In our same sample o*f 200 cars we could note for each car whether it is less than six
years old or not, which is a qualitative measurement. If 172 cars in the sample are less than
six years old, which is 0.86 or 86, then we would estimate the parameter of interest, the
population proportion, to be about the same as the sample statistic, the sample proportion,
that is, about 0.86 .

Remarque 1. An important distinction between variables is between qualitative variables


and quantitative variables. Qualitative variables are those that express a qualitative attribute
3

such as hair color, eye color, religion, favorite movie, gender, and so on. The values of a
qualitative variable do not imply a numerical ordering. Values of the variable “religion” differ
qualitatively; no ordering of religions is implied. Qualitative variables are sometimes referred
to as categorical variables.

Definition 1.4. Discrete and Continuous Variables


Variables such as number of children in a household are called discrete variables since the
possible scores are discrete points on the scale. For example, a household could have three
children or six children, but not 4.53 children. Other variables such as “time to respond to a
question” are continuous variables since the scale is continuous and not made up of discrete
steps. The response time could be 1.64 seconds, or it could be 1.64237123922121 seconds. Of
course, the practicalities of measurement preclude most measured variables from being truly
continuous.

1.2. Organization and graphing qualitative data. .


A sample of 100 students enrolled at a university were asked what they intended to do
after graduation. Forty-four said they wanted to work for private companies/businesses, 16
said they wanted to work for the federal government, 23 wanted to work for state or local
governments,
and 17 intended to start their own businesses. The table below lists the types of employment
and the number of students who intend to engage in each type of employment. In this
table, the variable is the type of employment, which is a qualitative variable. The categories
(representing the type of employment) listed in the first column are mutually exclusive. In
other words, each of the 100 students belongs to one and only one of these categories. The
number of students who belong to a certain category is called the frequency of that category.
A frequency distribution exhibits how the frequencies are distributed over various categories.
The tabel below is called a frequency distribution table or simply a frequency tabl
4

Definition 1.5. A frequency distribution for qualitative data lists all categories and the
number of elements that belong to each of the categories

ioio

Exercise 1.1. A sample of 30 employees from large companies was selected, and these em-
ployees were asked how stressful their jobs were. The responses of these employees are recorded
below, where very represents very stressful, somewhat means somewhat stressful, and none
stands for not stressful at all.
somewhat none somewhat very very none
very somewhat somewhat very somewhat somewhat
very somewhat none very none somewhat
somewhat very somewhat somewhat very none
somewhat very very somewhat none somewhat

Construct a frequency distribution table for these data.

Solution 1.1. Note that the variable in this example is how stressful is an employee’s job.
This variable is classified into three categories: very stressful, somewhat stressful, and not
stressful at all. We record these categories in the first column of Table 2.4. Then we read
each employee’s response from the given data and mark a tally, denoted by the symbol —,
in the second column of Table 2.4 next to the corresponding category. For example, the first
employee’s response is that his or her job is somewhat stressful. We show this in the frequency
table by marking a tally in the second column next to the category somewhat. Note that the
tallies are marked in blocks of five for counting convenience. Finally, we record the total of
the tallies for each category in the third column of the table. This column is called the column
of frequencies and is usually denoted by f . The sum of the entries in the frequency column
5

gives the sample size or total frequency. In Table 2.4, this total is 30 , which is the sample

size.

Relative Frequency and Percentage Distributions. The relative frequency of a cate-


gory is obtained by dividing the frequency of that category by the sum of all frequencies.
Thus, the relative frequency shows what fractional part or proportion of the total frequency
belongs to the corresponding category. A relative frequency distribution lists the relative
frequencies for all categories.

Frequency of that category


Relative frequency of a category= Sum of all frequencies

The percentage for a category is obtained by multiplying the relative frequency of that cat-
egory by 100. A percentage distribution lists the percentages for all categories.

Percentage= (relative frequency).100

Exercise 1.2. Determine the relative frequency and percentage distributions for the data of
Exercise.1
6

Solution 1.2. .

2. Graphical Presentation of Qualitative Data

All of us have heard the adage “a picture is worth a thousand words.” A graphic display
can reveal at a glance the main characteristics of a data set. The bar graph and the pie chart
are two types of graphs that are commonly used to display qualitative data.

Bar graphs. To construct a bar graph (also called a bar chart), we mark the various cat-
egories on the horizontal axis as in Figure 2.1. Note that all categories are represented by
intervals of the same width. We mark the frequencies on the vertical axis. Then we draw
one bar for each category such that the height of the bar represents the frequency of the
corresponding category. We leave a small gap between adjacent bars. The figure below gives

the bar graph for the frequency distribution of Exercise.1. .

Definition 2.1. A graph made of bars whose heights represent the frequencies of respective
categories is called a bar graph.
7

The bar graphs for relative frequency and percentage distributions can be drawn simply
by marking the relative frequencies or percentages, instead of the frequencies, on the vertical
axis.

Pie graph. A pie chart is more commonly used to display percentages, although it can be
used to display frequencies or relative frequencies. The whole pie (or circle) represents the
total sample or population. Then we divide the pie into different portions that represent the
different categories.

Definition 2.2. A circle divided into portions that represent the relative frequencies or per-
centages of a population or a sample belonging to different categories is called a pie chart.

As we know, a circle contains 360 degrees. To construct a pie chart, we multiply 360 by
the relative frequency of each category to obtain the degree measure or size of the angle for
the corresponding category, so using the data of exercise.1 we have

.
8

2.1. Organization and grouping quantitative data. The table below gives the weekly
earnings of 100 employees of a large company. The first column lists the classes, which rep-
resent the (quantitative) variable weekly earnings. For quantitative data, an interval that
includes all the values that fall within two numbers—the lower and upper limits—is called
a class. Note that the classes always represent a variable. As we can observe, the classes
are nonoverlapping; that is, each value on earnings belongs to one and only one class. The
second column in the table lists the number of employees who have earnings within each
class. For example, 9 employees of this company earn 801 dollars to 1000 dollars per week.
The numbers listed in the second column are called the frequencies, which give the number
of values that belong to different classes. The frequencies are denoted by f.

For quantitative data, the frequency of a class represents the number of values in the data
set that fall in that class. The table contains six classes. Each class has a lower limit and
an upper limit. The values 801, 1001, 1201, 1401, 1601, and 1801 give the lower limits, and
the values 1000, 1200, 1400, 1600, 1800, and 2000 are the upper limits of the six classes,
respectively. The data presented in Table are an illustration of a frequency distribution table
for quantitative data. Whereas the data that list individual values are called ungrouped data,
the data presented in a frequency distribution table are called grouped data.

Definition 2.3. a A frequency distribution for quantitative data lists all the classes and
the number of values that belong to each class. Data presented in the form of a frequency
distribution are called grouped data.
9

.
To find the midpoint of the upper limit of the first class and the lower limit of the second
class in the Table , we divide the sum of these two limits by 2. Thus, this midpoint is

1000 + 1001
= 1000.5
2

The value 1000.5 is called the upper boundary of the first class and the lower boundary of
the second class. By using this technique, we can convert the class limits of the Table to
class boundaries, which are also called real class limits

Definition 2.4. The class boundary is given by the midpoint of the upper limit of one class
and the lower limit of the next class.

The difference between the two boundaries of a class gives the class width. The class width
is also called the class size.

Class width= Upper boundary− Lower boundary


Width of the first class = 1000.5 − 800.5 = 200
The class midpoint or mark is obtained by dividing the sum of the two limits (or the two
boundaries) of a class by 2.
Lower limit+Upper limit
class midpoint or mark= 2
801+1000
Midpoint of the first class= 2
= 900.5

2.2. Constructing a frequency distrubution table. .


Class widthAlthough it is not uncommon to have classes of different sizes, most of the time
it is preferable to have the same width for all classes. To determine the class width when all
classes are the same size, first find the difference between the largest and the smallest values
in the data. Then, the approximate width of a class is obtained by dividing this difference
by the number of desired classes
10

largest value−Smallest value


Approximate class width= Number of classes

18
Exercise 2.1. The following data give the total number of iPods sold by a mail order
company on each of 30 days. Construct a frequency distribution table.
8 25 11 15 29 22 10 5 17 21
22 13 26 16 18 12 9 26 20 16
23 14 19 23 20 16 27 16 21 14

Solution 2.1. In these data, the minimum value is 5 , and the maximum value is 29 .
Suppose we decide to group these data using five classes of equal width. Then,

29 − 5
Approximate width of each class = = 4.8
5

Now we round this approximate width to a convenient number, say 5 . The lower limit of
the first class can be taken as 5 or any number less than 5 . Suppose we take 5 as the lower
limit of the first class. Then our classes will be

5 − 9, 10 − 14, 15 − 19, 20 − 24, and 25 − 29

We record these five classes in the first column of the Table below.
Now we read each value from the given data and mark a tally in the second column of the
Table below next to the corresponding class. The first value in our original data set is 8,
which belongs to the 5–9 class. To record it, we mark a tally in the second column next to the
5–9 class. We continue this process until all the data values have been read and entered in the
tally column. Note that tallies are marked in blocks of five for counting convenience. After
the tally column is completed, we count the tally marks for each class and write those numbers
in the third column. This gives the column of frequencies. These frequencies represent the
number of days on which iPods indicated in classes are sold. For example, on 8 of 30 days,
15 to 19 iPods were sold.
11

we can denote the frequencies of the five classes by, f1 , f2 ; f3 , f4 and f5 , f1 = 3, f2 = 6,


f3 = 8, f4 = 8, f5 = 5.
P
f = f1 + f2 + f3 + f4 + f5 = 30
The number of observations in a sample is usually denoted by n. Thus, for the sample data,
P
f is equal to n. The number of observations in a population is denoted by N. Consequently,
P
f is equal to N for population data. Because the data set on the total iPods sold on 30
days in Table is for only 30 days, it represents a sample. Therefore, in Table 2.9 we can
P
denote the sum of frequencies by n instead of f

Relative Frequency and Percentage distribution. .


frequency of the class Pf
Relative frequency of a class = sum all of frequencies
= f

Percentage = (Relative frequency).100 .

Exercise 2.2. Calculate the relative frequencies and percentages for Exercice3.1

Solution 2.2. .

2.3. Graphing grouped data.


12

2.4. Histogram. A histogram can be drawn for a frequency distribution, a relative frequency
distribution, or a percentage distribution. To draw a histogram, we first mark classes on the
horizontal axis and frequencies (or relative frequencies or percentages) on the vertical axis.
Next, we draw a bar for each class so that its height represents the frequency of that class.
The bars in a histogram are drawn adjacent to each other with no gap between them. A
histogram is called a frequency histogram, a relative frequency histogram, or a percentage
histogram depending on whether frequencies, relative frequencies, or percentages are marked
on the vertical axis.

Definition 2.5. A histogram is a graph in which classes are marked on the horizontal axis
and the frequencies, relative frequencies, or percentages are marked on the vertical axis. The
frequencies, relative frequencies, or percentages are represented by the heights of the bars. In
a histogram, the bars are drawn adjacent to each other.

Going back to the example in Exercise3.1

2.5. Polygon. A polygon is another device that can be used to present quantitative data
in graphic form. To draw a frequency polygon, we first mark a dot above the midpoint of
each class at a height equal to the frequency of that class. This is the same as marking the
midpoint at the top of each bar in a histogram. Next we mark two more classes, one at each
end, and mark their midpoints. Note that these two classes have zero frequencies. In the
last step, we join the adjacent dots with straight lines. The resulting line graph is called a
frequency polygon or simply a polygon. A polygon with relative frequencies marked on the
vertical axis is called a relative frequency polygon. Similarly, a polygon with percentages
marked on the vertical axis is called a percentage polygon.
13

Definition 2.6. Polygon A graph formed by joining the midpoints of the tops of successive
bars in a histogram with straight lines is called a polygon

For a very large data set, as the number of classes is increased (and the width of classes is
decreased), the frequency polygon eventually becomes a smooth curve. Such a curve is called
a frequency distribution curve or simply a frequency curve. Figure 2.6 shows the frequency
curve for a large data set with a large number of classes.

3. Numerical descriptive measures

4. Cumulative frequency distribution

Consider again Exercise3.1 about the total number of iPods sold by a company. Suppose
we want to know on how many days the company sold 19 or fewer iPods. Such a question
can be answered by using a cumulative frequency distribution. Each class in a cumulative
frequency distribution table gives the total number of values that fall below a certain value.
14

A cumulative frequency distribution is constructed for quantitative data only.

Definition 4.1. Cumulative Frequency Distribution A cumulative frequency distribu-


tion gives the total number of values that fall below the upper boundary of each class

The cumulative relative frequencies are obtained by dividing the cumulative frequencies by
the total number of observations in the data set. The cumulative percentages are obtained
by multiplying the cumulative relative frequencies by 100.
15

gives. When plotted on a diagram, the cumulative frequencies give a curve that is called
an ogive (pronounced o-jive ). Figure below gives an ogive for the cumulative frequency
distribution of the Table . To draw the ogive in the Figure , the variable, which is total iPods
sold, is marked on the horizontal axis and the cumulative frequencies on the vertical axis.
Then the dots are marked above the upper boundaries of various classes at the heights equal
to the corresponding cumulative frequencies. The ogive is obtained by joining consecutive
points with straight lines. Note that the ogive starts at the lower boundary of the first class
and ends at the upper boundary of the last class.

Definition 4.2. An ogive is a curve drawn for the cumulative frequency distribution by
joining with straight lines the dots marked above the upper boundaries of classes at heights
equal to the cumulative frequencies of respective classes.

4.1. Stem-and-Leaf Display. A stem and leaf plot, or stem plot, is a technique used to
classify either discrete or continuous variables. A stem and leaf plot is used to organize data
as they are collected. A stem and leaf plot looks something like a bar graph. Each number
in the data is broken down into a stem and a leaf, thus the name.

Definition 4.3. In a stem-and-leaf display of quantitative data, each value is divided into
two portions—a stem and a leaf. The leaves for each stem are shown separately in a display.

4.2. Measure of central tendency for ungrouped data.

4.3. Mean. The mean, also called the arithmetic mean, is the most frequently used measure
of central tendency. This book will use the words mean and average synonymously. For
16

ungrouped data, the mean is obtained by dividing the sum of all values by the number of
values in the data set:
Sum of all values
Mean =
Number of values
The mean calculated for sample data is denoted by x̄ (read as ” x bar”), and the mean
calculated for population data is denoted by µ (Greek letter mu ). We know from the
discussion in Chapter 2 that the number of values in a data set is denoted by n for a sample
and by N for a population. In Chapter 1, we learned that a variable is denoted by x, and the
sum of all values of x is denoted by Σx. Using these notations, we can write the following
formulas for the mean.
Calculating Mean for Ungrouped Data The mean for ungrouped data is obtained by divid-
ing the sum of all values by the number of values in the data set. Thus, Mean for population
data: Mean for sample data:
Σx
µ=
N
Σx
x̄ =
n
where Σx is the sum of all values, N is the population size, n is the sample size, µ is the
population mean, and x̄ is the sample mean.

Exercise 4.1. The table below represents lists the total sales (rounded to billions of dollars)
of six U.S. companies for 2008.
Total Sales
Company
(billions of dollars)
General Motors 149
Wal-Mart Stores 406
General Electric 183
Citigroup 107
Exxon Mobil 426
Verizon Communication 97

4.4. Median.

Definition 4.4. The median is the value of the middle term in a data set that has been
ranked in increasing order.

As is obvious from the definition of the median, it divides a ranked data set into two equal
parts. The calculation of the median consists of the following two steps:
17

1. Rank the data set in increasing order.


2. Find the middle term. The value of this term is the median.
Note that if the number of observations in a data set is odd, then the median is given by the
value of the middle term in the ranked data. However, if the number of observations is even,
then the median is given by the average of the values of the two middle terms.

Example 4.1. The following data give the prices (in thousands of dollars) of seven houses
selected from all houses sold last month in a city. 312 257 421 289 526 374 497 Find the
median

Example 4.2. The table below gives the 2008 profits (rounded to billions of dollars) of 12
companies selected from all over the world

There are 12 values in this data set. Because there is an even number of values in the data
set, the median is given by the average of the two middle values. The two middle values are
the sixth and seventh in the foregoing list of data, and these two values are 12 and 13. The
median, which is given by the average of these two values, is calculated as follows.
18

12+13
And so median = 2
= 12.5
Thus, the median profit of these 12 companies is 12.5 billion

4.5. Mode.

Definition 4.5. The mode is the value that occurs with the highest frequency in a data set

Example 4.3. The following data give the speeds (in miles per hour) of eight cars that were
stopped on I-95 for speeding violations. 77 82 74 81 79 84 74 78 Find the mode.

In this data set, 74 occurs twice, and each of the remaining values occurs only once. Be-
cause 74 occurs with the highest frequency, it is the mode. Therefore
Mode= 74 miles par hour
A major shortcoming of the mode is that a data set may have none or may have more than
one mode, whereas it will have only one mean and only one median. For instance, a data set
with each value occurring only once has no mode. A data set with only one value occurring
with the highest frequency has only one mode. The data set in this case is called unimodal.
A data set with two values that occur with the same (highest) frequency has two modes. The
distribution, in this case, is said to be bimodal. If more than two values in a data set occur
with the same (highest) frequency, then the data set contains more than two modes and it
is said to be multimodal.

Example 4.4. Last year’s incomes of five randomly selected families were 76,150, 95,750,
124,985, 87,490, and 53,740.
Because each value in this data set occurs only once, this data set contains no mode.
19

4.6. Measures of dispersion for ungrouped data. The measures of central tendency,
such as the mean, median, and mode, do not reveal the whole picture of the distribution of
a data set. Two data sets with the same mean may have completely different spreads. The
variation among the values of observations for one data set may be much larger or smaller
than for the other data set. (Note that the words dispersion, spread, and variation have the
20

same meaning.) Consider the following two data sets on the ages (in years) of all workers
working for each of two small companies.
Company 1: 47 38 35 40 36 45 39
Company 2: 70 33 18 52 27
he ages of individual workers at these two companies and are told only that the mean age
of the workers at both companies is the same, we may deduce that the workers at these two
companies have a similar age distribution. As we can observe, however, the variation in the
workers’ ages for each of these two companies is very different. As illustrated in the diagram,
the ages of the workers at the second company have a much larger variation than the ages
of the workers at the first company

shape of the distribution of a data set. We also need a measure that can provide some
information about the variation among data values. The measures that help us learn about
the spread of a data set are called the measures of dispersion. The measures of central
tendency and dispersion taken together give a better picture of a data set than the mea-
sures of central tendency alone. This section discusses three measures of dispersion: range,
variance, and standard deviation.

4.7. Range. The range is the simplest measure of dispersion to calculate. It is obtained by
taking the difference between the largest and the smallest values in a data set.

Range Largest value − Smallest value

4.8. Variance and standard deviation. The standard deviation is the most-used measure
of dispersion. The value of the standard deviation tells how closely the values of a data set
are clustered around the mean. In general, a lower value of the standard deviation for a data
set indicates that the values of that data set are spread over a relatively smaller range around
the mean. In contrast, a larger value of the standard deviation for a data set indicates that
the values of that data set are spread over a relatively larger range around the mean
21

The standard deviation is obtained by taking the positive square root of the variance. The
variance calculated for population data is denoted by σ 2 (read as sigma squared), 2
and the
variance calculated for sample data is denoted by s2 . Consequently, the standard deviation
calculated for population data is denoted by σ, and the standard deviation calculated for
sample data is denoted by s. Following are what we will call the basic formulas that are used
3
to calculate the variance:
Σ(x − µ)2 Σ(x − x̄)2
σ2 = and s2 =
N n−1
where σ 2 is the population variance and s2 is the sample variance. The quantity x − µ or
x − x̄ in the above formulas is called the deviation of the x value from the mean. The sum
of the deviations of the x values from the mean is always zero; that is, Σ(x − µ) = 0 and
Σ(x − x̄) = 0.
For example, suppose the midterm scores of a sample of four students are 82, 95, 67, and
92 , respectively. Then, the mean score for these four students is
82 + 95 + 67 + 92
x̄ = = 84
4
The deviations of the four scores from the mean are calculated in Table 3.5. As we can
observe from the table, the sum of the deviations of the x values from the mean is zero; that
is, Σ(x − x̄) = 0. For this reason we square the deviations to calculate the variance and
standard deviation.

x x−x
82 82 − 84 = −2
95 95 − 84 = +11
67 67 − 84 = −17
92 92 − 84 = +8
Σ(x − x̄) = 0

Short-Cut Formulas for the Variance and Standard Deviation for Ungrouped
Data
2 2
Σx2 − (Σx)
2 N 2 Σx2 − (Σx)
n
σ = and s =
N n−1
where σ 2 is the population variance and s2 is the sample variance.
The standard deviation is obtained by taking the positive square root of the variance. Pop-
ulation standard deviation:

σ = σ2
22

Sample standard deviation:



s = s2

Exercise 4.2. The following table gives the 2008 market values (rounded to billions of dol-
lars) of five international companies.
Market Value
Company
(billions of dollars)
PepsiCo 75
Google 107
PetroChina 271
Johnson & Johnson 138
Intel 71

Find the variance and standard deviation for these data.

Solution 4.1. .
x x2
75 5625
107 11, 449
271 73, 441
138 19, 044
71 5041
Σx = 662 Σx2 = 114, 600

Step 1. Calculate Σx. The sum of the values in the first column of Table 3.6 gives the
value of Σx, which is 662 .

Step 2. Find Σx2 . The value of Σx2 is obtained by squaring each value of x and then
adding the squared values. The results of this step are shown in the second column of Table
3.6. Notice that Σx2 = 114, 600.

Step 3. Determine the variance. Substitute all the values in the variance formula and
simplify. Because the given data are on the market values of only five companies, we use the
formula for the sample variance.

2 2

2 Σx2 − (Σx)
n
114, 600 − (662)
5 114, 600 − 87, 648.80
s = = = = 6737.80
n−1 5−1 4
23

Step 4. Obtain the standard deviation. The standard deviation is obtained by taking the
(positive) square root of the variance.

s= 6737.80 = 82.0841 = $82.08 billion

Thus, the standard deviation of the market values of these five companies is $82.08 billion.

Remarque 2. . The values of the variance and the standard deviation are never negative.
That is, the numerator in the formula for the variance should never produce a negative value.
Usually the values of the variance and standard deviation are positive, but if a data set has
no variation, then the variance and standard deviation are both zero. For example, if four
persons in a group are the same age—say, 35 years—then the four values in the data set are
35 35 35 35 If we calculate the variance and standard deviation for these data, their values
are zero. This is because there is no variation in the values of this data set.
2. The measurement units of variance are always the square of the measurement units of the
original data. This is so because the original values are squared to calculate the variance. In
Example , the measurement units of the original data are billions of dollars. However, the
measurement units of the variance are squared billions of dollars, which, of course, does not
make any sense. Thus, the variance of the 2008 market values of these five companies is
6737.80 squared billion dollars. But the measurement units of the standard deviation are the
same as the measurement units of the original data because the standard deviation is obtained
by taking the square root of the variance.

4.9. Measures of dispersion for grouped data. Following are what we will call the basic
formulas used to calculate the population and sample variances for grouped data:
Σf (m − µ)2 Σf (m − x̄)2
σ2 = and s2 =
N n−1
where σ 2 is the population variance, s2 is the sample variance, and m is the midpoint of a
class. In either case, the standard deviation is obtained by taking the positive square root of
the variance.
Again, the short-cut formulas are more efficient for
Short-Cut Formulas for the Variance and Standard Deviation for Grouped
Data
(Σmf )2 2

2 Σm2 f − N 2 Σm2 f − (Σmf


n
)
σ = and s =
N n−1
where σ 2 is the population variance, s2 is the sample variance, and m is the midpoint of a
class. The standard deviation is obtained by taking the positive square root of the variance.
24
√ √
Population standard deviation: σ= σ 2 Sample standard deviation: s= s2

Example 4.5. The following data, give the frequency distribution of the daily commuting
times (in minutes) from home to work for all 25 employees of a company

Calculate the variance and standard deviation

Daily Commuting Time


f m mf m2 f
(minutes)
0 to less than 10 4 5 20 100
10 to less than 20 9 15 135 2025
Solution 4.2.
20 to less than 30 6 25 150 3750
30 to less than 40 4 35 140 4900
40 to less than 50 2 45 90 4050
N = 25 Σmf = 535 Σm2 f = 14, 825

Step 1. Calculate the value of Σmf . To calculate the value of Σmf , first find the midpoint m
of each class (see the third column in Table 3.12) and then multiply the corresponding class
midpoints and class frequencies (see the fourth column). The value of Σmf is obtained by
adding these products. Thus,

Σmf = 535

Step 2. Find the value of Σm2 f . To find the value of Σm2 f , square each m value and
multiply this squared value of m by the corresponding frequency (see the fifth column in Table
3.12). The sum of these products (that is, the sum of the fifth column) gives Σm2 f . Hence,

Σm2 f = 14, 825


25

Step 3. Calculate the variance. Because the data set includes all 25 employees of the
company, it represents the population. Therefore, we use the formula for the population
variance: 2 2

2 Σm2 f − (Σmf
N
)
14, 825 − (535)
25 3376
σ = = = = 135.04
N 25 25
Step 4. Calculate the standard deviation. To obtain the standard deviation, take the
(positive) square root of the variance.
√ √
σ = σ 2 = 135.04 = 11.62 minutes

Thus, the standard deviation of the daily commuting times for these employees is 11.62
minutes.

You might also like