Chapter 2 171

CHAPTER 2
DESCRIPTIVE DATA
SQQS1013 ELEMENTARY STATISTICS
ORGANIZING AND
VISUALIZING DATA
Objectives
In this chapter you learn:
• Organizing categorical variables.
• Organizing numerical variables.
• Visualizing categorical variables.
• Visualizing numerical variables.
• Organizing and visualizing a mix of variables.
• The challenge in organizing and visualizing variables.
2.1 INTRODUCTION
Example:
Here is a list of question asked in a large statistics class and the
data value given by one of the students:
i. What is your sex (m=male, f=female)?
ii. How many hours did you sleep last night?
iii. Randomly pick a letter – S or Q.
iv. What is your height in inches?
v. What’s the fastest you’ve ever driven a car (mph)?
Raw data - Data recorded in the sequence in which
they were originally collected,
before being processed or ranked.
Array data - Raw data that are arranged in

ascending or descending order.
PRESENTATION OF DATA
• Organizing Data Creates Both Tabular And Visual Summaries
• Summaries both guide further exploration and sometimes
facilitate decision making.
• Visual summaries enable rapid review of larger amounts of

data & show possible significant patterns.
• Often, the Organize and Visualize step in DCOVA occur

concurrently.
2.2 PRESENTATION OF
QUALITATIVE DATA
2.2.1 Organizing Categorical Data
Categorical
Data
Tallying Data
One Two
Categorical Categorical
Variable Variables
Summary Contingency
Table Table
• A summary table tallies the frequencies or percentages of
items in a set of categories so that you can see differences
between categories.
Table 2.3 Main Reason Young Adults Shop Online
Reason For Shopping Frequenc Percen

Online? y t
Better Prices 555 37%
Avoiding holiday crowds or hassles 435 29%
Convenience 270 18%
Percentage =
Better selection 195 13% Frequency
�100%
Ships directly 45 3% Total frequency
Total 1500 100%
Source: Data extracted and adapted from “Main Reason Young Adults Shop Online?”
USA Today, December 5, 2012, p. 1A.
• A Contingency Table
– Helps Organize Two or More Categorical Variables
– Used to study patterns that may exist between the responses of two or
more categorical variables.
– Cross tabulates or tallies jointly the responses of the categorical

variables.
– For two variables the tallies for one variable are located in the rows and
the tallies for the second variable are located in the columns
Example 2.1-Contingency Table
Table 2.4 Contingency Table Showing
• A random sample of 400 Frequency of Invoices Categorized
invoices is drawn. By Size and The Presence Of Errors
No
• Each invoice is categorized as Errors Errors Total
a small, medium, or large
Small 170 20 190
amount. Amount
• Each invoice is also examined Medium 100 40 140
to identify if there are any Amount
errors. Large 65 5 70
Amount
• This data are then organized in Total 335 65 400
the contingency table to the
right.
Contingency Table Based On
Percentage Of Overall Total DCOVA
No
Errors Errors Total 42.50% = 170 / 400
Small 170 20 190 25.00% = 100 / 400
Amount 16.25% = 65 / 400
Medium 100 40 140
Amount No
Large 65 5 70 Errors Errors Total
Amount Small 42.50% 5.00% 47.50%
Total 335 65 400 Amount
Medium 25.00% 10.00% 35.00%
Amount
83.75% of sampled invoices
Large 16.25% 1.25% 17.50%
have no errors and 47.50% Amount
of sampled invoices are for Total 83.75% 16.25% 100.0%
small amounts.
Percentage of Row Totals DCOVA
No
Small 170 20 190 71.43% = 100 / 140
Amount 92.86% = 65 / 70
Medium 100 40 140
Amount
No
Amount
Small 89.47% 10.53% 100.0%
Medium 71.43% 28.57% 100.0%
Amount
Medium invoices have a larger
Large 92.86% 7.14% 100.0%
chance (28.57%) of having Amount
errors than small (10.53%) or Total 83.75% 16.25% 100.0%
large (7.14%) invoices.
Percentage Of Column Totals DCOVA
No
Small 170 20 190 30.77% = 20 / 65
Amount
Medium 100 40 140
Amount No
Amount
Small 50.75% 30.77% 47.50%
Medium 29.85% 61.54% 35.00%
Amount
There is a 61.54% chance
Large 19.40% 7.69% 17.50%
that invoices with errors are Amount
of medium size. Total 100.0% 100.0% 100.0%
2.2.2 Visualizing Categorical Data
DCOVA
Categorical
Data
Visualizing Data
Summary Contingency
Table For One Table For Two
Variable Variables
Bar Pareto Component / Doughnut

Chart Chart Multiple Bar Chart
Chart
Pie or
Doughnut Chart
The Bar Chart DCOVA
 The bar chart visualizes a categorical variable as a series of bars.
The length of each bar represents either the frequency or percentage
of values for each category. Each bar is separated by a space called a
gap.
Reason For Percent

Shopping Online?
Better Prices 37%
Avoiding holiday 29%
crowds or hassles
Convenience 18%
Better selection 13%
Ships directly 3%
The Pie Chart DCOVA
 The pie chart is a circle broken up into slices that represent
categories. The size of each slice of the pie varies according to
the percentage in each category.
Reason For Shopping Percent

Online?
Better Prices 37%
Avoiding holiday crowds or 29%
hassles
Convenience 18%
Ships directly 3%
The Doughnut Chart DCOVA
 The doughnut chart is the outer part of a circle broken up into
pieces that represent categories. The size of each piece of the
doughnut varies according to the percentage in each category.
Doughnut Chart of Reasons to Shop Online
Reason For Shopping Percent

Online?
Better Prices 37%
Avoiding holiday crowds or 29%
hassles
Convenience 18%
Ships directly 3%
The Pareto Chart
DCOVA
• Used to portray categorical data (nominal scale).
• A vertical bar chart, where categories are shown in

descending order of frequency.
• A cumulative polygon is shown in the same graph.
• Used to separate the “vital few” from the “trivial many.”

The Pareto Chart (con’t)
DCOVA
Table 2.5 Ordered Summary Table For Causes
Of Incomplete ATM Transactions
Cumulative
Cause Frequency Percent Percent
Warped card jammed 365 50.41% 50.41%
Card unreadable 234 32.32% 82.73%
ATM malfunctions 32 4.42% 87.15%
ATM out of cash 28 3.87% 91.02%
Invalid amount requested 23 3.18% 94.20%
Wrong keystroke 23 3.18% 97.38%
Lack of funds in account 19 2.62% 100.00%
Total 724 100.00%
Source: Data extracted from A. Bhalla, “Don’t Misuse the Pareto Principle,”
Six Sigma Forum
Magazine, May 2009, pp. 15–18.
The Pareto Chart (con’t) DCOVA
The “Vital
Few”
Multiple (Side By Side) Bar Charts
 The side by side bar chart represents the data from a contingency DCOVA
table.
No
Errors Errors Total Invoice Size Split Out By Errors & No Errors
Small 50.75% 30.77% 47.50%

Amount Errors
Medium 29.85% 61.54% 35.00%

Amount
No Errors
Large 19.40% 7.69% 17.50%
Amount
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0%
Total 100.0% 100.0% 100.0% Small Medium Large
Invoices with errors are much more likely to be of

medium size (61.5% vs 30.8% & 7.7%).
Component Bar Charts
 The component bar chart represents the data from a contingency table. DCOVA
No
Errors Errors Total Invoice Size Split Out By Errors & No Errors
Small 50.75% 30.77% 47.50% 120.00%
Amount 100.00%
Medium 29.85% 61.54% 35.00% 80.00%

Amount
60.00%
Large 19.40% 7.69% 17.50% 40.00%

Amount
20.00%
Total 100.0% 100.0% 100.0%
0.00%
Amount Amount Amount

medium size (61.5% vs 30.8% & 7.7%).
Doughnut Charts DCOVA
 A Doughnut Chart can be used to represent the data from a contingency table.
No Invoice Size & Errors

Errors Errors Total Inner Ring With Errors, Outer Ring No Errors
Small 50.75% 30.77% 47.50% 19.38%

Amount 7.70%
30.80%
Medium 29.85% 61.54% 35.00%
Amount 50.75%
Large 19.40% 7.69% 17.50% 29.87%

61.50%
Amount
Total 100.0% 100.0% 100.0%
Small Medium Large

medium size (61.5% vs 30.8% & 7.7%).
EXERCISE 2.1
A recent consumer survey on i. Construct a bar chart for the types of
stores customers plan to shop at.
holiday shopping reveals the
following information on the types ii. construct a pie chart for the types of
stores customers plan to shop at.
of stores at which consumers plan
to shop. iii. What is the type of stores that the most
customers plan to shop at?
Types of Stores % of
Customers iv. What is the percentage of the top 2
Stand-alone “big box” stores 54 categories of stores that customers plan to
Traditional mall 61
Local independent stores not in 35 shop at make up out of the 6 categories of
a mall shopping preferences.
Strip mall or mini mall 25
Town hall mall 14 v. What is the % of the customers surveyed
I do not plan to shop at any of 9 mentioned that they did not plan to shop
these at any of these stores.
2.3 PRESENTATION OF
QUANTITATIVE DATA
2.3.1 Organizing Quantitative Data
Numerical Data
Frequency Cumulative
Ordered Array
Distributions Distributions
Ordered Array DCOVA
 An ordered array is a sequence of data, in rank order, from the smallest value to the
largest value.
 Shows range (minimum value to maximum value).
 May help identify outliers (unusual observations).
Age of Surveyed Day Students

College Students
16 17 17 18 18 18
19 19 20 20 21 22
22 25 27 32 38 42
Night Students
18 18 19 19 20 21
23 28 32 33 41 45
DCOVA
Frequency Distribution
 The frequency distribution is a summary table in which the data are
arranged into numerically ordered classes  group data
 You must give attention to
i. selecting the appropriate number of class groupings (Sturge’s Rule) for the
table,
c = 1 + 3.3 log n
ii. determining a suitable width of a class grouping, and establishing the
boundaries of each class grouping to avoid overlapping.
i must always i>
Largest value - Smallest value
c shall be
be rounded-up Number of classes
Range rounded-up or
i>
c rounded down
iii. Starting point of the 1st class
use the smallest value in the data set.
Example 2.2
Frequency Distribution Example DCOVA
A manufacturer of insulation randomly selects 20 winter days and records the

daily high temperature.
24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
Construct a frequency distribution from the data given.
Why Use a Frequency Distribution?
• It condenses the raw data into a more useful form.
• It allows for a quick visual interpretation of the data.
• It enables the determination of the major characteristics of the data set including where
the data are concentrated / clustered.
Frequency Distribution: Tips

• As the size of the data set increases, the impact of alterations in the selection of class
boundaries is greatly reduced.
• When comparing two or more groups with different sample sizes, you must use either a
relative frequency or a percentage distribution.
2.3.2 Visualizing Numerical Data
DCOVA
Numerical Data
Frequency Distributions
Ordered Array and
Cumulative Distributions
Stem-and-Leaf Histogram Polygon Ogive

Display
Stem-and-Leaf Display DCOVA
• A simple way to see how the data are distributed and where
concentrations of data exist.
METHOD: Separate the sorted data series

into leading digits (the stems) and
the trailing digits (the leaves).
Stem and Leaf Display DCOVA
 A stem-and-leaf display organizes data into groups (called stems) so that the values
within each group (the leaves) branch out to the right on each row.
Age of College Students

Age of Day Students
Surveyed 16 17 17 18 18 18 Day Students Night Students
College
Students 19 19 20 20 21 22 Stem Leaf Stem Leaf
22 25 27 32 38 42 1 67788899 1 8899
Night Students
2 0012257
18 18 19 19 20 21 2 0138
23 28 32 33 41 45 3 28
3 23
4 2
4 15
The Histogram DCOVA
 A vertical bar chart of the data in a frequency distribution is called a histogram.
 In a histogram there are no gaps between adjacent bars.
 The class boundaries (or class midpoints) are shown on the horizontal axis.
 The vertical axis is either frequency, relative frequency, or percentage.
 The height of the bars represent the frequency, relative frequency, or percentage.
The Histogram DCOVA
Relative
Class Frequency Frequency Percentage
12 - 21 3 .15 15
22 - 31 6 .30 30
32 - 41 5 .25 25
42 - 51 4 .20 20
52 - 61 2 .10 10
Total 20 1.00 100 Histogram: Temperature
(In a percentage histogram

the vertical axis would be
defined to show the
percentage of observations
per class).
The Polygon DCOVA
 A percentage polygon is formed by having the midpoint of each class
represent the data in that class and then connecting the sequence of
midpoints at their respective class percentages.
 The cumulative percentage polygon, or ogive, displays the variable

of interest along the X axis, and the cumulative percentages along the
Y axis.
 Useful when there are two or more groups to compare.

The Frequency Polygon DCOVA
Useful When Comparing Two or More Groups
The Percentage Polygon
DCOVA
Ogive
• An ogive is a curve drawn for the cumulative frequency distribution.
• Two types of ogive:
(1) ogive less than

(2) ogive greater than
• Steps:
– Build a table of cumulative frequency.
– Draw x and y axes. Label x = class boundaries, y= cumulative frequencies.
– Plot graph using the appropriate class boundary.
– Join the 1st appropriate class boundary to the consecutive points.
Ogive
SQQS1013 W2 L4 41
Ogive
SQQS1013 W2 L4 42
2.3.3 Visualizing Two Numerical Variables
DCOVA
Two Numerical
Variables
Scatter Time-Series
Plot Plot
The Scatter Plot
DCOVA
 Scatter plots are used for numerical data consisting of paired observations
taken from two numerical variables.
 One variable is measured on the vertical axis and the other variable is
measured on the horizontal axis.
 Scatter plots are used to examine possible relationships between two

numerical variables.
Scatter Plot Example DCOVA
Volume Cost per

per day day
23 125
26 140
29 146
33 160
38 167
42 170
50 188
55 195
60 200
The Time Series Plot
DCOVA
• A Time-Series Plot is used to study patterns in the
values of a numeric variable over time.
• The Time-Series Plot:

– Numeric variable is measured on the vertical axis and the
time period is measured on the horizontal axis.
DCOVA
Time Series Plot Example
Number of franchises, 2007 to 2015

Number of
Year Franchises 120
2007 43 100
2008 54
number of franchises
80
2009 60
2010 73 60
2011 82 40
2012 95
20
2013 107
2014 99 0
2007 2008 2009 2010 2011 2012 2013 2014 2015
2015 95 year
EXERCISE 2.2
The histogram below represents i. How many percent of the job applicants
scored between 10 and 20?
scores achieved by 200 job
applicants on a personality profile. ii. How many percent of the job applicants
scored below 50?
0.30
Rel.Freq. iii. What is the number of job applicants
who scored between 30 and below 60.
0.20
0.20 0.20 0.20
iv. What is the number of job applicants
who scored 50 or above.
0.10
0.10 0.10 0.10 0.10
v. 90% of the job applicants scored above
or equal to ________.
0.00
0 10 20 30 40 50 60 70
vi. Half of the job applicants scored below
________.
NUMERICAL
DESCRIPTIVE MEASURE
Objectives
In this topic, you learn to:
• Describe the properties of central tendency, variation, and
shape in numerical variables.
• Construct and interpret a boxplot.
Summary DCOVA
 The central tendency is the extent to which the values of a numerical
variable group around a typical or central value.
 The variation is the amount of dispersion or scattering away from a

central value that the values of a numerical variable show.
 The shape is the pattern of the distribution of values from the lowest
value to the highest value.
2.4 MEASURE OF
CENTRAL TENDENCY
2.4.1 MEAN
2.4.1.1 UNGROUP DATA DCOVA
• The arithmetic mean (often just called the “mean”) is the most common
measure of central tendency.
• The most common measure of central tendency.
• Mean = sum of values divided by the number of values.
• Affected by extreme values (outliers).
•Population mean,  if data comes from population.
•Sample mean, if data comes from sample.
For a sample of size n:
The ith value

Pronounced x-bar
n
X i
X1  X 2    X n
X i 1

n n
Sample size Observed values
EXAMPLE 2.3 DCOVA
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Mean = 13 Mean = 14
11  12  13  14  15 65 11  12  13  14  20 70
  13   14
5 5 5 5
2.4.1 MEAN
2.4.1.2 GROUP DATA DCOVA
The ith value

Pronounced x-bar n
fX i i
f1 X 1  f 2 X 2    f n X n
X i 1

n
f1  f 2  ...  f n
f
i 1
i
Mid-point of a
Total of Frequency of a class
frequency/S class
ample size
EXAMPLE 2.3
a. During a semester, a student took five exams. The population of
exam scores is 78, 83, 92, 68, and 85. Find the mean. (406, 81.2)
b. The following table shows the speeds (in km/h) of 30 cars measured
at certain checkpoint. (1504, 50.13)
41 53 58 67 33 61 43 45 42 67
39 48 36 47 34 59 57 54 65 69
63 42 60 48 66 30 30 46 52 49
c) The following table presents the daily high temperature in a
manufacturer of insulation for randomly selected 20 winter
days(Refer Example 2.2). Approximate the mean of daily high
temperature. (34.5)
Class Frequency
12 - 21 3
22 - 31 6
32 - 41 5
42 - 51 4
52 - 61 2
Total 20
2.4.2 MEDIAN
2.4.2.1 UNGROUP DATA DCOVA
• In an ordered array, the median is the “middle” number (50% above, 50%
below).
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Median = 13 Median = 13
• Less sensitive than the mean to extreme values.

Procedure of computing the DCOVA
Median
1. Arrange data in ascending order
2. The location of the median when the values are in numerical order
(smallest to largest):
n 1
Median position  position in the ordered data
2
– If the number of values is odd, the median is the middle number.
– If the number of values is even, the median is the average of the two middle
numbers.
3. Find the median.

2.4.2 MEDIAN
2.4.2.2 GROUP DATA
•Procedure :
1. Construct cumulative frequency distribution
Class width
2. Determine median class
3. Compute the median

Cumulative
n  freq before a
Total freq
 F  class median
Median  Lm   2 i
 fm 
 
Lower boundary  
of class median Freq of a class median
EXAMPLE 2.4
a. During a semester, a student took five exams. The population of
exam scores is 78, 83, 92, 68, and 85. Find the median. (83).
b. One of the goals of medical research is to develop treatments that
reduce the time spent in recovery. Eight patients undergo a new
surgical procedure, and the number of days spent in recovery for
each is as follows. Find the median. (17).
c. The following table presents the daily high temperature in a
days(Refer Example 2.2). Approximate the median of daily high
temperature. (33.5)
Class Frequency
12 - 21 3
22 - 31 6
32 - 41 5
42 - 51 4
52 - 61 2
Total 20
2.4.3 MODE
2.4.3.1 UNGROUP DATA
DCOVA
• Value that occurs most often.
• Not affected by extreme values.
• Used for either numerical or categorical data.
• There may be no mode.
• There may be several modes.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
Mode = 9 No Mode
2.4.3 MODE
2.4.3.2 GROUP DATA
• Determine class mode (or, modal class) - the class with the highest
frequency.
• Use the following formula Class width
the difference between the

 1  frequency of class mode and
MODE  Lmo   i the frequency of the class
 1   2  before the class mode
Lower boundary the difference between the

of class median frequency of class mode and
the frequency of the class
before the class mode
Approximating mode using
histogram
14
12
10
Frequency
0
-0.5 49.5 99.5 149.5 199.5 249.5 299.5 No. of text messages
MODE = 140 66
EXAMPLE 2.5
a. Ten students were asked how many siblings they had. The results,
arranged in order, were
0111122336
Find the mode of this data set.(1).
b. The following table presents the daily high temperature in a
days(Refer Example 2.2). Approximate the mode of daily high
temperature. (29.0)
Class Frequency
12 - 21 3
22 - 31 6
32 - 41 5
42 - 51 4
52 - 61 2
Total 20
Which Measure to Choose?
DCOVA
 The mean is generally used, unless extreme
values (outliers) exist.
 The median is often used, since the median
is not sensitive to extreme values. For
example, median home prices may be
reported for a region; it is less sensitive to
outliers.
 In many situations it makes sense to report
both the mean and the median.
Describing the Shape of a Data Set
• The mean and median measure the center of a data set in different
ways. When a data set is symmetric, the mean, median and mode are
equal.
• When a data set is skewed to the right, there are large values in the
right tail. Because the median is resistant while the mean is not, the
mean is generally more affected by these large values. Therefore for a
data set that is skewed to the right, the mean is often greater than the
median greater than the mode.
• Similarly, when a data set is skewed to the left, the mean is often less
than the median less than the mode.
70
i. Approximately Symmetric
Shape: Approximately Symmetric
Relationship Between
the Mean, Median and Mode: Mean, median and mode are approximately the same
71
ii. Skewed to the Right
Shape: Skewed to the Right
the Mean, Median and Mode : Mean is noticeably greater than the median greater
than the mode.
72
iii. Skewed to the Left
Shape: Skewed to the Left
the Mean, Median and Mode: Mean is noticeably less than the median less than the
mode.
73
Summary of Measure of Central
Tendency
Data
Measure
Ungrouped Grouped
Mean
Mode = value with

Mode highest frequency (could Mode
be > 1)
Median Med
74
2.5 MEASURE OF
POSITION
75
DCOVA
Position
Percentiles Quartiles
Measures of position are techniques that divide a set of data into equal groups.
To determine the measurement of position, the data must be sorted from lowest to highest. The
different measures of position are percentiles and quartiles
2.5.1 PERCENTILES
• The mean and median of a data set describe the center of a
distribution (quantitative).
• For some data it is often useful to compute measures of positions
other than the center, to get a more detailed description of the
distribution.
• Percentiles provide a way to do this. Percentiles divide a data set into
hundredths.
• Definition: For a number p between 1 and 99, the pth percentile
separates the lowest p% of the data from the highest (100 – p)%.
77
2.5.1 PERCENTILES
UNGROUPED DATA
• First, the data need to be arranged in increasing order.
• To compute the data value corresponding to a given percentile:
– If L is a whole number, then the pth percentile is the average of the number in position L and the number in position (L+1).
– If L is not a whole number, round it up to the next higher whole number. The pth percentile is the number in the position
corresponding to the rounded-up value.
• To compute the percentile corresponding to a given data value, X:
– Round the result to the nearest whole number.
78
EXAMPLE 2.6
A teacher gives a 20-points test to 10 students. The scores are shown
here.
18 15 12 6 8 2 3 5 20 10
1. Find the value corresponding to the 25th and 60th percentile (5, 11).
2. Find the percentile rank of a score of 6 and 12 (35, 65).
79
2.5.2 QUARTILES
• There are 3 percentiles that are used more often than the others - the 25th,
the 50th, and the 75th .
• These percentiles divide the data into 4 parts, each of which contains
approximately one quarter of the data.
• Thus, these 3 percentiles are called quartiles.
• Can visualize the distribution of the values for a numerical variable by
computing:
– The quartiles.
– The five-number summary.
– Constructing a boxplot.
80
DCOVA
2.5.2 QUARTILE MEASURES
2.5.2.1 UNGROUPED DATA
• Quartiles split the ranked data into 4 segments with an equal number
of values per segment.
25% 25% 25% 25%
Q1 Q2 Q3
 The first quartile, Q1, is the value for which 25% of the
values are smaller and 75% are larger.
 Q2 is the same as the median (50% of the values are
smaller and 50% are larger).
 Only 25% of the values are greater than the third quartile -
separates the lowest 75% of the data from the highest 25%.
• Determining quartiles
i. Arrange data in ascending order
ii. Find 25th and 75th percentiles or find the depth of Q1 and Q3,
iii. Determine the values based on the positions.

2.5.1 QUARTILE MEASURES
2.5.1.2 GROUPED DATA
• Recall the procedure for approximating the median using grouped data
• Determining quartiles
– Cumulative frequency
– Quartile class:
– Q1 class 
– Q3class 
– Find the values
EXAMPLE 2.7
• Following are final exam scores, arranged in increasing order for 28
students.
58 59 62 64 67 68 69 71 73 74 74 75 76 76
76 77 78 78 78 82 82 84 86 87 87 88 91 97
a. Find the 1st quartile of the scores (70).

b. Find the 3rd quartile of the scores (83).
84
EXAMPLE 2.8
The following table presents the daily high temperature in a manufacturer
of insulation for randomly selected 20 winter days(Refer Example 2.2).
Calculate the Q1 and Q3.
Class Frequency Cumulative

Frequency
12 - 21 3 3
22 - 31 9 6
32 - 41 14 5
18
42 - 51 4
20
52 - 61 2
Total 20
Conclusions: Measures of Positions
Data
Measurement
Ungrouped Grouped
Percentiles 
Percentiles 
1st Quartile
1st Quartile
3rd Quartile
3rd Quartile
86
2.6 MEASURE OF
DISPERSION
DCOVA
Variation
Range Variance Standard Coefficient

Deviation of Variation
Interquartile
Range
 Measures of variation give

information on the spread
or variability or
dispersion of the data
values.
Same center,
different variation
2.6.1 THE RANGE DCOVA
2.6.1.1 UNGROUP DATA
 Simplest measure of variation.
 Difference between the largest and the smallest values:
Range = Xlargest – Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 13 - 1 = 12
2.6.1 THE RANGE
2.6.1.2 GROUP DATA
Class Frequency
41 – 50 1
51 – 60 3 Upper bound of last class = 100.5
61 – 70 7 Lower bound of first class = 40.5
71 – 80 13 Range = 100.5 – 40.5 = 60
81 – 90 10
91 - 100 6
Total 40
Why The Range Can Be Misleading
DCOVA
 Does not account for how the data are distributed.
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
 Sensitive1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
to outliers
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
EXAMPLE 2.9
The following table presents the average monthly temperature, in degrees Fahrenheit,
for the cities of San Francisco and St. Louis. Compute the range for each city.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
San Francisco 51 54 55 56 58 60 60 61 63 62 58 52
St. Louis 30 35 44 57 66 75 79 78 70 59 45 35
Source: National Weather Service
92
2.6.2 INTERQUARTILE RANGE (IQR)
• Quartiles can be used as a rough measurement of variability.
• The interquartile range is the range of the middle 50% of the data.
• The IQR is a measure of variability that is not influenced by outliers or
extreme values.
• Measures like Q1, Q3, and IQR that are not influenced by outliers are
called resistant measures.
• It is defined as the difference between the first quartile and the third
quartile.
IQR = Q3 – Q1
93
EXAMPLE 2.10
Table below list the total revenue for the 12 top tourism company in Malaysia
109.7 79.9 74.1 121.2 76.4 80.2 82.1 79.4 89.3 98.0 103.5
86.8
Determine the interquartile of the data (79.5, 102.1, 22.6)
74.1 76.4 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7 121.2
Answer : 79.65, 100.75, 21.1
94
2.6.3 VARIANCE
• Although the range is easy to compute, it is not often used in practice. The
reason is that the range involves only two values from the data set; the largest
and smallest.
• The measures of spread that are most often used are the variance and the
standard deviation, which use every value in the data set.
• When a data set has a small amount of spread, like the San Francisco
temperatures, most of the values will be close to the mean. When a data set has
a larger amount of spread, more of the data values will be far from the mean.
• The variance is a measure of how far the values in a data set are from the
mean, on the average.
• The variance is computed slightly differently for populations and samples.
95
Population Sample
• In the formula, the mean μ is replaced
by the sample mean and the
denominator is n – 1 instead of N. The
sample variance is denoted by s2.
where N – number of where n – number of

observation in the population observation in the sample
96
Sample Variance
Ungrouped Grouped
• 1st: • 1st: Compute the midpoint (x) for each
class
• 2nd:
• 2nd: Multiply the midpoint by the class
• 3rd: Calculate the sample variance frequency (). Find the sum ()
• 3rd: Squared the midpoint (x2) and

multiply the frequency (), then sum the
values ().
• 4th: Calculate the sample variance
97
EXAMPLE 2.11
A company that manufactures batteries is testing a new type of battery designed for
laptop computers. They measure the lifetimes, in hours, of six batteries, and the results
are presented in the following table. Find the variance of the lifetimes. (2)
Battery Lifetime 3 4 6 5 4 2
98
EXAMPLE 2.12
No. of text No. of student Class Midpoint, fx
message sent (frequency, f) x
0 – 49 10 24.5 245.0
50 – 99 5 74.5 372.5
100 – 149 13
124.5 1618.5
150 – 199 11
174.5 1919.5
200 – 249 7
224.5 1571.5
250 – 299 4
274.5 1098.0
6825
99
2.6.4 STANDARD DEVIATION
• Because the variance is computed using squared deviations, the units of the
variance are the squared units of the data.
• For example, in Battery Lifetime example, the units of the data are hours, and
the units of variance are squared hours.
• In most situations, it is better to use a measure of spread that has the same
units as the data.
• We do this simply by taking the square root of the variance. This quantity is
called the standard deviation.
• The standard deviation of a sample is denoted s, and the standard deviation
of a population is denoted by σ.
100
Important properties of standard
deviation
• The standard deviation is a measure of variation of all values from the
mean.
• The value of the standard deviation is usually positive (it is never
negative).
• The value of the standard deviation can increase dramatically with the
inclusion of one or more outliers (data values far away from all others).
• The units of the standard deviation are the same as the units of the
original data values.
101
Comparing Standard Deviations
Smaller standard deviation
Larger standard deviation

Summary Characteristics
 The more the data are spread out, the greater the range, variance, and standard
deviation.
 The more the data are concentrated, the smaller the range, variance, and
standard deviation.
 If the values are all the same (no variation), all these measures will be zero.
 None of these measures are ever negative.

2.6.5 THE COEFFICIENT OF
VARIATION
• Measures relative variation.
• Always in percentage (%).
• Shows variation relative to mean.
• Can be used to compare the variability of two or more

sets of data measured in different units.
 S
CV     100%

X 
EXAMPLE 2.13 Comparing Coefficients of
Variation
• Stock A:
– Mean price last year = $50.
– Standard deviation = $5.
• Stock B:
Comparing Coefficients of Variation (con’t)
• Stock A:
• Stock C:
Conclusions: Measures of
Dispersion
Data
Measuremen
t
Ungrouped Grouped
Range
Interquartile
Interquartile IQR
IQR =
= Q3
Q3 –
– Q1
Q1
range
range
Variance
Variance
Standard
Standard
deviation
deviation
107
2.7 MEASURE OF
SKEWNESS/SHAPE
• Describes how data are distributed.
• Two useful shape related statistics are:
– Skewness:
– Measures the extent to which data values are not symmetrical.
– Kurtosis:
– Kurtosis measures the peakedness of the curve of the
distribution—that is, how sharply the curve rises approaching the
center of the distribution.
2.7.1 COEFFICIENT OF SKEWNESS
• To determine the skewness of the data
– If the value = +ve  right skewed

– If the value = -ve  left skewed
– If the value = 0  symmetry
• Measures the extent to which data is not symmetrical.
Left-Skewed Symmetric Right-Skewed

Mean < Median Mean = Median Median < Mean
Skewness
<0 0 >0
Statistic
2.7.2 KURTOSIS
Measures how sharply the curve rises approaching the center of the distribution
Sharper Peak
Than Bell-Shaped
(Kurtosis > 0)
Bell-Shaped
(Kurtosis = 0)
Flatter Than
Bell-Shaped
(Kurtosis < 0)
The Five Number Summary
The five numbers that help describe the center, spread and shape of data are:
 Xlargest.
 Third Quartile (Q3).
 Median (Q2).
 First Quartile (Q1).
 Xsmallest.
• These summaries are more informative when it is displayed on a diagram drawn to

scale.
• A graphic display that accomplishes this is known as box-and-whiskers display
(boxplot)
Five Number Summary and
The Boxplot
• The Boxplot: A Graphical display of the data based on the five-number
summary:
Xsmallest -- Q1 -- Median -- Q3 -- Xlargest

Example:
25% of data 25% 25% 25% of data

of data of data
Xsmallest Q1 Median Q3 Xlargest

Calculating The Interquartile Range
Example:
Median X
X Q1 Q3 maximum
minimum (Q2)
25% 25% 25% 25%
12 30 45 57 70
Interquartile range
= 57 – 30 = 27
Five Number Summary:
Shape of Boxplots
DCOVA
• If data are symmetric around the median then the box and central
line are centered between the endpoints.
Xsmallest Q1 Median Q3 Xlargest
• A Boxplot can be shown in either a vertical or horizontal orientation.

Distribution Shape and
The Boxplot
Left-Skewed Symmetric Right-Skewed
Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
Chapter Summary
In this chapter we covered:
• Organizing categorical variables.
• Organizing numerical variables.
• Visualizing categorical variables.
• Visualizing numerical variables.
• Describing the properties of central tendency, variation, and shape in
numerical variables.
• Constructing and interpreting a boxplot.

Chapter 2 171

Uploaded by

Copyright:

Available Formats

Chapter 2 171

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 171

Uploaded by

Copyright:

Available Formats

CHAPTER 2

Array data - Raw data that are arranged in

• Visual summaries enable rapid review of larger amounts of

• Often, the Organize and Visualize step in DCOVA occur

Table 2.3 Main Reason Young Adults Shop Online

Reason For Shopping Frequenc Percen

– Cross tabulates or tallies jointly the responses of the categorical

Bar Pareto Component / Doughnut

Reason For Percent

Reason For Shopping Percent

Reason For Shopping Percent

• A vertical bar chart, where categories are shown in

• A cumulative polygon is shown in the same graph.

• Used to separate the “vital few” from the “trivial many.”

Small 50.75% 30.77% 47.50%

Medium 29.85% 61.54% 35.00%

Invoices with errors are much more likely to be of

Medium 29.85% 61.54% 35.00% 80.00%

Large 19.40% 7.69% 17.50% 40.00%

Amount Amount Amount

Invoices with errors are much more likely to be of

No Invoice Size & Errors

Small 50.75% 30.77% 47.50% 19.38%

Large 19.40% 7.69% 17.50% 29.87%

Invoices with errors are much more likely to be of

 Shows range (minimum value to maximum value).

 May help identify outliers (unusual observations).

Age of Surveyed Day Students

A manufacturer of insulation randomly selects 20 winter days and records the

Frequency Distribution: Tips

Stem-and-Leaf Histogram Polygon Ogive

METHOD: Separate the sorted data series

Age of College Students

 In a histogram there are no gaps between adjacent bars.

 The vertical axis is either frequency, relative frequency, or percentage.

(In a percentage histogram

 The cumulative percentage polygon, or ogive, displays the variable

 Useful when there are two or more groups to compare.

(1) ogive less than

 Scatter plots are used to examine possible relationships between two

Volume Cost per

• The Time-Series Plot:

Number of franchises, 2007 to 2015

 The variation is the amount of dispersion or scattering away from a

The ith value

The ith value

• Less sensitive than the mean to extreme values.

3. Find the median.

3. Compute the median

the difference between the

Lower boundary the difference between the

Mode = value with

• To compute the percentile corresponding to a given data value, X:

– Round the result to the nearest whole number.

2. Find the percentile rank of a score of 6 and 12 (35, 65).

iii. Determine the values based on the positions.

a. Find the 1st quartile of the scores (70).

Class Frequency Cumulative

Range Variance Standard Coefficient