Chapter 2 171
Chapter 2 171
Chapter 2 171
DESCRIPTIVE DATA
SQQS1013 ELEMENTARY STATISTICS
ORGANIZING AND
VISUALIZING DATA
Objectives
In this chapter you learn:
• Organizing categorical variables.
• Organizing numerical variables.
• Visualizing categorical variables.
• Visualizing numerical variables.
• Organizing and visualizing a mix of variables.
• The challenge in organizing and visualizing variables.
2.1 INTRODUCTION
Example:
Here is a list of question asked in a large statistics class and the
data value given by one of the students:
i. What is your sex (m=male, f=female)?
ii. How many hours did you sleep last night?
iii. Randomly pick a letter – S or Q.
iv. What is your height in inches?
v. What’s the fastest you’ve ever driven a car (mph)?
Raw data - Data recorded in the sequence in which
they were originally collected,
before being processed or ranked.
Tallying Data
One Two
Categorical Categorical
Variable Variables
Summary Contingency
Table Table
• A summary table tallies the frequencies or percentages of
items in a set of categories so that you can see differences
between categories.
– Used to study patterns that may exist between the responses of two or
more categorical variables.
– For two variables the tallies for one variable are located in the rows and
the tallies for the second variable are located in the columns
Example 2.1-Contingency Table
Table 2.4 Contingency Table Showing
• A random sample of 400 Frequency of Invoices Categorized
invoices is drawn. By Size and The Presence Of Errors
No
• Each invoice is categorized as Errors Errors Total
a small, medium, or large
Small 170 20 190
amount. Amount
• Each invoice is also examined Medium 100 40 140
to identify if there are any Amount
errors. Large 65 5 70
Amount
• This data are then organized in Total 335 65 400
the contingency table to the
right.
Contingency Table Based On
Percentage Of Overall Total DCOVA
No
Errors Errors Total 42.50% = 170 / 400
Small 170 20 190 25.00% = 100 / 400
Amount 16.25% = 65 / 400
Medium 100 40 140
Amount No
Large 65 5 70 Errors Errors Total
Amount Small 42.50% 5.00% 47.50%
Total 335 65 400 Amount
Medium 25.00% 10.00% 35.00%
Amount
83.75% of sampled invoices
Large 16.25% 1.25% 17.50%
have no errors and 47.50% Amount
of sampled invoices are for Total 83.75% 16.25% 100.0%
small amounts.
Contingency Table Based On
Percentage of Row Totals DCOVA
No
Errors Errors Total 89.47% = 170 / 190
Small 170 20 190 71.43% = 100 / 140
Amount 92.86% = 65 / 70
Medium 100 40 140
Amount
No
Large 65 5 70 Errors Errors Total
Amount
Small 89.47% 10.53% 100.0%
Total 335 65 400 Amount
Medium 71.43% 28.57% 100.0%
Amount
Medium invoices have a larger
Large 92.86% 7.14% 100.0%
chance (28.57%) of having Amount
errors than small (10.53%) or Total 83.75% 16.25% 100.0%
large (7.14%) invoices.
Contingency Table Based On
Percentage Of Column Totals DCOVA
No
Errors Errors Total 50.75% = 170 / 335
Small 170 20 190 30.77% = 20 / 65
Amount
Medium 100 40 140
Amount No
Large 65 5 70 Errors Errors Total
Amount
Small 50.75% 30.77% 47.50%
Total 335 65 400 Amount
Medium 29.85% 61.54% 35.00%
Amount
There is a 61.54% chance
Large 19.40% 7.69% 17.50%
that invoices with errors are Amount
of medium size. Total 100.0% 100.0% 100.0%
2.2.2 Visualizing Categorical Data
DCOVA
Categorical
Data
Visualizing Data
Summary Contingency
Table For One Table For Two
Variable Variables
Source: Data extracted from A. Bhalla, “Don’t Misuse the Pareto Principle,”
Six Sigma Forum
Magazine, May 2009, pp. 15–18.
The Pareto Chart (con’t) DCOVA
The “Vital
Few”
Multiple (Side By Side) Bar Charts
The side by side bar chart represents the data from a contingency DCOVA
table.
No
Errors Errors Total Invoice Size Split Out By Errors & No Errors
Amount 100.00%
Amount
Total 100.0% 100.0% 100.0%
Small Medium Large
Numerical Data
Frequency Cumulative
Ordered Array
Distributions Distributions
Ordered Array DCOVA
An ordered array is a sequence of data, in rank order, from the smallest value to the
largest value.
• When comparing two or more groups with different sample sizes, you must use either a
relative frequency or a percentage distribution.
2.3.2 Visualizing Numerical Data
DCOVA
Numerical Data
Frequency Distributions
Ordered Array and
Cumulative Distributions
The class boundaries (or class midpoints) are shown on the horizontal axis.
The height of the bars represent the frequency, relative frequency, or percentage.
The Histogram DCOVA
Relative
Class Frequency Frequency Percentage
12 - 21 3 .15 15
22 - 31 6 .30 30
32 - 41 5 .25 25
42 - 51 4 .20 20
52 - 61 2 .10 10
Total 20 1.00 100 Histogram: Temperature
SQQS1013 W2 L4 41
Ogive
SQQS1013 W2 L4 42
2.3.3 Visualizing Two Numerical Variables
DCOVA
Two Numerical
Variables
Scatter Time-Series
Plot Plot
The Scatter Plot
DCOVA
Scatter plots are used for numerical data consisting of paired observations
taken from two numerical variables.
One variable is measured on the vertical axis and the other variable is
measured on the horizontal axis.
2007 43 100
2008 54
number of franchises
80
2009 60
2010 73 60
2011 82 40
2012 95
20
2013 107
2014 99 0
2007 2008 2009 2010 2011 2012 2013 2014 2015
2015 95 year
EXERCISE 2.2
The histogram below represents i. How many percent of the job applicants
scored between 10 and 20?
scores achieved by 200 job
applicants on a personality profile. ii. How many percent of the job applicants
scored below 50?
0.30
Rel.Freq. iii. What is the number of job applicants
who scored between 30 and below 60.
0.20
0.20 0.20 0.20
iv. What is the number of job applicants
who scored 50 or above.
0.10
0.10 0.10 0.10 0.10
v. 90% of the job applicants scored above
or equal to ________.
0.00
0 10 20 30 40 50 60 70
vi. Half of the job applicants scored below
________.
NUMERICAL
DESCRIPTIVE MEASURE
Objectives
In this topic, you learn to:
• Describe the properties of central tendency, variation, and
shape in numerical variables.
• Construct and interpret a boxplot.
Summary DCOVA
The central tendency is the extent to which the values of a numerical
variable group around a typical or central value.
The shape is the pattern of the distribution of values from the lowest
value to the highest value.
2.4 MEASURE OF
CENTRAL TENDENCY
2.4.1 MEAN
2.4.1.1 UNGROUP DATA DCOVA
• The arithmetic mean (often just called the “mean”) is the most common
measure of central tendency.
• The most common measure of central tendency.
• Mean = sum of values divided by the number of values.
• Affected by extreme values (outliers).
•Population mean, if data comes from population.
•Sample mean, if data comes from sample.
For a sample of size n:
X i
X1 X 2 X n
X i 1
n n
Sample size Observed values
EXAMPLE 2.3 DCOVA
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Mean = 13 Mean = 14
11 12 13 14 15 65 11 12 13 14 20 70
13 14
5 5 5 5
2.4.1 MEAN
2.4.1.2 GROUP DATA DCOVA
fX i i
f1 X 1 f 2 X 2 f n X n
X i 1
n
f1 f 2 ... f n
f
i 1
i
Mid-point of a
Total of Frequency of a class
frequency/S class
ample size
EXAMPLE 2.3
a. During a semester, a student took five exams. The population of
exam scores is 78, 83, 92, 68, and 85. Find the mean. (406, 81.2)
b. The following table shows the speeds (in km/h) of 30 cars measured
at certain checkpoint. (1504, 50.13)
41 53 58 67 33 61 43 45 42 67
39 48 36 47 34 59 57 54 65 69
63 42 60 48 66 30 30 46 52 49
c) The following table presents the daily high temperature in a
manufacturer of insulation for randomly selected 20 winter
days(Refer Example 2.2). Approximate the mean of daily high
temperature. (34.5)
Class Frequency
12 - 21 3
22 - 31 6
32 - 41 5
42 - 51 4
52 - 61 2
Total 20
2.4.2 MEDIAN
2.4.2.1 UNGROUP DATA DCOVA
• In an ordered array, the median is the “middle” number (50% above, 50%
below).
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Median = 13 Median = 13
Class Frequency
12 - 21 3
22 - 31 6
32 - 41 5
42 - 51 4
52 - 61 2
Total 20
2.4.3 MODE
2.4.3.1 UNGROUP DATA
DCOVA
• Value that occurs most often.
• Not affected by extreme values.
• Used for either numerical or categorical data.
• There may be no mode.
• There may be several modes.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
Mode = 9 No Mode
2.4.3 MODE
2.4.3.2 GROUP DATA
• Determine class mode (or, modal class) - the class with the highest
frequency.
• Use the following formula Class width
12
10
Frequency
0
-0.5 49.5 99.5 149.5 199.5 249.5 299.5 No. of text messages
MODE = 140 66
EXAMPLE 2.5
a. Ten students were asked how many siblings they had. The results,
arranged in order, were
0111122336
Find the mode of this data set.(1).
b. The following table presents the daily high temperature in a
manufacturer of insulation for randomly selected 20 winter
days(Refer Example 2.2). Approximate the mode of daily high
temperature. (29.0)
Class Frequency
12 - 21 3
22 - 31 6
32 - 41 5
42 - 51 4
52 - 61 2
Total 20
Which Measure to Choose?
DCOVA
The mean is generally used, unless extreme
values (outliers) exist.
The median is often used, since the median
is not sensitive to extreme values. For
example, median home prices may be
reported for a region; it is less sensitive to
outliers.
In many situations it makes sense to report
both the mean and the median.
Describing the Shape of a Data Set
• The mean and median measure the center of a data set in different
ways. When a data set is symmetric, the mean, median and mode are
equal.
• When a data set is skewed to the right, there are large values in the
right tail. Because the median is resistant while the mean is not, the
mean is generally more affected by these large values. Therefore for a
data set that is skewed to the right, the mean is often greater than the
median greater than the mode.
• Similarly, when a data set is skewed to the left, the mean is often less
than the median less than the mode.
70
i. Approximately Symmetric
Shape: Approximately Symmetric
Relationship Between
the Mean, Median and Mode: Mean, median and mode are approximately the same
71
ii. Skewed to the Right
Shape: Skewed to the Right
Relationship Between
the Mean, Median and Mode : Mean is noticeably greater than the median greater
than the mode.
72
iii. Skewed to the Left
Shape: Skewed to the Left
Relationship Between
the Mean, Median and Mode: Mean is noticeably less than the median less than the
mode.
73
Summary of Measure of Central
Tendency
Data
Measure
Ungrouped Grouped
Mean
Median Med
74
2.5 MEASURE OF
POSITION
75
DCOVA
Position
Percentiles Quartiles
Measures of position are techniques that divide a set of data into equal groups.
To determine the measurement of position, the data must be sorted from lowest to highest. The
different measures of position are percentiles and quartiles
2.5.1 PERCENTILES
• The mean and median of a data set describe the center of a
distribution (quantitative).
• For some data it is often useful to compute measures of positions
other than the center, to get a more detailed description of the
distribution.
• Percentiles provide a way to do this. Percentiles divide a data set into
hundredths.
• Definition: For a number p between 1 and 99, the pth percentile
separates the lowest p% of the data from the highest (100 – p)%.
77
2.5.1 PERCENTILES
UNGROUPED DATA
• First, the data need to be arranged in increasing order.
• To compute the data value corresponding to a given percentile:
– If L is a whole number, then the pth percentile is the average of the number in position L and the number in position (L+1).
– If L is not a whole number, round it up to the next higher whole number. The pth percentile is the number in the position
corresponding to the rounded-up value.
78
EXAMPLE 2.6
A teacher gives a 20-points test to 10 students. The scores are shown
here.
18 15 12 6 8 2 3 5 20 10
1. Find the value corresponding to the 25th and 60th percentile (5, 11).
79
2.5.2 QUARTILES
• There are 3 percentiles that are used more often than the others - the 25th,
the 50th, and the 75th .
• These percentiles divide the data into 4 parts, each of which contains
approximately one quarter of the data.
• Thus, these 3 percentiles are called quartiles.
• Can visualize the distribution of the values for a numerical variable by
computing:
– The quartiles.
– The five-number summary.
– Constructing a boxplot.
80
DCOVA
2.5.2 QUARTILE MEASURES
2.5.2.1 UNGROUPED DATA
• Quartiles split the ranked data into 4 segments with an equal number
of values per segment.
25% 25% 25% 25%
Q1 Q2 Q3
The first quartile, Q1, is the value for which 25% of the
values are smaller and 75% are larger.
Q2 is the same as the median (50% of the values are
smaller and 50% are larger).
Only 25% of the values are greater than the third quartile -
separates the lowest 75% of the data from the highest 25%.
• Determining quartiles
i. Arrange data in ascending order
ii. Find 25th and 75th percentiles or find the depth of Q1 and Q3,
84
EXAMPLE 2.8
The following table presents the daily high temperature in a manufacturer
of insulation for randomly selected 20 winter days(Refer Example 2.2).
Calculate the Q1 and Q3.
Total 20
Conclusions: Measures of Positions
Data
Measurement
Ungrouped Grouped
Percentiles
Percentiles
1st Quartile
1st Quartile
3rd Quartile
3rd Quartile
86
2.6 MEASURE OF
DISPERSION
DCOVA
Variation
Interquartile
Range
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 13 - 1 = 12
2.6.1 THE RANGE
2.6.1.2 GROUP DATA
Class Frequency
41 – 50 1
51 – 60 3 Upper bound of last class = 100.5
61 – 70 7 Lower bound of first class = 40.5
71 – 80 13 Range = 100.5 – 40.5 = 60
81 – 90 10
91 - 100 6
Total 40
Why The Range Can Be Misleading
DCOVA
Does not account for how the data are distributed.
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
Sensitive1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
to outliers
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
EXAMPLE 2.9
The following table presents the average monthly temperature, in degrees Fahrenheit,
for the cities of San Francisco and St. Louis. Compute the range for each city.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
San Francisco 51 54 55 56 58 60 60 61 63 62 58 52
St. Louis 30 35 44 57 66 75 79 78 70 59 45 35
Source: National Weather Service
92
2.6.2 INTERQUARTILE RANGE (IQR)
• Quartiles can be used as a rough measurement of variability.
• The interquartile range is the range of the middle 50% of the data.
• The IQR is a measure of variability that is not influenced by outliers or
extreme values.
• Measures like Q1, Q3, and IQR that are not influenced by outliers are
called resistant measures.
• It is defined as the difference between the first quartile and the third
quartile.
IQR = Q3 – Q1
93
EXAMPLE 2.10
Table below list the total revenue for the 12 top tourism company in Malaysia
109.7 79.9 74.1 121.2 76.4 80.2 82.1 79.4 89.3 98.0 103.5
86.8
Determine the interquartile of the data (79.5, 102.1, 22.6)
74.1 76.4 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7 121.2
94
2.6.3 VARIANCE
• Although the range is easy to compute, it is not often used in practice. The
reason is that the range involves only two values from the data set; the largest
and smallest.
• The measures of spread that are most often used are the variance and the
standard deviation, which use every value in the data set.
• When a data set has a small amount of spread, like the San Francisco
temperatures, most of the values will be close to the mean. When a data set has
a larger amount of spread, more of the data values will be far from the mean.
• The variance is a measure of how far the values in a data set are from the
mean, on the average.
• The variance is computed slightly differently for populations and samples.
95
Population Sample
• In the formula, the mean μ is replaced
by the sample mean and the
denominator is n – 1 instead of N. The
sample variance is denoted by s2.
96
Sample Variance
Ungrouped Grouped
• 1st: • 1st: Compute the midpoint (x) for each
class
• 2nd:
• 2nd: Multiply the midpoint by the class
• 3rd: Calculate the sample variance frequency (). Find the sum ()
97
EXAMPLE 2.11
A company that manufactures batteries is testing a new type of battery designed for
laptop computers. They measure the lifetimes, in hours, of six batteries, and the results
are presented in the following table. Find the variance of the lifetimes. (2)
Battery Lifetime 3 4 6 5 4 2
98
EXAMPLE 2.12
No. of text No. of student Class Midpoint, fx
message sent (frequency, f) x
0 – 49 10 24.5 245.0
50 – 99 5 74.5 372.5
100 – 149 13
124.5 1618.5
150 – 199 11
174.5 1919.5
200 – 249 7
224.5 1571.5
250 – 299 4
274.5 1098.0
6825
99
2.6.4 STANDARD DEVIATION
• Because the variance is computed using squared deviations, the units of the
variance are the squared units of the data.
• For example, in Battery Lifetime example, the units of the data are hours, and
the units of variance are squared hours.
• In most situations, it is better to use a measure of spread that has the same
units as the data.
• We do this simply by taking the square root of the variance. This quantity is
called the standard deviation.
• The standard deviation of a sample is denoted s, and the standard deviation
of a population is denoted by σ.
100
Important properties of standard
deviation
• The standard deviation is a measure of variation of all values from the
mean.
• The value of the standard deviation is usually positive (it is never
negative).
• The value of the standard deviation can increase dramatically with the
inclusion of one or more outliers (data values far away from all others).
• The units of the standard deviation are the same as the units of the
original data values.
101
Comparing Standard Deviations
The more the data are concentrated, the smaller the range, variance, and
standard deviation.
If the values are all the same (no variation), all these measures will be zero.
S
CV 100%
X
EXAMPLE 2.13 Comparing Coefficients of
Variation
• Stock A:
– Mean price last year = $50.
– Standard deviation = $5.
• Stock B:
– Mean price last year = $100.
– Standard deviation = $5.
Comparing Coefficients of Variation (con’t)
• Stock A:
– Mean price last year = $50.
– Standard deviation = $5.
• Stock C:
– Mean price last year = $8.
– Standard deviation = $2.
Conclusions: Measures of
Dispersion
Data
Measuremen
t
Ungrouped Grouped
Range
Interquartile
Interquartile IQR
IQR =
= Q3
Q3 –
– Q1
Q1
range
range
Variance
Variance
Standard
Standard
deviation
deviation
107
2.7 MEASURE OF
SKEWNESS/SHAPE
• Describes how data are distributed.
• Two useful shape related statistics are:
– Skewness:
– Measures the extent to which data values are not symmetrical.
– Kurtosis:
– Kurtosis measures the peakedness of the curve of the
distribution—that is, how sharply the curve rises approaching the
center of the distribution.
2.7.1 COEFFICIENT OF SKEWNESS
• To determine the skewness of the data
Skewness
<0 0 >0
Statistic
2.7.2 KURTOSIS
Measures how sharply the curve rises approaching the center of the distribution
Sharper Peak
Than Bell-Shaped
(Kurtosis > 0)
Bell-Shaped
(Kurtosis = 0)
Flatter Than
Bell-Shaped
(Kurtosis < 0)
The Five Number Summary
The five numbers that help describe the center, spread and shape of data are:
Xlargest.
Third Quartile (Q3).
Median (Q2).
First Quartile (Q1).
Xsmallest.
Example:
Median X
X Q1 Q3 maximum
minimum (Q2)
25% 25% 25% 25%
12 30 45 57 70
Interquartile range
= 57 – 30 = 27
Five Number Summary:
Shape of Boxplots
DCOVA
• If data are symmetric around the median then the box and central
line are centered between the endpoints.
Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3
Chapter Summary
In this chapter we covered:
• Organizing categorical variables.
• Organizing numerical variables.
• Visualizing categorical variables.
• Visualizing numerical variables.
• Describing the properties of central tendency, variation, and shape in
numerical variables.
• Constructing and interpreting a boxplot.