0% found this document useful (0 votes)
2 views101 pages

Chapter 3 - Data Description (for student)

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 101

SEHH1028

ELEMENTARY STATISTICS
CHAPTER 3
Data Description
In Chapter 3, the procedure for finding Percentile (page 81) is
VERY IMPORTANT. The calculation of Quartile (page 85), Five
Number Summary (page 86), Boxplot (page 87), Outlier (page 95)
are related to the calculation Percentile.
Data Description
3.1 Measures of Central Tendency
3.2 Measures of Variation
3.3 Measures of Position

SEHH1028 Elementary Statistics Page 2


3.1 Measures of Central Tendency
• Central tendency refers to center of the distribution or
the most typical case in a population.
• Examples of measures of central tendency
– A typical Hong Kong male is 171cm tall.
– Average number of people per household in Hong Kong is 2.9.
– The average household income in Hong Kong is $25000 per
month.

SEHH1028 Elementary Statistics Page 3


Statistics & Parameters
• A statistic is a characteristic or measure obtained by
using the data values from a sample.

• A parameter is a characteristic or measure obtained by


using all the data values from a specific population.

SEHH1028 Elementary Statistics Page 4


Relationship Between Statistics, Parameter,
Sample & Population
Sampling

Sample: Population:
A random sample of All working adults in
1000 working adults in Hong Kong
Hong Kong

Statistic: Parameter (unknown):


Average salary of the Average salary of all
1000 working adults working adults in
in the sample Hong Kong

Estimate

SEHH1028 Elementary Statistics Page 5


Common Measures of Central Tendency
• Mean
• Weighted Mean
• Median
• Mode
• Modal Class
• Midrange

SEHH1028 Elementary Statistics Page 6


Sample Mean & Population Mean
The sample mean   is equal to the sum of data values in a sample
divided by the sample size .
 +  +  + ⋯ +  ∑
 =   =
 
The population mean  is equal to the sum of data values in a
population divided by the population size .
 +  +  + ⋯ +  ∑
=   =
 

SEHH1028 Elementary Statistics Page 7


Example 1 – Sample Mean
The scores of a sample of 9 students in a statistics test
are:
58, 75, 61, 44, 53, 66, 90, 84, 46
Find the sample mean. Correct your answer to 1 decimal
place.
Solution:
∑
 =

(58 + 75 + 61 + 44 + 53 + 66 + 90 + 84 + 46)
=
9
= 64.1111
≈ 64.1 "#$%
SEHH1028 Elementary Statistics Page 8
Computation of Mean from Grouped
Frequency Distribution
Suppose the following grouped frequency distribution represents
the no. of cigarettes patients smoked per day. Can we calculate the
sample mean of the 20 patients?
Class boundaries Frequency ( f )
4.5 – 7.5 2
7.5 – 10.5 3
10.5 – 13.5 6
13.5 – 16.5 5
16.5 – 19.5 3
19.5 – 22.5 1
n = 20

Since the raw data of the sample is not available, we cannot use the
sample mean formula shown earlier. We need another method to
calculate sample mean from grouped frequency distribution.

SEHH1028 Elementary Statistics Page 9


Computation of Sample Mean from
Grouped Frequency Distribution
1. Construct a frequency distribution with the following columns.
A B C D
Class limits/boundaries Frequency (&) Class midpoint (' ) (&)(' )

2. Find the class midpoint of each class and place them in column C.
3. Multiple the frequency by the class midpoint for each class and place the
product in column D.
4. Find the sum of column D.
5. Divide the sum obtained in D by the sum of the frequencies obtained in
column B.
The formula for the sample mean is  = ∑& × '


SEHH1028 Elementary Statistics Page 10


Example 2 – Sample Mean & Grouped
Frequency Distribution
Suppose the following grouped frequency distribution represents
the no. of cigarettes patients smoked per day for a sample of 20
patients. Find the sample mean of the grouped frequency
distribution. Correct your answer to 1 decimal place.
Class boundaries Frequency ( f )
4.5 – 7.5 2
7.5 – 10.5 3
10.5 – 13.5 6
13.5 – 16.5 5
16.5 – 19.5 3
19.5 – 22.5 1
n = 20

SEHH1028 Elementary Statistics Page 11


Example 2 - Solution
Step 1: Construction the frequency distribution table.
A B C D
Class boundaries Frequency ( f ) Midpoint ( Xm ) ( f )( Xm )
4.5 – 7.5 2
7.5 – 10.5 3
10.5 – 13.5 6
13.5 – 16.5 5
16.5 – 19.5 3
19.5 – 22.5 1
Total n = 20
Step 2: Find the class midpoints and place them in column C.
Step 3: For each class, multiple the frequency by class midpoint and
place the product in column D.

SEHH1028 Elementary Statistics Page 12


Example 2 - Solution
After step 2 & 3, you will get the frequency distribution as shown below.
A B C D
Class boundaries Frequency ( f ) Midpoint ( Xm ) ( f )( Xm )
4.5 – 7.5 2 6 2×6=12
7.5 – 10.5 3 9 3×9=27
10.5 – 13.5 6 12 6×12=72
13.5 – 16.5 5 15 5×15=75
16.5 – 19.5 3 18 3×18=54
19.5 – 22.5 1 21 1×21=21
Total  = ∑& = 20 ∑& × ' = 261
Step 4: Find the sum of column D as shown above.
Step 5: Divide the sum of column D by sample size  to get the sample mean.
∑& × ' 261

= = = 13.05 ≈ 13.1 *+,#-..-%
 20

SEHH1028 Elementary Statistics Page 13


Example 2 - Solution
An alternative way to present your solution:

∑& × '
 =

(2 × 6) + (3 × 9) + ⋯ + (1 × 21)
=
20
261
=
20
= 13.05
≈ 13.1 *+,#-..-%

SEHH1028 Elementary Statistics Page 14


Weighted Mean
In the calculation of mean in previous section, all data points have
the same contribution.
For weighted mean, some data points contribute more than the
others. The contribution (weight) is usually determined by some
additional factor.
/  + /  + /  + ⋯ + /  ∑/
 =   =
/ + / + / + ⋯ + / ∑/

where / , / , / , … , / are the weights and  ,  ,  , … ,  are


the data values.

SEHH1028 Elementary Statistics Page 15


Example 3 – Weighted Mean
Suppose Mary took 3 subjects in the last semester. The results are
shown in the table below
Subject Grade point (X) No. of credits (w)
Mathematics 4.5 (A+) 3
English 3.0 (B) 4
Chinese 1.0 (D) 5

Find Mary’s grade point average in the last semester. Correct your
answer to 2 decimal places.

Solution:
∑/ 3 × 4.5 + 4 × 3.0 + 5 × 1.0
 = = = 2.5417 ≈ 2.54 2+.%
∑/ 3+4+5

SEHH1028 Elementary Statistics Page 16


Exercise 1 – Weighted Mean
An investment portfolio consists of 3 stocks. The percentage of
the 3 stocks in the portfolio and their payoffs are summarized in
the following table. Find the average payoff using the weighted
mean formula. Correct your answer to the nearest integer.
Stock Percent (%) Payoff ($)
A 30 10,000
B 50 3,000
C 20 1,000

Solution:

SEHH1028 Elementary Statistics Page 17


Median
The median is the midpoint of an ordered data array. The
symbol for the median is MD.
If the number of data points is an odd number, median is
equal to the one data point in the middle of the data
array.
If the number of data points is an even number, median is
equal to the average of the two data points in the middle
of the data array.

SEHH1028 Elementary Statistics Page 18


Example 4 - Median
The heights (in cm) of seven HKCC students are:
170, 165, 162, 180, 173, 177, 168
Find the median.
Solution:
Step 1: Arrange the data in (increasing / ascending) order.
162, 165, 168, 170, 173, 177, 180
Step 2: Select the middle value.
162, 165, 168, 170, 173, 177, 180
Hence, the median height is 170 cm.

SEHH1028 Elementary Statistics Page 19


Example 5 - Median
The scores of ten students in a Mathematics test are:
58, 75, 61, 44, 53, 66, 90, 84, 46, 77
Find the median.
Solution:
Step 1: Arrange the data in (increasing / ascending) order.
44, 46, 53, 58, 61, 66, 75, 77, 84, 90
Step 2: Since the middle point falls halfway between 61 and 66,
the median can be found by taking the average of 61 and
66.
5655
i.e. 34 = = 63.5 "#$s

SEHH1028 Elementary Statistics Page 20


Uses of Median in Labor Statistics
• Have you wonder the effect of setting the minimum
hourly wage in HK?
• Census and Statistics Department actually carried out
surveys to monitor the salary of various industries in HK
– See the full report HERE

SEHH1028 Elementary Statistics Page 21


Uses of Median in Labor Statistics

SEHH1028 Elementary Statistics Page 22


Mode
The value that occurs most often in a data set is
called mode.

• If a data set has only one mode, it is called unimodal


• A data set can have more than one mode
– If a data set has two modes, it is called bimodal
– If a data set has more than two modes, it is called multimodal
• When no data value occurs more than once, the data is
said to have no mode

SEHH1028 Elementary Statistics Page 23


Example 6 - Mode
A survey taken from City X Observatory shows the mean
temperature of each day in June in 0C. Find the mode.
25, 23, 23, 24, 26, 27, 25, 24, 22, 24, 25, 27, 27, 26, 26,
25, 24, 24, 22, 23, 25, 25, 23, 23, 22, 22, 22, 23, 23, 22

Solution:
It is helpful to arrange the data in order, although it is not necessary.
22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 24, 24,
24, 24, 24, 25, 25, 25, 25, 25, 25, 26, 26, 26, 27, 27, 27
Since the temperature of 23 degree occurs 7 times – a frequency
larger than any other number – the mode for the data set is 23 0C.

SEHH1028 Elementary Statistics Page 24


Exercise 2 - Mode
Find the mode of the scores of ten students in a Mathematics test:
58, 75, 61, 44, 53, 66, 90, 84, 46, 77

Solution:

Find the mode of the data representing the age of 10 college


students: 18, 19, 19, 20, 20, 20, 21, 21, 21, 22

Solution:

SEHH1028 Elementary Statistics Page 25


Modal Class
The idea of mode can be applied to grouped frequency
distribution. The modal class is the class with the largest
frequency. Class Frequency
boundaries (f)
Example 7 4.5 – 7.5 2
Find the modal class and the mode for the 7.5 – 10.5 3
10.5 – 13.5 6
frequency distribution of cigarettes 13.5 – 16.5 5
patients smoked per day. 16.5 – 19.5 3
19.5 – 22.5 1
Solution:
The modal class is 10.5 – 13.5, since it has the largest frequency. Note: Instead of
using class boundaries, students may also state their answer using class limits.
The mode could also be given as 12 cigarettes per day.

SEHH1028 Elementary Statistics Page 26


Midrange
The midrange is a rough estimate of the middle. It is
found by adding the lowest and highest values in the data
set and dividing by 2. The symbol MR is used for the
midrange. 8/-%. 9#:;- + <+,ℎ-%. 9#:;-
37 =
2
Example 8
The data below represents the mean temperature of each month in
1999 in 0C. Find the midrange.
17.3, 18.7, 20.4, 24.3, 24.9, 28.9, 29.2, 28.3, 27.8, 26.2, 22.2, 16.8

5.>6 ?.
Solution: 37 = = 230C

SEHH1028 Elementary Statistics Page 27


Central Tendency & Shape of Distribution

Measures of central tendency only tell The “tail” is on the left


you the horizontal position of a
distribution but NOT the shape. Some
common shapes are shown here on
this slide.
Negatively skewed or left-skewed

The “tail” is on the right

Symmetric Positively skewed or right-skewed

SEHH1028 Elementary Statistics Page 28


Central Tendency & Shape of Distribution
Different measures of central tendency will
be influenced by the shape of a
distribution. The graphs show the effect of
the shape of a distribution on the values of
mean, mode and median.
Trivia: Did you notice that the ordering of
mean, mode and median in skewed
Mean < Median < Mode
distribution is the same as the alphabetical
ordering of the three words? Negatively skewed or left-skewed

Mean
Mode < Median < Mean
Median
Mode Positively skewed or right-skewed
Symmetric

SEHH1028 Elementary Statistics Page 29


3.2 Measures of Variation
A single measure of central tendency cannot describe a
distribution effectively as seen in the following histograms
Boy's Math Score Girl's Math Score

Mean=50 marks Mean=50 marks


250

250
Frequency
Frequency
Frequency

Frequency
150

150
0 50

0 50
0 20 40 60 80 100 0 20 40 60 80 100

Score
boys.score Score
girls.score

Measures of variation can help us in describing the


spread of a distribution in addition to its location

SEHH1028 Elementary Statistics Page 30


Common Measures of Variation
• Range
• Variance
• Standard Deviation
• Coefficient of Variation

SEHH1028 Elementary Statistics Page 31


Range
The range is the highest value minus the lowest value. The
symbol R is used for the range.
7 = <+,ℎ-%. 9#:;- − 8/-%. 9#:;-
Example 9
The weights (in pound) of nine football players are given below. Find
the range of the data set.
206, 215, 305, 297, 265, 282, 301, 255, 261

Solution:
The range = 7 = 305 − 206 = 99 2;A%

SEHH1028 Elementary Statistics Page 32


B
Population Variance ( )
The variance is the average of the squares of the distance each
value is from the mean. The symbol for the population variance is
C . The formula for the population variance is:
∑ −
C =
where 
=individual value
=population mean
=population size

SEHH1028 Elementary Statistics Page 33


Population Standard Deviation ( )
The standard deviation is the square root of the variance. The
symbol for the population standard deviation is C. The
corresponding formula for the population standard deviation is:

∑ −
C= C =


SEHH1028 Elementary Statistics Page 34


Computational Formulas for
Population Variance and Standard Deviation
The formulas for population variance and standard deviation in the
definition are not convenient for hand calculation. It is better to use
the following computational formulas.

Population Variance Population Standard Deviation


∑ − [ ∑ /]
C = ∑ − [ ∑ /]
 C=


SEHH1028 Elementary Statistics Page 35


Example 10 – Population Variance and
Standard Deviation
The weights (in gram) of a population of insects are
shown in the table below.

5, 7, 9, 11, 13, 15, 17

Find the population variance and standard deviation.


Correct your answers to the nearest integers.

SEHH1028 Elementary Statistics Page 36


Example 10 - Solution
Step 1: Find the population mean.
∑G H6I6?66 6H6I
= = = 11g
 I

Step 2: Subtract the population mean from each value and place
the result in column B of the table shown below.

Step 3: Square the results in column B and put the squares in


column C.

SEHH1028 Elementary Statistics Page 37


Example 10 – Solution
Resulting table after step 2 & 3
A B C
X X- (X - )2
5 -6 36
7 -4 16
9 -2 4
11 0 0
13 2 4
15 4 16
17 6 36
∑ − = 112

SEHH1028 Elementary Statistics Page 38


Example 10 - Solution
Step 4: Find the sum of the squares in column C.
∑  −  = 36 + 16 + 4 + 0 + 4 + 16 + 36 = 112

Step 5: Divide the sum by  to get the population variance.


∑ − 112
C = = = 16g
 7

Step 6: Take the square root to get the population standard


deviation.
∑ −
C= = 16 = 4g


SEHH1028 Elementary Statistics Page 39


Example 10 – Population Variance and Standard
Deviation (Computational Formula)
Step 1: Find the sum of the data values
∑ = 5 + 7 + 9 + 11 + 13 + 15 + 17 = 77
Step 2: Find the sum of squares of the data values
∑ = 5 + 7 + 9 + 11 + 13 + 15 + 17 = 959
Step 3: Substitute ∑ and ∑ into the computational formulas
∑ − ∑ / 959 − (77) /7
C = = = 16g
 7
∑ − ∑ /
C= = 16 = 4g


SEHH1028 Elementary Statistics Page 40


Population Variance vs Sample Variance
• The population variance and standard deviation
formula introduced in previous slides are only
applicable if you observe the ENTIRE POPULATION.

• If you draw a random SAMPLE from the population, you


should calculate sample variance (KB ) and standard
deviation (K) instead.

SEHH1028 Elementary Statistics Page 41


Sample Variance
Since population variance C (a parameter) is usually unknown, we
will estimate the value of (C ) using sample variance (% ) computed
from a random sample drawn from the population. The formula for
sample variance is

∑  − 
% =
−1

where =sample mean and =sample size.

SEHH1028 Elementary Statistics Page 42


Sample Standard Deviation
Analogous to population standard deviation, sample standard
deviation is equal to the square root of the sample variance. The
formula for sample standard deviation is

∑ GLG M
%=
L

where =sample mean and =sample size.

SEHH1028 Elementary Statistics Page 43


Computational Formulas for
Sample Variance and Standard Deviation
The formulas for sample variance and standard deviation given
above are not convenient for hand calculation. It is better to use the
following computational formulas.

Sample Variance Sample Standard Deviation


∑ − [ ∑ /]
% = ∑ − [ ∑ /]
−1 %=
−1

SEHH1028 Elementary Statistics Page 44


Example 11 – Sample Variance & Standard
Deviation
The weights (in pound) of nine football players are given
below.
206, 215, 305, 297, 265, 282, 301, 255, 261

Find the sample variance and standard deviation. Correct


your answers to 1 decimal place.

SEHH1028 Elementary Statistics Page 45


Example 11 - Solution
Step 1: Find the sum of the data values.
∑ = 206 + 215 + 305 + ⋯ + 261 = 2387
Step 2: Find the sum of the squares of the data values.
∑ = 206 + 215 + 305 + ⋯ + 261 = 643391
Step 3: Substitute ∑ and ∑ into the formula
∑ − [ ∑ /] ∑ − ∑ /
% = %=
−1 −1
643391 − [ 2387 /9]
= = 1288.194
9−1
= 1288.194 ≈ 1288.2 :N = 35.891 ≈ 35.9 :N

Therefore, the sample variance is 1288.2 lb2 and the sample


standard deviation is 35.9 lb.

SEHH1028 Elementary Statistics Page 46


Summary About Variance and Standard
Deviation
• Population variance (C ) and population standard deviation (C)
are population parameters, which are often unknown.

• Sample variance (% ) and sample standard deviation (%) are


sample statistics. They are computed from a random sample and
are often used to estimate the corresponding unknown
population parameters.

• The denominators in % and % are ( − 1) instead of . This is


done to ensure % is an unbiased estimator of C .

SEHH1028 Elementary Statistics Page 47


Sample Variance and Standard Deviation for
Grouped Frequency Distribution
Similar to the calculation of mean for grouped frequency
distribution, we can calculate the sample variance and
standard deviation for grouped frequency distribution.

SEHH1028 Elementary Statistics Page 48


Computation of Sample Variance & Standard
Deviation from Grouped Frequency Distribution
1. Construct a frequency distribution with the following columns.
A B C D E
Class Frequency (&) Class midpoint (' ) (&)(' ) (&)(' )

2. Multiply the frequency by the class midpoint for each class and place the
products in column D.
3. Multiply the frequency by the square of the class midpoint for each class
and place the products in column E.
4. Find the sum of column B, D and E.
5. Compute the sample variance % using the formula given below.
∑(& · ' ) − [ ∑& · ' /]
% =
−1
6. Take the square root to get the sample standard deviation %.

SEHH1028 Elementary Statistics Page 49


Example 12 – Sample Variance & Standard
Deviation for Grouped Frequency Distribution
Suppose the following frequency distribution represents the
no. of cigarettes patients smoked per day for a sample of 20
patients. Find the sample variance and standard deviation of
the frequency distribution. Correct your answers to 1 decimal
place.
Class boundaries Frequency ( f )
4.5 – 7.5 2
7.5 – 10.5 3
10.5 – 13.5 6
13.5 – 16.5 5
16.5 – 19.5 3
19.5 – 22.5 1
n = 20

SEHH1028 Elementary Statistics Page 50


Example 12 – Solution
Step 1: Construct a frequency distribution table as shown below.
A B C D E
Class boundaries Frequency (&) Class midpoint (&)(' ) (&)(' )
(' )
4.5 – 7.5 2 6
7.5 – 10.5 3 9
10.5 – 13.5 6 12
13.5 – 16.5 5 15
16.5 – 19.5 3 18
19.5 – 22.5 1 21

Step 2: Multiply the frequency by the class midpoint for each class
and place the products in column D.
Step 3: Multiply the frequency by the square of the class midpoint
for each class and place the products in column E.

SEHH1028 Elementary Statistics Page 51


Example 12 - Solution
Step 4: Find the sum of column B, D and E.
Below is the frequency distribution after step 2 - 4.
A B C D E
Class Frequency (&) Class midpoint (&)(' ) (&)(' )
boundaries (' )
4.5 – 7.5 2 6 2×6=12 2×62=72
7.5 – 10.5 3 9 3×9=27 3×92=243
10.5 – 13.5 6 12 6×12=72 6×122=864
13.5 – 16.5 5 15 5×15=75 5×152=1125
16.5 – 19.5 3 18 3×18=54 3×182=972
19.5 – 22.5 1 21 1×21=21 1×212=441
Total  = ∑& = 20 ∑&' = 261 ∑&' = 3717

SEHH1028 Elementary Statistics Page 52


Example 12 - Solution
Step 5: Calculate sample variance % by substituting the
sums computed in step 4.
∑(& · ' ) − [ ∑& · ' /]
% =
−1
3717 − [ 261 /20]
=
20 − 1
= 16.3658 ≈ 16.4 *+,#-..-%
Step 6: Take the square root to get the sample standard
deviation.
% = 16.3658 = 4.0455 ≈ 4.0 *+,#-..-%

SEHH1028 Elementary Statistics Page 53


Exercise 3
A survey of 40 clothing stores reported the following number of
sales held during randomly selected year.
Number of sales Frequency
11 – 15 3
16 – 20 5
21 – 25 12
26 – 30 9
31 – 35 8
36 – 40 3
Find the following statistics.
(a) Sample mean (b) Modal class
(c) Sample variance (d) Sample standard deviation
For a, c, d, correct your answers to 1 decimal place.
SEHH1028 Elementary Statistics Page 54
Exercise 3 - Solution
No. of sales Frequency (&) Midpoint (' ) (&)(' ) (&)(' )
11 – 15 3 13 39 507
16 – 20 5 18 90 1620
21 – 25 12 23 276 6348
26 – 30 9 28 252 7056
31 – 35 8 33 264 8712
36 – 40 3 38 114 4332
Total  = ∑& = 40 ∑&' = 1035 ∑&' = 28575

Solution
(a) Sample mean =
(b) Modal class =
(c) Sample variance =
(d) Sample standard deviation =

SEHH1028 Elementary Statistics Page 55


Coefficient of Variation
The coefficient of variation is the standard deviation divided by the
mean. The result is expressed as a percentage.

For samples For populations


% C
PQ# = · 100% PQ# = · 100%
 

The coefficient of variation is used to compare standard deviations


when the units are different for the two variables being compared.

Rounding rule: Coefficient of variations are in percentages. You


can round your answer to 2 decimal places.

SEHH1028 Elementary Statistics Page 56


Example 13 – Coefficient of Variation
The sample mean and sample standard deviation of ages and salary
of a sample of workers are summarized in the following table.
Compare the variations of the two variables. Correct your answers
to 2 decimal places.

Statistics Age Salary



 32 years $15000
K 5 years $500

Hint:
Calculate and compare the coefficients of variation of the above
two variables.

SEHH1028 Elementary Statistics Page 57


Example 13 – Coefficient of Variation

Statistics Age Salary



 32 years $15000
K 5 years $500

Solution:
S H
For age, PQ# = × 100% = × 100% = 15.625% ≈ 15.63%
G
S HTT
For salary, P9# = × 100% = × 100% = 3.3333% ≈ 3.33%
G HTTT

Since the coefficient of variation is larger for ages, the ages of the workers are
more variable than their salaries.

SEHH1028 Elementary Statistics Page 58


Exercise 4 – Coefficient of Variation
The blood pressure and cholesterol level of a sample of elderlies are
summarized in the following table. Compare the variation of the
two variables. Correct your answers to 2 decimal places.
Statistics Blood pressure (mmHG) Cholesterol level (mmol/L)

 98 6.5
K 5 1.5

Solution:

SEHH1028 Elementary Statistics Page 59


3.3. Measures of Position
Measures of central tendency and variation are used to
describe the distribution of the entire dataset.

Measures of positions are used to locate the relative


position of a data value in the data set. For example,
• How is the performance of a student relative to students in
the same class?
• Comparing performances of two students from two different
classes?

SEHH1028 Elementary Statistics Page 60


Some Common Measures of Position

• Standard score (z score)


• Percentile
• Quartile
• Outlier

SEHH1028 Elementary Statistics Page 61


Standard Score
Standard score (z score) tells us how many standard deviations a
data value is above or below the mean for a specific distribution of
values.
For samples For populations
 −  −
U= U=
% C

Since standard score has no units, it can be used to compare


relative positions of data values from different variables.

Rounding rule: Standard scores have no unit. You can round your
answer to 2 decimal places.

SEHH1028 Elementary Statistics Page 62


Example 14 – Standard Score
If the population mean () and standard deviation (C) of the
retirement ages of residents in a town are 65 and 3 years
respectively. Suppose Susan & Mary retired at age 61 and 71
respectively. What are the standard score of their retirement ages?
Correct your answers to 2 decimal places.

Solution:
GLV 5L5H
For Susan, U %*- = = = −1.3333 ≈ −1.33
W
GLV IL5H
For Mary, U %*- = = = 2.00
W

Interpretation:
Susan’s retirement age is 1.33 standard deviations below mean. (negative z score)
Mary’s retirement age is 2.00 standard deviations above mean. (positive z score)

SEHH1028 Elementary Statistics Page 63


Exercise 5 - Standard Score
The table below shows the means and standard deviations of a Statistics
test and English test.
Statistics Test English Test
Mean () 70 66
Standard deviation (X) 4 8
If Tom got 64 marks in the Statistics test and 62 marks in the English Test,
in which subject did Tom get a better relative position? Correct your
answers to 2 decimal places.
Solution:

SEHH1028 Elementary Statistics Page 64


Percentile
Percentiles of a data set divides the data set into 100
equal groups.

For any integer P between 1 and 99 (i.e. 1 ≤ P ≤ 99),


P-th percentile is defined as the value such that P% of the
data values are less than or equal to the value and
(100-P)% of the data values are greater than the value.

SEHH1028 Elementary Statistics Page 65


Percentiles – Graphical Method
• Percentiles can be found by drawing a percentile graph.
• Percentile graph is similar to the cumulative frequency
polygon we learnt in chapter 2, except that the y-axis is
now replaced by cumulative percentages.
• Cumulative percentages of a grouped frequency
distribution is defined by
Cumulative frequency
Cumulative % = × 100%
n

SEHH1028 Elementary Statistics Page 66


Example 15 – Percentile Graph
The frequency distribution below summarizes the
scores on an aptitude test for a group of students.
Constructing a percentile graph.

Score Frequency
196.5 – 217.5 5
217.5 – 238.5 17
238.5 – 259.5 22
259.5 – 280.5 48
280.5 – 301.5 22
301.5 – 322.5 6

SEHH1028 Elementary Statistics Page 67


Example 15 - Solution
Step 1 – Find the cumulative frequencies.
Score Frequency Cumulative frequency

196.5 – 217.5 5 5
217.5 – 238.5 17 5+17 = 22
238.5 – 259.5 22 5+17+22 = 44
259.5 – 280.5 48 5+17+...+48 = 92
280.5 – 301.5 22 5+17+...+22 = 114
301.5 – 322.5 6 5+17+...+6 = 120

SEHH1028 Elementary Statistics Page 68


Example 15 - Solution
Step 2 – Find the cumulative percentages.
Score Frequency Cumulative Cumulative percent
(h) frequency
196.5 – 217.5 5 5 5
× 100% = 4%
120
217.5 – 238.5 17 22 22
× 100% = 18%
120
238.5 – 259.5 22 44 44
× 100% = 37%
120
259.5 – 280.5 48 92 92
× 100% = 77%
120
280.5 – 301.5 22 114 114
× 100% = 95%
120
301.5 – 322.5 6 120 120
× 100% = 100%
120
 = ∑& = 120

SEHH1028 Elementary Statistics Page 69


Example 15 - Solution
Step 3 – Use the class boundaries for the x-axis and
cumulative percent for the y-axis
Percentile Graph for Aptitude Test Scores
of DSE Students
y
100

90

80
Cumulative Percentage

70

60

50

40
30

20

10
x
0
175.5 196.5 217.5 238.5 259.5 280.5 301.5 322.5
Score

SEHH1028 Elementary Statistics Page 70


Example 16 – Using Percentile Graph
Using the percentile graph constructed in example 15 to
find
(a) the 30th percentile of the score;
(b) the percentile rank (P) that corresponds to a score of
280.5.

SEHH1028 Elementary Statistics Page 71


Example 16 - Solution
Percentile Graph for Aptitude Test Scores
of DSE Students

100 y

90

80
77
70
Cumulative Percentage

60

50

40

30
a) 30th percentile is
20 approximately 251.
10 x b) 77th is corresponding
0
175.5 196.5 217.5 238.5 251259.5 280.5 301.5 322.5
to the score 280.5.
Score

SEHH1028 Elementary Statistics Page 72


Exercise 6 - Percentile
The hourly wages ($) of 30 private tutors are summarized below.
Class limits Frequency
46 – 54 3
55 – 63 12
64 – 72 10
73 – 81 3
82 – 90 1
91 – 99 0
100 – 108 1
Total 30

(a) Construct a percentile graph.


(b) Find the hourly wages that correspond to 35th, 65th and 85th
percentiles.
(c) Find the percentile ranks of the hourly wages $50, $64 & $70.

SEHH1028 Elementary Statistics Page 73


Exercise 6 - Solution
Construct the grouped frequency distribution for the data
Class Class Frequency Cumulative Cumulative
limits boundaries frequency percent
46 – 54 45.5 – 54.5 3 3 10
55 – 63 54.5 – 63.5 12 15 50
64 – 72 63.5 – 72.5 10 25 83
73 – 81 72.5 – 81.5 3 28 93
82 – 90 81.5 – 90.5 1 29 97
91 – 99 90.5 – 99.5 0 29 97
100 - 108 99.5 – 108.5 1 30 100

SEHH1028 Elementary Statistics Page 74


Exercise 6 - Solution
(a) Percentile Graph of Hourly Wages
of Private Tutors
100
y
90

80

70
Cumulative Percentage

60

50

40

30

20

10 x
0
45.5 54.5 63.5 72.5 81.5 90.5 99.5 108.5
Hourly Wages in $

SEHH1028 Elementary Statistics Page 75


Exercise 6 - Solution
(b) The 35th percentile corresponds to an hourly wage of _______.
The 65th percentile corresponds to an hourly wage of _______.
The 85th percentile corresponds to an hourly wage of _______.

(c) An hourly wage of $50 corresponds approximately to the _____ percentile.


An hourly wage of $64 corresponds approximately to the _____ percentile.
An hourly wage of $70 corresponds approximately to the _____ percentile.

SEHH1028 Elementary Statistics Page 76


Percentiles – Computational Method
• Disadvantages of graphical method
– It is a lot of work to construct the graph
– It is not accurate to look up percentiles and percentile ranks
from the graph
• If what we have is the raw data, we can compute
percentiles and percentile ranks directly without
constructing a graph

SEHH1028 Elementary Statistics Page 77


Formula for Finding Percentile Rank
Given a data value , the corresponding percentile rank
can be computed by the following formula
ijklmn op qrsjm smtt uvrw x 6T.H
Percentile Rank = × 100%
w

where  is the number of data values in the data set.

Round your answer to the nearest integer percentage

SEHH1028 Elementary Statistics Page 78


Example 17 – Percentile Rank
Below are the ages of a group of randomly selected
visitors at an airport, find the percentile rank of the
individual of age 28. Round the answer to the nearest
integer percentage. 12 28 35 42 47 49 50
Solution:
Number of data values less than 28 = 1 (i.e. 12)
Total number of data values (n) = 7
;"N- & A#.# 9#:;-% :-%% .ℎ# 28 + 0.5
Percentile rank of 28 = × 100%

1 + 0.5
= × 100%
7
= 21.429%
≈ 21%
Therefore, the individual of age 28 is OLDER than 21% of the selected visitors at the
airport.

SEHH1028 Elementary Statistics Page 79


Example 18 – Percentile Rank
For the test scores shown below, find the percentile rank
of the score 35. Round the answer to the nearest integer
percentage. 12 28 35 42 47 49 50
Solution:
Number of data values less than 35 = 2 (i.e. 12 & 28)
Total number of data values (n) = 7
;"N- & A#.# 9#:;-% :-%% .ℎ# 35 + 0.5
Percentile rank of 35 = × 100%

2 + 0.5
= × 100%
7
= 35.714%
≈ 36%
Therefore, the student with a score of 35 did better than 36% of the class.

SEHH1028 Elementary Statistics Page 80


Procedure for Finding Percentile
Use the following procedure to find p-th percentile from a data set
with n data values.
Step 1 – Sort the data in ascending order (from lowest to highest)
·2
Step 2 – Compute *, using the formula * =
100
Step 3a – If | is not a whole number, round it up to the next
integer. The p-th percentile will be the *-th data value in
the sorted data.
Step 3b – If | is a whole number, The p-th percentile will be the
average of *-th and (* + 1)th data values in the sorted
data. Note: For Step 3a and 3b, you only work on ONE of these two steps
Note to students: The above procedure for finding Percentile is VERY IMPORTANT.
The calculation of Quartile (page 85), Five Number Summary (page 86), Boxplot (page 87),
Outlier (page 95) are related to Percentile.

SEHH1028 Elementary Statistics Page 81


Example 19 – Percentile
(c is not a whole number)
Find the 60th percentile of the following data set
containing the scores of 7 students.
12 28 35 42 47 49 50
Solution:
Total number of data values (n) = 7
We want to find 60th percentile, therefore 2=60.

Step 1 – Sort the data in ascending order 12, 28, 35, 42, 47, 49, 50
·} I×5T
Step 2 – Compute * = = = 4.2
TT TT
Step 3a – Since | is not a whole number, we need to round it up to the next
integer (i.e. * = 5).
The 60th percentile is equal to 47 marks (i.e. the 5th value in the sorted data set)

SEHH1028 Elementary Statistics Page 82


Example 20 – Percentile
(c is a whole number)
Find the 60th percentile for the following data set
containing the scores of 10 students.
18 15 12 6 8 2 3 5 20 10
Solution:
Total number of data values (n) = 10
We want to find 60th percentile, therefore 2=60.

Step 1 – Sort the data in ascending order 2, 3, 5, 6, 8, 10, 12, 15, 18, 20
·} T×5T
Step 2 – Compute * = = =6
TT TT
Step 3b – Since | is a whole number, 60th percentile will be the average of the
6th and the 7th data values in the sorted data
(T6 )
Therefore 60th percentile = = 11 marks.

SEHH1028 Elementary Statistics Page 83


Uses of Percentiles in Labor Statistics
• Other than median hourly wages in HK, Census and
Statistics Department also tracks the percentiles of
hourly wages in HK
– See the full report HERE

SEHH1028 Elementary Statistics Page 84


Quartile
Quartiles divide the data set into 4 groups. They are
denoted by ~ , ~ and ~ .
Smallest MD Largest
€ B 
data value data value

25% 25% 25% 25%

~ is equivalent to 25th percentile.


~ is equivalent to 50th percentile (i.e. median).
~ is equivalent to 75th percentile.

SEHH1028 Elementary Statistics Page 85


Five-Number Summary
The distribution of a data set can be summarized using
the five-number summary. It consists of the following 5
numbers.
– The lowest data value of the data set (i.e. minimum)
– ~ (i.e. first quartile)
– ~ (i.e. the median)
– ~ (i.e. the third quartile)
– The highest data value of the data set (i.e. maximum)

SEHH1028 Elementary Statistics Page 86


Boxplot
A boxplot can be constructed from five-number summary
to give people a graphical view of the data distribution.
Procedure to construct a boxplot:
Step 1 – Find the five-number summary of the data set
Step 2 – Draw a horizontal axis with a scale such that it includes the
maximum and minimum of the data set
Step 3 – Draw a box whose vertical sides go through ~ and ~ , and
draw a vertical line through the median
Step 4 – Draw a line from the minimum value to the left side of the
box and a line from the maximum data value to the right
side of the box

SEHH1028 Elementary Statistics Page 87


Example 21 – Five-Number Summary
The data below are the reaction times (in ms, i.e.
milliseconds) of 10 children in a reaction time test.
89 47 164 296 30 215 138 78 48 39

Find the five-number summary and construct a boxplot


for the above data set.

SEHH1028 Elementary Statistics Page 88


Example 21 - Solution
Sort the data in ascending order
30 39 47 48 78 89 138 164 215 296
↑ ↑ ↑ ↑ ↑
Low ~ MD ~ High

Five-number summary Boxplot


8/ = 30"% 47 83.5 164

~ = 47"% 30 296
34 = 83.5"%
~ = 164"%
<+,ℎ = 296"% Reaction time in ms

SEHH1028 Elementary Statistics Page 89


Example 22 - Quartiles
The data below are the amount of gold (in grams)
contained in 14 pieces of rocks.

7 18 17 29 18 4 27 30 2 4 10 21 5 8

Find ~ , ~ and ~ for the above data set.

SEHH1028 Elementary Statistics Page 90


Example 22 - Solution
Sort the data in ascending order
2, 4, 4, 5, 7, 8, 10, 17, 18, 18, 21, 27, 29, 30

‚× H
For ~ (i.e. 25th percentile), * = = 3.5
TT
Therefore ~ = 4th data value = 5g

‚×HT
For ~ (i.e. 50th percentile), * = =7
TT
Therefore ~ = average of 7th and 8th data values = (10+17)/2 = 13.5g

‚×IH
For ~ (i.e. 75th percentile), * = = 10.5
TT
Therefore ~ = 11th data value = 21g

SEHH1028 Elementary Statistics Page 91


Describing the Shape of a Distribution
The distribution is approximately symmetric if
– the median is near the center of the boxplot; and
– the lines are about the same length in the boxplot

SEHH1028 Elementary Statistics Page 92


Describing the Shape of a Distribution
The distribution is positively skewed (or right skewed) if
– the median falls to the left of the center of the boxplot; or
– the right line in the boxplot is longer than that on the left

SEHH1028 Elementary Statistics Page 93


Describing the Shape of a Distribution
The distribution is negatively skewed (left skewed) if
– the median falls to the right of the center of the box; or
– the left line in the boxplot is longer than that on the right

SEHH1028 Elementary Statistics Page 94


Outlier
An outlier is an extremely high or an extremely low data
value when compared with the rest of the data values.

Some causes of outliers


• Measurement or observational error
• Errors introduced during data processing
• The data value may come from a subject that is not in the
defined population
• The data value may be a legitimate value that occurred by
chance

SEHH1028 Elementary Statistics Page 95


Procedure for Identifying Outliers
Step 1 – Compute 1st and 3rd quartiles (i.e. ~ and ~ )
Step 2 – Compute interquartile range (IQR) using the
formula ƒ~7 = ~ − ~
Step 3 – Compute lower bound (8) for outlier detection.
8 = ~ − (1.5 × ƒ~7)
Step 4 – Compute upper bound („) for outlier detection.
„ = ~ + (1.5 × ƒ~7)
Step 5 – Data values that are less than 8 or greater than „
are potential outliers

SEHH1028 Elementary Statistics Page 96


Example 23 - Outliers
Check the following data set for outliers.
16 18 22 19 3 21 17 20

Solution:
Step 1 – Sort the data in ascending order 3, 16, 17, 18, 19, 20, 21, 22. ~ =
16.5 and ~ = 20.5
Step 2 – ƒ~7 = ~ − ~ = 20.5 − 16.5 = 4
Step 3 – 8 = ~ − (1.5 × ƒ~7) = 16.5 − (1.5 × 4) = 10.5
Step 4 – „ = ~ + (1.5 × ƒ~7) = 20.5 + (1.5 × 4) = 26.5
Step 5 – The data value 3 is less than the lower bound 8, therefore 3 can be
considered as a potential outlier.

SEHH1028 Elementary Statistics Page 97


Exercise 7 - Outliers
Check the following data set for outliers.
14 18 27 26 19 13 5 25

Solution:
Step 1 – Sort the data in ascending order 5, 13, 14, 18, 19, 25, 26, 27.
~ = and ~ =
Step 2 – ƒ~7 = ~ − ~ =
Step 3 – 8 = ~ − (1.5 × ƒ~7) =
Step 4 – „ = ~ + (1.5 × ƒ~7) =
Step 5 –

SEHH1028 Elementary Statistics Page 98


Outliers in Action
Outlier detection techniques are
often used to identify abnormal
growth in children.

The chart on the right is a


typical growth chart for
identifying children who are
overweight or underweight. The full article can be accessed using
this link.

SEHH1028 Elementary Statistics Page 99


Chapter Summary
• Relationship between population, sample, parameter and
statistic
• Measures of central tendency
– Mean, weighted mean, median, mode, modal class and mid-range
– Calculation of sample mean from grouped frequency distribution
• Measures of variation
– Variance, standard deviation, range, coefficient of variation
– Differences between C , % , C and %
– Calculation of sample variance and standard deviation from grouped
frequency distribution

SEHH1028 Elementary Statistics Page 100


Chapter Summary
• Measures of position
– Standard score
– Percentile, percentile rank
• Graphical method & computation formula
– Quartile and outlier
– Five-number summary and boxplot
– Describing the shape of a distribution
• Symmetric, positively skewed (right skewed), negatively skewed (left
skewed)

SEHH1028 Elementary Statistics Page 101

You might also like