STPDF2 - Descriptive Statistics
STPDF2 - Descriptive Statistics
STPDF2 - Descriptive Statistics
Descriptive
Statistics
MPS Department | FEU Institute of Technology
Subtopic 2
OBJECTIVES
Data Presentation
Measures of Central tendency
Measures of Dispersion
Measures of Position
Descriptive statistics consists of the collection, organization,
summarization, and presentation of data.
• Collect data https://docplayer.es/50880990-Preparacion-
de-propuestas-en-horizonte-puerto-real-18-
de-junio-de-2015.html
• e.g., Survey
• Present data
• e.g., Tables and graphs
• Summarize data
• e.g., Sample mean = X i
n
Descriptive Statistics
• Describes the important characteristics of a set of data.
• Organize, present, and summarize data:
1. Graphically
2. Numerically
“Shape, Center, and Spread”
• Center: A representative or average value that indicates where the
middle of the data set is located.
2. Grouped Frequency Distribution: for data sets with many different values,
which are grouped together in the classes.
Ungrouped Grouped
6 2 3 5 5 3 5
5 5 7 4 3
4 9
4 5 4 5 6
5 18
5 1 6 2 6
6 6 6 6 4 6 12
4 5 4 5 3
7 3
5 5 7 6 5
Frequency Histogram
• A bar graph that represents the frequency distribution.
• The horizontal scale is quantitative and measures the data values.
• The vertical scale measures the frequencies of the classes.
• Consecutive bars must touch.
frequency
data values
Ex. Peas per Pod
Peas per pod Freq, f Number of Peas in a Pod
1 1
20
2 2
15
Frequency, f
3 5
10
4 9
5
5 18
0
6 12
1 2 3 4 5 6 7
7 3 Number of Peas
Relative Frequency Distribution
• Shows the portion or percentage of the data that falls in a particular
class.
class frequency f
relative frequency
Sample size n
Relative Frequency Histogram
• Has the same shape and the same horizontal scale as the corresponding
frequency histogram.
• The vertical scale measures the relative frequencies, not frequencies.
• Has the same shape and horizontal scale as a histogram, but the vertical
scale is marked with relative frequencies.
• For data sets with many different values.
• Groups data into 5-20 classes of equal width.
• Upper class limits: are the largest numbers that can actually belong to
different classes
• Class width: is the difference between two consecutive lower class limits
(or upper class limits)
• Class midpoints: the value halfway between LCL and UCL:
(Lower class limit) (Upper class limit)
2
• Class boundaries: the value halfway between an UCL and the next LCL
range
class width =
number of classes
• Round up to the next convenient number.
4. Find the class limits.
• Choose the first LCL: use the minimum data entry or something smaller that is
convenient.
• Find the remaining LCLs: add the class width to the lower limit of the preceding
class.
• Find the UCLs: Remember that classes must cover all data values and cannot
overlap.
5. Find the frequencies for each class. (You may add a tally column first
and make a tally mark for each data value in the class).
Symmetric
• Data is symmetric if the left half of its histogram is roughly a mirror
image of its right half.
Skewed
• Data is skewed if it is not symmetric and if it extends more to one side
than the other.
Uniform
• Data is uniform if it is equally distributed (on a histogram, all the bars
are the same height or approximately the same height).
Symmetric Uniform
https://mikerogerstrg.wordpress.com/2015/01/23/outliers-
escaping-average-and-becoming-great/
• A value that represents a typical, or central, entry of a data set.
• Most common measures of central tendency:
• Mean
• Median
• Mode
The sum of all the data entries divided by the number of entries.
• Population mean:
x
N
• Sample mean:
x
x
n
The weighted mean is a type of mean that is calculated by multiplying
the weight (or probability) associated with a particular event or
outcome with its associated quantitative outcome and then summing
all the products together.
• The value that lies in the middle of the data when the data set is
arranged in order from lowest to highest. .
• Measures the center of an ordered data set by dividing it into two equal
parts.
• A sample mean is often referred to as ~ x.
• If the data set has an
• odd number of entries: median is the middle data entry.
• even number of entries: median is the mean of the two middle data entries.
If the data set has an:
•odd number of entries: median is the middle data entry:
2 5 6 11 13
𝑥=6
median is the exact middle value:
•even number of entries: median is the mean of the two middle data entries:
2 5 6 7 11 13
6+7
𝑥= = 6.5
median is the mean of the by two numbers: 2
• The data entry that occurs with the greatest frequency.
• If no entry is repeated the data set has no mode.
• If two entries occur with the same greatest frequency, each entry is a
mode (bimodal).
c) 1 2 3 6 7 8 9 10 No Mode
https://slideplayer.com/slide/10513276/
All three measures describe an “average”. Choose the one that best
represents a “typical” value in the set.
• Mean:
• The most familiar average.
• A reliable measure because it takes into account every entry of a data set.
• May be greatly affected by outliers or skew.
• Median:
• A common average.
• Not as effected by skew or outliers.
• Mode: May be used if there is an overwhelming repeat.
• The shape of your data and the existence of any outliers may help you
choose the best average:
http://chandra-silitonga.blogspot.com/2017/09/contoh-soal-
menghitung-mean-median-dan.html
• Quartiles are used to divide the distribution into four parts
or subgroups
• Deciles are used to divide the distribution into ten parts or
subgroups
• Percentiles are used to divide the distribution into
hundred parts or subgroups
𝒌(𝒏 + 𝟏) 𝐢𝐭𝐞𝐦
𝑸𝒌 = 𝐭𝐡
𝟒 𝐨𝐛𝐬𝐞𝐫𝐯𝐚𝐭𝐢𝐨𝐧
• Compute quartiles for the data given: 25, 18, 30, 8, 15,
5, 10, 35, 40, 45
• Arrange the data: 5, 8, 10, 15, 18, 25, 30, 35, 40, 45
𝟏(𝟏𝟎+𝟏)
• 𝑸𝟏 = 𝟒 𝐭𝐡 = 𝟐. 𝟕𝟓 𝐭𝐡 𝐢𝐭𝐞𝐦
• 𝑸𝟏 = 𝟐𝒏𝒅 𝒊𝒕𝒆𝒎 + 𝟎. 𝟕𝟓 𝟑𝒓𝒅 𝒊𝒕𝒆𝒎 − 𝟐𝒏𝒅 𝒊𝒕𝒆𝒎
• 𝑸𝟏 = 𝟖 + 𝟎. 𝟕𝟓 𝟏𝟎 − 𝟖 = 𝟖 + 𝟎. 𝟕𝟓 𝟐 = 𝟖 + 𝟏. 𝟓 = 𝟗. 𝟓
𝒌(𝒏 + 𝟏) 𝐢𝐭𝐞𝐦
𝑫𝒌 = 𝐭𝐡
𝟏𝟎 𝐨𝐛𝐬𝐞𝐫𝐯𝐚𝐭𝐢𝐨𝐧
• Compute 𝑫𝟑 for the data given: 25, 18, 30, 8, 15, 5, 10,
35, 40, 45
• Arrange the data: 5, 8, 10, 15, 18, 25, 30, 35, 40, 45
𝟑(𝟏𝟎 + 𝟏)
𝑫𝟑 = = ⋯.
𝟏𝟎
𝒌(𝒏 + 𝟏) 𝐢𝐭𝐞𝐦
𝑷𝒌 = 𝐭𝐡
𝟏𝟎𝟎 𝐨𝐛𝐬𝐞𝐫𝐯𝐚𝐭𝐢𝐨𝐧
• Compute 𝑷𝟕𝟓 for the data given: 25, 18, 30, 8, 15, 5, 10,
35, 40, 45
• Arrange the data: 5, 8, 10, 15, 18, 25, 30, 35, 40, 45
𝟕𝟓(𝒏 + 𝟏) 𝟕𝟓(𝟏𝟎 + 𝟏)
𝑷𝟕𝟓 = = = ⋯.
𝟏𝟎𝟎 𝟏𝟎𝟎
Grouped Data
The mean may often be confused with the median, mode or range. The
mean is the arithmetic average of a set of values, or distribution.
Example: The following table gives the Number f
of order
frequency distribution of the number
10 – 12 4
of orders received each day during the 13 – 15 12
past 50 days at the office of a mail-order 16 – 18 20
company. Calculate the mean. 19 – 21 14
n = 50
Solution:
Number f x fx X is the midpoint of the
of order class. It is adding the class
10 – 12 4 11 44
13 – 15 12 14 168
limits and divide by 2.
16 – 18 20 17 340 x=
fx = 832 =16.64
19 – 21 14 20 280 n 50
n = 50 = 832
Step 1: Construct the cumulative
frequency distribution.
A median is described as the numerical Step 2: Decide the class that contain the
median
value separating the higher half of a Class Median is the first class with the
sample, a population, or a probability value of cumulative frequency equal at
least n/2.
distribution, Step 3: Find the median by using the
following formula:
n
Where: 2 -F
Median = Lm + i
fm
n = the total frequency
Example: Based on the grouped data below, find the Interquartile Range
Time to travel to work Frequency
1 – 10 8
11 – 20 14
21 – 30 12
31 – 40 9
41 – 50 7
Solution:
1st Step: Construct the cumulative frequency distribution
Time to travel Frequency Cumulative
to work Frequency
1 – 10 8 8
11 – 20 14 22
21 – 30 12 34
31 – 40 9 43
41 – 50 7 50
2nd Step: Determine the Q1 and Q3
n 50 n
Class Q1 12.5 - F
4 4 Q1 LQ1 4 i
Class Q1 is the 2nd class fQ1
Therefore,
12.5 - 8
10.5 10
14
13.7143
n
3n 3 50 - F
Class Q3 37.5 Q3 LQ3 4 i
4 4 f
Q3
Class Q3 is the 4th class 30.5
37.5 - 34
10
9
Therefore, 34.3889
Interquartile Range
IQR = Q3 – Q1
Where:
i is the class width
1 is the difference between the frequency of class
mode and the frequency of the class before the class
mode
2 is the difference between the frequency of class mode
and the frequency of the class after the class mode
Lmo is the lower boundary of class mode
Based on the grouped data below, find the mode
Solution:
Based on the table,
Lmo = 10.5, 1 = (14 – 8) = 6, 2 = (14 – 12) = 2
and
i = 10
6
Mode = 10.5 10 17.5
6 2
Another important characteristic of quantitative data is how much the
data varies, or is spread out.
The 4 most common method of measuring spread are:
1. Range
2. Mean Absolute Deviation
3. Quartile Deviation
4. Standard Deviation and Variance
49
• The difference between the maximum and minimum data entries in the
set.
• The data must be quantitative.
Range = (Max. data entry) – (Min. data entry)
The wait time to see a bank teller is studied at 2 banks.
• Note: The range is easy to compute, but only uses 2 values. Do the
following 2 sets vary the same?
• Set A: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
• Set B: 1, 10, 10, 10, 10, 10, 10, 10, 10, 10
Measures the typical amount data deviates from the mean.
2
Sample Variance, : s
( x x ) 2
s2
n 1
Sample Standard Deviation, s:
( x x ) 2
s s2
n 1
53
x
1.Find the mean of the sample data x
set. n
s s2
• Round to one more decimal than the data.
• Don’t round until the end.
• Include the appropriate units.
Wait time, x Deviation: x – x Squares: (x – x)2
x 36.5 (in min)
x 7.3 min
n 5 6.6
6.8
( x x ) 2
7.5
s2
n 1 7.7
7.9
x x
2
x 36.5 Σ(x – x) =
s s 2
• Round to one more decimal than the data.
• Don’t round until the end.
• Include the appropriate units.
Sample Population
Statistics: Parameters:
Mean x µ
Standard s σ
Deviation
Variance s2 σ2
Note: Unlike x and µ, the formulas for s and σ are not mathematically the
same:
Sample Standard Deviation
( x x ) 2
s s2
n 1
Population Standard Deviation
( x ) 2
2
N
• Standard deviation is a measure of the typical amount an entry deviates
from the mean.
• The more the entries are spread out, the greater the standard deviation.
https://www.robinsonschools.com/unit2/images/users/dforbes/Stats/Stats_Notes_2.4.pdf
The gas mileage of 2 cars is sampled over various conditions:
Use a calculator to find the mean and standard deviation for each to
justify your choice.
How does “s” show how much the data varies?
Three methods:
1. Range Rule of Thumb
2. Chebyshev’s Theorem
3. The Empirical Rule
Alternatively, If the range is known, you can use the range rule to estimate the
standard deviation:
Range
s
4
• A sample of women’s heights has a mean of 64 inches and a standard
deviation of 2.5 inches. Using the range rule, “most” women fall within
what heights?
• What would be an “unusual” height?
The sample of Exam Scores used in the class handout had a mean of 73.6.
Which of the following is most likely the standard deviation of the sample?