Descriptive Statistics: Making Sense of Data
Descriptive Statistics: Making Sense of Data
Descriptive Statistics: Making Sense of Data
Statistics
MAKING SENSE OF DATA
Introduction
• ARITRA KUMAR SINHA
• M . S C ( S TAT I S T I C S ) , H I N D U C O L L E G E , D U ( 2 0 0 8 )
• 1 2 + Y E A R S I N T O P R E D I C T I V E A N A LY T I C S A N D D ATA
SCIENCE
• L O V E T O D R I V E T O H I M A L AYA N D E S T I N AT I O N S A N D
E X P L O R E O F F B E AT L O C AT I O N S
• LINKED IN :
H T T P S : / / W W W. L I N K E D I N . C O M / I N / A R I T R A - K U M A R - S I
NHA-A37186A/
• YOUTUBE CHANNEL -
HTTPS://YOUTU.BE/MMG_OUIRML8
Learning Objectives
Descriptive Measures or Summary Measures enable us to characterize the data and draw insights
What is it?
about the overall behaviour of the data
Why Descriptive Measures are used in EDA stage in Data Science Projects from Missing Values to
Learn ? Outlier Treatments
We will go through the following topics to chart out our learning journey on Descriptive Statistics
• Central Tendency
How to • Dispersion
Learn?
• Symmetry
What are Descriptive Measures
Descriptive Measures help in deciding what is the overall behaviour of the data
Quartiles
Median
Variance
Mode
Standard
Deviation
Mean - Concept
Definition Advantage and Disadvantages
• Simple Arithmetic Mean is the calculated as the Sum • Advantage : Simple Interpretation of central
of all datapoints divided by the number of behaviour of Data
datapoints
𝑛
∑ 𝑥𝑖
𝑖=1
≡𝜇≡ 𝑋
¯
• 𝑛
Weighted Mean is calculated by Summing up the • Disadvantage : Affected by extreme Values
product of Datapoints and relevant weights
𝑛
∑ 𝑓𝑖 𝑥 𝑖
𝑖=1
𝑛
≡𝜇≡ ¯
𝑋
∑ 𝑓 𝑖
𝑖=1
Mean – Numerical Example
Data Set Calculation Result
Class A IQ of 13 Students
𝑛
The mean IQ of the Class A
102.00 115.00 ∑ 𝑥𝑖 Students is 110.538
𝑖=1
128.00 109.00 ≡𝜇≡ 𝑋
¯
𝑛 / 13
= 1437 110.538 is the central IQ Level
131.00 89.00 of the Class
98.00 106.00 = 110.538
140.00 119.00
93.00 97.00
110.00
Median - Concept
Definition Advantage and Disadvantages
• The middle value when a variable’s values are • Advantage : The median is unaffected by outliers,
ranked in order; the point that divides a distribution making it a better measure of central tendency, better
into two equal halves describing the “typical person” than the mean when
data are skewed.
• When data are listed in order, the median is the point • Disadvantage : If the recorded values for a variable
at which 50% of the cases are above and 50% below form a symmetric distribution, the median and mean
it are identical
Median – Numerical Example
Data Set Calculation Result
Class A IQ of 13 Students Sort the IQ data in Ascending
Order The median IQ of the Class A
102.00 115.00 89.00
93.00
Students is 109
128.00 109.00 97.00
98.00 Six cases above and Six cases
131.00 89.00 below
102.00
98.00 106.00 106.00
109.00
140.00 119.00
110.00
93.00 97.00 115.00
119.00
110.00 128.00
131.00
140.00
Mode - Concept
Definition Advantage and Disadvantages
• Mode is defined as the most frequent datapoint • Advantage : The mode conveys the “most likely”
experience of the data
• When data is grouped and for every datapoint we • Disadvantage : If the recorded values for a variable
calculate the frequency, we can then observe the form a symmetric distribution, the median, mean and
mode of the data the model will be same
Mode – Numerical Example
Data Set Calculation Result
Class A IQ of 13 Students – new Data Sort the IQ data in Ascending
Order The modal IQ of the Class A
102.00 115.00 89.00
93.00
Students is 109
128.00 109.00 97.00
98.00
131.00 89.00
102.00
98.00 109.00 109.00
109.00
140.00 119.00
109.00
93.00 97.00 115.00
119.00
109.00 128.00
131.00
140.00
Range - Concept
Definition Advantage and Disadvantages
• The Spread or the distance between the highest and • Gives a sense of overall width of data
the lowest values in the data is called Range
• Range = Highest Value – Lowest Value • Disadvantage : It does not give a sense of dispersion
about a central value or a measure
Range – Numerical Example
Data Set Calculation
Class A IQ of 13 Students Class B IQ of 13 Students
102.00 115.00 Range of IQ in Class A
127.00 162.00
128.00 109.00 131.00 103.00 = 140 – 89
= 51
131.00 89.00 96.00 111.00
98.00 109.00 80.00 109.00 Range of IQ in Class B
140.00 119.00 93.00 87.00 = 162 – 80
93.00 97.00 120.00 105.00 = 82
109.00 109.00
Quartile - Concept
Definition Advantage of Quartiles
• A quartile is the value that marks one of the divisions • The concept is used to interpret the model
that breaks a series of values into four equal parts. performance and build custom cut offs and rules
• The interquartile range is the distance or range between the 25 th percentile and the 75th percentile
110.00 109.00
Variance - Concept
Definition
Variance of
Population
𝑛
Mean
∑ ¿¿¿
𝑖=1
• The smaller the variance, the closer the individual scores are to the mean.
Mean
Variance – Numerical Example
Data Set Step 1 Step 2 Step 3
Class A IQ of 13 Students Calculate the Mean Sum of Squares Calculate Variance
102.00 115.00 Y-bar = 110.54 (102 – 110.54)2 + (115 – 110.54)2 + SS/N = Variance for a
128.00 106.00 population.= 217.34
(126 – 110.54)2 + (109 – 110.54)2 +
131.00 89.00
(131 – 110.54)2 + (89 – 110.54)2 +
98.00 109.00 SS/n-1 = Variance for a
(98 – 110.54)2 + (106 – 110.54)2 + sample =235.45
140.00 119.00
(140 – 110.54)2 + (119 – 110.54)2 +
93.00 97.00
(93 – 110.54)2 + (97 – 110.54)2 +
110.00
(110 – 110.54) = SS = 2825.39
Standard Deviation - Concept
Definition
√ ∑ ¿¿¿¿
𝑖=1
√ ∑ ¿¿¿¿
𝑖=1
= 15.34
Symmetric
Skewed
Mean Mean
Median Median
Box Plot - Concept
Definition
• Graphical way to represent nearly all the descriptive statistics in one go Box Plot
• A box-plot shows: Upper and lower quartiles IQR = 27; There is no outlier.
Mean 162
Median
Range
Outliers (1.5 IQR)
Mean=110.5
106.5
96.5
82
Practice Questions
1 Here are 7 scores in an Algebra test.
15, 18, 21, 22, 26, 28, 28
Compute : Mean , Median , Q1, Q3, Mode