Descriptive Statistics: Making Sense of Data

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

Descriptive

Statistics
MAKING SENSE OF DATA
Introduction
• ARITRA KUMAR SINHA

• M . S C ( S TAT I S T I C S ) , H I N D U C O L L E G E , D U ( 2 0 0 8 )

• 1 2 + Y E A R S I N T O P R E D I C T I V E A N A LY T I C S A N D D ATA
SCIENCE

• DOMAINS – MARKETING, CREDIT RISK , SALES AND


AUTO POLICY CLAIMS ACROSS BANKING AND
INSURANCE DOMAINS

• L O V E T O D R I V E T O H I M A L AYA N D E S T I N AT I O N S A N D
E X P L O R E O F F B E AT L O C AT I O N S

• LINKED IN :
H T T P S : / / W W W. L I N K E D I N . C O M / I N / A R I T R A - K U M A R - S I
NHA-A37186A/

• YOUTUBE CHANNEL -
HTTPS://YOUTU.BE/MMG_OUIRML8
Learning Objectives
Descriptive Measures or Summary Measures enable us to characterize the data and draw insights
What is it?
about the overall behaviour of the data

Why Descriptive Measures are used in EDA stage in Data Science Projects from Missing Values to
Learn ? Outlier Treatments

We will go through the following topics to chart out our learning journey on Descriptive Statistics
• Central Tendency
How to • Dispersion
Learn?
• Symmetry
What are Descriptive Measures
Descriptive Measures help in deciding what is the overall behaviour of the data

Central Tendency Dispersion Symmetry

Mean Range Skewness

Quartiles
Median

Variance
Mode
Standard
Deviation
Mean - Concept
Definition Advantage and Disadvantages

• Simple Arithmetic Mean is the calculated as the Sum • Advantage : Simple Interpretation of central
of all datapoints divided by the number of behaviour of Data
datapoints

𝑛
 
∑ 𝑥𝑖
𝑖=1
≡𝜇≡ 𝑋
¯
• 𝑛
Weighted Mean is calculated by Summing up the • Disadvantage : Affected by extreme Values
product of Datapoints and relevant weights

𝑛
 ∑ 𝑓𝑖 𝑥 𝑖
𝑖=1
𝑛
≡𝜇≡ ¯
𝑋
∑ 𝑓 𝑖
𝑖=1
Mean – Numerical Example
Data Set Calculation Result
Class A IQ of 13 Students
𝑛
  The mean IQ of the Class A
102.00 115.00 ∑ 𝑥𝑖 Students is 110.538
𝑖=1
128.00 109.00 ≡𝜇≡ 𝑋
¯
𝑛 / 13
= 1437 110.538 is the central IQ Level
131.00 89.00 of the Class
98.00 106.00 = 110.538

140.00 119.00
93.00 97.00

110.00
Median - Concept
Definition Advantage and Disadvantages

• The middle value when a variable’s values are • Advantage : The median is unaffected by outliers,
ranked in order; the point that divides a distribution making it a better measure of central tendency, better
into two equal halves describing the “typical person” than the mean when
data are skewed.

• When data are listed in order, the median is the point • Disadvantage : If the recorded values for a variable
at which 50% of the cases are above and 50% below form a symmetric distribution, the median and mean
it are identical
Median – Numerical Example
Data Set Calculation Result
Class A IQ of 13 Students Sort the IQ data in Ascending
Order The median IQ of the Class A
102.00 115.00 89.00
93.00
Students is 109
128.00 109.00 97.00
98.00 Six cases above and Six cases
131.00 89.00 below
102.00
98.00 106.00 106.00
109.00
140.00 119.00
110.00
93.00 97.00 115.00
119.00
110.00 128.00
131.00
140.00
Mode - Concept
Definition Advantage and Disadvantages

• Mode is defined as the most frequent datapoint • Advantage : The mode conveys the “most likely”
experience of the data

• When data is grouped and for every datapoint we • Disadvantage : If the recorded values for a variable
calculate the frequency, we can then observe the form a symmetric distribution, the median, mean and
mode of the data the model will be same
Mode – Numerical Example
Data Set Calculation Result
Class A IQ of 13 Students – new Data Sort the IQ data in Ascending
Order The modal IQ of the Class A
102.00 115.00 89.00
93.00
Students is 109
128.00 109.00 97.00
98.00
131.00 89.00
102.00
98.00 109.00 109.00
109.00
140.00 119.00
109.00
93.00 97.00 115.00
119.00
109.00 128.00
131.00
140.00
Range - Concept
Definition Advantage and Disadvantages

• The Spread or the distance between the highest and • Gives a sense of overall width of data
the lowest values in the data is called Range

• Range = Highest Value – Lowest Value • Disadvantage : It does not give a sense of dispersion
about a central value or a measure
Range – Numerical Example
Data Set Calculation
Class A IQ of 13 Students Class B IQ of 13 Students
102.00 115.00 Range of IQ in Class A
127.00 162.00
128.00 109.00 131.00 103.00 = 140 – 89
= 51
131.00 89.00 96.00 111.00
98.00 109.00 80.00 109.00 Range of IQ in Class B
140.00 119.00 93.00 87.00 = 162 – 80
93.00 97.00 120.00 105.00 = 82

109.00 109.00  
Quartile - Concept
Definition Advantage of Quartiles

• A quartile is the value that marks one of the divisions • The concept is used to interpret the model
that breaks a series of values into four equal parts. performance and build custom cut offs and rules

• The knowledge of quartiles give us a way to treat


• The median is a quartile and divides the cases in half. Outliers with minimal loss of variation
• 25th percentile is a quartile that divides the first ¼ of
cases from the latter ¾.

• 75th percentile is a quartile that divides the first ¾ of


cases from the latter ¼.
Inter Quartile Range- Concept
Definition

• The interquartile range is the distance or range between the 25 th percentile and the 75th percentile

• The IQR from the below data is 750 – 250 = 500

25% 25% 25%


25%
of of
cases cases

0 250 500 750 1000


Range – Numerical Example
Data Set Calculation
Class A IQ of 13 Students Class B IQ of 13 Students
102.00 115.00 Range of IQ in Class A
127.00 162.00
128.00 106.00 131.00 103.00 = 140 – 89
= 51
131.00 89.00 96.00 111.00
98.00 109.00 80.00 109.00 Range of IQ in Class B
140.00 119.00 93.00 87.00 = 162 – 80
93.00 97.00 120.00 105.00 = 82

110.00 109.00  
Variance - Concept
Definition

• A measure of the spread of the recorded values on a variable. A measure of dispersion.

Variance of
Population
  𝑛

Mean
∑ ¿¿¿
𝑖=1
• The smaller the variance, the closer the individual scores are to the mean.

Mean
Variance – Numerical Example
Data Set Step 1 Step 2 Step 3
Class A IQ of 13 Students Calculate the Mean Sum of Squares Calculate Variance
102.00 115.00 Y-bar = 110.54 (102 – 110.54)2 + (115 – 110.54)2 + SS/N = Variance for a
128.00 106.00 population.= 217.34
(126 – 110.54)2 + (109 – 110.54)2 +
131.00 89.00
(131 – 110.54)2 + (89 – 110.54)2 +
98.00 109.00 SS/n-1 = Variance for a
(98 – 110.54)2 + (106 – 110.54)2 + sample =235.45
140.00 119.00
(140 – 110.54)2 + (119 – 110.54)2 +
93.00 97.00
(93 – 110.54)2 + (97 – 110.54)2 +
110.00
(110 – 110.54) = SS = 2825.39
Standard Deviation - Concept
Definition

• A measure average deviation of observation or datapoints from the mean

SD of Population SD of Sample Standard Deviation from the last example

  𝑛   𝑛 = Square Root (235.45)

√ ∑ ¿¿¿¿
𝑖=1
√ ∑ ¿¿¿¿
𝑖=1
= 15.34

An Average of Person’s Deviation from the average IQ is 15.34 points.


Skewness - Concept
Definition

• Data is said to be skewed if it is not evenly distributed about the Mean

Symmetric
Skewed

Mean Mean

Median Median
Box Plot - Concept
Definition

• Graphical way to represent nearly all the descriptive statistics in one go Box Plot
• A box-plot shows: Upper and lower quartiles IQR = 27; There is no outlier.
Mean 162
Median
Range
Outliers (1.5 IQR)
Mean=110.5
106.5

96.5

82
Practice Questions
1 Here are 7 scores in an Algebra test.
15, 18, 21, 22, 26, 28, 28
Compute : Mean , Median , Q1, Q3, Mode

2 145, 136, 198, 115, 128, 156


Compute: Variance, Standard Deviation
In two units of company, unit one employees are 650 and monthly salary is $2750, employees in unit two are
3 700 and monthly salary is $2500 then what is
combined arithmetic mean?
4 Number of observations are 30 and value of arithmetic mean is 15 then what is the sum of all values?

You might also like