Topic II Part II
Topic II Part II
Topic II Part II
If the distribution is symmetric, the mean and the median provide very
similar information!
Quick Exercise
Find median and mean, and compare your results
16 + 1
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄! = = 4,25 → 4
4
𝑄! = 12800
16 + 1
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄# 𝑀𝑒𝑑𝑖𝑎𝑛 = = 8,5 → 𝑡𝑎𝑘𝑒 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛
2
19600 + 20900
𝑄# = = 20250
2
3(16 + 1)
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄" = = 12,75 → 13
4
𝑄" = 36000
Example II
Example with odd obs. (n = 25): Share of women in parliament of total parliament members for
25 selected countries (2016)
25 + 1
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄! = = 6.5 → 7
4
% %
i Country women i Country women
1 Japan 11.6 13 Italy 30.1 25 + 1
2 Korea (Republic of) 16.3 14 Austria 30.3
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄# = = 13
2
3 United States 19.5 15 Australia 30.5
4 Ireland 19.9 16 New Zealand 31.4 3(25 + 1)
5 Singapore 23.9 17 Netherlands 36.4 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄" = = 19,5 → 19
6 France 25.7 18 Germany 36.9 4
7 Israel 26.7 19 Denmark 37.4
8 United Kingdom 26.7 20 Spain 38
Remember positions
9 Slovenia 27.7 21 Norway 39.6
10 Canada 28.3 22 Iceland 41.3
with 0.5 must be round
11 Luxembourg 28.3 23 Finland 41.5
up in case of Q1 and
12 Switzerland 28.9 24 Belgium 42.4 round down in case of
25 Sweden 43.6 Q3
Outliers
An outlier is an extreme value observations that falls outside of the overall
pattern of the data
Characteristics:
• Problematic values because they tend to distort the analysis of the whole
data set (e.g. mean)
• It could appear either because of the variability of the measurement, or
because of some experimental error. In the first case, the existence of the
outlier is fully justified, while in the other case we want to detect and
eliminate such data
• Ultimately, deleting data depends completely on your own criteria!
Outliers
How can we detect outliers?
To find the outliers of a data set compute the Interquartile Range:
• IQR = Q3 – Q1
The range between Q1 – 1,5 x IQR and Q3 + 1,5 x IQR is the range where values are
not considered as outliers
Example (cont)
Any outliers?
Five Summary Numbers
The set of: median, Q1, Q3, Min, Max is often referred to as ”Five
summary numbers”
14
Q3
Max value 12
Q2
(w/o outliers)
10
8 Outliers
6
Q1
4
Min value
(w/o outliers)
2
0
Variance and Standard Deviation
Both are measures of dispersion wrt to the mean: the larger the measure,
the larger the dispersion of the data.
& "
"
∑ (𝑥
#$% # − 𝑥)
̅
𝑉𝑎𝑟 𝑥 = 𝜎! =
𝑛−1
𝑠𝑑 𝑥 = 𝜎! = 𝑉𝑎𝑟(𝑥)
Step #2: for each value, square the difference to the mean
(1-5)^2=16 (3-5)^2=4 (5-5)^2=0 (7-5)^2=4 (9-5)^2=16
Var(x)=(16+4+4+16)/(5-1)=10
sd(x)= 10
Variance and Sd
1. Why do we define the variance this way if we want to measure dispersion (with respect
to the mean)?
• Squaring makes each term positive so that values above the mean do not cancel
values below the mean
• We want to know how spread-out data values are around the mean on average,
regardless if they have negative or positive value
• Remark: Squaring adds more “weight” to the large differences
• Important: both sd and var are measured in the units of the variable (not great when
comparing different variables)
𝑉 = 𝜎! /𝑥̅
Threshold:
• If the V is below 0,3 then we consider that the data is not very disperse
• If the V is above 0,3 then data is considered to be disperse
Disadvantages:
• When the mean of a variable is zero V cannot be calculated
Take-home Example
Example: compute the coefficient of variation for each
dataset
𝑛 ∑&$%! 𝑥$ − 𝑥̅ 3
𝑆𝑘𝑒𝑤 =
𝑛 − 1 (𝑛 − 2) 𝑠"
I will not ask you to compute skewness by hand. However, you must know
how to interpret the resulting measure
Skewness
How to interpret the skewness coefficient
• If skewness is less than −1 or greater than +1, the distribution is highly
skewed
• If skewness is between −1 and −0.5 or between +0.5 and +1, the
distribution is moderately skewed
• If skewness is between −0.5 and +0.5, the distribution is approximately
symmetric
å i
( x - x ) 4
ni
CRT = i =1
4
-3
nS
• Again: I am not going to ask for Kurtosis computations, but you must know
how to interpret it.
Kurtosis
1. A mesokurtic data (KRT close to 0) follows a normal distribution
2. A leptokurtic data (KRT > 0) the distribution of values has a high peak and
”fat” tails (though the tails are shorter)
3. A platykurtic data (KRT < 0) flatter than a normal distribution with longer,
thinner tails
Take-home Example:
Upload in Stata the dataset we used in Seminar 1. Study the skewness
and Kurtosis of the numerical variables.
Hint: sum, det is the appropriate command