Topic II Part II

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Mean vs Median

Some remarks about the mean:

• Takes into consideration all values in the dataset


• Best measure of central tendency for symmetrically distributed data
• Highly sensitive to extreme values among the data (outliers)

Some remarks about the median:


• Depends on the order of the data, not on the actual values in the dataset
• Median is less sensitive than the mean to outliers

If the distribution is symmetric, the mean and the median provide very
similar information!
Quick Exercise
Find median and mean, and compare your results

Data set: {1, 3, 5, 5, 4, 6, 9, 14, 21, 22}


Measures of Dispersion
We have several instruments to measure dispersion

• Variance, Standard Deviation and coefficient of variation (dispersion


wrt the mean)

• Quartiles (similar idea as the median)


Quartiles
We divide the ordered data in 4 intervals (each containing the same
amount of observations) and obtain:
1. The first or lower quartile (Q1): value that keeps one-fourth of the
observations smaller and three-fourths larger (higher than 25% of
the obs)
2. The second quartile (Q2 = median)
3. The third or upper quartile (Q3): value that keeps 3/4 of the
observations smaller and one- fourth larger (higher than 75%)
Quartiles
To compute the 1st and 3rd quantiles we follow a process which is similar to
the one to find the median.

1) Finding the position


2) If the position has a ,25 à round down / if the position has a ,75 à
round up

How to find the position?


• 1st Quantile: (n+1)/4
• 3rd Quantile: 3(n+1)/4
Example
Example with even observations (n = 16): MBA tuition fees in Spain

16 + 1
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄! = = 4,25 → 4
4

𝑄! = 12800
16 + 1
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄# 𝑀𝑒𝑑𝑖𝑎𝑛 = = 8,5 → 𝑡𝑎𝑘𝑒 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛
2
19600 + 20900
𝑄# = = 20250
2
3(16 + 1)
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄" = = 12,75 → 13
4
𝑄" = 36000
Example II
Example with odd obs. (n = 25): Share of women in parliament of total parliament members for
25 selected countries (2016)
25 + 1
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄! = = 6.5 → 7
4
% %
i Country women i Country women
1 Japan 11.6 13 Italy 30.1 25 + 1
2 Korea (Republic of) 16.3 14 Austria 30.3
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄# = = 13
2
3 United States 19.5 15 Australia 30.5
4 Ireland 19.9 16 New Zealand 31.4 3(25 + 1)
5 Singapore 23.9 17 Netherlands 36.4 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑄" = = 19,5 → 19
6 France 25.7 18 Germany 36.9 4
7 Israel 26.7 19 Denmark 37.4
8 United Kingdom 26.7 20 Spain 38
Remember positions
9 Slovenia 27.7 21 Norway 39.6
10 Canada 28.3 22 Iceland 41.3
with 0.5 must be round
11 Luxembourg 28.3 23 Finland 41.5
up in case of Q1 and
12 Switzerland 28.9 24 Belgium 42.4 round down in case of
25 Sweden 43.6 Q3
Outliers
An outlier is an extreme value observations that falls outside of the overall
pattern of the data

Characteristics:
• Problematic values because they tend to distort the analysis of the whole
data set (e.g. mean)
• It could appear either because of the variability of the measurement, or
because of some experimental error. In the first case, the existence of the
outlier is fully justified, while in the other case we want to detect and
eliminate such data
• Ultimately, deleting data depends completely on your own criteria!
Outliers
How can we detect outliers?
To find the outliers of a data set compute the Interquartile Range:
• IQR = Q3 – Q1

A value in the data set is an outlier if either:


• It is smaller than Q1 – 1,5 x IQR
• It is greater than Q3 + 1,5 x IQR

The range between Q1 – 1,5 x IQR and Q3 + 1,5 x IQR is the range where values are
not considered as outliers
Example (cont)

We previously obtained Q1 and Q3.


IQR= Q3-Q1= 36.000 – 12.800 = 23.200

Therefore: an observation will be an outlier if it is smaller than


Q1-1.5IQR=-22.800 or larger than Q3+1.5IQR=70.800

Any outliers?
Five Summary Numbers
The set of: median, Q1, Q3, Min, Max is often referred to as ”Five
summary numbers”
14
Q3

Max value 12
Q2
(w/o outliers)
10

8 Outliers

6
Q1
4
Min value
(w/o outliers)
2

0
Variance and Standard Deviation
Both are measures of dispersion wrt to the mean: the larger the measure,
the larger the dispersion of the data.
& "
"
∑ (𝑥
#$% # − 𝑥)
̅
𝑉𝑎𝑟 𝑥 = 𝜎! =
𝑛−1

𝑠𝑑 𝑥 = 𝜎! = 𝑉𝑎𝑟(𝑥)

Translation: if we want to compute the variance, we need to 1) obtain the


mean and 2) for each value of the variable, we have to square the difference
to the mean. Finally we have to 3) sum everything and divide by n-1
Example
Data: {1,3,5,7,9}
Step #1: compute the mean (=5)

Step #2: for each value, square the difference to the mean
(1-5)^2=16 (3-5)^2=4 (5-5)^2=0 (7-5)^2=4 (9-5)^2=16

Step #3: sum everything and divide by n-1

Var(x)=(16+4+4+16)/(5-1)=10
sd(x)= 10
Variance and Sd
1. Why do we define the variance this way if we want to measure dispersion (with respect
to the mean)?

• Squaring makes each term positive so that values above the mean do not cancel
values below the mean
• We want to know how spread-out data values are around the mean on average,
regardless if they have negative or positive value
• Remark: Squaring adds more “weight” to the large differences
• Important: both sd and var are measured in the units of the variable (not great when
comparing different variables)

2. What does it mean if the standard deviation is zero?


Coefficient of Variation
The advantage of the Coefficient of Variation is that has no unit. Therefore, it is useful to compare
dispersion measures for different variables.

𝑉 = 𝜎! /𝑥̅

Threshold:
• If the V is below 0,3 then we consider that the data is not very disperse
• If the V is above 0,3 then data is considered to be disperse

Disadvantages:
• When the mean of a variable is zero V cannot be calculated
Take-home Example
Example: compute the coefficient of variation for each
dataset

• Dataset A {0,0,14,14} (sol 1.15)


• Dataset B {0,6,8,14} (sol 0.82)
• Dataset C {6,6,8,8} (sol 0.16)
Measures of shape: skewness
We introduce the concept and computation of skewness for unimodal data (just one peak)

Positively skewed variable Our data is symmetric Negatively skewed variable


• The “tail” is on the right (not skewed) if the • The tail is on the left
• Mean > Median mean and the median • Median > Mean
agree
Skewness
The difference between the mean and the median is a first indicator of
(a)symmetry, although not a very good one

The coefficient of skewness (asymmetry) is computed as:

𝑛 ∑&$%! 𝑥$ − 𝑥̅ 3
𝑆𝑘𝑒𝑤 =
𝑛 − 1 (𝑛 − 2) 𝑠"

I will not ask you to compute skewness by hand. However, you must know
how to interpret the resulting measure
Skewness
How to interpret the skewness coefficient
• If skewness is less than −1 or greater than +1, the distribution is highly
skewed
• If skewness is between −1 and −0.5 or between +0.5 and +1, the
distribution is moderately skewed
• If skewness is between −0.5 and +0.5, the distribution is approximately
symmetric

• IMPORTANT: NEGATIVE SKEWNESS VALUES ARE ASSOCIATED WITH TAILS


TO THE LEFT AND POSITIVE SKEWNESS VALUES ARE ASSOCIATED WITH
TAILS TO THE RIGHT
Measures of Shape: Kurtosis
Kurtosis
Measures how concentrated the data are in the tails of its distribution by
comparison to the normal distribution

• The coefficient of kurtosis is computed as:


n

å i
( x - x ) 4
ni
CRT = i =1
4
-3
nS

• Again: I am not going to ask for Kurtosis computations, but you must know
how to interpret it.
Kurtosis
1. A mesokurtic data (KRT close to 0) follows a normal distribution
2. A leptokurtic data (KRT > 0) the distribution of values has a high peak and
”fat” tails (though the tails are shorter)
3. A platykurtic data (KRT < 0) flatter than a normal distribution with longer,
thinner tails
Take-home Example:
Upload in Stata the dataset we used in Seminar 1. Study the skewness
and Kurtosis of the numerical variables.
Hint: sum, det is the appropriate command

You might also like