Reading 2 - Organising, Visualising and Describing Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

Quantitative Methods

Organizing, Visualizing and


Describing Data
Study Session 1

Reading No – 2
Version 2022
Learning Outcome Statements
The candidate Should be able to:
• a. identify and compare data types;
• b. describe how data are organized for quantitative analysis;
• c. interpret frequency and related distributions;
• d. interpret a contingency table;
• e. describe ways that data may be visualized and evaluate uses of specific visualizations;
• f. describe how to select among visualization types;
• g. calculate and interpret measures of central tendency;
• h. evaluate alternative definitions of mean to address an investment problem;
• i. calculate quantiles and interpret related visualizations;
• j. calculate and interpret measures of dispersion;
• k. calculate and interpret target downside deviation;
• l. interpret skewness;
• m. interpret kurtosis;
• n. interpret correlation between two variables.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
LOS a: Data Types
FV Value of an
Continuous The reason why this is discussed at the first
Investment
place is because in the future readings,
Numerical
analyzing them using different statistical
Number Of
Discrete
Compounding techniques depends on the data types. For
e.g.: if we talk about unstructured data vs
GICS Industry
Classification Nominal unstructured data. The unstructured data will
Standard
Categorical
have to be analyzed or processed to really
1 Start for Worst Performance
analyze it or use it in a a process.
Ordinal 5 Start for Best Performance

Data Types
Inflation Rates in
Cross sectional January in Asian
Countries

Daily Closing Price of


Time Series Infosys stock for
December 2021

Stock market data-


structured
Structured Vs.
Unstructured
Social media posts ,
credit transactions,
etc are unstructured

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Learning Outcome Statements
The candidate Should be able to:
• a. identify and compare data types;
• b. describe how data are organized for quantitative analysis;
• c. interpret frequency and related distributions;
• d. interpret a contingency table;
• e. describe ways that data may be visualized and evaluate uses of specific visualizations;
• f. describe how to select among visualization types;
• g. calculate and interpret measures of central tendency;
• h. evaluate alternative definitions of mean to address an investment problem;
• i. calculate quantiles and interpret related visualizations;
• j. calculate and interpret measures of dispersion;
• k. calculate and interpret target downside deviation;
• l. interpret skewness;
• m. interpret kurtosis;
• n. interpret correlation between two variables.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los b: Organizing data for analysis
One Dimension Two Dimensions
Organizing Raw Data
2016- 01/04/201
Date NIFTY 50 2021 01/04/2016 7 01/04/2018
30/03/2016 1000 NIFTY 50
St Dev 18% 18% 20%
01/04/2016 997
Returns 15% 15% 13%
04/04/2016 1003 Sharpe 0.41 0.42 0.27

One Dimensional 05/04/2016 983


We can’t really say whether this is a
06/04/2016 984
way of analyzing but mostly depends
07/04/2016 976 on the analysis objective itself. For eg
08/04/2016 977 we want to analyze NIFTY 50 Returns,
Two Dimensional 11/04/2016 992 in this case a one-dimension analysis
would do the job. However rarely
12/04/2016 997
would you even consider such
13/04/2016 1015 analysis but compare it over a time
18/04/2016 1023 period, along with calculating some
20/04/2016 1023
other statistics related to it.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Learning Outcome Statements
The candidate Should be able to:
• a. identify and compare data types;
• b. describe how data are organized for quantitative analysis;
• c. interpret frequency and related distributions;
• d. interpret a contingency table;
• e. describe ways that data may be visualized and evaluate uses of specific visualizations;
• f. describe how to select among visualization types;
• g. calculate and interpret measures of central tendency;
• h. evaluate alternative definitions of mean to address an investment problem;
• i. calculate quantiles and interpret related visualizations;
• j. calculate and interpret measures of dispersion;
• k. calculate and interpret target downside deviation;
• l. interpret skewness;
• m. interpret kurtosis;
• n. interpret correlation between two variables.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los c: data using tabular format
Summarize Returns
Index Data- NIFTY Find Data Extremes Spreadsheet Summarizing Data – Use
this template to see it in spreadsheet
Data Set Max 1.8% 5:Divide the Cum
Date Closing Price % Change Min -0.665%
20/05/2021 14906.05
abs / total
21/05/2021 15175.3 1.8% observations
24/05/2021 15197.7 0.1%
25/05/2021 15208.45 0.1% Create Range from minimum data point to Maximum
26/05/2021 15301.45 0.6%
27/05/2021 15337.85 0.2%
Count the
28/05/2021 15435.65 0.6%
31/05/2021 15582.8 1.0% observations between
01/06/2021 15574.85 -0.1% the ranges
02/06/2021 15576.2 0.0%
03/06/2021 15690.35 0.7% Bin Abs Relative Relative % Cum Absolute Cum Relative Frequency
04/06/2021 15670.25 -0.1% -0.67% 1 0.047619 1 0.047619048
07/06/2021 15751.65 0.5%
-0.17% 2 0.095238 3 0.142857143
08/06/2021 15740.1 -0.1%
09/06/2021 15635.35 -0.7% 0.33% 9 0.428571 12 0.571428571
10/06/2021 15737.75 0.7% 0.83% 7 0.333333 19 0.904761905
11/06/2021 15799.35 0.4% 1.33% 1 0.047619 20 0.952380952
14/06/2021 15811.85 0.1% 1.83% 1 0.047619 21 1
15/06/2021 15869.25 0.4%
16/06/2021 15767.55 -0.6% 1:We created a range 2:Count the 4:Cumulative the
17/06/2021 15691.4 -0.5% of .5% difference . observations between abs count on
3:Divide the abs count
18/06/2021 15683.35 -0.1% There is no particular the ranges every row
by total observations
rule for this.
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los e: Visualising the Tabular Data
Plotting the same table of absolute frequency and relative frequency will help us visualise
this data
Histogram with Frequency Polygon
Cum Relative Frequency
10 0.45
1.2
Frequency
9 0.4
Polygon
8 1
0.35

7
0.3 0.8
6
0.25
5 0.6
Histogram 0.2
4
0.15 0.4
3

0.1
2
0.2

1 0.05

0
0 0
-1.00% -0.50% 0.00% 0.50% 1.00% 1.50% 2.00%
-0.67% -0.17% 0.33% 0.83% 1.33% 1.83%
Abs Relative Relative %

Reference Spreadsheet
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Learning Outcome Statements
The candidate Should be able to:
• a. identify and compare data types;
• b. describe how data are organized for quantitative analysis;
• c. interpret frequency and related distributions;
• d. interpret a contingency table;
• e. describe ways that data may be visualized and evaluate uses of specific visualizations;
• f. describe how to select among visualization types;
• g. calculate and interpret measures of central tendency;
• h. evaluate alternative definitions of mean to address an investment problem;
• i. calculate quantiles and interpret related visualizations;
• j. calculate and interpret measures of dispersion;
• k. calculate and interpret target downside deviation;
• l. interpret skewness;
• m. interpret kurtosis;
• n. interpret correlation between two variables.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
LOS d. Interpret a contingency table-1
The frequency table was for one variable stock returns of NIFTY summarized. What if we summarized the
companies market capitalisation of all stocks in NYSE, according to sectors.
Marginal Frequency
The addition of all health care
sectors in the U.S of all types
of market cap is marginal
Between (50 Bn to 100
frequency
Greater than 100 Bn Bn) Less Than 50 Bn
100 50 50 Total
Industrials 3 0 7 10 Joint Frequency.
Health Care 4 6 8 18
Information
Discretionary sector has
Technology 5 3 6 14 the highest number of
Consumer
Discretionary 1 2 10 13
companies with a
Utilities 0 0 6 6 market cap less than 50
Financials 2 5 8 15 bn. Its called that
Materials 0 0 5 5
because you use sector
Consumer Staples 1 0 3 4
Real Estate 0 1 4 5 and market cap to
Energy 1 0 5 6 categorise
Telecommunication
Services 1 0 1 2
Total 18 17 63 17

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
LOS d. Interpret a contingency table-2
Expressing the same table in the previous slide as % or relative frequency
Relative Frequency
Greater than 100 Bn Between (50 Bn to 100 Bn) Less Than 50 Bn
100.00 50.00 50.00 Total
Industrials 17% 0% 11% 10%
Health Care 22% 35% 13% 18%
Information Technology 28% 18% 10% 14%
Consumer Discretionary 6% 12% 16% 13%
Utilities 0% 0% 10% 6%
Financials 11% 29% 13% 15%
Materials 0% 0% 8% 5%
Consumer Staples 6% 0% 5% 4%
Real Estate 0% 6% 6% 5% Confusion Matrix
Energy 6% 0% 8% 6%
Telecommunication Services 6% 0% 2% 2% Less Than More than
Total 18% 17% 64% 100% Margin( 2017 E) 20% 20%
Good
Bad Performance Performance Total
Industrials 5.00 5.00 10.00
Health Care 0.00 14.00 14.00

Converting the same matrix based on some outcome or Information Technology


Consumer Discretionary
14.00
0.00
13.00
4.00
27.00
4.00

quality of the data, is called as confusion matrix. In the Utilities


Financials
6.00
0.00
4.00
11.00
10.00
11.00
table to the right we counted the number of companies Materials
Consumer Staples
5.00
0.00
2.00
3.00
7.00
3.00
with less than 20% margin expected (2017) are treated Real Estate
Energy
5.00
0.00
4.00
1.00
9.00
1.00
as bad performance and vice versa. Telecommunication
Services 2.00 1.00 3.00
Total 37.00 62.00 99.00
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los e: Visualising this data
Number of Companies in Various Sectors Stacked Bar
Telecommunication Services
Telecommunication Services Energy
Energy Real Estate
Real Estate Consumer Staples
Consumer Staples Materials
Materials Financials
Financials Utilities
Consumer Discretionary
Utilities
Information Technology
Consumer Discretionary
Health Care
Information Technology
Industrials
Health Care
Industrials 0 5 10 15 20

0 2 4 6 8 10 12 14 16 18 20 Large Cap Mid Cap Small Cap

Clustered Bar
Telecommunication Services
Energy
Real Estate
Consumer Staples
Materials
Financials
Utilities
Consumer Discretionary
Information Technology
Health Care
Industrials
0 2 4 6 8 10 12

Small Cap Mid Cap Large Cap

All Rights Reserved -Mentor Me Careers www.mentormecareers.com Reference Spreadsheet


Los e: Visualising this data
Word Cloud: Use to highlight the key, keywords Line Charts are useful when showing a trends in a clear and concise way

Index Performance Since 2016-2021


2500

2000

1500

1000

500

0
30/03/2016 30/03/2017 30/03/2018 30/03/2019 30/03/2020
Source: CFA curriculum
NIFTY 50 Nifty 100 MCap 100 S- Cap 100
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los e: Visualising this data
Bubble charts help in conveying more than
one information compared to a line chart.
In this case you can see that we are
showing
• Revenue
• EPS Profit or EPS Loss
For a particular company

Source: CFA curriculum

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los e: Visualising this data
Gold ETF Vs Equity MF
20%

15%

10%

5%

0%
22-Sep-17 10-Apr-18 27-Oct-18 15-May-19 01-Dec-19 18-Jun-20 04-Jan-21 23-Jul-21

-5%

-10%

-15%

-20% Scatter Plots: Are used to show relationship between two variables over a Notice that in case of gold and equity relationship. Most of the time they
time frame. A random data may not be useful to plot because it might now have an inverse or negative relationship. That’s the benefit of using scatter
-25% show any relationship. The closeness or wideness of the plots might indicate plot
the strength of the relationship
-30%

ETF Franklin

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los e: Visualising this data
Heat Map

This basically summarizes the same tabular classified


data but in an order from highest to lowest with a color
code

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Learning Outcome Statements
The candidate Should be able to:
• a. identify and compare data types;
• b. describe how data are organized for quantitative analysis;
• c. interpret frequency and related distributions;
• d. interpret a contingency table;
• e. describe ways that data may be visualized and evaluate uses of specific visualizations;
• f. describe how to select among visualization types;
• g. calculate and interpret measures of central tendency;
• h. evaluate alternative definitions of mean to address an investment problem;
• i. calculate quantiles and interpret related visualizations;
• j. calculate and interpret measures of dispersion;
• k. calculate and interpret target downside deviation;
• l. interpret skewness;
• m. interpret kurtosis;
• n. interpret correlation between two variables.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los f: How to select the visual type
Bonus Practical Tip:
Always decide the exploration or presentation using the 3
Scatter, Scatter
Relationship
steps
Matrix, Heat Map
1. What- What to present- What exactly do you want
to conclude
2. Whom- Whom to present- Makes a big difference on
What to explore or the readability of the data. CEO Vs an Employee
present?
3. How- The chart type

Distribution Comparison

Numerical Categorical Unstructured Categories Time

Hisogram, Frequency
Bar chart, Tree chart,
polygon, Cumulative Word Cloud Bar, Tree, Heat Map Line, Bubble Line
Heat Map
Distribution

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Learning Outcome Statements
The candidate Should be able to:
• a. identify and compare data types;
• b. describe how data are organized for quantitative analysis;
• c. interpret frequency and related distributions;
• d. interpret a contingency table;
• e. describe ways that data may be visualized and evaluate uses of specific visualizations;
• f. describe how to select among visualization types;
• g. calculate and interpret measures of central tendency;
• h. evaluate alternative definitions of mean to address an investment problem;
• i. calculate quantiles and interpret related visualizations;
• j. calculate and interpret measures of dispersion;
• k. calculate and interpret target downside deviation;
• l. interpret skewness;
• m. interpret kurtosis;
• n. interpret correlation between two variables.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los g: Measures of central tendency

Our Usual Daily Questions


Average House
Price
Average score in
an exam

Average Salary

Average Expenses

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los g:The fundamentals of central tendency
Score in English

12

Frequency
Numeric 15

Add & Divide


Middle Location
20 Average:25.33
Median:20
20

Location 40

Most Repeating 45
Value

Mode:20

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los g: Why are we Interested In the centre?
Consider that you are evaluating a mutual fund investment scheme. Which one of the below questions
will you be interested in asking?
1. What was the highest return for a particular month?
2. Or what has been the average returns?
Lets take the example on this sheet: Franklin Templeton MF.
Average Median
After the analysis we concluded that the average monthly return is:
0.58% 1.02%

• Does the central location itself give you any information?


• What if we were to find how much does each day fluctuate from the centre?- That’s Risk, and that’s
how we get into concepts of standard deviation, mean absolute deviation, downside deviation.
• Because we are always interested in the average fluctuation, which will somehow affect our capital
volatility.
• Imagine investing in a fund and even though the returns are good at 21% annual but just when you
were planning to buy a house the fund experiences a -70% fluctuation and you have no option but
to wait for the markets to normalise again. This is why we are learning statistics and this is why
managing risk is more important than targeting returns.
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los g: Arithmetic Mean Properties
Data Set A B Handling these outliers
1 20 20 1. Do nothing and use the same
data
2 25 0 • Done usually when the
3 30 30 data is correct
4 32 32 2. Delete all the outliers
Outlier • This is called as trimmed
5 34 34 mean
6 40 40 • Usually used in sports
7 45 100 competition
3. Replace the outlier with
AM 32 37
normalised data
Median 32 32 • This is called as winsorized
High Effect mean
on Average

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los g: Median
Data Set ODD EVEN Also you might have noticed that
median is not affected by extreme
1 20 20 values because its based on finding
2 25 25 a location rather than performing
an arithmetic
3 30 30
Mid Location 4 32 32 Mid Location
• Location= 4th Observation= 8 5 34 34 Location
observations/2 or N/2 • 4tH Observation= 8 observations/2 or N/2
6 40 40
• 5TH Observation= 8 observations+2/2
7 45 45 • Take the mean of the two values

8 50
Median 32 33

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
LOS g: The Mode
Mode is basically the most frequently occurring value. Histogram with Frequency Polygon
There is a possibility that a data set might have 10 0.45

• One mode- Unimodal 9


Frequency
Polygon 0.4

• Two modes- Bi modal 8


0.35

• Three modes- Trimodal 7


0.3
6
For stock returns or asset returns there might be no 0.25

specific modal values but while creating a histogram, it5 0.2


Histogram
can have bin modes. 4
0.15
3

0.1
2

1 0.05

0 0
-0.67% -0.17% 0.33% 0.83% 1.33% 1.83%
Abs Relative Relative %

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los g: Weighted Mean
Often used in portfolio analysis, or in financial analysis. The method assigns weight on each values compared
to the total.
For Example: Considering that we invested in three asset classes on different proportions, then average returns
is the proportion multiplied by its returns.

Assets Allocation(A) Returns(B) AxB

Gold 15% 9% 1.3500%

Equity 60% 15% 9.0000%

Bonds 25% 12% 3.0000%

Weighted Mean 13.4%

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los g: Geometric Mean
Often used in calculating time series return calculation. The concept of geometric mean, works on a multiplicative principle
rather than additive one. Intuitively , GM is nothing but first compounding the periodic returns and then using the rate
expression to find the average rate of compounding

Future Value
Year Returns Value( $10) Expression Or Simplifying
0 1
1 4% 1.04 1(1+4%)= 1.04
2 3% 1.0712 1.04(1+3%)=1.07
1*(1+4%)(1+3%)(1+5%)=
3 5% 1.12476 1.07(1+5%)= 1.12 1.124
1.12476 1.124
Rate (FV/PV)^(1/n)-1
AM 4% GM 3.9976%
• You must however note that in case there is a negative return, (lets say -3%) cannot be used because the value actually mathematically
becomes 97% of the previous value or (1-3%)=97% because (1+(-3%) = 0.97.
• Always remember that the GM will be less than or equal to AM, because of the compounding effect

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los g: Harmonic Mean
This central tendency method is particularly useful for rates or ratio. For example look at the
harmonic mean of P/E Ratios of Nifty

P/E Ratios- nifty Inverse Value(1/x)


25 4%
What this inverse does is, the exact opposite of
weighted mean. The higher the value the lower
35 3%
the proportion. If you had calculate the AM, You
40 3%
would have got 30 and if you had calculate the
20 5%
Weighted mean you would have got 32.
Sum of Inverse 14%
HM (no of obs/ sum of inverse) 27.86

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Learning Outcome Statements
The candidate Should be able to:
• a. identify and compare data types;
• b. describe how data are organized for quantitative analysis;
• c. interpret frequency and related distributions;
• d. interpret a contingency table;
• e. describe ways that data may be visualized and evaluate uses of specific visualizations;
• f. describe how to select among visualization types;
• g. calculate and interpret measures of central tendency;
• h. evaluate alternative definitions of mean to address an investment problem;
• i. calculate quantiles and interpret related visualizations;
• j. calculate and interpret measures of dispersion;
• k. calculate and interpret target downside deviation;
• l. interpret skewness;
• m. interpret kurtosis;
• n. interpret correlation between two variables.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Summarizing Means
Means

Additive
Multiplicative Principle Frequency Location
Principle

Arithmetic Geometric
Mode Median
Includes Mean Mean
Outliers

Compounding With Additive


Principle
With Inverse
Weights
With Weights

Harmonic
Mean

Weighted
Mean

Extreme
Outliers
Variation Principle Use
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Learning Outcome Statements
The candidate Should be able to:
• a. identify and compare data types;
• b. describe how data are organized for quantitative analysis;
• c. interpret frequency and related distributions;
• d. interpret a contingency table;
• e. describe ways that data may be visualized and evaluate uses of specific visualizations;
• f. describe how to select among visualization types;
• g. calculate and interpret measures of central tendency;
• h. evaluate alternative definitions of mean to address an investment problem;
• i. calculate quantiles and interpret related visualizations;
• j. calculate and interpret measures of dispersion;
• k. calculate and interpret target downside deviation;
• l. interpret skewness;
• m. interpret kurtosis;
• n. interpret correlation between two variables.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los h: Quartiles, Quintiles, Deciles
& Percentiles
Its easier for statisticians to talk about the location of data by dividing the distribution into some specific parts.
Its critical for us to understand the logic here instead of memorization in any way

2 Parts Median
4 Parts Quartiles
Location

5 Parts Quintiles
10 Parts Deciles
100 Parts Percentiles
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Example: Percentile
1. Find the 10th & 90th Percentile 10th Percentile
If we try to look at this
intuitively using the same logic of
median, we know that when the
data is points are divided in 100
parts and if the data is ordered in
ascending fashion then just
looking at the cumulative
observations, we should be able
to identify it
100 Parts

www.mentormecareers.com 90th Percentile Source: CFA CURRICULUM


All Rights Reserved -Mentor Me Careers
Example: Quintiles or Quartiles
Quintiles
1. Find the 1ST ,2nd & 3rd Quintile. Now quintile means 5
parts, which means data is divided in 5 parts or 20
observations in an ascending order. Which means the
1. 1st Quintile: The lowest 20 observations or -0.432 or
lower
2. 2ND Quintile: is -0.070 or lower
3. 3RD Quintile: is 0.173 or lower

Quartile
1. Find the 1ST & 3rd Quartile Now quintile means 4 parts,
which means data is divided in 4 parts or 25 observations
in an ascending order. Which means the
1. 1st Quartile: The lowest 20 observations or -0.293 or
lower
2. 3RD Quartile: is 0.460 or lower
3. Inter quartile range: The difference between 1st and
3rd Quartile (-0.293 -0.460)= 0.754%
www.mentormecareers.com Source: CFA CURRICULUM
All Rights Reserved -Mentor Me Careers
Los h: Visualising Quartiles
Trading
Bins Days Lower Higher No of observations Quartiles
Upper Boundary Q3 -8% 3 -26% -8% 3 1
Highest
-4% 6 -8% -4% 3 1
-2% 9 -4% -2% 3 1
-1% 12 -2% -1% 3 2
0% 15 -1% 0% 3 2
0% 18 0% 0% 3 2
Inter
Quartile 2% 21 0% 2% 3 3
Median
Range 2% 24 2% 2% 3 3
4% 27 2% 4% 3 3
5% 30 4% 5% 3 4
Mean 8% 33 5% 8% 3 4
14% 36 8% 14% 3 4

Lowest Boundary Q2

Source spreadsheet
Remember that in spreadsheet, you can directly select the data
Lowest set and create the box and whisker but the point is to understand
what it means.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
CFA Curriculum Question

1:The fourth quintile return for the MSCI World Index is closest to:
A 20.65%.
B 26.03%.
C 27.37%.
2:For Year 6–Year 10, the mean absolute deviation of the MSCI World Index total
returns is closest to:
A 10.20%.
B 12.74%.
C 16.40%.
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Solution
1. B 26.03%.
2. A 10.20%

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Learning Outcome Statements
The candidate Should be able to:
• a. identify and compare data types;
• b. describe how data are organized for quantitative analysis;
• c. interpret frequency and related distributions;
• d. interpret a contingency table;
• e. describe ways that data may be visualized and evaluate uses of specific visualizations;
• f. describe how to select among visualization types;
• g. calculate and interpret measures of central tendency;
• h. evaluate alternative definitions of mean to address an investment problem;
• i. calculate quantiles and interpret related visualizations;
• j. calculate and interpret measures of dispersion;
• k. calculate and interpret target downside deviation;
• l. interpret skewness;
• m. interpret kurtosis;
• n. interpret correlation between two variables.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los j: Dispersion
Although the CFA text has done this in small mini parts. This concept is best understood as a whole with one data set. We will try to
make this as easily and intuitively understandable as possible. Please refer to the below sheet for calculations
Source spreadsheet
Below we are just trying to show the principle of getting a deviation not necessarily the method itself.
Once you understand the principle the method is easy to replicate.

Distance between
Highest and Lowest Range
Point

Summing all of the


So we square the
difference will lead to 0 Because –ve will cancel
Dispersion Multiplicative ( Observation – Mean) distance to remove the
Because of negative the +Ve
effect
distance

Summing all of the


difference will lead to 0 So we ignore the sign (
Arithmetic (Observation- Mean)
Because of negative Abs difference)
distance

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los j: Calculating Deviations- MAD Ignore sign
Distance from Mean 26%
-26%
9%
-9%
9%
-9% 6% Step 4:Add and Divide by no
-6% 5%
-5%
4%
of observations -1( Because
-4%
-4%
4%
3%
we are dealing with a
-3%
-2%
2%
2%
sample)
-2%
2%
Step 1: Calculate Mean -2%

Step 2:Substract each value


-2% 2%
-1% 1%

Step 3:Remove signs


-1% 1%
-1% 1%
-1% 1%
-1% 1%
Although in the

from mean
0% 0%
0% 0% CFA text they
1% 1%
1% 1% haven’t
1%
2%
1% mentioned N-1,
2%
2%
2% practical world is
2%
2%
3%
3%
what we showed
3%
3%
3% here
4% 3%
5% 4%
5% 5%
5% 5%
7% 5%
9% 7%
12%
13%
9%
12% Source spreadsheet
0.00 13%

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los j: Calculating Deviations- ST DEV Square the
Distance from Mean distance
-26% 7%
-9% 1%
-9% 1%
-6% 0%
-5% 0%
-4% 0% Step 4:take a root of the
-4% 0%
-3% 0% sum of squared distance
-2% 0%
-2% 0% divided by N-1
Step 1: Calculate Mean -2% 0%

Step 2:Substract each value

Step 3:Square the distance


-2% 0%
-1% 0%
-1% 0%
-1% 0%
-1% 0%
-1% 0%

from mean
0% 0%
0% 0% • Do understand that the
1% 0%
1% 0% square root is done for
1%
2%
0%
0%
remove the effect of
2% 0% squaring in step 3
2% 0%
3% 0% • And also that square root is
3%
3%
0%
0%
for both the squared
4% 0% distance and N both
5% 0%
5% 0% • If we don’t do square root
5%
7%
0%
0%
then what we have is
9% 1% variance
12% 1%
13%
0.00
2%
15%
Source spreadsheet
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los k: Coefficient of Variation
This is quick detour to an future LOS, since this makes sense here. When we don’t take the
square root of the squared distance from the mean, then its called variance but in practice
variance has very little use. Majorly because its not intuitive. Variance was suppose to tell us
the relative variance to the mean, the keyword is relative dispersion.
In other words how much risk per unit of returns.
Coefficient of variation = Standard Deviation/ Mean Value(or Benchmark value)

Hence in our previous example the CV would be: 6.60%/0.56%=11.71 Times

In simple street language we would say that for each 1% return, we are taking 11.71 times
risk. Whether this is good or bad, depends on the comparison. So another fund with lets 6
Times CV would be considered better than this.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los j:Relationship between arithmetic and
Geometric Means

All that you need to know is the larger the


variance, larger the difference between
geometric mean and arithmetic mean

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Learning Outcome Statements
The candidate Should be able to:
• a. identify and compare data types;
• b. describe how data are organized for quantitative analysis;
• c. interpret frequency and related distributions;
• d. interpret a contingency table;
• e. describe ways that data may be visualized and evaluate uses of specific visualizations;
• f. describe how to select among visualization types;
• g. calculate and interpret measures of central tendency;
• h. evaluate alternative definitions of mean to address an investment problem;
• i. calculate quantiles and interpret related visualizations;
• j. calculate and interpret measures of dispersion;
• k. calculate and interpret target downside deviation;
• l. interpret skewness;
• m. interpret kurtosis;
• n. interpret correlation between two variables.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los k: Issues with St Dev St Deviation due to its athematic
principle penalises even if you
generate higher returns.
Date Returns-1 Returns-2
03/04/2020 90% -9%
02/08/2019 -8% -8%
04/10/2018 -8% -8%
02/03/2020 -6% -6%
Higher Returns
05/03/2018 -4% -4% but risk is high
03/02/2020 -4% -4%
07/10/2020 -3% -3%
Low Returns but
04/09/2019 -2% -2%
02/04/2018 -2% -2% Better Risk
02/05/2019 -1% -1%
02/07/2019 -1% -1%
Hence even if you had identical performance with
02/07/2018 -1% -1% returns 2 but even when you perform extremely well.
04/02/2019 -1% -1% The fund manager will appear to be worse on a risk
01/06/2018 -1% -1% adjusted basis. Now we could argue that what's
wrong in that, may be such single extreme values do
St Dev 25% 3% signify risk, even if it generated +ve values.
Mean Returns 3% -4%

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los k:Downside Deviation
Date Returns-1 Returns-1 1- Target R 2- Target R
03/04/2020 90% -9% 0.00% 0.36%
02/08/2019 -8% -8% 0.28% 0.28%
04/10/2018 -8% -8% 0.27% 0.27% In this case lets say we say we are
02/03/2020 -6% -6% 0.07% 0.07%
Decide on the target worried about returns below
returns
05/03/2018 -4% -4% 0.01% 0.01% -3%
03/02/2020 -4% -4% 0.00% 0.00%
07/10/2020 -3% -3% 0.00% 0.00%
04/09/2019 -2% -2% 0.00% 0.00% Squared distance Take the square root and divide
02/04/2018 -2% -2% 0.00% 0.00%
between returns – by the number of observations -
Target returns 1
02/05/2019 -1% -1% 0.00% 0.00%
02/07/2019 -1% -1% 0.00% 0.00%
02/07/2018 -1% -1% 0.00% 0.00%
04/02/2019 -1% -1% 0.00% 0.00%
01/06/2018 -1% -1% 0.00% 0.00%
Dowside Dev 2.14% 2.68%
Target Returns -3% -3% Now Fund Manager 1 Seems to
look less risky. Whether this is a
comfortable convenient
www.mentormecareers.com
manipulation is debatable
All Rights Reserved -Mentor Me Careers
CFA curriculum Question
1: The arithmetic mean return over the 10 years is closest to:
A 2.97%.
B 3.00%.
C 3.33%.
2:The geometric mean return over the 10 years is closest to:
A 2.94%.
B 2.97%.
C 3.00%.
3:The harmonic mean return over the 10 years is closest to:
A 2.94%.
B 2.97%.
C 3.00%.
4: The standard deviation of the 10 years of returns is closest to:
A 2.40%.
B 2.53%.
C 7.58%.
5:The target semi deviation of the returns over the 10 years if the target is 2% is
closest to:
A 1.42%.
B 1.50%.
C 2.01%.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Solutions
1. 3.0%
2. 2.9717%.
3. 2.9442%.
4. 2.5276%.
5. 1.5%.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Learning Outcome Statements
The candidate Should be able to:
• a. identify and compare data types;
• b. describe how data are organized for quantitative analysis;
• c. interpret frequency and related distributions;
• d. interpret a contingency table;
• e. describe ways that data may be visualized and evaluate uses of specific visualizations;
• f. describe how to select among visualization types;
• g. calculate and interpret measures of central tendency;
• h. evaluate alternative definitions of mean to address an investment problem;
• i. calculate quantiles and interpret related visualizations;
• j. calculate and interpret measures of dispersion;
• k. calculate and interpret target downside deviation;
• l. interpret skewness;
• m. interpret kurtosis;
• n. interpret correlation between two variables.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los I,m: Normal Distribution
NIFTY 50 Return Distribution
4000
Standard deviation and means might not
tell us more than a direct metric of risk
3500
but it won’t tell us how are these returns
3000
distributed over time. Look at this
distribution of NIFTY 50 returns since
2500 1997 and the returns are normally
2000
distributed or symmetrical.

1500

1000
Mean Area
500

0
-21% -19% -17% -15% -13% -11% -9% -7% -5% -3% -1% 1% 3% 5% 7% 9% 11% 13% 15% 17% 19%

Source spreadsheet
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los I,m: Non Symmetrical
Every day world wont exactly look skewed
perfectly like shown in the CFA Curriculum but
possibly like the one below of Commodities
CTA -Hedge Fund trading hedge fund
30

Frequent small
25
losses

20

15

Large Losses infrequent Large Gains or Outliers which


10
but more than normal are again not normal

5 Mean 0.407%
Median 0.140%
Mode -0.160%
0
-6% -5% -5% -4% -4% -3% -3% -2% -2% -1% -1% 0% 0% 1% 1% 2% 2% 3% 3% 4% 4% 5% 5% 6% 6% More

All Rights Reserved -Mentor Me Careers


www.mentormecareers.com Source spreadsheet
Los I,m: Interpreting Skewness
A perfect symmetrical distribution should have a skewness of 0. However we can clearly see that
probably that world does not exist. Nifty 50 & CTA both have skewness. In fact CTA has a +ve
skewness and Nifty has –ve. Calculating this is not a part of the curriculum, in fact in the real world,
spreadsheet has this as a function “Skew.s”. But both negative and positive tells us that there are
times when the CTA can give us some extraordinary returns on the +ve side and NIFTY can give us
some extraordinary negative returns too in a single day.

CTA Nifty
0.174697 -0.1355182

Source spreadsheet
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los I,m: Interpreting Kurtosis

No need to worry, you don’t have to calculate this or A direct question could
remember this equation even logically. Just know what is be what is the kurtosis of
kurtosis and its types. That’s why we didn’t do much in normal distribution. The
this slide. In our data set answer is 3. Please don’t
1. Nifty has a kurtosis of 9- fat tailed lose this easy one on the
2. Hedge Fund has a kurtosis of -2.48- means thin tailed exam.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
CFA Curriculum Question
1)The coefficient of variation is closest to:
A 0.02.
B 0.42.
C 2.41.

2)This distribution is best described as:


A negatively skewed.
B having no skewness.
C positively skewed.

3)Compared to the normal distribution, this sample’s distribution is best described as having tails of the distribution
with:
A less probability than the normal distribution.
B the same probability as the normal distribution.
C more probability than the normal distribution.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Solution
1. B is correct. The coefficient of variation is the ratio of the standard deviation to the arithmetic average, or
0.001723 0.09986 = 0.416.
2. C is correct. The skewness is positive, so it is right-skewed (positively skewed).
3. C is correct. The excess kurtosis is positive, indicating that the distribution is “fat-tailed”; therefore, there is
more probability in the tails of the distribution relative to the normal distribution

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Learning Outcome Statements
The candidate Should be able to:
• a. identify and compare data types;
• b. describe how data are organized for quantitative analysis;
• c. interpret frequency and related distributions;
• d. interpret a contingency table;
• e. describe ways that data may be visualized and evaluate uses of specific visualizations;
• f. describe how to select among visualization types;
• g. calculate and interpret measures of central tendency;
• h. evaluate alternative definitions of mean to address an investment problem;
• i. calculate quantiles and interpret related visualizations;
• j. calculate and interpret measures of dispersion;
• k. calculate and interpret target downside deviation;
• l. interpret skewness;
• m. interpret kurtosis;
• n. interpret correlation between two variables.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los n: How things move together-
Correlation
Again the los says interpret not calculate but calculating in spreadsheet in the real world is very
simple, at the same type look at the equation and hopefully its intuitive too.

Covariance Correlation
Out here we are finding the each We then divide the covariance
values distance from the mean and (strength of relationship) and find
multiplying this with the other data the ratio to each assets standard
set. deviation

Why do we multiply ?
Correlation ranges between
The mathematical complex answer
-1 to 1
is because multiplication preserves
-1 means perfect opposite
the relationship
1 means they are identical
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Los n: Use of correlation

Sector Correlation Matrix India Index Correlation India


Simply put imagine you are an investor and you are optimising your portfolio
Note that correlation doesn’t mean causation, which means one affecting the
and you are posed with couple of questions.
other. Also be aware of something called as spurious correlation
1. Your current portfolio has 100% Equity assets, in the event of a
1. Correlation because of chance
recession are you comfortable with the entire portfolio going one way-
Down? 2. Correlation because of calculation that mixes each of the two variables
with a third one . For eg when you are trying to find correlation between
2. In such cases would you consider an asset with lower correlation to
dividends and total assets but when you divide each with market cap,
equity- like gold or real estate ?
there seems to be correlation. Because of market caps correlation with
3. In your portfolio you realise you have 90% bank stocks, would you each.
consider adding stocks which have lower correlation with bank stocks
3. Correlation which is indirect, hence it might blind you from concluding
or general market conditions. Pharma?
diversification when in fact it is not
Hope you get the idea on the use of correlation in investments.
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Summary
• Data can be defined as a collection of numbers, characters, words, and text—as well as images, audio, and
video—in a raw or organized format to represent facts or information.
• From a statistical perspective, data can be classified as numerical data and categorical data. Numerical data
(also called quantitative data) are values that represent measured or counted quantities as a number.
Categorical data (also called qualitative data) are values that describe a quality or characteristic of a group of
observations and usually take only a limited number of values that are mutually exclusive.
• Numerical data can be further split into two types: continuous data and discrete data. Continuous data can
be measured and can take on any numerical value in a specified range of values. Discrete data are numerical
values that result from a counting process and therefore are limited to a finite number of values.
• Categorical data can be further classified into two types: nominal data and ordinal data. Nominal data are
categorical values that are not amenable to being organized in a logical order, while ordinal data are
categorical values that can be logically ordered or ranked.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Summary
• Based on how they are collected, data can be categorized into three types: cross-sectional, time series, and
panel. Time-series data are a sequence of observations for a single observational unit on a specific variable
collected over time and at discrete and typically equally spaced intervals of time. Cross-sectional data are a
list of the observations of a specific variable from multiple observational units at a given point in time. Panel
data are a mix of time-series and cross-sectional data that consists of observations through time on one or
more variables for multiple observational units.
• Based on whether or not data are in a highly organized form, they can be classified into structured and
unstructured types. Structured data are highly organized in a pre-defined manner, usually with repeating
patterns. Unstructured data do not follow any conventionally organized forms; they are typically alternative
data as they are usually collected from unconventional sources.
• Raw data are typically organized into either a one-dimensional array or a two-dimensional rectangular array
(also called a data table) for quantitative analysis.
• A frequency distribution is a tabular display of data constructed either by counting the observations of a
variable by distinct values or groups or by tallying the values of a numerical variable into a set of numerically
ordered bins. Frequency distributions permit us to evaluate how data are distributed.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Summary

• The relative frequency of observations in a bin (interval or bucket) is the number of observations in the bin
divided by the total number of observations. The cumulative relative frequency cumulates (adds up) the
relative frequencies as we move from the first bin to the last, thus giving the fraction of the observations
that are less than the upper limit of each bin.
• A contingency table is a tabular format that displays the frequency distributions of two or more categorical
variables simultaneously. One application of contingency tables is for evaluating the performance of a
classification model (using a confusion matrix). Another application of contingency tables is to investigate a
potential association between two categorical variables by performing a chi-square test of independence.
• Visualization is the presentation of data in a pictorial or graphical format for the purpose of increasing
understanding and for gaining insights into the data.
• A histogram is a bar chart of data that have been grouped into a frequency distribution. A frequency polygon
is a graph of frequency distributions obtained by drawing straight lines joining successive midpoints of bars
representing the class frequencies.
• A bar chart is used to plot the frequency distribution of categorical data, with each bar representing a
distinct category and the bar’s height (or length) proportional to the frequency of the corresponding
category. Grouped bar charts or stacked bar charts can present the frequency distribution of multiple
categorical variables simultaneously.
www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Summary
• A tree-map is a graphical tool to display categorical data. It consists of a set o f 0colored rectangles to represent distinct groups, and the area of each rectangle is proportional to the value of the corresponding group.
Additional dimensions of categorical data can be displayed by nested rectangles.
• A word cloud is a visual device for representing textual data, with the size of each distinct word being proportional to the frequency with which it appears in the given text.
• A line chart is a type of graph used to visualize ordered observations and often to display the change of data series over time. A bubble line chart is a special type of line chart that uses varying-sized bubbles as data
points to represent an additional dimension of data.
• A scatter plot is a type of graph for visualizing the joint variation in two numerical variables. It is constructed by drawing dots to indicate the values of the two variables plotted against the corresponding axes. A scatter
plot matrix organizes scatter plots between pairs of variables into a matrix format to inspect all pairwise relationships between more than two variables in one combined visual.
• A heat map is a type of graphic that organizes and summarizes data in a tabular format and represents it using a color spectrum. It is often used in displaying frequency distributions or visualizing the degree of
correlation among different variables.
• The key consideration when selecting among chart types is the intended purpose of visualizing data (i.e., whether it is for exploring/presenting distributions or relationships or for making comparisons).
• A population is defined as all members of a specified group. A sample is a subset of a population.
• A parameter is any descriptive measure of a population. A sample statistic (statistic, for short) is a quantity computed from or used to describe a sample.
• Sample statistics—such as measures of central tendency, measures of dispersion, skewness, and kurtosis—help with investment analysis, particularly in making probabilistic statements about returns.
• Measures of central tendency specify where data are centered and include the mean, median, and mode (i.e., the most frequently occurring value).
• The arithmetic mean is the sum of the observations divided by the number of observations. It is the most frequently used measure of central tendency.
• The median is the value of the middle item (or the mean of the values of the two middle items) when the items in a set are sorted into ascending or descending order. The median is not influenced by extreme values
and is most useful in the case of skewed distributions.
• The mode is the most frequently observed value and is the only measure of central tendency that can be used with nominal data. A distribution may be unimodal (one mode), bimodal (two modes), trimodal (three
modes), or have even more modes.
• A portfolio’s return is a weighted mean return computed from the returns on the individual assets, where the weight applied to each asset’s return is the fraction of the portfolio invested in that asset.
• The geometric mean, XG , of a set of observations X1, X2, …, Xn, is XG = n X1X2X3 Xn , with Xi ≥ 0 for i = 1, 2, …, n. The geometric mean is especially important in reporting compound growth rates for time-series data.
The geometric mean will always be less than an arithmetic mean whenever
• there is variance in the observations.

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers
Summary
• The harmonic mean, XH , is a type of weighted mean in which an observation’s weight is inversely proportional to its magnitude.
• Quantiles—such as the median, quartiles, quintiles, deciles, and percentiles— are location parameters that divide a distribution into halves, quarters, fifths, tenths, and hundredths, respectively.
• A box and whiskers plot illustrates the interquartile range (the “box”) as well as a range outside of the box that is based on the interquartile range, indicated by the “whiskers.”
• Dispersion measures—such as the range, mean absolute deviation (MAD), variance, standard deviation, target downside deviation, and coefficient of variation— describe the variability of
outcomes around the arithmetic mean.
• The range is the difference between the maximum value and the minimum value of the dataset. The range has only a limited usefulness because it uses information from only two observations.
• The MAD for a sample is the average of the absolute deviations of observations from the mean, 1 , where X is the sample mean and n is the number of observations in the sample.

• The variance is the average of the squared deviations around the mean, and the standard deviation is the positive square root of variance. In computing sample variance (s2) and sample standard
deviation (s), the average squared deviation is computed using a divisor equal to the sample size minus 1.
• The target downside deviation, or target semideviation, is a measure of the risk of being below a given target. It is calculated as the square root of the average squared deviations from the target,
but it includes only those observations below the target (B), or

• The coefficient of variation, CV, is the ratio of the standard deviation of a set of observations to their mean value. By expressing the magnitude of variation among observations relative to their
average size, the CV permits direct comparisons of dispersion across different datasets. Reflecting the correction for scale, the CV is a scale-free measure (i.e., it has no units of measurement).
• Skew or skewness describes the degree to which a distribution is asymmetric about its mean. A return distribution with positive skewness has frequent small losses and a few extreme gains
compared to a normal distribution. A return distribution with negative skewness has frequent small gains and a few extreme losses compared to a normal distribution. Zero skewness indicates a
symmetric distribution of returns.
• Kurtosis measures the combined weight of the tails of a distribution relative to the rest of the distribution. A distribution with fatter tails than the normal distribution is referred to as fat-tailed
(leptokurtic); a distribution with thinner tails than the normal distribution is referred to as thin-tailed (platykurtic). Excess kurtosis is kurtosis minus 3, since 3 is the value of kurtosis for all normal
distributions.
• The correlation coefficient is a statistic that measures the association between two variables. It is the ratio of covariance to the product of the two variables’ standard deviations. A positive
correlation coefficient indicates that the two variables tend to move together, whereas a negative coefficient indicates that the two variables tend to move in opposite directions. Correlation does
not imply causation, simply association. Issues that arise in evaluating correlation include the presence of outliers and spurious correlation

www.mentormecareers.com
All Rights Reserved -Mentor Me Careers

You might also like