Empirical Data Analysis in Accounting and Finance
Empirical Data Analysis in Accounting and Finance
Empirical Data Analysis in Accounting and Finance
Pattern recognition
Financial classification models
Financial distress prediction
Causality models
Association between accounting data and financial
market reactions
Causality patterns on international financial markets
Time series modelling and prediction
Optimization models, e.g.
Portfolio optimization
Product mix optimization
A typical process for empirical data
analysis
Define the test problem
Collect data
Data bases for financial data, e.g. market data, financial
statements, interest rates, exchange rates
Surveys for opinion data
To access data from the voitto database, map the network
drive z:\\cdrom\voitto with your user name:
AKADEMI\user_name
Select the analysis method
Control for the suitability of the data to the selected
method
Different methods have different assumptions on the
properties of the data, e.g. approximate normality
A typical process for empirical data analysis ...
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-3
3
-2.7
-2.4
-2.1
-1.8
-1.5
-1.2
-0.9
-0.6
-0.3
0.3
0.6
0.9
1.2
1.5
1.8
2.1
2.4
2.7
Normal distributions with different mean
values (m ∈ {0;1;3}, s = 1)
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
Normal distributions with different mean and
standard deviation
mean = 2, st.dev = 3
8
7
6
5
4
3
2
mean = 0, st.dev = 1
1
9
0
8
-3.50 -1.00 1.50 4.00 6.50 9.00
7
Observed
6
5
4
3
2
1
0
-2.20 -1.20 -0.20 0.80 1.80 2.80
Observed Expected
Percentage of observations within {1;2;3}
standard deviations from the mean
The multivariate normal
distribution
y
-1
-2
-3
-4
-4 -3 -2 -1 0 1 2 3 4
x
The multivariate Gaussian mixture density with
correlated processes
y
-1
-2
-3
-4
-4 -3 -2 -1 0 1 2 3 4
x
Problems with empirical data
Most empirical data sets fail to satisfy the
basic assumption on normal distribution
Typical problems are
Skewness of the data
Leptokurtosis (thick tails)
Outliers
The quality of the data may be improved for
example by
Different transformations
Taking logarithms of the data
Taking square roots of the data
Differencing (for time series data)
Removing the outliers
Skewness
A measure of the asymmetry of the probability distribution
Skewness = 0 for a symmetric distribution
Negative skewness (left skewed pdf): The left tail is
longer; the mass of the distribution is concentrated on the
right of the figure
Positive skewness (right-skewed pdf): The right tail is
longer; the mass of the distribution is concentrated on the
left of the figure
Skewness…
m3 E [( X - m )3 ]
1 = 3 =
s E[( X - m ) 2 ]3 / 2
m4 E [( X - m ) 4 ]
2 = 4 =
s E ( X - m) ]
[ 2 2
Excess kurtosis
Kurtosis for a normal distribution equals 3
In order to have kurtosis of the normal distribution
zero, kurtosis is mostly defined as excess kurtosis
(also in statistical packages like SPSS or Excel)
m4
2 = 4 3
s
Distributions with zero excess kurtosis, e.g. normal
distributions are called mesokurtic
Leptokurtosis
A distribution with positive excess kurtosis is called
leptokurtic
A leptokurtic distribution has a more acute peak
around the mean and fatter tails
Many financial data series (for example, stock returns)
have leptokurtic distributions
Platykurtosis
A distribution with negative excess kurtosis i
called platykurtic
A platykurtic distribution has a lower, wider
peak and thinner tails
Testing for the normality of a data set
180
160
140 Leptokurtic
120
100
80 Left
60 skewed
40
20
0
-3 -2 -1 0 1 2 3
1.20
1.00
0.80
0.60
0.40
0.20
0.00
3
-3
-1
6
.6
.2
.8
.4
.6
.2
0.
0.
1.
1.
2.
2.
-2
-2
-1
-1
-0
-0
-0.20
Observed N(0,1)
The observed cdf of the Canadian data plotted
against the normal cdf with the observed m = 0.034
and s = 0.573 (Tested by SPSS/K-S)
1.20
1.00
0.80
0.60
0.40
0.20
0.00
3
-3
-1
6
.6
.2
.8
.4
.6
.2
0.
0.
1.
1.
2.
2.
-2
-2
-1
-1
-0
-0
-0.20
Observed N(0,034,0.57)
The Kolmogorov-Smirnov statistic
1 n
Fn ( x) = ∑I X i ≤ x
n i =1
where IXi ≤ x is the indicator function, equal to 1 if Xi ≤ x
and equal to 0 otherwise.
The Kolmogorov–Smirnov statistic for a given cumulative
distribution function F(x) is
Dn = sup | Fn ( x) - F ( x) |
x
where sup x is the supremum of the set of distances.
Kolmogorov-Smirnov test
The goodness-of-fit test or the Kolmogorov–Smirnov
test is constructed by using the critical values of the
Kolmogorov distribution
The null hypothesis is rejected at level α if
Z=√n ̅ Dn > Kα,
where Kα is found from
Pr(K Kα) = 1 – α
If F is continuous then under the null hypothesis √ ̅n Dn
converges to the Kolmogorov distribution, which does
not depend on F
Kolmogorov distribution
The Kolmogorov distribution is the distribution of the
random variable
K = sup | B(t ) |
t∈[0 ,1]
∞
2 ∞
Pr ( K ≤ z ) = 1 - 2∑(-1) e
-1 - z
∑e -( 2 -1) 2 2 / (8 z 2 )
2 2
=
=1 z =1
Can
N 752
Normal Parametersa,,b Mean ,0340
Std. Deviation ,57275
Most Extreme Differences Absolute ,077
Positive ,047
Negative -,077
Kolmogorov-Smirnov Z 2,103
Asymp. Sig. (2-tailed) ,000
a. Test distribution is Normal.
Data not normal
b. Calculated from data.
(α < 0.05)
Shapiro-Wilk test
In order to work reliably, the Kolmogorov-Smirnov test
requires a relatively large data sample (> 2 000
observations)
For sample sizes less than 2 000, Shapiro-Wilk test
should be applied instead
n 2
The Shapiro-Wilk test statistic is: (∑a x ) i (i)
i =1
W= n
∑( x - x )
i
2
i =1
where
x(i) is the ith order statistic, i.e., the ith-smallest
number in the sample
̅x = (x1 + ... + xn) / n is the sample mean
Shapiro-Wilk test
The constants ai are given by
mT V 1
where (a1 ,..., an ) = T 1 1 1 / 2
(m V V m)
m = (m1,…,mn) ,T
6 4
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45
c2-distributions with {5;10;20} degrees of
freedom
0.12
0.1
0.08
0.06
0.04
0.02
0
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
F-distribution
A ratio of two c2-distributions
Characterized by two d.o.f. –parameters
F (d1>0, d2>0)
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0
3
2
8
0.
0.
0.
0.
1.
1.
1.
1.
2.
2.
2.
2.
F-distributions with different degrees of
freedom
0.05
0.045
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0
3
2
8
0.
0.
0.
0.
1.
1.
1.
1.
2.
2.
2.
2.
F(3,150) F(10,20) F(50,10)
Student’s t-distribution
A probability distribution that arises in the
problem of estimation the mean of a normally
distributed population when the sample size
is small
The basis of the popular Student's t-tests for
The statistical significance of the difference
between two sample means
Confidence intervals for the difference between two
population means
t-distribution with 3 degrees of freedom
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
t-distribution with 3 degrees of freedom compared to
a standard normal distribution
0.045
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
t(3) N(0,1)
t-distributions with {3;10;100} degrees of
freedom
0.045
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3