ViscoVery SOM (Clustering)
ViscoVery SOM (Clustering)
ViscoVery SOM (Clustering)
contents
contents
preface
1
introduction
1.1
1.2
contents
iv
1.2.1
1.2.2
1.2.3
1.3
2
2.1
viii
1
Overview
Research domain
Bond ratings
Financial data and ratings
Self-Organizing Maps
4
4
4
5
Research topics
credit ratings
2.1.1
2.1.2
2.1.3
2.1.4
iv
9
10
10
11
12
13
2.2
The S & P credit rating process
2.2.1
Process steps
15
15
2.3
18
18
18
19
20
Summary
23
2.3.1
2.3.2
2.3.3
2.3.4
2.4
3
3.1
3.2
3.3
3.4
3.5
self-organizing maps
25
3.1.1
3.1.2
3.1.3
Knowledge discovery
Introduction
Knowledge discovery process
Description and prediction
26
26
26
27
3.2.1
3.2.2
3.2.3
3.2.4
28
28
29
30
30
3.3.1
3.3.2
3.3.3
Classification techniques
Linear regression
Ordered logit
Artificial neural networks
31
31
31
32
3.4.1
3.4.2
33
33
34
3.5.1
3.5.2
3.5.3
3.5.4
SOM projection
The self-organization process
A two dimensional example
Mathematical description
A three dimensional example
36
36
37
39
40
3.6
SOM visualization and clustering
3.6.1
Maps
3.6.2
Map quality
3.6.3
Clusters
3.6.4
Cluster quality
3.6.5
Map settings
41
41
43
45
47
47
3.7
50
50
52
3.8
55
3.9
Summary
57
3.7.1
3.7.2
4
4.1
contents
descriptive analysis
4.1.1
4.1.2
59
60
60
64
4.2
Clustering companies
4.2.1
Creating suitable maps
4.2.2
Intermediate results
4.2.3
Results
67
67
68
70
4.3
72
72
73
75
4.3.1
4.3.2
4.3.3
4.4
Sensitivity analysis
4.4.1
Cluster coincidence plots
4.4.2
Results
78
78
79
4.5
Benchmark
Principal Component Analysis
Results
Comparison with SOM
82
82
82
84
Summary
85
vi
4.5.1
4.5.2
4.5.3
4.6
5
5.1
5.2
classification model
87
5.1.1
5.1.2
5.1.3
5.1.4
5.1.5
Model set-up
Training and prediction
Data
The prediction process
Ratings distribution
Measuring performance
88
88
88
89
89
91
5.2.1
5.2.2
5.2.3
5.2.4
Model construction
Initial model
Variable reduction
Sensitivity analysis
Results
95
95
96
99
102
5.3
5.4
5.5
5.3.1
5.3.2
5.3.3
5.3.4
Model validation
Comparison with constant prediction
Comparison with random prediction
Classifications per rating class
Equalized ratings distribution
104
104
104
106
106
5.4.1
5.4.2
5.4.3
Benchmark
Linear regression
Ordered logit
Results & comparison with SOM
108
108
108
109
5.5.1
5.5.2
5.5.3
Out-of-sample test
Results for test set
Results for older historical periods
Linking spreads
112
112
113
114
Summary
117
5.6
conclusions
6.1
Conclusions
120
6.2
Further research
122
bibliography
appendix
119
123
125
126
II
128
III
131
IV
134
138
VI
Descriptive analysis
140
VII
Classification model
163
vii
preface
This masters thesis forms the conclusion to my study of Econometrics, with specialization Business Oriented
Computer Science, at the Erasmus University of Rotterdam.
Quantitative Research (QR) department of the Rotterdam based asset-manager Robeco Group. My time at the
Robeco Group has been very enjoyable, and the combination of practical research and writing at the same time
has proven to be a very relaxed and sure way of writing a thesis. I can recommend this to everyone in the final
stage of his or her study.
This thesis is targeted at readers from two different scientific areas (computer science and financial
econometrics), so some concepts are treated more extensively than first may seem necessary. Considerable
time was also spent making this thesis into an attractive package, but at all times have I striven to keep looks
preface
viii
1 introduction
In chapter 1 we introduce the main problem and the research topics for this thesis. Paragraph 1 gives a brief
overview of the problem setting and paragraph 2 describes the domain of research. Paragraph 3 reviews the
central question and several sub-questions to be answered in the remainder of this thesis.
1.1 Overview
When you lend a sum of money to someone you will most likely first estimate the probability of not being paid
back. A correct assessment of this probability is based on the (observed) trustworthiness of the person in
question and on your knowledge of her or his financial situation.
When investors lend money to companies or governments it is often in the form of a bond; a freely tradable loan
issued by the borrowing company.
creditworthiness of the issuing company, based on its financial statement (balance sheet and income account)
and on expectations of future economic development.
Most buyers of bonds do not have the resources to perform this type of difficult and time-consuming research.
Fortunately so-called rating agencies exist who specialize in assessing the creditworthiness of a company. The
resulting credit or bond rating is a measure for the risk of the company not being able to pay an interest
payment or redemption of its issued bond. Furthermore, the amount of interest paid on the bond is dependent
on this expected chance of default1. A higher rating implies a less riskful environment to invest your money and
1 introduction
less interest is received on the bond. A lower rating implies a more riskful company and the company has to pay
Unfortunately not all companies have been rated yet. Rating agencies are also often slow to adjust their ratings
to new important information on a company or its environment. And sometimes different rating agencies assign
different ratings to the same company. In all of these situations we would like to be able to make our own
assessment of creditworthiness of the company, using the same criteria as the rating agencies. The resulting
measure of creditworthiness should be comparable to the rating issued by the rating agencies.
This is more difficult than it first may seem, because the rating process is somewhat of a black box. Rating
agencies closely guard their rating process; they merely state that financial and qualitative factors are taken
into account when assigning ratings to companies. We would like to try to open this black box by describing the
relationship between the financial statement of a company and its assigned credit rating. This might enable us
to say how much of a companys rating is affected by the qualitative analysis performed by the rating agency.
And the found knowledge could of course be used to classify companies that have not yet been rated or recently
have substantially changed.
Several techniques have been developed for these kind of analyses. We will focus on a less common technique
called Self-Organizing Maps, which is a combination of a projection and a clustering algorithm. Its main
advantages are the insightful visualizations of large datasets and its flexibility.
Ov e rvi e w
3
1 introduction
4
these companies. Therefore, our research will be conducted using only U.S. based
companies.
lawsuits), and what is the economic outlook for the sector as a whole? We will treat
the credit rating process extensively in chapter 2, but suffice it to say that the contribution of qualitative factors
to the rating is unclear. We can clarify the relationship between financial data and credit ratings using
quantitative techniques like the Self-Organizing Map and indirectly give an assessment of the contribution of
qualitative factors.
Financial statement data on most US companies is available in huge databases from datasources like Compustat
and WorldScope. The information in these databases could help us gain a better understanding of the
relationship between financial information and bond ratings. It might even provide us with a means to correctly
predict bond ratings, based on the stored financial data alone. However, transforming the stored data into
knowledge is no trivial task.
R es e a rc h d o m a i n
5
What are credit ratings and how is the credit rating process structured?
An analysis of the Standard & Poor credit rating process gives us a better understanding of the relation between
credit ratings and financial statement data.
2.
What are Self-Organizing Maps and how can they aid in exploring relationships in large data sets?
Before we can trust the results inferred from the SOM maps we first have to understand how the SOM gives a
1 introduction
6
view on the underlying data. We provide an in-depth review of the algorithm itself and a guide on how to
interpret the generated results.
3.
Is it possible to find a logical clustering of companies, based on the financial statements of these
companies?
First we would like to know if companies are discernible based on financial statement data alone.
4.
If such a clustering is found, does this clustering coincide with levels of creditworthiness of the companies
in a cluster?
We then compare the found clustering with the distribution of the ratings over the companies to determine to
what extent they coincide.
5.
Is it possible to classify companies in rating classes using only financial statement data?
Using previously found knowledge we set up a model specifically suited to the task of classifying new
companies using financial statement data.
This thesis is divided into several chapters. Chapter 1 contains the introduction, a description of the research
domain and this overview of the research topics. In chapter 2 we give a theoretical treatment of the credit rating
process and in chapter 3 we provide an in-depth review of Self-Organizing Maps. Chapter 4 discusses the
descriptive analysis after which chapter 5 focuses on the classification model. In chapter 6 we draw our
conclusions and present some suggestions for further research.
R es e arc h t opic s
7
2 credit ratings
This chapter provides a background on credits and credit ratings. Question 1 from the introduction is answered:
1.
What are credit ratings and how is the credit rating process structured?
Paragraph 1 addresses the theoretical foundations of credits and credit ratings. Paragraph 2 reviews the rating
process of Standard & Poors, a well-known rating agency. Paragraph 3 evaluates the key financial ratios
applicable to the economic sector under scrutiny in this thesis, Consumer Cyclicals.
Characteristics
2 credit ratings
10
Each bond has certain characteristics, which fully describe the bond. The bond has to be redeemed on a fixed
date, called the maturity date. Bonds with original maturities longer than a year are considered long-term, all
bonds with maturities up to one year are considered short-term. Each period a certain interest percentage has
to be paid in the form of the coupon. Often this percentage is fixed, but sometimes this percentage is
dependent on the market interest rate (the coupon is floating). Other variations on the standard bond include
sinking redemptions (periodically a part of the bond is redeemed), callable bonds (at certain dates the issuer
has the right to prematurely redeem the bond), and of course special combinations leading to more exotic
variants.
Value
The value of a bond depends largely on the coupon percentage and the current market interest rate. If the
market interest rate rises, then the value of the bond lowers. The coupon percentage is fixed, and investors
would rather buy a new bond with a coupon that is more in-line with the current market interest rate.
If the
market interest rate declines, then the value of the bond rises. Investors would rather buy our bond than new
bonds with lower interest rates.
The value of the bond is determined in the market, by the forces of supply and demand. Using the market price
the current yield of the bond can be calculated. This is the internal discount factor needed when discounting all
future cash flows of the bond (coupon payments and redemption payment) to represent the current price. This
yield is often used when comparing bonds, as it is based on the current price of the bond, thus taking the
coupon, the market interest rate and other factors into account. The difference between two yields is referred to
as a spread. A whole range of appropriate government bonds (combining to a government curve3) is most often
used as the benchmark, so the spread of the bond then means the difference between the yield of the bond and
the yield of a comparable government bond on the government curve.
Default
Another factor influencing the value of the bond is the default risk associated with the bond. When an issuer is
unable to meet one of the payments with respect to a specific bond, we say that the issuer has defaulted on
the bond. This does not necessarily mean that the issuer has gone bankrupt, a missed or delayed interest
payment also counts as a default. If the issuer settles the payment a few days later the issuer has recovered.
How the spread is influenced by the default risk is explained in the next paragraph.
2.1.2 Credits
We use the term credits for all bonds not issued by central governments in their own currency. All bonds
issued by companies (also known as corporate bonds) are good examples of credits. Governments in emerging
markets often issue their bonds denominated in U$, so these are credits too. All credits are inherently riskier
than issues by stable governments in developed markets like the United States or the Netherlands: Because the
company or the unstable government is prone to financial problems we cannot say with 100% certainty that all
payments on the bond will be fulfilled.
Credit spread
When investors buy credits, they want something in return for the extra risk involved; this is known as the credit
spread. Some extra yield is received to compensate for the default risk. Naturally, this credit spread is larger
when the risk of default is larger. Conversely, the credit spread is smaller when the perceived risk is lower.
Some credits are more eligible for repayment than others (they are senior to other issues from the same
issuer). Also sometimes issues are secured by e.g. a parent company. As this all reduces the risk involved with
the credit, the accompanying credit spread also narrows.
The credit spread is only part of the difference between the yield of a bond and a comparable government bond.
Other factors are the liquidity of the bond (large issues are more easily traded than smaller issues) and the
inclusion of the bond in a bond index (bonds included in indices composed by e.g. J.P. Morgan are more indemand by investors and thus more valuable).
11
Rating agencies
A rating agency, of which S&P is one of the best known examples, assesses the relevant factors relating to the
creditworthiness of the issuer. These include the quantitative factors like the profitability of the company and
the amount of outstanding debt, but also the qualitative factors like skill of management and economic
expectations for the company. The whole analysis is then condensed into a letter rating5. Standard & Poors
and Moodys both have been rating bonds for almost a century and are the leading rating agencies right now.
Other reputable rating institutions are Fitch and Duff & Phelps.
Ratings interpretation
The types of assigned ratings are comparable for most agencies, and for S&P and Moodys there is a direct
Table 2-1 Credit ratings and interpretation
2 credit ratings
S&P
AAA
AA+
AA
AAA+
A
ABBB+
BBB
BBBBB+
BB
BBB+
B
BCCC+
CCC
CCCCC
C
D
12
Moodys
Aaa
Aa1
Aa2
Aa3
A1
A2
A3
Baa1
Baa2
Baa3
Ba1
Ba2
Ba3
B1
B2
B3
Caa
Interpretation
Highest quality
High quality
Ca
D
Bankruptcy filed
Defaulted
relation6. The ratings for S&P and for Moodys and the accompanying interpretation is shown in table 2-1.
The letter rating is sometimes augmented by a + or - (for S&P) or a 1, 2, or 3 (for Moodys). These indicate
sub-levels of creditworthiness within a specific rating class. The difference between the sub-levels is called a
notch, so an A+ and an A- rating differ two notches. In practice, this is also used over the rating classes, so a
B+ and an A rating are also said to differ two notches.
Comparing ratings
The differences between regions, countries or even economic sectors can be so large that it is difficult to arrive
at a certain rating when using the same criteria. To make comparisons possible, the rating agencies use
different criteria and special risk characteristics for companies in different sectors and different countries or
regions.
A good example is the qualitative assessment of an industrial company; business fundamentals then include
technological change, labour unrest or regulatory actions. For a financial institution we would be looking at the
reputation of the institution and the quality of the outstanding debt.
Please refer to paragraph 2.2 for a comprehensive review of the credit rating process of Standard & Poors.
Cantor, R. and Packer, F., 1994, page 6.
Figure 2-1 shows the default rates corresponding to Moodys rating classes for 19997. As is to be expected, the
lower rating classes have corresponding higher default rates.
Default rates for 1999
12
10
8
6
4
2
B3
B2
B1
Ba
Ba
a3
a2
a1
Ba
Ba
Ba
A3
Ba
A2
A1
Aa
Aa
Aa
Aa
2 credit ratings
Credits with an assigned rating from AAA to BBB- are known as investment grade credits. Lower rated issues are
14
relatively wide, thus providing an interesting investment opportunity. This is even more so after finding an
known as speculative grade credits, high yield issues or junk bonds. The spreads on these high yield issues are
average recovery rate of 42%8 (for every U$ 100 worth of defaults on average U$ 42 recovers).
Sometimes fundmanagers are restricted to purchasing investment grade issues, to avoid speculative
investments. However, the absolute default rates do not remain stable over the years. For example, restricting
the fund managers to purchase at least BBB- grade issues does not guarantee lower than 1% default rates.
7
8
The
qualitative analysis is most extensively treated and thus most emphasized. The descriptive analysis in chapter
4 will try to uncover whether this depiction reflects the actual rating practice of S&P.
meet
rating
issuer
rating
committee
meeting
issue(r)
surveil-
rating
lance
appeals
process
Figure 2-2 The Standard & Poor's credit rating process
Request rating
Companies themselves often approach Standard & Poors to request a rating. In addition to this, it is S&Ps
policy to rate any public corporate debt issue larger than U$ 50 million, with or without request from the issuer.
Basic research
When the rating is requested a team of analysts is gathered. The analysts working at S&P each have their own
sector specialty, covering all risk categories in the sector.
The appropriate analysts are chosen and a lead analyst is assigned, who is responsible for the conduct of the
rating process.
Some basic research is conducted, based on publicly available information and based on information received
from the company prior to the meeting with the management10. The information requested prior to the meeting
should contain:
-
five years of audited annual financial statements (balance sheet and profits and losses account),
the last several interim financial statements (this is mostly applicable to US companies, as they are
required by law to provide quarterly financial statements),
As some of this may be sensitive information, S&P has a strict policy of confidentiality on all the information
obtained in a non-public fashion. Any published rationale on the realization of the assigned rating only contains
publicly available information.
2 credit ratings
16
This meeting covers the operating and financial plans of the company and the
management policies but it is also a qualitative assessment of management itself. The meeting is scheduled
well in advance so ample time for preparation is given.
The specific topics discussed at the meeting are:
-
an overview of the major business segments, including operating statistics and comparisons with
competitors and industry norms,
managements projections, including income and cash flow statements and balance sheets, together with
the underlying market and operating assumptions,
10
So called public information ratings are the exception to this rule; they are solely based on the annual publicly available financial
statement.
Standard & Poors does not base its rating on the issuers financial projections, but uses them to indicate how
the management assesses potential problems and future economic developments.
an analysis of the nature of the companys business and its operating environment,
a financial analysis,
After a discussion about the rating recommendation and the facts supporting it the committee votes on the
recommendation. The issuer is notified of the rating and the major considerations supporting it. An appeal is
possible (the issuer could possibly provide new information), but there is no guarantee that the committee will
alter its decision.
Surveillance
The rated issues and issuers are being monitored on an ongoing basis.
developments are reviewed and often a meeting with the management is scheduled annually.
If these
developments might lead to a rating change, this will be made known using the CreditWatch listings. A more
thorough analysis is performed, after which the rating committee again convenes and decides on the rating
change.
17
2 credit ratings
18
liquidity ratios measure the ease with which a company can acquire cash,
In addition to these financial ratios a few other classes of variables can be observed to characterize a company:
!
stability variables measure the stability of the company over time in terms of size and income,
Although financial ratios provide a means to quickly compare companies, some caution should be taken when
using them. Companies often use different accounting standards, so two comparable companies can have very
different values for certain ratios just because of different ways of valuing the items on the balance sheet.
Furthermore, companies often want to present an as favourable as possible image, known as window dressing.
This also leads to ratios not fully representing the real financial state of the company.
Liabilities
+ total short term debt
+ accounts payable
+ other current liabilities
+ income taxes payable
total current liabilities
Fina nc ia l stat em e nt a na ly si s
19
Leverage ratios
Financial leverage is created when firms borrow money. To measure this leverage, a number of ratios are
2 credit ratings
20
available.
Debt ratio
(long term debt) / (long term debt + equity + minority interest)
Debt-equity ratio
This can be measured in several ways, two of which are:
(long term debt) / (equity)
and
(long term debt) / (total capital)
Net gearing
(total liabilities cash) / (equity)
Profitability ratios
Profitability ratios measure the profits of a company in proportion to its assets.
Return on equity
This measures the income the firm was able to generate for its shareholders11.
(net income) / (average equity)
Return on total assets
(earnings before interest and taxes) / (total assets)
Operating income / sales
(operating income before depreciation) / (sales)
Net profit margin
(net income) / (total sales)
Size variables
These measure the size of a company.
Total assets
The total assets of the company.
Market value
Price per share * number of shares outstanding.
Stability variables
Stability variables measure the stability of the company over time in terms of size and income.
Coefficient of variation of net income
(standard deviation of net income over 5 years) / (mean of net income over 5 years)
Coefficient of variation of total assets
(standard deviation of total assets over 5 years) / (mean of total assets over 5 years)
Market variables
Market variables are used to assess the value investors assign to a company.
11
Note the use of the average of the equity (at the beginning and the end of the quarter). Averages are often used when comparing
flow data (net income) with snapshot data.
Fina nc ia l stat em e nt a na ly si s
21
2 credit ratings
22
12
2.4 Summary
In this chapter we have reviewed some theoretical aspects of bonds and credits before exploring the ratings
domain. The credits we are most interested in are bonds issued by companies (corporate bonds). We have seen
the direct relation between creditworthiness, default probability and spread of a credit. If the perceived
creditworthiness is better, then the assigned rating will be higher and the default probability will be lower. The
difference in yield with a similar government bond (also known as the spread) will be subsequently lower.
The different process steps of the Standard & Poors credit rating process emphasize the qualitative analysis
performed by the agency. The quantitative analysis, based on financial statement data, is just a single step in
the process. In the remainder of this thesis we will try to uncover whether actual rating practice reflects this
depiction of matters, using the described financial ratios. These ratios form a means to summarize the balance
sheet and income statement of a company and to compare the financial statements of different companies.
S um m a r y
23
3 self-organizing maps
Chapter 3 reviews the Self-Organizing Map and its place in the knowledge discovery process. To provide a
background for the SOM we will briefly discuss some related techniques before examining the Self-Organizing
Map algorithm. Altogether this answers question 2 from the introduction:
2.
What are Self-Organizing Maps and how can they aid in exploring relationships in large data sets?
Paragraph 1 describes the knowledge discovery process. Paragraph 2 describes some projection and clustering
methods related to SOM.
classification model of chapter 5. The remainder of the chapter is dedicated to an explanation of SOM and
guidelines for the use of SOM.
Data
Target data
preprocessing
Preprocessed
data
3 self-organizing maps
26
and all the steps necessary to extract knowledge from databases is known as
transformation
Transformed
data
visualization
Patterns
Maps
interpretation
evaluation
Creating a target data set based on the available data, the knowledge of
the underlying domain and the goals of the research.
Knowledge
Pre-processing this data to account for extreme values and missing values.
Mining the data so distinct patterns become available for interpretation and evaluation. In this thesis we
will focus on visualization techniques, whereby specific patterns can be found in the resulting maps.
13
Computer scientists use the term data-mining in a positive context (extracting previously unknown knowledge from large databases),
econometricians use the term data-mining in a negative context (manipulating data and the used technique to support specific
conclusions). This sometimes leads to confusion about the intended meaning.
14
Fayyad, U.M., et al, 1996, chapter 2.
Interpreting and evaluating these maps, often repeating one or more steps of the process.
Descriptive
knowledge discovery tries to correctly represent the data in a compact form. The new representation implicitly
or explicitly shows relationships in the data. Not so obvious relationships emerge, thus attributing to a greater
knowledge of the underlying domain. Obvious relationships are of course visible too, strengthening the image
one has of the data based on preliminary research. Common used techniques are projection and clustering
algorithms.
Predictive knowledge discovery is used to complement values for one or more characteristics (or variables) of
observations in the data set. This is often in the form of a classification problem: A data set with known class
memberships is used to build a model, and this model is used to predict the class membership for new
observations. Common used techniques are linear regression based classifiers like ordered logit and artificial
neural networks.
Of course this division is not strict. Some of the algorithms are combinations of techniques, and often the
descriptive techniques are used as an intermediate step in large investigations. The output of the descriptive
Kno w le dg e di sco ve r y
analysis then may serve as input for some of the prediction algorithms.
In the following sections we will highlight some of the available projection, clustering, and classification
techniques. The Self-Organizing Map, treated extensively in the remainder of the chapter, is actually a neural
network combining regression, projection and clustering!
27
Principal component analysis (PCA) is a commonly used linear projection method. The PCA technique tries to
capture the intrinsic dimensionality of the data by finding the directions in which the data displays the greatest
variance. Often the data is stretched in one or more
directions and has an intrinsic lower dimensionality
3 self-organizing maps
28
in
components.
the
data
are
called
These
principal
15
29
17
18
( x1 + y1 )2 +( x2 + y2 )2 +...+( xn + yn )2
The
Merging methods work bottom-up, starting with each case in a separate cluster. Clusters having the least intercluster distance d are merged, often the Euclidean distance is used for d. An example clustering of car brands is
shown in figure 3-5.
3 self-organizing maps
30
19
x i c m if m 1 < y i
where cj denotes class j, and j denotes the threshold for class j.
The class thresholds and the coefficients in the regression equation can be simultaneously estimated using
Maximum Likelihood Estimators. More information on the ordered logit model can be found in Fok20.
3 self-organizing maps
32
training
phase
observations
of
the
required
S el f Or g a n iz i ng Ma p s
33
Figure 3-7 Picture of the homunculus in the brain, drawn by Wilder Penfield
3.4.2 Overview
The self-organizing map algorithm involves two steps. The first step projects the observations, the second step
clusters the projected observations.
Projection
The first step of the algorithm involves projecting the observations onto a 2 dimensional, flexible grid composed
of neurons or nodes. The grid is stretched and bended through the input space to form an as good as possible
representation of the data. The projection on this grid is a generalization of simple projection (on the flat
surface) and projection using Principal Component Analysis (PCA).
Simple projection simply projects the datapoints on the flat surface defined by the x and y axes. Projection
using principal components is more advanced than simple projection (reflecting the intrinsic dimensionality of
the data), but is still limited because the observations are projected on a flat plane. The flat plane is aligned
according to the axes defined by the two directions inhibiting the largest variance of the data. The projection
part of the SOM algorithm (also known as the self-organization process) can be thought of as a non-linear
generalization of PCA21. The plane onto which the observations are projected can stretch and bend through the
3 self-organizing maps
34
input space thus more thoroughly capturing the distribution of the observations in the input space.
The first two types of projections are often too restricted to fully capture the irregularities of the data. The three
dimensional example in Figure 3-7shows this more clearly. The data is clustered in three distinct segments of
the cube, simple projection projects the observations on the bottom of this cube (left picture). The flat plane
Figure 3-7 Plane of projection using the X-Y plane, using PCA and using SOM
shown in the middle picture is aligned along the first two principal components of the data. A projection on this
surface gives a better representation of relative distances in the data set. The rightmost picture shows the
flexible, bended and stretched grid used for SOM projection. By following the form of the data an even more
21
accurate representation of relative distances in the data set is given. How the SOM achieves this projection is
extensively treated in paragraph 3.5.
Clustering
The flexible grid, onto which the observations have been projected, is (for convenient output viewing) returned
to a normal, unstretched flat plane and displayed as the map. The form of the grid in the input space remains
fixed. The local ordering of the sample is preserved; neighbouring observations in the input space will be
neighbouring observations on the map.
A bottom-up clustering method is used to cluster the projected observations: starting with each observation in
a separate cluster, 2 clusters are merged if their relative distance (e.g. Euclidean distance) in the input space is
smallest and if they are adjacent in the map. The number of shown clusters varies with the specific step of the
algorithm we want to see. One step later in the algorithm means one less cluster shown (another cluster has
merged), one step earlier means one more cluster shown.
Cluster are clear separations of the input space, so observations can only be member of one cluster (the clusters
do not overlap). The clustering algorithm is discussed in paragraph 3.6.
S el f Or g a n iz i ng Ma p s
35
3 self-organizing maps
36
input space: Input vectors situated nearby each other in the input space will be assigned to nearby (or even the
same) neurons. They are thus placed nearby each other on the associated map grid points, this is known as
'topology preservation'. The neurons are not fixed in the input space; they move at each iteration of the
algorithm, thereby stretching and bending the grid, to better accommodate the distribution of the input vectors.
Alternative interpretation
We can also interpret the self-organization process as a form of regression. Normal or parametrized regression
tries to fit a line or curve through the observations using a presupposed form of the underlying function. Only
the compression (coefficient) and the height (constant) of the curve are adjusted.
Instead of being tied to a fixed functional form the neural network of the SOM can vary over a whole class of
functions to make an as good as possible approximate match of the data. However, the freedom of the network
to find a functional form is restricted by the interconnections between the neurons, and the final form of the grid
is fixed. Therefore it is called semi-parametrized regression.
When a priori a specific form of the function is not taken into account at all, we would refer to it as nonparametrized regression. Some kind of functional form can be found using representative reference vectors
(e.g. averages over sets of observations), but these vectors do not influence each other, so any form is
achievable.
S O M p ro j ec tio n
37
The winning neuron, most closely resembling the current observation, is identified. The most common
used measure of likeliness is the Euclidean distance.
2.
We now say that the current observation is projected on the winning neuron and this winning neuron is
adjusted to more closely resemble the current observation. The neurons are interconnected, so some of
the neighbours (in the grid) of the winning neuron will also be adjusted. The farther away from the winning
neuron in the output grid, the less adjustment is made to the neuron.
To stabilize the algorithm a function is included that reduces the adjustments to the neurons over the performed
iterations. Appendix II contains a complete example showing the adjustments in each iteration .
The final fit, after several iterations of the algorithm,
is shown in figure 3-10. The observations have been
projected on the neurons and the neurons have been
adjusted to form best representations of these
3 self-organizing maps
38
populated areas.
The output map is identical to the shown grid; the
observations are projected on the same neurons
(gridpoints). The interconnections in the input space
also hold true in the map: Neighbouring units in the
input space are neighbouring units in the map. The
whole available range of visualizations will be
treated in paragraph 3.6.
The winning neuron mc(t) most closely resembling the current observation x(t) is selected (c denotes the
winning and i denotes the current neuron):
S O M p ro j ec tio n
x (t ) mc (t ) = min x (t ) mi (t ) .
i
2.
39
The adjustment is monotonically decreasing as the number of iterations increases. This is controlled by the
learning rate factor (t) ( 0 < (t) < 1 ), which is usually defined as a linearly decreasing function over the
iterations. The neighbours of the winning neuron are also adjusted, but the adjustment is decreasing as the
distance from the winning neuron in the output grid increases.
neighbourhood function hci(t). Different specific forms of the neighbourhood function can be found in paragraph
3.6.5.
Multiple stages
Instead of performing the self-organization process only once, the map is trained in multiple stages (also called
epochs) in which the algorithm reiterates over all the observations in the train set. This way the map converges
to a more stable situation while improving statistical accuracy. Any differences in initialization or ordering of the
observations are also cancelled out.
The Viscovery implementation (see paragraph 3.6) uses a specific method called batch training to accelerate
the train process while keeping the same results. For more information on the batch train process please refer
to Deboeck22.
3 self-organizing maps
40
The distribution of the neurons after the self-organization process is shown in Figure 3-12. The network, still a 2
dimensional lattice, has curved and stretched to form an as good as possible fit to the original data. The
neurons are concentrated in those areas of the input space containing the most observations. The largest
separation occurs between the cluster of observations in the bottom half of the cube and the two clusters of
observations in the upper half of the cube.
22
24
3.6.1 Maps
The visible output of the algorithm consists of the map, which is an unstretched, flattened representation of the
grid in the input space. Observations mapped to a specific neuron in the input space appear on the same
specific neuron (grid point) in the map. Neighbouring observations in the input space are neighbouring
observations on the map.
The map has several manifestations:
-
U-matrix: to view relative distances between neurons (in the input space).
It is important to remember that for each map manifestation the distribution of observations over the map does
not change. We are looking at the same map, but each time different information is shown.
space
to
the
output
map,
distance
Eudaptics, 1999.
24
In addition to this, the intuitive interface and the ability to work with Excel files make it an attractive package.
25
The clusters and specific clustering algorithms will be treated in paragraph 3.6.3.
differences between the neurons in the input space translate to darker colours in the map.
The U-matrix for the earlier used three dimensional example is shown in figure 3-13. The implicit clustering is
visible as groups of neurons having almost equal colour separated by nodes with distinctly different colours. In
this U-matrix one very clear cluster at the right of the map can be found. The two clusters at the left half,
separated in the middle, are less clear. This agrees with the placement of the three clusters of observations, as
can be checked in figure 3-12.
Component planes
A component plane is a manifestation of the map whereby the values for only one of the variables (a
component) are shown. In this way the distribution of this separate variable over the map can easily be
inspected. When comparing two different component planes of the same map highly correlated variables would
stand out because of the likeliness of their component planes. Components not contributing much to the
distribution of the observations show a more random pattern in their component planes, they are only
contributing noise to the clustering.
Often a display of the U-matrix surrounded by the component planes of all the variables is created. Figure 3-14
3 self-organizing maps
42
shows such a display for our three-dimensional example. The three component planes represent the X, Y and Z
variables.
The display shows that no two variables are highly correlated. The right cluster is characterized by small values
for all variables. The top-left cluster is characterized by high values for X and Z, the bottom-left cluster displays
high values for Y and Z. This also agrees with the placement of the clusters of observations in Figure X.
Data representation
The data representation accuracy is most often
measured using the average quantization error
d (x, m
x
3 self-organizing maps
44
The
quantization errors. A good map shows low and equally distributed quantization errors (Figure 3-16).
representation
accuracy
is
the
3.6.3 Clusters
It is left to the user to find any clustering of observations based on the U-matrix and the component planes. This
so-called implicit clustering can be complemented with other clustering techniques to find an explicit clustering.
Most software implementations of the Self-Organizing Map do not incorporate any explicit clustering
algorithms. The Viscovery SOMine package includes up to three different clustering methods.
The clustering algorithm frees the user from the difficult task of identifying clusters in the U-matrix. However,
by altering parameters of the clustering algorithm the number of shown clusters may vary. The user still has to
select the most adequate clustering based on all available information.
The three clustering methods implemented in Viscovery SOMine are Ward's clustering, SOM single linkage and
a combination of these two, called SOM-Ward. Instead of directly clustering the original observations these
algorithms perform a clustering on the neurons (grid points) in the map, on which the observations are
projected. As these neurons form 'best representations' for the observations in the input space there is no
qualitative difference. The clustering of the observations can be found by retrieving the projected observations
for each neuron in each cluster.
Distance measure
Two of the implemented clustering algorithms make use of a specific distance measure, called the Ward
distance. It is defined as:
Ward distance d x,y =
nx ny
nx + ny
mean x mean y
nx
ny
1
2
3
4
5
10
9
8
7
6
mean x is the vector with averages over all components of the neurons in
cluster x, also known as the cluster centroid. Distances between clusters with
an evenly distributed number of neurons are enlarged in comparison with
distances between clusters with an uneven distribution of the numbers of
neurons (see table 3-1). This accelerates the merging of stray small clusters.
n x ny
nx + ny
0.91
1.64
2.18
2.55
2.73
Ward's clustering
This is one of the classic bottom-up methods. It starts with all the neurons in a separate cluster, in each step
merging the clusters having the least Ward distance. This distance is calculated without taking the ordering of
3 self-organizing maps
the map into account, only distances between neurons in the input space are used. When the found clustering
46
warranting the inclusion in one cluster, but the grid may be bended through the input space in such a way that
is shown on the map, the clusters may appear disconnected: In the input space the neurons are close-by
the neurons are far apart on the map.
SOM-Ward
This clustering method is essentially the same as Ward's clustering, but this time the ordering of the neurons on
the map is taken into account. Only clusters that are direct neighbours in the map can be merged together to
form a larger cluster. The SOM-Ward clustering technique is primarily used in our research. An example of
SOM-Ward clustering (using the same 3 dimensional data set) is shown in figure 3-18.
Shown clusters
The number of shown clusters may vary
according to user specified settings. For
the Ward and the SOM-Ward clustering this
implies fixing the formed clusters at a
specific step of the clustering algorithm.
For the SOM single linkage clustering this
means setting the threshold, generating
more or less separators and clusters.
For each specific clustering algorithm a quantitative quality measure for the current clustering is calculated. For
SOM-Ward clusters this cluster indicator subtracts the observed distance levels at all steps in the cluster
algorithm from an exponential increasing standard distance level. If this deviation at the next clustering step
(from c to c-1 clusters) is more positive than the deviation at the current clustering step (from c+1 to c clusters)
then the current cluster configuration is better.
(c )
Cluster Indicator I (c ) =
1 100 if I (c ) > 0 , else I (c ) = 0
1
+
(
)
c
where (c ) = d (c ) c
, d (c ) is the SOM-Ward distance for the step in the algorithm from c to c-1 clusters, 3 c
number of neurons.
The is the coefficient found by linear regression through the [(ln(c), ln(d(c)))] data points, where 2 c
number of neurons. This exponential curve conforms to the observed standard exponential increase of
distance levels as the number of clusters decreases.
I(c) = 0 if c = 1, c = 2 or if d(c+1) > d(c); a smaller than normal distance level at the next clustering step (from c
to c-1 clusters) than the distance level at the current clustering step (from c+1 to c clusters) means a worse
clustering with c clusters.
Number of neurons
One of the main settings to choose when training a map is
the number of output neurons.
A small number of neurons (smaller than the total number of
observations in the train set) means a more general fit is
made. The map is better at generalizing and is less sensitive
to noise in the data.
3 self-organizing maps
48
Although the
The number of neurons should be chosen in proportion to the trust one places in his or her data: If a lot of noise
is to be expected, then a relatively small number of neurons should be chosen. If the distribution of the sample
data very closely resembles the underlying distribution of the population, then a relatively large number of
neurons can be initialized. The extra neurons then warrant a more refined representation of the data by the
network.
Initialization
Instead of random initialization one often uses linear initialization. Both can be used, but linear initialization
provides a better starting point for the organization of the map. The map is often linear initialized along the
axes provided by the first two principal components of the data set.
(t ) =
(B + t )
where A and B are constants. Earlier and later samples will now be taken into account with approximately
similar average weights26.
The neighbourhood function often has the Gaussian form
r r 2
i
j
,
hij (t ) = exp
2
2 (t )
where ri denotes the place of this neuron in the map and (t) is some monotonically decreasing function over
the iterations. Sometimes a simpler form of the neighbourhood function is used, e.g. the bubble function which
just denotes a fixed set of neurons around the winning neuron (in the map). The Gaussian form ensures a global
best ordering of the map (the quantization error arrives at a global minimum instead of a local minimum) 27.
26
27
3.7.1 Description
When a map has been created the user has to evaluate the map, determine a good clustering and possibly
improve on the clustering so that a clear understanding of the underlying data set emerges.
Determining a good clustering is a non-trivial task. Of course the variables used for map creation have to be
suitable for the research setting. Then each specific setting for the used clustering algorithm renders a different
number of clusters visible. The map quality measures and the quantitative cluster quality measure form a
starting point for determining a good clustering. It is up to the expert user to choose a clustering suitable for
the task at hand, specifically by taking any domain knowledge into account.
3 self-organizing maps
50
variables used in the creation of the map. Removing a variable is warranted only under certain conditions, if
these conditions hold then the variable does not contribute much to the generated map and can safely be
removed:
-
With or without the variable the distribution of the companies over the map remains equal.
With or without the variable the clustering remains the same (same size and same characteristics in terms
of individual variables).
The component plane of the variable shows a random distribution (Figure 3-21). The component only adds
noise to the formation of the map, it does not contribute to the distribution of companies over the map. For
instance, this could happen when the variance of the normalized variable is significantly lower than the
variance of the other normalized variables.
The component plane of the variable bears a close resemblance with the component plane of another
variable (Figure 3-21). The variables are then highly correlated (not necessarily in a linear fashion). The
dependent variable does not contribute to the distribution of companies over the map, because the same
information is already contained in the other variable.
The distribution of the high and low values of the component plane does not coincide with one or more
specific clusters (Figure 3-22). A strong characterization of the clusters (regarding this variable) can not be
given. It is most likely that the variable does not contribute to the clustering, so we choose to remove the
variable.
51
Examples
In appendix III and IV two examples of descriptive SOM use can be found, one on a medical domain and the
second on a data based marketing domain. Chapter 4 also uses SOM in a descriptive way to evaluate the link
between credit ratings and financial ratios.
3.7.2 Prediction
The SOM can be used to predict values for any of the variables of new observations. We are then not so much
interested in the found clustering as we are in the form of the neural network in the input space. The final form
of the neural network is found using the self-organization process (this is a form of semi-parametric regression),
and remains fixed.
The network can now be used to predict values of one or more variables for previously unknown observations,
just as we would use the regression line to predict values for new observations in a standard linear regression
model. It is also possible to do this for observations used to create the map, but this would of course lead to an
artificially good prediction.
The values for specific variables are predicted as follows:
1.
A neighbourhood of K neurons of the current new observation is determined. The user can set K.
2.
A weighted average of the variable is taken over this neighbourhood, where close-by neurons are weighted
more strongly.
3 self-organizing maps
52
When K is set to 1 the prediction is based on only one neuron, also called the 'best matching unit'.
Model set-up
To correctly assess the prediction power of the model we have to make a distinction between an in-sample train
set and an out-of-sample test set. The test set is reserved for a final assessment of prediction capabilities after
the model has been constructed. When creating and iteratively improving the model we want to test the
prediction power of the created map without using the test set. We also do not want to use the same
observations as used for training the map, so we have to make an extra distinction in the in-sample data set: a
train set to train the map, and a validation set to tune its parameters.
a relatively short time span. When using all the variables for map creation, and then subsequently removing
variables not contributing much to the prediction power, we can be certain that all contributing combinations
are found. Unfortunately this strategy is more time consuming.
neuron the new observation (green) is matched in the one-dimensional final map (the distance to either neuron
is equal).
When using the target variable as a train variable, the map is created using two dimensions (figure 3-24). The
placement of the neurons shifts, it is much clearer that the new observation matches the rightmost neuron.
Remember that we do not have the value of the target variable for the new observation, so we can still only use
the x-dimension to determine the best matching unit for this new observation.
3 self-organizing maps
54
Of course this particular example only illustrates one possible outcome of using the target variable as a train
variable. A deeper investigation into the effects of this technique lies outside the scope of this thesis.
Examples
An example of the use of SOM as a prediction model can be found in chapter 5: Financial ratios are used to
classify companies according to creditworthiness.
28
S O M q ue sti on s a nd an sw e r s
A: It is not really the grid in the input space that is flattened and unstretched, rather a direct representation of
this grid in 2 dimensions. Each neuron in the input space directly corresponds with a grid point in the 2
dimensional map.
Q: Is there a chance of overfitting the neural network when using a large number of neurons (larger than the
number of observations)?
A: This depends on your definition of overfitting. The SOM algorithm includes automatic 'dampening' functions
in the form of the learning rate factor and the neighbourhood function. When using a large number of neurons
the network more precisely represents the underlying dataset, some would consider this overfitting. However,
thanks to the dampening functions the neurons are not completely attracted by the specific observations.
Q: Does the order in which the observations are being processed by the self-organization process make any
difference for the final results?
A: No, because instead of processing the observations just once, often multiple iterations are used. Together
with the used dampening functions the map converges to a stable form.
55
3 self-organizing maps
56
3.9 Summary
Chapter 3 covered the theoretical foundations of SOM. We viewed the place of Self-Organizing Maps in the
knowledge discovery process, and we described some projection, clustering and classification techniques
related to SOM. The SOM is a combination of non-linear projection and hierarchical clustering, driven by a
simple feed forward neural network. The observations are projected on a flexible grid of neurons that stretches
and bends to accommodate to the distribution of the data in the input space. After the network has found its
final form, it is displayed in a flattened state as a map. The observations projected on this map are then
clustered, according to similarity of the used variables.
A Self-Organizing Map can be used in two ways: As a descriptive analysis tool, and as a prediction model. For
use in a descriptive setting the map display and the clustering is most important. Visually comparing the
clusters and other parts of the SOM display provides a good and insightful overview of the underlying data set.
When deploying the SOM as a prediction model, we are more interested in the distribution of the companies
over the map (or equivalently, the form of the map) than the clustering. The SOM then functions as a semiparametric (possibly non-linear) regression model.
S um m a r y
57
4 descriptive analysis
The paragraphs in chapter 4 form an account of our descriptive analysis, using the SOM as a visual exploration
tool. We answer question 3 and 4 from the introduction:
3.
Is it possible to find a logical clustering of the companies, based on the financial statements of these
companies?
4.
If such a clustering is found, does this clustering coincide with levels of creditworthiness of the companies
in a cluster?
Paragraph 1 covers the basic data analysis. Paragraph 2 explores the possibility of clustering companies based
on financial data. In paragraph 3 we then compare the found clustering with the credit ratings of the clustered
companies. Paragraph 4 reviews the performed sensitivity analysis and in paragraph 5 we benchmark the SOM
results to a principal components analysis.
Financial ratios
Our preliminary selection in chapter 2 yielded several types of financial ratios for industrial companies. We can
distinguish the following kinds of ratios or variables:
-
Interest coverage ratios: these measure the extent to which the earnings of a company cover debt or
interest.
4 descriptive analysis
60
Leverage ratios: these measure the financial leverage created when firms borrow money.
Profitability ratios: profitability ratios measure the profits of a company in proportion to its assets.
Stability variables: stability variables measure the stability of the company over time in terms of size and
income.
Market variables: market variables are used to assess the value investors assign to a company.
Leverage
Profitability
Size
Stability
Market
Name
EBIT interest coverage
EBITDA interest coverage
EBIT / total debt
Debt ratio
Debt-equity 1
Debt-equity 2
Net gearing
Return on equity
Return on total assets
Operating income / sales
Net profit margin
Total assets
Market value
Coefficient of variation of net
income
Coefficient of variation of total
assets
Coefficient of variation of
earnings forecasts FY1
Market beta relative to NYSE
Earnings per share
Description
(earnings before interest and taxes) / (interest expenses)
(earnings before interest, taxes, depreciation and amortization)
/ (interest expenses)
(earnings before interest and taxes) / (total debt)
(long term debt) / (long term debt + equity + minority interest)
(long term debt) / (equity)
(long term debt) / (total capital)
(total liabilities cash) / (equity)
(net income) / (average equity)
(earnings before interest and taxes) / (total assets)
(operating income before depreciation) / (sales)
(net income) / (total sales)
The total assets of the company
Price per share * number of shares outstanding
(standard deviation of net income over 5 years)
/ (mean of net income over 5 years)
(standard deviation of total assets over 5 years)
/ (mean of total assets over 5 years)
(standard deviation of forecasts fiscal year 1 over analysts)
/ (mean of forecasts fiscal year 1 over analysts)
snapshot taken on the last trading day of the quarter
(earnings applicable to common stock)
/ (total number of shares)
Bas ic dat a a na ly si s
61
Rating classification
The ratings are classified according to the S&P rating classification scale. These
letter ratings have been transformed to numerical rating codes; the
transformation is shown in table 4-2. The rating code represents an ordering
between the different rating classes. A higher rating corresponds to a lower
default risk, a lower rating corresponds to a higher default risk. Note that this
numerical scale also seems to imply rating classes of equal width, which not
necessarily has to be true. At this point we choose to use this particular
transformation because we do not know the exact width of the rating classes.
Furthermore, because the SOM can also model non-linear relationships this is
less of a disadvantage than it first may seem.
Publication lag
Rating agencies try to react as soon as possible on any news that may affect the
rating of a company.
4 descriptive analysis
This
publication lag is often 3 to 6 months. Thus the rating for the fourth quarter of
1998 is based on financial figures for the second or third quarter of 1998. Taking
62
this into account we downloaded the ratings with a 2-quarter offset; the figures
of the fourth quarter of 1998 were matched with the ratings of the second
Rating Code
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
quarter of 1999.
Except for defaults, the ratings do not change twice within two quarters. Actually, they do not change much at
all (generally less than once per year). This is consistent with the general policy of rating agencies to keep the
ratings as stable as possible.
Company universe
The constructed universe consists of US companies with at least one Standard & Poor's credit rating in the
evaluated time period. This time period starts at the first quarter of 1994 and ends at the fourth quarter of 1998.
The total number of companies in our universe is 1677.
Sector classification
Each company is classified according to the S&P sector classification29. The distribution of companies over S&P
sectors is shown in the following figure:
Sector partitioning
350
300
250
200
Companies
150
100
50
on
Ba
si
c
M
at
su
er
m
ia
er
ls
C
C
on
y
cl
su
i
ca
m
ls
er
St
ap
H
le
ea
s
lth
C
ar
e
En
er
gy
Fi
na
C
ap ncia
ita
ls
C
lG
om
m
Te ood
un
s
ch
ic
no
at
lo
io
gy
n
Se
rv
ic
es
U
Tr
til
an
i
sp ties
or
ta
tio
n
We have chosen to perform our analysis on a per sector basis, because of the following reasons:
-
Companies in different sectors can display very different values for the same variable (the long-term debt of
a bank will on average be much larger than the long term debt of a steel factory).
Often a variable is not applicable to companies in one sector but very applicable to companies in another
(a bank does not have a raw materials inventory like a steel factory has).
Data provider
Most of the data was downloaded from the Compustat Quarterly historical database, available through the
Factset datavendor. The Compustat database is actually owned by Standard & Poor's, this leads us to believe
29
We do not use the new MSCI sector classification because that classification scheme does not yet cover the whole company
universe.
Bas ic dat a a na ly si s
63
that we at least partially use the same data for our model as S&P uses for their rating decisions. Forecasts are
available from the IBES database, a well-known data provider for earnings forecasts.
For each variable the underlying components were downloaded (instead of downloading just the ratio when
available).
This guarantees a ratio calculated according to our specifications and it facilitates checking
individual values.
4 descriptive analysis
64
robust estimators.
Other calculated statistics include the minimum, the maximum, the number of not-availables, the percentage of
values greater (smaller) than the mean plus (minus) 3 times the standard deviation and the percentage of values
greater (smaller) than the median plus (minus) 3 times the median standard deviation. Finally, to conduct some
tests on the normality of the variables, we calculate the skewness, the kurtosis and the Jarque-Bera statistic31.
These tests show that a few extreme values are greatly offsetting the distribution per variable. The mean and
the median differ a lot, as do the standard deviation and the median standard deviation.
Stability of variables over time
To test the stability of variables over time we have evaluated the summary statistics per variable for all
downloaded quarters32. We can only justify the merging of cross-sections (to enlarge the data set leading to
higher statistical significance) when the characteristics of the data do not significantly change over time. This
holds true for the median and the median standard deviation; differences between periods are small. When
evaluating the mean and the standard deviation, great differences between periods can be detected for almost
all variables. This once again confirms our observation of extreme values off-setting the distributions.
30
A complete description of the median standard deviation can be found in the appendix V.
A complete description of these tests can be found in appendix V.
32
These figures are not displayed here but are available upon request.
31
Missing Values
The data contains a substantial amount of missing values per variable, as is often the case with financial data.
For the same company we often see a sequence of missing values. This is due to the fact that some variables
consist of one or more of the same components. If one of these components is missing, the derived variables
can not be calculated.
Extreme values
The scatter plots show that almost all variables contain one or more extreme values. We checked these extreme
values by decomposing the variable and checking the values for the underlying components. This tells us that
most extreme values represent accurate values. For example, in the "Total Assets" variable some companies
are several times as big as most other companies. Often the same companies arise in different time periods as
extreme values, this is also an indication that the values are structurally higher or lower (thus correct) instead of
mere data errors. Therefore we should explicitly not call them outliers.
Coping with extreme values
The found extreme values present us with some problems. We do not want to lose this information, but we also
want to capture all of the information contained in the 'normal' range of the variable. When not removing the
extreme values it might result in a loss of resolution in this range.
To cope with the extreme values we proceed in the following way: For every variable we calculate a cut-off, so
that approximately 2,5 percent of the observations is situated above the median plus this cut-off or below the
median minus this cut-off. Then the observations having values larger (smaller) than the median plus (minus)
the cut-off are replaced by this upper (lower) value. The histograms in figure A-13 in appendix VI clearly exhibit
much more evenly distributed variables after the cut-off. For comparison the found model can at a later time be
tested using non-edited data.
Bas ic dat a a na ly si s
65
Correlation
To find possible collinearities between variables it is insightful to calculate a correlation matrix; the matrix is
displayed in table A-3 in appendix VI. The correlations higher than a best-practice value of 0.7033 are highlighted
in the matrix. For linear models these high correlations can lead to problems involved with multi-collinearity. To
circumvent these problems we will test our models with different sets of variables, every time selecting other
combinations of non multi-collinear variables.
Transformation
The histograms and the summary statistics show rather skewed size variables. Therefore these are transformed
using a logarithmic transformation with a base 10 logarithm.
Other transformations are unnecessary. The Viscovery program automatically re-scales the values per variable
so they are comparable when creating the map. This way no specific preference (that could influence the
forming of the map) is shown for any of the variables. For convenient output viewing these values are scaled
back to their original values.
4 descriptive analysis
66
33
configurations; the differences between companies in separate clusters are larger when the cluster indicator is
higher.
The user is presented with a choice between several possible clusterings, for each possibility displaying the
number of clusters and the value of the cluster indicator. The user has to combine this quantitative measure
with available information like summary statistics per cluster, the distribution of individual variables over the
overall clustering (using component planes) and specific domain knowledge. In the end the found clustering
should accurately reflect key (and not necessarily linear) relationships between clusters and individual
variables.
34
The used clustering algorithms and the cluster indicator are explained in chapter 3
C lu st e rin g c om pa ni es
67
4 descriptive analysis
68
EBITDA interest coverage with EBIT interest coverage and with EBIT / total debt.
The net gearing variable shows a strange correlation with debt-equity ratio 2. For extreme values they are
negatively correlated, while we would expect a positive correlation. The raw data tells us that this only happens
when a company has negative equity36. Due to the definition of net gearing this variable increases when equity
decreases, until equity turns negative. The value for the variable then turns negative which shows up on the
map as a relatively low value.
This also holds true for the debt-equity ratio 1. As both variables try to convey the same message as the debt
equity ratio 2 and in view of the lesser contribution to the overall clustering we see no need to keep them.
Adjustments
We removed the following variables: EBIT interest coverage, EBIT / total debt, Debt ratio, Debt-equity ratio 1,
Net gearing and Log market value.
35
The correlation matrix can function as an extra confirmation, but due to the necessarily linear nature of the correlation coefficient it
might not capture all dependencies between the variables.
36
Negative equity occurs when a company has to compensate for so many losses that the sum of its reserves and stockholders equity
turns negative.
Second iteration
In the second iteration a new map was drawn, after performing the adjustments of the first iteration. The map
can be found in appendix VI as figure A-15.
Inferred relations
At this point the map is formed using four profitability ratios, whereas we are using just one interest coverage
ratio, one leverage ratio, one size variable, two stability variables and two market risk variables. This way we
are giving extra weight to the profitability measure of a company. For this analysis we would like to have a
representative overview of the financial statement of each company, an extra weight for profitability is
undesirable. Therefore we remove two profitability ratios: return on equity and the net profit margin. The
return on equity variable approximates the return on assets variable but contributes less to the overall
clustering. Likewise, the net profit margin (net income / total sales) mimics the operating income / sales
variable, but contributes less to the overall clustering.
On this map the EPS variable shows a correlation with the return on assets variable, and is thus candidate for
removal. The beta variable does not contribute much to the clustering at all, and therefore we will remove it.
Adjustments
C lu st e rin g c om pa ni es
We removed the following variables: Return on equity, Net profit margin, EPS and Beta.
Third iteration
In the third iteration the final map is created based on the following variables:
-
Debt-equity ratio 2
Return on assets
As in the previous iterations we do not use the S & P senior unsecured debt rating when creating the map. The
map is displayed in the appendix VI as figure A-17.
69
4.2.3 Results
The final clustering is shown in figure 4-2. The inter-cluster distance indicator is highest for the shown eight
clusters, so on a quantitative basis this clustering is a good starting point. When evaluating the component
planes we notice for each variable a concentration of extreme high values in one or two clusters and a
concentration of extreme low values in one or two clusters. An even distribution of the high and low values over
the map would mean that this specific variable is only adding noise to the clustering. As this is clearly not the
case, we are confident that the found clustering is adequate for this data set.
4 descriptive analysis
70
The summary statistics37 per cluster re-enforce this image. Per cluster the following statistics are calculated:
-
For each variable the mean, minimum, maximum and standard deviation
The companies are evenly distributed over the clusters, and the statistics for the variables per cluster differ
enough to make meaningful characterizations of the clusters.
37
When visually inspecting the map and the distribution of individual variables over the map the following
characterization of the clusters can be made (in order of descending creditworthiness):
C 2 - Healthy companies with high interest coverage, low leverage, high profitability, very stable
companies and low perceived market risk. Remarkable: these are not always the biggest companies.
C 4 - Large stable companies with a high profit margin. Remarkable: not so high interest coverage.
C1, C3 and C8 - Average companies with no real outstanding features.
C 5 - Small companies with low interest coverage and high leverage. Remarkable: a stable coefficient of
variation of net income, these companies do not grow much.
C 6 - Underperformers: very low interest coverage, very low or even negative profitability, negative
earnings forecasts.
C 7 - Unstable companies: very unstable and a very high perceived market risk.
C lu st e rin g c om pa ni es
71
superimposing
the
clusters
on
the
ratings
4 descriptive analysis
72
whereas
underperformers
and
unstable
Starting from the same assumptions we can most likely attribute the observed rating diversity within each
cluster to external, non-financial factors. According to its financial statement the company belongs in a certain
cluster, but an unknown external factor contributed to a higher or lower rating.
A poor fit is a fit where the ratings are randomized over the map. When randomizing the ratings we take the
original form of the distribution of the ratings into account. Ratings occurring less frequently on the
original map still do not appear very often. We have created this situation by shuffling the existing ratings
over the companies. It is portrayed in figure 4-5 a.
For this specific research a perfect fit would be when the model solely based on financial ratios perfectly
describes the dataset. The found clustering makes a perfect distinction between companies with different
levels of creditworthiness, and companies with exactly the same level of creditworthiness are perfectly
alike. Each cluster only contains companies with an equal rating. This purely hypothetical situation is
shown in figure 4-5 c.
The observed fit (when forcing the map to display 22 clusters) is shown in figure 4-5 b.
a
b
Figure 4-5 Rating mapping from poor to perfect
Visually the three different scenarios are very distinguishable, this can be mathematically verified using the
cluster coefficient of determination.
4 descriptive analysis
74
2
=
R cluster
SSC
SSE
= 1
SST
SST
Ratings distribution
60
50
40
30
ratings
20
(because
sum
to
otherwise
zero)
for
these
each
0
1
company.
N
SSE clusters =
(ri rcluster (i ) )2
i =1
N 1
,
38
10
9 10 11 12 13 14 15 16 17 18 19 20 21 22
where r i is the rating of company i and r cluster (i ) is the average rating of the cluster company i belongs to. We
are trying to estimate the variance based on a sample of N companies (instead of the whole population),
therefore we divide by N - 1.
2
is a measure for the fit of the ratings mapping to the current clustering, when keeping the number of
R cluster
2
would indicate that the ratings mapping is poor (a high residual variance of
clusters constant. A small R cluster
2
the ratings within each cluster), a high R cluster
would indicate that the ratings mapping is good (a small residual
variance of the ratings within each cluster). The number of clusters must be suitably chosen, otherwise an
2
artificially high R cluster
can easily be obtained by using a lot of clusters.
4.3.3 Results
Fit of observed mapping
We have found the current clustering using the self-organizing map and a fixed set of financial ratios, without
using the S&P ratings. So when we assume the following:
-
the used financial ratios are a representative financial characterization of the companies in this sector,
2
represents the variance of the ratings that can be explained by a model based solely on
then the R cluster
financial ratios. The residual variance can then most likely be attributed to the qualitative factors rating
2
agencies take into account when assigning a rating to a company. Table 3 shows the found values of R cluster
for
a poor fit, an observed fit, a good fit and a perfect fit.
The perfect situation can only occur when ratings are solely influenced by financial ratios. We know that rating
agencies also take qualitative factors into account when determining a rating. As of yet we do not know the
precise contribution of the qualitative factors, but we will try to simulate them using deviations from the real
ratings based on a standard normal distribution (mean 0 and standard deviation 2 notches, or one rating class).
This is the good fit also displayed in table 4-3.
2
for the observed mapping indicates that
The R cluster
Mapping
2
R cluster
Poor
0.07
Observed
0.61
Good
0.95
Perfect
1
other 40% to (amongst others) the qualitative analysis performed by the rating agency. The difference between
2
the R cluster
for the observed mapping and for our good scenario also indicates that the influence of the
qualitative analysis reaches further than a simple one or two notches adjustment of the assigned rating.
the effects of this decision. We calculate the fit for the 2-quarter lag, a 1-quarter lag and a 0-quarter lag. The
results are shown in table 4-4.
The fit marginally improves when
using a smaller lag period. As a lag
period of 2 quarters is better
justifiable we will continue to use a
2-quarter lag
0.61
1-quarter lag
0.64
0-quarter lag
0.67
Perfect
1
Drawbacks
4 descriptive analysis
76
2
Some of the drawbacks of the R cluster
figure are:
We have to assume that the data is free of errors and representative for the underlying domain.
We have to assume that the clustering found using SOM is representative of the underlying dataset.
2
The R cluster
can only be compared for maps with approximately equal clusters. More clusters reduce the
2
variance of the ratings within clusters and thus improve the R cluster
. This is equivalent to using more
2
variables in a standard regression model: The R cluster
can only improve, more of the variance will be
Alternative use
This last drawback can actually be used to indicate a suitable number of clusters for the current map and the
2
associated ratings. When the R cluster
does not improve after selecting an extra cluster to view, then the last
cluster division did not contribute to a better division of the ratings. Before and after the cluster division the
2
is the same, so the ratings in that specific cluster are even distributed and the first situation is (locally)
R cluster
optimal.
This is not unlike the way we use the eigenvalues in PCA to determine the intrinsic dimensionality of the data.
As less as possible intrinsic dimensions are selected that reasonably capture the variance in the dataset. Where
PCA compares the variance of the principal components with the variance of the dataset, our method compares
the variance of the clusters with the variance of the variable we want to see explained (in this case the rating).
This differs in that with PCA the dataset is used to calculate the principal components, while we do not use the
ratings to determine a suitable clustering. We should thus not try to find a number of clusters that fully explains
the variance in the ratings, as it may well be that the variance can not be totally explained. We can however use
2
the rate of improvement of the R cluster
on a sensible range of number of clusters to determine how good the
mapping really is.
2
If the R cluster
is plotted against the
100
90
80
70
60
%
50
40
classes is 22).
We clearly see a
30
20
10
0
1
10
12
14
16
18
20
22
24
# Clusters
Our initial choice of 8 clusters seems to be reasonably good; most of the explainable variance in the ratings is
captured, no direct improvement can be found when using one extra cluster and it is still possible to easily infer
relationships from the maps.
When using 14 clusters almost all possibly explainable variance of the clusters has been captured. This directly
corresponds with the distribution of the ratings over the companies. Most companies are concentrated in but 14
of the 22 rating classes.
The results of the qualitative analysis performed on the first map should match results of the qualitative
analysis performed on the second map.
The distribution of the companies over the map should locally stay the same.
When the inferred relationships from the two maps are alike then the maps are qualitatively equal. To show
that the local distribution of companies over the map does not change we make use of cluster coincidence plots.
78
The cluster
Iteration 2
4 descriptive analysis
5
4
3
The cluster
Iteration 3
long as the main bulk of the cluster is concentrated in one cluster in the previous iteration there is nothing to
worry about.
4.4.2 Results
Two kinds of sensitivity analysis can be distinguished. The tests on sensitivity of the algorithm aim to show that
the qualitative results stay the same, regardless of the chosen settings for map creation. The tests on data
sensitivity try to show that the results remain equal, regardless of any of the specific choices we made in each
step of the knowledge discovery process.
Ordering of samples
Using a different ordering of the companies generates exactly the same map. We tried using a random and an
inverted order of the companies, both show that the algorithm is stable.
Number of neurons
Maps built using 100, 250 and 1000 neurons are displayed in figures A-19 to A-23 in appendix VI. For each map
the results of the qualitative analysis remain unchanged, so we are inclined to say that the maps strongly match
each other. The cluster coincidence plots, displayed below the maps, re-enforce this image. As they show a
few relatively big bubbles we are confident that the local distribution of companies stays the same.
Eliminating variables
During the course of the analysis we gradually reduce the number of variables from 18 to 8. Although the maps
are not exactly the same, we did not lose much important information when discarding the spurious variables.
The results of the qualitative analysis stay the same, some relationships are even clearer.
S en si t i v i t y a na l ysi s
79
Figures A-14 and A-15 in appendix VI show that cluster coincidence for iteration 1 vs. iteration 2 and for iteration
2 vs. iteration 3 is high. Companies that appear in one cluster in a previous iteration appear in a single or just a
few clusters in the current iteration.
Cluster
coincidence (shown in figure A-26) is high; this shows that the distribution of the companies over the maps
approximately stays the same.
4 descriptive analysis
80
quarter) we expect to find the same clustering. The resulting map is shown in figure A-27 in appendix VI. Using
companies from all four quarters of 1998 (1098 data points) generates a map with the same global ordering of
the clusters as the previous map. Although the map is rotated 90 degrees compared to the original map, the
same relations can be inferred.
Because the merged map contains approximately 4 times as much companies as our final map, we can not
compute the cluster coincidence between these maps. But we can compute the cluster coincidence between our
final map and the placement on the merged map of companies from one specific quarter, this is shown in figure
A-28 in appendix VI. Cluster coincidence is high when comparing only the fourth quarter companies of the
merged map with the final map (found in iteration 3). Cluster coincidence however slightly deteriorates when
comparing companies from the older cross-sections (quarters 3, 2 and 1) with the final map.
In all years the companies with high interest coverage, low leverage, high profitability and high stability
represent healthy companies.
The relative placement of the clusters on the map does not change, indicating the same global ordering of
the companies in the input space. Healthy and large, stable companies are situated near each other, as are
unstable companies and underperformers.
The characteristics for two specific clusters seem to have significantly changed from 1996 to 1997:
-
The large and stable companies show a medium return on assets in the years 1994, 1995 and 1996. In 1997
and 1998 these companies show a high return on assets.
The small companies show a high return on assets in the years 1994 through 1998, and a low return on
assets in 1997.
Table 4-5 shows that the return on assets of large companies significantly improved when comparing 1996 and
1997, and this remained high for 1998. It also shows that the return on assets of small companies significantly
worsened in 1997.
Standard
&
Poors
did
not
changed
their
Cluster
1998
1997
1996
1995
1994
0.036
0.021
0.024
0.022
mean SP rating
criteria
small
0.01
0.028
0.024
0.034
mean SP rating
8.15
8.68
10.66
8.857
S en si t i v i t y a na l ysi s
81
4.5 Benchmark
4.5.1 Principal Component Analysis
To provide a means for comparison we will rework part of our analysis using the principal components
technique, previously discussed in chapter 3. The PCA technique tries to capture the intrinsic dimensionality of
the data by finding the directions in which the data displays the greatest variance. The data can then be
projected on the plane spanned by the first two of these directions.
Data
We once again use the data set containing company values for the fourth quarter of 1998, and all the financial
ratios gathered at the start of our original analysis (eighteen in total). The used software package is XLStat, a
Microsoft Excel add-in containing numerous statistical analysis tools.
Missing values
For a complete principal components analysis no values may be missing from the data. All records containing at
4 descriptive analysis
least one missing value are deleted. Missing values are quite common for financial statement data; for our
82
problem (e.g. insert averages for the missing values), but at this stage we accept the lesser significance of the
analysis this means deleting 158 records out of 287! Normally we would try to find a work-around for the
results. The Self-Organizing Map technique does not have this drawback, the SOM algorithm uses as much of
the available data as possible to create the map.
Correlations matrix
First step in the analysis is the creation of a correlations matrix, displayed in table A-5 in appendix VI. Based on
this matrix the application finds the uncorrelated principal components and corresponding eigenvalues (table A6), which are equal to the variances of the principal components. The total population variance due to each
principal component can be calculated using these eigenvalues (this is shown in table A-7 in appendix VI).
4.5.2 Results
The first eight principal components cover 86% of the variance in the data. Furthermore, the correlations
between original variables and the principal components lose their significance after the first eight principal
components (the correlations do not exceed 60% anymore). We therefore conclude that the linear relations in
the data set can adequately be described using just the first eight principal components. The dimensionality of
the data set has been reduced from 18 original variables to 8 principal components.
The principal components are shown in table 4-6. The characterization of each principal component is based on
the variables having the highest correlations per component.
Table 4-6 Principal components
Principal
component
1
Covered variance
/ cumulative
0.28 / 0.28
Characterization
Variables
0.13 / 0.41
Leverage
0.12 / 0.53
Leverage
4
5
0.08 / 0.61
0.08 / 0.69
Profitability
Size
6
7
8
0.07 / 0.76
0.05 / 0.81
0.05 / 0.86
Market
Stability
Perceived risk
Correlation with
principal component
0.92
0.90
0.90
0.84
0.83
0.80
0.72
0.66
0.63
0.69
0.63
0.81
0.66
0.76
The PCA has grouped the original financial ratios according to the broad classification described in the
beginning of this chapter. Leverage has been divided over two principal components (2 and 3), but the division
is the same as the one found in the self-organizing map analysis. Debt-equity ratio 1 and net gearing are highly
correlated, and so are debt-equity ratio 2 and debt ratio.
Another similarity with SOM is the reduction of the number of variables from 18 to 8. The exact set of variables
found using PCA is slightly different.
-
PCA selected the market variable (beta) where SOM selected another stability variable (c.o.v. total assets).
And where SOM of course selected only original variables, most of the variables found using PCA are
combinations of other variables. We could simplify this by removing all but one of the highly correlated
variables within a principal component.
Be nch ma rk
83
Visualization
Clustering
4
2
0
-2
-4
-6
PCA could be
-8
-10
4 descriptive analysis
10
-5
84
both be used to reduce the dimensionality of large data sets, the found compressed data sets are very much
alike. Using SOM is especially advantageous when one suspects non-linear relationships in the data, as PCA
can not handle these. The added visualization and clustering techniques of the used SOM implementation
provide an insight in the data that PCA lacks.
Table 4-7 Comparison of SOM versus PCA
Software package
User friendliness
Missing values allowed
Dimensionality reduction
Spread of variables over variable
classes
Found relationships
Projection of observations
Added value of projection
Clustering of observations
SOM
Viscovery SOMine 3
high
yes
from 18 to 8
broad
PCA
XLStat (MS Excel add-in)
high
no
from 18 to 8
broad
linear
On flat plane spanned by first two
principal components
low
no
high
yes
4.6 Summary
In this chapter we have used the knowledge discovery process and specifically the Self-Organizing Map
technique to perform a descriptive analysis of the credit rating domain. We started with a basic data analysis, to
get a general feel of the data and already making some important decisions. A single sector (Consumer
Cyclicals) from the available universe of US companies was selected, and for each company we computed the
financial ratios already mentioned in chapter 2. Next to these figures we also downloaded the Standard &
Poors credit ratings. The size variables were log transformed and we used a cut-off for all variables, to take
care of extreme values.
We then proceeded to create a SOM clustering of the observations, using only financial statement data to train
the map. A clustering was found, whereby the clusters can be characterized by the average values for the
financial ratios of the companies in the cluster. Furthermore, when comparing the distribution of the S&P
ratings over the companies in the clusters with the characterizations of the clusters there appears to be a
positive correlation: Companies in the Healthy cluster received high ratings, whereas companies in the
Underperformers cluster received low ratings.
We have made this visual coincidence somewhat more quantifiable using the cluster coefficient of
determination. Our descriptive model based on financial statement data alone explains about 60% of the
variance in the ratings. If we presume the SOM model to be accurate and if the data does not contain any major
errors, then we could possibly attribute the other 40% to the qualitative analysis performed by S&P.
Tests on sensitivity of the algorithm (specific settings of the SOM during training) and sensitivity of the data
(different cross-sections) show that the model and the found results are stable. An analysis performed using
Principal Components Analysis gives similar results, but without the benefit of the insightful visualizations and
clusterings specific to SOM.
S um m a r y
85
5 classification model
Chapter 5 describes our efforts to build a classification model based on financial statement data. Question 5
from the introduction will be answered:
5.
Is it possible to classify companies in rating classes using only financial statement data?
Paragraph 1 describes the general model set-up. Then the construction of our SOM model is extensively
reviewed in paragraph 2. The model is validated in paragraph 3, and in the next paragraph we compare the
SOM model with our two benchmark models, linear regression and ordered logit. The final out-of-sample test is
conducted for all three models in paragraph 5.
The sample of companies in the Consumer Cyclicals sector is randomly divided in a train and validation set
(in-sample) and a test set (out-of-sample).
2.
The map is trained using the train set, then the ratings are predicted for the validation set. This in-sample
training and validating is repeated (using different settings and variables) until we are satisfied with the
found model. The test set is reserved for the final out-of-sample test.
3.
The predicted ratings are compared with the real ratings and several measures of likeliness are computed.
The relative prediction error (or classification performance) of the model can thus be ascertained.
In the following paragraphs we will more thoroughly review each of the model steps.
5 classification model
5.1.2 Data
Equivalent to our descriptive analysis in chapter 4 the used sample consists of companies from the Consumer
88
Cyclicals sector. We use exactly the same data, so we do not need to repeat our basic data analysis. This time
we merge the 8 cross-sections of years 1997 and 1998 (4 quarters in each year) to gain a larger sample, which
should improve the statistical accuracy of the model.
descriptive analysis.
We randomly divide the sample into a train, a validation and a test set. The train set covers approximately half
of all the companies, whereas the validation and test sets each cover a quarter of all the companies. The train
and validation set together form our in-sample dataset, while the test set is reserved for our out-of-sample test.
The train set is used to train the map. We then use the validation set to predict the ratings for the companies in
this set and compare them with the real ratings. Iteratively different settings and sets of variables are tested,
each time re-training the map on the train set and predicting ratings for the validation set. The test set is
reserved for the final out-of sample test, and is not used until we are completely satisfied with the found model.
When dividing the sets we make sure that multiple instances of the same company (from multiple crosssections) remain in the same set. We are thus assured that the map does not base the prediction for a company
solely on a previous or later instance of the same company. We also make sure that all classes are as good as
possible represented in all three sets. This is not always possible because of the very few companies in some
classes. Please refer to paragraph 5.1.4 for more information on the ratings distribution and the corresponding
implications.
39
40
Please refer to chapter 2 for an extensive treatment of the SOM algorithm and the prediction process.
As the rating of each neuron is updated when training the map this also leads to deviations from integer values.
Mod e l s et - up
89
BBB- to a BB+ company. This might not be true. However, the non-linear relationships in the SOM model
compensate for these restrictions, making it possible to represent unequal class widths.
By carefully examining the ratings distribution of all the companies, and per
train, validation and test set, we can get a clearer picture of what results to
expect. The rating distributions are shown in figures 5-1.
The histogram of the overall ratings distribution shows that certain rating
classes are under-represented: Only a few defaults occur (0.36 percent or 7
companies), and no C, AA+ or AAA companies are selected in our universe. The
contribution of the CC to B-, AA- and AA rating classes is very low, so effectively
only the classes from B through A+ (or 8 through 18) are correctly represented.
The average rating is BB+ or 12.
The same image holds true for the train, validation and test set alone.
Additionally, the validation set exhibits an under-representation of the BB+
class and an over-representation of the B+ class. The test set shows an under-
5 classification model
90
Rating Code
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Now we have a more direct goal: we want to predict the S&P rating as good as possible based on all available
information. The descriptive analysis supported the observation that the ratings contain extra information next
to an assessment of credit risk purely based on financial ratios. This information could be beneficial to our
model, so we should somehow represent it in our model. We achieve this by using the rating as a train variable
during map training. The companies are then clustered on financial ratios and qualitative information. The
rationale behind this kind of semi-supervised training is covered in paragraph 3.7.2.
All
Train
250
350
300
200
250
150
200
Frequency
150
Frequency
100
100
50
50
0
0
1
11
13
15
17
19
21
11
Validation
13
15
17
19
21
Test
100
100
90
90
80
80
70
70
60
60
Frequency
50
91
Frequency
50
40
40
30
30
20
20
10
Mod e l s et - up
10
0
1
11
13
15
17
19
21
11
13
15
17
19
21
Figure 5-1 Ratings distributions for all companies, train set, validation set and test set.
Success ratio
An obvious criterion for the classification error of a model is the success ratio: The percentage of the validation
set for which the map predicts ratings within a specified maximum number of notches deviation from the real
rating. The SOM algorithm necessarily predicts non-integer ratings so we have to convert the predicted ratings
to an integer valued scale. Rounding the numbers to the closest integer number most easily does this. We can
now compute success ratios for 0 notches
deviation, 1 notch deviation, 2 notches deviation,
Success ratio
100
90
70
60
%
80
Per # notches
50
Cumulative
40
30
20
10
0
5 classification model
92
10
MAD =
n =1
~
Rn
,
~
where R n is the real rating for company n, R n is the predicted rating for company n and N is the total number
of companies in the sample. This shows how much the predicted ratings deviate from the real ratings, without
stressing extreme deviations (as opposed to a measure like the standard deviation).
The coefficient of determination or R2 is an often used measure for the performance of a linear regression model.
It shows the variance in the predictions of the model that can be explained by the variance in the variables. A
perfectly classifying model is characterized by an R2 of 1. The non-linearity of the SOM model prohibits us from
directly calculating this R2, but we can calculate a simulated R2 by assuming a linear model to have generated
the found results.
20
15
Prediction
Ratings
Linear (Ratings)
10
y = 0.71x + 3.61
R2 = 0.66
0
0
10
15
20
Real
Statistical validation
Mod e l s et - up
The R describes the fit of the predicted ratings to the real ratings. This is used as a measure for the
performance of the model. In normal linear regression models the coefficient of a variable represents the
contribution of this variable to the model result. The validity of the model is verified by statistically testing that
this contribution was not a chance occurrence.
When we assume (amongst others) the residual errors of the regression to be normal distributed around zero,
then the ratio of the coefficient and its standard regression error follow a students t distribution41. We can test
whether the coefficient is unequal to zero and the contribution of this variable is statistically significant. If we
use a zero coefficient as our null hypothesis, we can reject this hypothesis when the observed t-value exceeds a
certain threshold. For an often used significance of 5% this threshold is 1.96.
For non-linear models it is not possible to directly relate the individual components of the model to the
prediction. We can compute the t-value of the coefficient in our assumed linear model between predictions and
real ratings, but the resulting high significance is in our case rather trivial. It is easy to see that a strong
relationship between predicted and real ratings exists, and the large number of observations only serves to
strengthen this relationship. The contribution of individual variables remains clouded.
41
93
To verify the validity of the model as a whole we can compare its performances with a number of nave and
random models. A good model will perform better than these models, regardless of chosen settings.
5 classification model
94
emphasizing the less-frequently occurring extreme ratings. Finally we will compare the best model with a
number of suitable random models, a constant prediction model and our benchmark models; linear regression
and ordered logit.
Every time when we evaluate a model we not only look at absolute scores, but also at the practical usability of
the used variables and settings. We do not want a model that is perfectly tuned (overfitted) to the current
validation set, but we want a robust model that is easy to grasp and use.
We use a two-year or eight-quarter historical period, as this proved to be the longest period for which no
important changes occurred.
A two quarter publication lag is used (the time between the realization and the publishing of the financial
figures).
The number of used neurons (1000) is approximately equal to the number of used samples in the train set
(1200 companies), this proved to be adequate for describing the data set in our previous research.
The original 18 financial ratios and the S&P ratings are used to train the model.
Model performance
The performance of our initial model is shown in figure 5-4 and in table 5-2.
Success ratio
100
20
90
80
70
Per # notches
50
Cumulative
40
Prediction
15
60
Ratings
Linear (Ratings)
10
30
y = 0.71x + 3.61
R2 = 0.66
20
10
0
0
10
10
15
20
Real
Figure 5-4 Success ratios and ratings plot for our initial model
5 classification model
96
MAD
1.59
Success ratio
0
1
0.23
0.57
R
2
0.76
0.66
at most 2 notches.
A model with fewer variables presents a less chaotic display making it easier to
understand the model and the relationships between the variables and the predicted rating. Such a model is
also easier to use for future classifications, as less additional data has to be downloaded and transformed or
checked.
As presented in chapter 2, the variables have been grouped in six major classes:
-
Interest coverage ratios: these measure the extent to which the earnings of a company cover debt or
interest.
Leverage ratios: these measure the financial leverage created when firms borrow money.
Profitability ratios: profitability ratios measure the profits of a company in proportion to its assets.
Stability variables: stability variables measure the stability of the company over time in terms of size and
income.
Market variables: market variables are used to assess the value investors assign to a company.
We aim to select the most promising variables in each class. We therefore try all major combinations in each
class, while keeping all other variables and settings equal. The variables performing best (classification
performance of the model) and conveying the most information (component plane of the variable) are selected.
We do not try all possible combinations of variables, as this would mean trying 218 - 1 or about 262143
combinations.
The selected variables in each class are shown in table 5-3 . In table A-8 in appendix VII a full overview of the
model scores per variable combination can be found.
For some classes another variable or
combination
of
performs better.
variables
actually
Selected variable
EBITDA interest coverage
Debt ratio
Return on equity
Log total assets
Coefficient of variation of total assets
Coefficient of variation of forecasts year 1
The size variable contributes considerably to an accurate account of the creditworthiness of companies.
This is visually evident from the component planes, as the ratings and size component planes are most
alike (shown in figure 5-5). When we leave out the size variable the prediction results significantly drop.
2.
The market variable does not contribute much to the form of the map, as is shown in figure 5-5. Leaving out
this variable improves the prediction results for all possible performance measures, what leads us to
believe that the market variable only adds noise to the prediction.
Table 5-4 Model performances when removing variable classes and relative importance
of each variable class
Model
(used classes)
All
without interest coverage
without leverage
without profitability
without size
without stability
without market
MAD
1.74
1.90
1.75
1.73
2.13
1.68
1.66
Success ratio
0
1
0.24
0.56
0.21
0.53
0.22
0.51
0.22
0.51
0.19
0.49
0.23
0.56
0.26
0.59
R
2
0.73
0.71
0.74
0.74
0.67
0.74
0.74
0.59
0.54
0.62
0.60
0.41
0.64
0.61
Relative
Importance
n.a.
++
+
+
++++
+
-
5 classification model
98
Figure 5-5 Component planes for S&P Rating, Log total assets and Coefficient of variation of Forecasts
In both analyses a broad selection is made, to represent all financial aspects of a company. Within a variable
class the choices for specific financial ratios may slightly differ, and the market variable class is not represented
anymore in our prediction analysis. One possible explanation for these differences is the different variable
selection procedure: In our descriptive analysis at we have selected the ratios based solely on their contribution
to a good clustering, without taking the ratings into account. Now we are directly relating the financial ratios to
S&P ratings containing quantitative and qualitative information, leading to different choices.
History length: The length of history or the number of quarterly cross-sections to use for training of the
model.
Prediction neighbourhood K: The size of the neighbourhood to take into account when predicting values
from the map.
Using ratings as a train variable: The contribution of the extra qualitative information in the ratings.
The classification performances for all evaluated model parameter settings are displayed in table 5-6. The
ultimately selected settings are displayed in red.
Table 5-6 Classification performances for different parameter settings
Model
5 classification model
100
MAD
Success ratio
0
1
1.49
1.63
1.66
1.79
0.35
0.26
0.26
0.23
0.56
0.56
0.59
0.57
0.75
0.74
0.74
0.79
0.69
0.61
0.61
0.56
1.50
1.60
1.66
1.68
0.26
0.25
0.26
0.26
0.65
0.58
0.59
0.55
0.78
0.76
0.74
0.76
0.67
0.64
0.61
0.62
1.66
1.57
1.48
1.43
1.42
1.44
0.26
0.26
0.28
0.28
0.31
0.29
0.59
0.62
0.63
0.63
0.64
0.64
0.74
0.74
0.77
0.77
0.78
0.77
0.61
0.65
0.67
0.68
0.69
0.69
1.50
1.66
1.88
0.27
0.26
0.20
0.64
0.59
0.51
0.78
0.74
0.72
0.65
0.61
0.55
History length
A longer historical period means a larger sample and a statistically more sound model. But a too long historical
period could also obscure some relationships in the data because of changed environments for companies and
changed measures for extending ratings to companies.
Initially a two year history was used (the eight quarterly cross-sections of 1997 and 1998 merged). Using only
one quarter of data generates a better classifying model than using two full years (or eight quarters) of data.
Because of the higher statistical significance (due to the larger sample) we do not change our initial two-year
historical period.
Number of neurons
The number of neurons should be tuned to the way we want to use the SOM. A large number of neurons (>=
number of samples) gives a more accurate description of the data. A smaller number of neurons (<< number of
samples) produces a more general map, which predicts better in a multitude of cases.
As the number of observations in our sample equals 1200, we have used about 1000 neurons to train the map in
our initial model. When we try maps using 250, 500 and 2000 neurons, we find that less detail (250 - 500
neurons) builds a better generalizing map.
Prediction neighbourhood K
In the previous models we always used the single best matching neuron for the current company to extract a
prediction from the map. A common variant of the prediction algorithm is to take a weighted average of the
rating over the K neighbouring neurons (in the input space). The less nearby neurons in the neighbourhood
contribute less to the prediction, in a linear fashion. There is a correspondence between choosing a larger K for
prediction and a smaller number of neurons when training the map. Both have the effect of generalizing the
predicted values for the ratings, so for clarity in our final model we should choose one of the two methods to
enhance the generalizing capabilities of the map, and not use both.
The results in Table X show that the generalizing effect of using larger Ks gives better classifications. We opt to
only use the K neighbourhood as a generalizing instrument, as this is more flexible than using less neurons.
The map still accurately represents the underlying dataset (no information is lost), while we can vary the
generality of the predictions.
contribute to a better clustering of the companies resulting in better classifications. The additional information
in the ratings is contradicting the information contained in the financial ratios.
If we do not use the ratings when clustering the companies we can be sure that only financial information is
taken into account when classifying a company. The assigned rating is an average rating for companies in
similar financial situations. If we do use the rating as a train variable, then the clustering is based on financial
information and the qualitative information as expressed in the ratings. Some companies, financially speaking
belonging to a cluster of e.g. AA rated companies, are clustered with BBB rated companies because of a rating
downgrade based on qualitative factors. The average values for the financial ratios for these BBB rated
companies are off-set by these AA companies in disguise, leading to worsened predictions for true BBB
companies. If the clustering is more dependent on the ratings (larger weight for the ratings variable during
training), then this effect is more noticeable in worsened classification performances.
The qualitative up- and downgrades of companies are not systematic for companies in certain financial
situations, they are more random-like.
systematic then they would be expressed in a higher or lower rating average for companies in these financial
situations. Using the ratings as a train variable is only profitable when we are certain that the additional
5 classification model
102
information in the rating does not contradict the information contained in the other variables. More generally
speaking we should restrict the use of the target variable as a train variable to these models where the target
variable is not contradicting the other variables of the model.
5.2.4 Results
Based on our analyses we can summarize our model so far:
-
Debt ratio,
Return on equity,
The size of the map is 1000 neurons, or approximately as much as the number of observations.
The model results are shown in figure 5-6 and table 5-7.
The performance figures show that most of the classification performance of the model can be captured using
just a subset of five of the original eighteen variables. The performance loss is relatively small. After adjusting
some of the parameters the performance of the model clearly improves. For most parameters the initially
chosen settings were adequate, the greatest performance increase is found in the larger prediction
neighbourhood size and in not using the ratings as a train variable.
Absolute deviation
100
20
90
80
70
Prediction
15
60
Per # notches
50
Cumulative
40
Ratings
Linear (Ratings)
10
30
y = 0.63x + 4.50
R2 = 0.71
20
10
0
0
10
Figure 5-6 Success ratios and ratings plot for SOM model
MAD
1.59
1.66
1.40
Success ratio
0
1
0.23
0.57
0.26
0.59
0.29
0.64
10
Real
R
2
0.76
0.74
0.81
0.66
0.61
0.71
15
20
103
first
comparison
involves
constant
Success ratio
90
80
70
60
5 classification model
Per # notches
50
Cumulative
40
30
20
10
0
0
10
104
42
Success ratio
100
20
90
80
15
60
Per # notches
50
Real
70
Cumulative
40
Ratings
10
Linear (Ratings)
y = -0.02x + 12.26
R2 = 0.00
30
20
10
0
10
10
15
20
Prediction
The figures show that the random model performs poorly. To verify that we did not just happened to stumble
upon a relatively bad predicting random model we have simulated 100 random models (this is also known as
bootstrapping). The histogram of the Mean Absolute Deviation for these models is shown in figure 5-9. The
MAD of the models seems to be normal distributed around a mean of 4.11.
105
20
18
16
25
20
15
Frequency
10
14
12
10
8
6
Frequency
4
2
0
5
0
3,78 3,85 3,91 3,98 4,05 4,11 4,18 4,25 4,31 4,38 More
3,11 3,13 3,14 3,16 3,17 3,19 3,20 3,22 3,23 3,25 More
Figure 5-9 Distribution of MAD for 100 random models with and without averaging
We have not yet accounted for the averaging the SOM uses when predicting ratings, so we repeat the
simulations using an average over 50 predictions as the predicted rating. The results are also shown in figure 59. The MAD of the models converges on the same MAD as the constant prediction (3.20). Furthermore, the
spread of the MAD is much smaller.
Comparing the MAD of our SOM model (1.40) with the distribution of the random models shows that it is highly
unlikely that we have struck upon a good model by chance.
The main bulk of the sample has an average rating, so the model will be best fitted for these kind of
companies. Classifications of extreme rated companies will always be biased towards the average, which is
higher for low ratings and lower for high ratings.
2.
Lower ratings are difficult to classify too low, as there are hardly any lower classes. Vice versa the same
holds for high ratings.
Table 5-8 Performance comparison
5 classification model
106
Model
MAD
SOM
Constant
Random
Equalized SOM
1.40
3.19
4.11
1.46
Success ratio
0
1
0.29
0.64
0.03
0.18
0.10
0.27
0.22
0.60
R
2
0.81
0.33
0.41
0.76
0.71
0.00
0.71
A comparison of the deviation per class before and after equalization is given in figure 5-10. The average
prediction errors are more constant, especially for the positive deviations. The classification bias seems to have
been removed for the middle classes, at the expense of larger positive and negative peaks. The overall
classification performance slightly deteriorates, as is displayed in table 5-8.
12
10
8
6
4
2
0
-2
-4
-6
-8
-10
-12
max
min
pos
neg
11
13
15
17
19
21
rating class
# notches
# notches
max
min
pos
neg
11
13
15
17
19
21
rating class
5.4 Benchmark
We use two other models as a benchmark for the classification results of our SOM model. The first is a standard
linear regression model, the second is a more advanced technique called ordered logit.
5 classification model
108
43
More information on the ordered logit model can be found In Chapter 2 and in Fok, D., 1999. We would like to thank Dennis for the
use of his ordered logit application and for his help with the interpretation of the results.
SOM variable
Imp
Interest
coverage
Leverage
Profitability
EBITDA interest
coverage
Debt ratio
Return on equity
++
Size
Stability
Market
+
+
+++
+
+
Linear regression
variable
EBITDA interest
coverage
Debt ratio
Return on total
assets
Operating income
/ sales
Net profit margin
Log total assets
Coef
0.68
9.94
-0.45
0.24
-6.81
3.44
0.24
3.64
-0.23
2.06
-3.71
34.51
-0.45
0.13
-0.17
-8.24
2.21
-3.09
Ordered Logit
variable
EBITDA interest
coverage
Debt ratio
Return on total
assets
Operating income
/ sales
Net profit margin
Log total assets
Coef
0.63
7.62
-0.65
0.31
-7.94
4.28
0.33
4.51
-0.13
2.19
-2.07
26.48
-0.45
-8.43
Beta
-0.18
-3.16
Selected variables
The variables selected in the linear regression or ordered logit analysis do not substantially differ from the
variables selected in the SOM analysis. Furthermore, the relative importance, coefficients and signs are similar
in all three models.
The linear regression / ordered logit variable combination for the Profitability class has also been investigated
in our SOM analysis, but did not lead to better performances. Likewise the Market class was dropped from the
SOM model, the small coefficient in our linear regression and ordered logit analysis confirms this.
The net profit margin has a negative sign in the linear regression and ordered logit model, while we would
expect a positive sign. This is probably due to the somewhat flawed definition of the variable as was observed
in our SOM descriptive analysis. This is one of the reasons why we opted not to use this variable in the SOM
model. The relative small coefficient shows that the variable does not contribute much to the linear regression
and ordered logit model, either.
Be nch ma rk
109
Rating scale
The linear regression model presumes an ordered rating scale with equal class widths. The ordered logit model
only presumes an ordered rating scale, the boundaries and thus the class widths are estimated. The estimated
boundaries for the ordered logit model are displayed in table A-9 in appendix VII.
The estimated scale shows that all classes are of approximately equal width, they do not differ much from the
class widths in the linear regression model.
Performance
The performance of SOM, linear regression and ordered logit are compared in table 5-10, deviations per class
are shown in figure 5-11. The performances for all three models are similar, especially in the middle classes.
The classification bias is present in all three models.
110
12
10
8
6
4
2
0
-2
-4
-6
-8
-10
-12
max
min
pos
neg
11
13
15
17
19
21
# notches
5 classification model
# notches
max
min
pos
neg
rating class
# notches
12
10
8
6
4
2
0
-2
-4
-6
-8
-10
-12
max
min
pos
neg
11
13
11
13
rating class
15
17
19
21
rating class
Figure 5-11 Deviation per class for SOM, linear regression and ordered logit in-sample
15
17
19
21
The good scores for the linear regression model indicate that the possible non-linearities in the data might not
be that influential at all. Also the conversion of the letter rating scale into an (equally spaced) numerical scale
does not seem to cause much problems. This may be due to the large number of classes (22) giving a close
approximation to a continuous scale. The similarities in deviations for linear regression and ordered logit
strengthen the image that the ordered logit model approximates a pure linear model.
Table 5-10 Performance comparison
Model
MAD
SOM
Linear regression
Ordered logit
1.40
1.52
1.44
Success ratio
0
1
0.29
0.64
0.27
0.59
0.28
0.64
R
2
0.81
0.82
0.83
0.71
0.65
0.67
Be nch ma rk
111
5 classification model
112
validation set were also derived, so we would not expect large qualitative differences between these sets.
During the re-estimation of the linear regression model the Coefficient of variation of forecasts variable was
Table 5-11 Out-of-sample performances
Model
MAD
SOM out-of-sample
SOM in-sample
Linear regression out-of-sample
Linear regression in-sample
Ordered logit out-of-sample
Ordered logit in-sample
1.48
1.40
1.48
1.52
1.38
1.44
Success ratio
0
1
0.25
0.60
0.29
0.64
0.21
0.59
0.27
0.59
0.28
0.60
0.28
0.64
R
2
0.82
0.81
0.84
0.82
0.85
0.83
0.64
0.71
0.65
0.65
0.66
0.67
removed from the model because of a too small t-value. Likewise the ordered-logit algorithm removed the Net
profit margin variable. In view of our earlier comments on this variable with respect to the in-sample models
this comes as no surprise.
three models, larger errors seem to be somewhat magnified by the SOM. We again notice striking similarities
between linear regression and ordered logit.
Deviation per class for Linear regression
12
10
8
6
4
2
0
-2
-4
-6
-8
-10
-12
max
min
pos
neg
11
13
15
17
19
# notches
# notches
21
max
min
pos
neg
rating class
11
13
15
17
19
21
rating class
# notches
Out- of- sa mp l e t es t
max
min
113
pos
neg
11
13
15
17
19
21
rating class
Figure 5-12 Deviations per class for SOM, linear regression and ordered logit out-of-sample
MAD
1996
1995
1994
1.46
1.56
1.78
Success ratio
0
1
0.25
0.65
0.23
0.59
0.24
0.54
R
2
0.86
0.82
0.76
0.66
0.67
0.53
environment has apparently too much changed for the model to still be accurate. This indicates that we can not
keep the model constant. We should periodically retrain the map to be sure that the SOM incorporates the
latest insights and valuations of the companies in specific financial situations.
5 classification model
114
much lag as the rating agency, then we should see a higher than average spread when our model rates the
bond lower than S&P. Vice versa, if our model assigns a higher rating to a bond then the spread should be
smaller than average.
The following analysis is a first attempt to model this relationship. No definite conclusions should be drawn
from these results. We should also keep in mind that the market is not always correct. It is possible that rating
agencies uncover previously unknown information during their qualitative analysis and that the spread (the
market) reacts upon the resulting rating change.
Data
We use bond data from Lehman Brothers, a well known broker and dataprovider, selected from their universe of
bond indices. The senior unsecured bonds are all chosen from the Consumer Cyclicals sector, and from all
possible rating classes. The bonds are linked to our data using the CUSIP code, a code containing a general part
identifying companies and a specific part identifying individual bonds.
Two problems immediately arise regarding the linking of individual bonds to companies:
44
1.
Lehman Brothers uses another definition for the Consumer Cyclicals sector. Some of the companies
belonging to Consumer Cyclicals according to our definition are not included in the LB universe, and some
of the companies in the LB Consumer Cyclicals sector belong to other sectors in our universe.
2.
For a lot of bonds the company has merged with other companies or has otherwise disappeared, while the
bond still exists for the old company. The CUSIP codes for the bond and the company do not match, even
when the same company is involved.
This results in 50% of the bonds not being matched to earlier downloaded
33.5%
9.3%
5.1%
2.3%
0.4%
49.4%
Original sector
Consumer Cyclicals
Consumer Staples
Financials
Capital Goods
Technology
Unmatched
Option adjusted spread: The spread of the bond, adjusted for specific bond types like callables. The
spreads were downloaded for the last day of the fourth quarter in 1998.
S&P rating
Pre-processing
We have grouped the bonds according to maturity into buckets of 1 to 5 years and 5 to 10 years. The outer rating
classes (1 to 7 and 18 to 22) have been removed, because the predictions for these classes are severely biased.
Within the buckets we have standardized45 the spreads per rating class, thus making a comparison over the
classes possible. These standardized spreads are compared with the deviation of the predicted from the real
rating.
If the predicted rating is a better measure for the risk perceived by the market, then the spread should be
proportional to the deviation of the predicted rating from the real rating. A lower than average spread, or
negative standardized spread, should be accompanied by a higher predicted rating (a positive rating deviation).
A higher than average spread, or positive standardized spread, should be accompanied by a lower predicted
45
Standardization means subtracting the average of the spreads in the rating class from the current spread and dividing by the
standard deviation of the spreads in the rating class.
Out- of- sa mp l e t es t
115
rating (a negative rating deviation). An average spread, or a zero standardized spread, should be accompanied
by an equal predicted rating (no rating deviation).
In a scatterplot the data would be distributed as a line from the upper left quadrant to the lower right quadrant,
with a possible concentration of datapoints around zero.
Results
The scatter plots for the two maturity buckets are shown in figure 5-13. The sought-after relationship does not
show in the figures, the points seem to be randomly scattered in the plot. A negative or positive standardized
spread is as much correctly predicted as it is not.
Spread deviation vs rating deviation (5-10 yrs)
5 classification model
-4.00
-3.00
-2.00
-1.00
0
0.00
-2
1.00
2.00
3.00
4.00
-4
116
4
2
-4.00
-3.00
-2.00
-1.00
0
0.00
-2
1.00
2.00
3.00
-4
-6
-6
Spread deviation
Rating deviation
Rating deviation
-8
y = -0.0665x - 0.2597
R2 = 0.0014
Spread deviation
y = -0.0649x + 0.0677
R2 = 0.0006
5.6 Summary
In this chapter we have constructed a classification model, using the same data as in our descriptive analysis of
chapter 4. Before arriving at the final model, several variations have been tested. Each variation is evaluated
using three different performance measures; the success ratio, the mean absolute deviation and the R2.
First we created our initial model.
parameters, after each change testing the effect on model performance. In the final model the original 18
variables have been reduced to 5 variables without sacrificing too much classification performance.
To validate the model we have compared it with a constant predicting model and with 100 random predicting
models. The comparison with our benchmark models, linear regression and ordered logit, provide another way
to test the validity of the model. The in-sample and out-of-sample tests show comparable results for all the
models. The selected variables are similar, and so are the classification results.
S um m a r y
117
6 conclusions
In this chapter we draw our conclusions. The central question from chapter 1 is revisited, and the answers to the
sub-questions (given in the previous chapters) are summarized. Finally some directions for further research are
given.
6.1 Conclusions
In this thesis we tried to answer the following central question:
In what way can we use Self-Organizing Maps to explore the relationship between financial statement data
and credit ratings?
We have broken down this question into five sub-questions, each answered in separate chapters of this thesis.
1.
What are credit ratings and how is the credit rating process structured?
Chapter 2 provided a theoretical background on bonds, credits and credit ratings. We have seen that the credit
rating is basically an opinion on creditworthiness of the credit issuer, often a company or government. The
6 conclusions
120
number of defaults in each rating class shows that the credit rating is a good measure of creditworthiness.
The credit rating is determined by two main factors: A quantitative analysis of the balance sheet and income
account, and a qualitative analysis concerning the management of the company, the economic expectations of
the sector and other non-quantitative elements that could affect creditworthiness. A review of the Standard &
Poors credit rating process shows that rating agencies put much more weight on the qualitative analysis than
on the quantitative analysis.
2.
What are Self-Organizing Maps and how can they aid in exploring relationships in large data sets?
In chapter 3 we showed that Self-Organizing Maps are an innovative way to visualize the used data set. The
observations in the input space are projected on a surface (a neural network) that can stretch and bend to better
accommodate the distribution of the data in the input space. This projection is then visualized as a twodimensional map, surrounded by components representing the used variables. Additionally, the observations
are clustered according to likeliness of the underlying variables.
The full SOM display contributes to a better understanding of the underlying domain. Relationships between
variables, linear and non-linear, are clearly visible. The clustering of the mapped observations provides an
insight into the likeliness of the observations. The SOM thus functions as a descriptive tool. The SOM can also
be used as a prediction model. The found clustering is then of less importance; we use the stretched and
bended map surface as a form of non-linear regression.
3.
Is it possible to find a logical clustering of companies, based on the financial statements of these
companies?
4.
If such a clustering is found, does this clustering coincide with levels of creditworthiness of the companies
in a cluster?
These two questions were answered in the descriptive analysis of chapter 4. Using a selection of financial ratios
we created a Self-Organizing Map display of the US companies in sector Consumer Cyclicals. Qualitative
information was not taken into account when creating this SOM. The resulting display showed that a clustering
of companies based on financial ratios is very well possible. The found segmentation grouped companies in an
intuitively logical manner. Furthermore, when we compared the clustering with actual credit rating levels we
found a strong relation. Approximately 60 % of the variance in the ratings was matched by the found clustering.
If we assume the used variables and the model to be correct, then we might attribute the 40% residual variance
to be due to the qualitative analysis performed by S&P.
Co nc l u sio ns
121
5.
Is it possible to classify companies in rating classes using only financial statement data?
In chapter 5 we used the results of our descriptive analysis to construct a classification model. The final results
show that it is possible to predict credit ratings, but only to a certain extent. About 80% of the companies in the
sample are classified with an error of at most two notches. Once again it is most likely that this is due to the
qualitative analysis, which can not be duplicated by a quantitative model.
Even if the model is not suitable for classifying companies exactly correct, the model still gives an improved
insight into the credit rating process. The qualitative analysis is less important than S&P might lead us to
believe and the most important variables determining creditworthiness are size and interest coverage.
Furthermore, the classification of the model can be used as a first approximation when the real credit rating is
currently not available. The stability of the performances when comparing different techniques show that we
have found a stable model for this sector, in which the selected variables are most important.
6 conclusions
122
On a more computer science related note, it would be interesting to further explore and explain the use of semisupervised learning (using the ratings as a train variable) with the Self-Organizing Map. A comparison with
normal supervised learning would give an insight into the special characteristics of semi-supervised learning
and effects on model performance.
7 bibliography
Bishop, C.M., 1995. Neural networks for Pattern Recognition, New York, United States of America, Oxford
University Press Inc. .
Brealey, R.A. and Myers, S.C., 1991. Principles of corporate finance, Fourth edition, United States of America,
McGraw-Hill Inc. .
Cantor, R. and Packer, F. The credit rating industry, FRBNY Quarterly Review / Summer-Fall 1994
Deboeck, G., 1998. Visual explorations in finance with Self-Organizing Maps, London, Great Britain, SpringerVerlag.
Fabozzi, F.J., 1993. Fixed Income Mathematics, Revised edition, Chicago, Illinois, United States of America,
Probus Publishing Company.
Fayyad, U.M., Piatetsky-Shapiro, G. , Smyth, P. and Uthurusamy, R., 1996. "Advances in knowledge discovery
and data mining, Menlo Park, California, United States of America, American Association for Artificial
Intelligence Press.
Fok, D., 1999. Risk profile analysis of Rabobank investors, masters thesis, Erasmus University of Rotterdam.
Geluk, I. en Van der Hart, J., 1996. Cursus econometrie in de praktijk, Rotterdam, Robeco Group.
Greene, W.H., 1997. Econometric Analysis, Third edition, Upper Saddle River, New Jersey, United States of
America, Prentice-Hall.
Johnson, R.A. and Wichern, D.W., 1992. Applied Multivariate Statistical Analysis, Third edition, Englewood
Cliffs, New Jersey, United States of America, Prentice-Hall.
www.cis.hut.fi/~sami/thesis/
thesis_tohtml.html.
Kohonen, T., 1997. Self-Organizing Maps, Second Edition, Heidelberg, Germany, Springer-Verlag.
bibliography
124
Moodys, 2000. Historical default rates of corporate bond issuers, 1920-1999, www.moodys.com.
appendix
Overview
Artificial neural networks are conceptual models of networks consisting of small simple processing elements
called neurons. These neurons are inspired by real neurons (figure A-1) and the way they function in the human
brain: The neurons are interconnected (hence the term network), the output of a neuron serves as one of the
inputs for another neuron. The output of each neuron is based on its inputs using some kind of mathematical
function, all neurons operate in parallel.
Neural networks originated in artificial intelligence research, which in the sixties and seventies produced 'expert
systems' that somehow failed to capture certain key elements of human intelligence. These expert systems
were based on a model of the high-level reasoning processes. Thus some of the research focused on mimicking
the lower level structure in the brain, hoping that this would yield better results.
Of course the human brain is much
more complex than the relatively
appendix
126
our
neural
'artificial'
is
networks
often
(the
omitted)
word
use
brain. Therefore neural networks should not be regarded as accurate representations of (parts of) the brain but
more as a general model of the behavioural processes in the brain.
Engineers use neural networks for signal processing and automatic control.
Cognitive scientists view neural networks as a way to model thinking and consciousness (higher brain
functions).
Neuro-physiologists use neural networks to model sensory systems, memory and motorics (medium
level brain functions).
Ar tifici al n eu ra l n et wo rk s
127
II
10
6
Observations
5
Neurons
0
0
10
Iteration 1
appendix
128
We first identify the winning neuron that most closely resembles the first observation. This is of course neuron
a. The placement of this neuron in the input space is adjusted to match observation 1 (figure A-3). We now say
that observation 1 is mapped to neuron a.
10
10
Observations
Neurons
Neurons
Observations
1
a
a
1
0
0
10
10
Besides the winning neuron the neighbours of the winning neuron are also adjusted, but to a lesser degree
depending on the neighbourhood function. This is shown in figure A-4.
Iteration 2 through 5
Figures A-5 through A-8 show how the neurons are updated to match each of the observations . The learning
rate factor reduces the adjustments for the later iterations.
10
10
9
3
b
c
6
6
c
Observations
Neurons
a
Observations
Neurons
0
0
10
10
10
9
8
b
129
Observations
Neurons
Observations
Neurons
c
4
10
10
Output map
The final map after the adjustments in iteration 5 is shown in Figure A-8. The associated output map is shown in
figure A-9. This output map is a representation of the neural network in the input space. The neurons and their
associated observations are displayed, but the absolute distance information is lost. This is later reintroduced
by colour coding the map.
Output
Neurons
1
a
2,3
b
4,5
c
appendix
130
III
In this example the SOM is used as a descriptive tool in the medical domain. In medical research the sample
size is often necessarily small, so statistical inference is more difficult. We show how the SOM can visually aid
in providing a good understanding of the data at hand.
Data
A Self-Organizing Map is used to display the characteristics of persons for whom rectal muscle sizes were
measured using ultrasound images. The scans were performed at the Academic Hospital of Maastricht by Dr.
Regina Beets-Tan for her PhD research.
The sample consists of a group of 60 test subjects, 46 females and 14 males. The age varies from 19 to 72, and
some of the women have given birth while others have not. The test subjects are chosen in such a way that no
bias towards age or number of births is present in the sample.
For each test subject the following five muscles in the rectal area were measured:
-
Longitudinal muscle
Perineal body
For each test subject the following additional information was recorded:
-
Sex
Age
Weight
Length
Training
The SOM is created using the five muscle sizes as train variables. The additional information is not used to train
the map. We can now answer the following two questions:
1.
2.
Do these clusters coincide with groupings based on sex, age, number of births, weight or length?
The Self-Organizing Map is displayed in Figure A-10. The five train variables are displayed at the bottom
(Internal sphincter, Longitudinal muscle, External sphincter, Total sphincter thickness, Perineal body). The
clusters and independent variables are displayed above (in the order Sex, Age, Partus x, Weight and Length) .
appendix
132
The SOM has formed 7 clusters, purely based on rectal muscle sizes. More important than the exact boundaries
of the clusters are the overall relationships we can infer from this display.
Inferred relationships
Looking at the Sex component we clearly see most of the males grouped together. This means that based on
measurements of the rectal muscles we should be able to distinguish a man from a woman. Furthermore, when
comparing the Age component with the Sex component it becomes clear that mostly younger males
participated. A third fact about the male test subjects is captured in the Perineal body component, the values
are low for all clusters containing males, and high for almost all females. The Perineal body can not be
measured for males, hence the difference. In a more extensive analysis we would have to correct for this
difference.
The Partus x component reveals a clustering of women not having given any births at all. These women are
characterized by a small internal sphincter and a small longitudinal muscle. The relative random pattern for
age, weight and size in this cluster confirms that this relationship holds for all women in the sample, not just the
young or small ones. As the weight and size components show a definite resemblance, we know that they are
correlated. This is what we would expect, taller people are heavier than short people.
Conclusions
For medical appliances the Self-Organizing Map can serve as a valuable tool to enhance the understanding of
the underlying domain. The dataset is accurately represented, the relatively small sample size does not form an
insurmountable problem. However, the inferred relationships have to be used with care.
IV
The Self-Organizing Map is especially suitable for use in the marketing domain. Large data sets containing
(often non-linear) customer data are more and more common for corporations of all sizes. Finding relationships
in these databases and using them to optimize the relationship with the customers is known as Customer
Relationship Management.
In this example we will show how customers of the Rabobank can be grouped according to their investment
preferences, as expressed in a short survey. We then compare these expressed preferences with some real
characteristics of their investment portfolios over the previous year.
Data
The sample consists of 1000 investing customers of the Rabobank. Each customer has been asked to fill in a
survey, consisting of 24 questions. For each question five answers are possible, ranging from Fully disagree,
Disagree, Disagree nor agree, Agree, to Fully agree. The full questionnaire can be found at the end of
this chapter.
appendix
134
Next to these questions we also recorded the number of transactions, the use of the Internet or the Rabo
Orderlijn (direct telephone contact with a Rabobank broker), the age, the total size of investments, and the
length of the relationship between the customer and the Rabobank.
The use of the Internet or the Rabo Orderlijn is used by the Rabobank marketing department as a dependence
variable: Clients are considered independent if they have used the Internet or the Rabo Orderlijn at least once
to make a transaction.
The size of investments and number of transactions variables have been log
transformed, to equalize the sometimes large differences in total invested assets or number of transactions.
Training
The SOM is created using the answers to the 24 questions as train variables. The additional information was not
used to train the map. We can now answer the following questions:
1.
Is it possible to make clusters of customers (a customer segmentation) with different investment profiles
based on answers from the survey?
2.
Does the observed behaviour of customers (number of transactions and independence) coincide with these
investment profiles?
The SOM is displayed in figure A-11. None of the train variables are displayed, the components that are
displayed are the Log size of investments, Log number of transactions, Relationship length, Age and
Independence.
S O M ex a m p le : C u sto m e r s e g m en t a t i o n
135
Inferred relationships
If we only take the answers given at the survey into account the customers can be segmented into three distinct
groups. The most independent customers are located in the bottom left corner of the map. They say to have an
active investment style, and they do not need advice from the bank before making a decision. The somewhat
less independent customers are situated in the upper left and the middle of the map. Although they clearly
want to make their own decisions, they still seem to benefit from consultation with the bank. The most
dependent group can be found in the right portion of the map. Before making any investment decision they
would like to receive advice from their financial advisor at the bank. They also want to be kept up-to-date on
their portfolio and on the current events of the market.
The independence variable fits reasonably well to the formed clusters. The customers with a dependent
investment style have almost never used the Internet or Rabo Orderlijn, as was to be expected. For the
somewhat less independent customers we see that some have used the Internet and some have not. For the
independent customers we would expect this to be somewhat higher, there are apparently some customers who
say to act independently but never really do (when not using the Internet or Rabo Orderlijn an advisor always
comes into play).
The size variable shows that the richest customers are in general dependent. Logically they are also the older
customers, as can be seen from the age component plane.
transactions component reveals that no direct relation with the found investment profiles can be made.
Independent investors do not necessarily perform more transactions. The relationship length is also randomly
distributed over the map, but shows at some points a correlation with the independence variable: Customers
that have had longer relationships with the Rabobank have never used the Internet or the Rabo Orderlijn.
Conclusions
The SOM gives an attractive overview of the Rabobank customer sample. Based on the answers from the
survey, the customers can roughly be divided into three groups, each having distinct investment preferences.
appendix
136
Although easy to measure, we can unfortunately not use the number of transactions to determine the group
membership of a customer. We can be relatively sure that dependent investors do not use the Internet or the
Rabo Orderlijn.
Survey questions
1.
2.
3.
4.
5.
There are so many investment possibilities that it is hard to keep track of them all.
6.
7.
Investing is fun.
8.
9.
S O M ex a m p le : C u sto m e r s e g m en t a t i o n
137
med(|x-med(x)|)
st.dev.(x)
appendix
50%
138
66%
Figure A-12 The median standard deviation in relation to the standard deviation
On a normal distributed variable the area within plus or minus one standard deviation encompasses 2/3 of the
distribution. The boundary of the area encompassing 1/2 of the distribution lies at plus or minus one median
(according to the definition of the median). The ratio of this median to the standard deviation (on a normal
distributed variable) is 0.6745. To convert the previously found measure to one comparable with a standard
deviation we multiply it with 1/0.6745. Thus
Skewness
This measures the asymmetry of a distribution. It is defined as
1
T
S=
( x t x )3
t =1
where T is the number of observations in the sample. For symmetric distributions the skewness is 0.
Kurtosis
The kurtosis measures the thickness of the tails of the distribution. It is defined as
1
T
K =
( x t x )4
t =1
A normal distributed variable has a kurtosis of 3. The kurtosis we calculated is the excess kurtosis, this is the
kurtosis - 3. Values greater than 10 give rise to suspicions of non-normality.
Jarque-Bera
The final test on normality is the Jarque-Bera test. The statistic is given by
T 2 1 2
[S + K ]
6
4
where S is the skewness and K is the excess kurtosis. We can say a variable is normal distributed with 95%
confidence when the result of the statistic 5.99 (2 distribution with 2 degrees of freedom).
St ati stic a l m ea s ur e s an d t es t s
139
VI
Descriptive analysis
return on
assets
return on
equity
net gearing
debt-equity
ratio 2
debt-equity
ratio 1
debt ratio
5.07
6.73
0.20
0.61
0.79
0.56
1.76
-0.09
0.03
median
2.39
3.51
0.05
0.56
0.97
0.50
2.02
0.03
0.03
stdev
8.94
10.27
1.84
0.91
12.73
0.35
19.45
1.69
0.04
medstdev
2.71
3.10
0.06
0.26
1.01
0.27
1.44
0.04
0.02
minimum
-4.65
-4.38
-0.29
-10.59
-164.48
0.01
-237.55
-19.62
-0.16
maximum
86.50
98.50
31.30
6.02
75.34
2.52
148.33
8.23
0.25
282
282
288
229
289
268
292
286
289
12
12
65
26
0.014
0.011
0.00
0.01
0.01
0.01
0.01
0.01
0.01
#NA
> |3stdev|
140
EBIT / total
debt
mean
count
appendix
EBITDA
interest
coverage
variable
EBIT interest
coverage
Table A-1 Summary statistics per variable before cut-off, fourth quarter 1998 (continued on next page)
> |3mstdev|
0.12
0.12
0.12
0.05
0.15
0.04
0.18
0.13
0.05
skewness
4.87
4.83
16.82
-7.10
-7.37
1.98
-5.60
-9.07
1.335
kurtosis
33.81
33.13
284.45
105.62
105.99
6.80
93.45
104.07
12.01
4461.47
4887.81
106999.38
2851.11
35949.60
6.51
54740.28
3828.38
1.57
J.-Bera
Table A-2 Summary statistics per variable after cut-off, fourth quarter 1998 (continued on next page)
mean
4.58
6.19
0.09
0.63
1.16
0.55
2.26
0.02
0.03
0.03
median
2.39
3.51
0.05
0.56
0.97
0.50
2.02
0.03
stdev
6.15
7.13
0.13
0.40
5.61
0.31
8.24
0.23
0.03
medstdev
2.71
3.10
0.06
0.26
1.01
0.27
1.44
0.04
0.02
minimum
-4.65
-4.38
-0.29
-1.03
-24.35
0.01
-33.86
-1.01
-0.05
maximum
26.79
31.39
0.61
2.14
26.29
1.59
37.89
1.07
0.10
282
282
288
229
289
268
292
286
289
12
12
65
26
> |3stdev|
0.04
0.04
0.03
0.03
0.04
0.03
0.04
0.05
> |3mstdev|
0.12
0.12
0.12
0.05
0.15
0.04
0.18
0.13
0.05
skewness
2.03
1.98
2.12
1.29
-0.17
1.16
-0.20
-0.35
0.35
kurtosis
4.48
3.93
6.21
4.91
12.89
1.78
10.61
13.61
1.33
40.87
40.75
1.44
2.75
182.06
0.57
177.96
8.27
0.01
count
#NA
J.-Bera
eps
beta
cov forecasts
market value
total assets
net profit
margin
op inc / sales
variable
mean
0.16
-0.06
4283.61
3985.04
0.74
0.33
-0.01
1.21
0.23
median
0.13
0.00
1377.87
782.85
0.78
0.25
0.02
1.16
0.24
stdev
0.15
0.67
20717.12
14354.96
7.29
0.27
0.64
0.59
1.36
medstdev
0.10
0.01
1455.80
991.75
1.13
0.20
0.02
0.51
0.48
minimum
-0.55
-10.39
69.27
0.71
-33.18
0.02
-5.17
-0.58
-16.16
maximum
0.86
0.77
257389.00
191264.00
34.74
1.25
3.67
2.93
6.32
count
268
294
294
266
294
294
220
264
274
26
28
74
30
20
> |3stdev|
0.01
0.01
0.01
0.01
0.03
0.02
0.02
0.01
> |3mstdev|
0.05
0.24
0.11
0.22
0.20
0.07
0.18
0.01
0.06
skewness
0.90
-13.10
11.23
9.82
-0.25
1.52
-4.28
0.15
-6.57
kurtosis
4.92
197.25
130.68
115.98
12.52
1.92
J.-Bera
0.99
1272.23
1.89E+08
1.1E+08
227.11
mean
0.16
-0.01
2476.56
2705.75
median
0.13
0.00
1377.87
782.85
stdev
0.15
0.17
3352.74
medstdev
0.10
0.01
1455.80
minimum
-0.38
-0.83
maximum
0.64
count
#NA
#NA
> |3stdev|
42.57
0.50
79.65
0.67 288.06
0.04
1716.30
0.74
0.33
0.02
1.21
0.29
0.78
0.25
0.02
1.16
0.24
4515.35
7.29
0.27
0.28
0.58
0.73
991.75
1.13
0.20
0.02
0.51
0.48
69.27
0.71
-33.18
0.02
-1.34
-0.36
-2.17
0.77
15935.84
20617.87
34.74
1.25
1.38
2.68
2.65
268
294
294
266
294
294
220
264
274
26
28
74
30
20
0.02
0.03
0.04
0.04
0.03
0.02
0.04
0.02
> |3mstdev|
0.04
0.24
0.11
0.22
0.20
0.07
0.18
0.06
skewness
0.78
-2.38
2.57
2.63
-0.25
1.52
-0.81
0.16
0.08
kurtosis
2.67
16.50
6.71
6.81
12.52
1.92
16.52
0.29
2.77
J.-Bera
0.26
9.48
47511.06
63562.72
227.11
0.67
15.85
0.021
0.85
De sc r ipt iv e a na ly si s
141
eps
beta
CoV forecasts
market value
total assets
op inc / sales
return on assets
return on equity
net gearing
debt-equity ratio 2
debt-equity ratio 1
debt ratio
1.00 0.98 0.92 -0.34 -0.05 -0.40 -0.01 0.21 0.67 0.15 0.18 0.25 0.41 -0.08 -0.20 0.01 0.10 0.48
1.00 0.88 -0.38 -0.06 -0.44 -0.02 0.19 0.60 0.14 0.16 0.28 0.43 -0.07 -0.19 0.02 0.10 0.44
1.00 -0.32 -0.05 -0.34 0.00 0.26 0.76 0.16 0.22 0.18 0.42 -0.06 -0.19 0.05 0.10 0.49
1.00 -0.10 0.81 -0.15 -0.07 -0.09 0.07 -0.14 -0.18 -0.23 -0.11 -0.06 -0.04 -0.20 -0.30
1.00 -0.02 0.97 0.15 -0.05 0.00 0.11 0.01 -0.05 0.01 0.05 0.12 0.09 0.01
1.00 -0.06 -0.15 0.02 0.12 -0.11 -0.27 -0.32 -0.02 0.03 0.03 -0.23 -0.14
1.00 0.14 -0.02 0.01 0.12 0.07 0.01 0.02 0.02 0.18 0.10 0.09
return on equity
1.00 0.28 0.07 0.15 0.08 0.06 0.00 -0.06 -0.08 0.19 0.21
return on assets
1.00 0.38 0.39 0.09 0.25 -0.03 -0.20 0.10 0.05 0.68
appendix
op inc / sales
142
total assets
market value
Table A-4 Summary statistics per cluster for final map (continued on next page)
C1
82
C2
59
C3
48
C4
47
C5
29
C6
13
C8
8
C7
8
27.89
20.07
16.33
15.99
9.86
4.42
2.72
2.72
0.01
0.03
0.04
0.02
0.04
0.07
0.08
0.07
10.64
1
18
2.74
14.72
9
20
2.88
10.54
8
14
1.58
14.68
10
18
2.02
9.15
7
13
1.38
7.55
1
12
2.68
9
1
14
3.43
9
7
12
1.5
17.64
4.99
31.39
8.20
3.72
0.89
17.20
2.72
5.44
1.29
11.58
2.74
2.39
0.01
9.33
1.81
-1.04
-3.56
0.43
1.31
6.59
1.92
24.96
7.24
0.53
-4.38
5.10
3.39
Matching
records
Matching
records (%)
Average
quant. error
SP rating
Mean
Minimum
Maximum
Std.
deviation
debt-equity ratio 2
Mean
Minimum
Maximum
Std.
deviation
0.50
0.06
1.33
0.22
0.32
0.03
0.80
0.15
0.64
0.27
1.56
0.21
0.45
0.09
0.87
0.19
1.16
0.76
1.59
0.27
0.54
0.02
0.98
0.28
0.61
0.42
0.96
0.19
0.47
0.01
0.71
0.26
return on assets
Mean
Minimum
Maximum
Std.
deviation
0.02
-0.02
0.05
0.02
0.06
0.03
0.10
0.02
0.03
-0.01
0.10
0.02
0.02
0.00
0.06
0.01
0.04
-0.02
0.10
0.03
-0.02
-0.05
-0.01
0.02
0.03
0.01
0.07
0.02
-0.01
-0.05
0.02
0.03
op inc / sales
Mean
Minimum
Maximum
Std.
deviation
0.09
-0.12
0.25
0.06
0.17
0.04
0.38
0.07
0.27
0.06
0.61
0.13
0.25
0.02
0.64
0.19
0.16
0.00
0.37
0.09
-0.06
-0.28
0.21
0.12
0.11
0.05
0.24
0.06
-0.02
-0.38
0.17
0.17
De sc r ipt iv e a na ly si s
143
C1
log total assets
Mean
Minimum
Maximum
Std.
deviation
144
C3
C4
C5
C6
C8
C7
2.88
2.12
3.48
0.33
3.39
2.40
4.20
0.43
3.08
2.16
4.20
0.39
3.65
2.60
4.20
0.33
2.52
1.84
3.71
0.40
2.59
1.87
3.36
0.41
2.73
2.19
3.59
0.47
2.99
2.41
4.00
0.54
1.25
-9.62
17.79
3.69
0.81
-17.86
5.38
2.91
1.98
-15.77
19.02
4.33
0.66
-8.19
4.27
1.77
0.82
-8.02
13.93
5.06
-3.37
-15.45
2.49
4.82
-25.92
-33.18
-12.65
8.67
21.14
-3.04
34.74
15.48
0.24
0.02
0.75
0.15
0.20
0.05
0.50
0.11
0.72
0.13
1.25
0.27
0.23
0.02
0.62
0.13
0.19
0.02
0.57
0.15
0.42
0.11
0.83
0.24
0.20
0.04
0.53
0.16
0.91
0.36
1.25
0.30
cov forecasts
Mean
Minimum
Maximum
Std.
deviation
0.07
-0.29
1.33
0.20
0.02
0.00
0.11
0.02
0.04
-0.55
0.50
0.15
0.01
-0.40
0.09
0.07
0.32
0.01
1.38
0.49
-1.20
-1.34
-0.85
0.20
0.12
0.01
0.40
0.16
0.29
-0.04
1.38
0.46
appendix
C2
eps
beta
cov forecasts
log market
value
net profit
margin
op inc / sales
return on
assets
return on
equity
net gearing
debt-equity
ratio 2
0.98 0.93 -0.35 -0.09 -0.34 -0.04 0.24 0.78 0.23 0.13 0.26 0.44 -0.12 -0.25 0.01 -0.02 0.59
debt-equity
ratio 1
EBITDA interest
coverage
1.00
debt ratio
ebit interest
coverage
EBIT int. cov.
1.00 0.89 -0.39 -0.10 -0.38 -0.05 0.22 0.72 0.20 0.10 0.30 0.46 -0.11 -0.24 0.03 -0.01 0.52
1.00 -0.30 -0.09 -0.28 -0.03 0.28 0.86 0.22 0.13 0.18 0.35 -0.06 -0.27 0.02 0.01 0.62
1.00 0.02 0.96 -0.04 0.11 -0.07 0.11 0.02 -0.11 -0.24 -0.18 0.03 -0.09 -0.05 -0.29
debt-equity ratio 1
-0.24 -0.16 -0.12 -0.03 -0.19 -0.24 -0.04 0.03 0.16 0.14 -0.09
debt-equity ratio 2
1.00 0.07
-0.01 -0.03 0.14 0.02 -0.18 -0.29 -0.17 0.03 -0.02 -0.05 -0.24
net gearing
1.00
-0.24 -0.10 -0.12 -0.03 -0.14 -0.17 -0.03 -0.05 0.23 0.13 0.03
return on equity
1.00 0.27 0.08 0.10 0.18 0.14 -0.10 -0.03 -0.12 0.14 0.14
return on assets
1.00 0.44 0.30 0.13 0.30 -0.07 -0.29 0.03 -0.03 0.74
op inc / sales
net profit margin
log total assets
log market value
cov net income
cov total assets
cov forecasts
beta
eps
De sc r ipt iv e a na ly si s
145
Value
5.097 2.274 2.176 1.453 1.407 1.298 0.942 0.859 0.795 0.612 0.508 0.235 0.142 0.092 0.050 0.032 0.019 0.010
10
11
12
13
14
15
16
17
18
% of variance 28.3
12.6
12.1
8.1
7.8
7.2
5.2
4.8
4.4
3.4
2.8
1.3
0.8
0.5
0.3
0.2
0.1
0.1
Cumulative % 28.3
41
53
61.1
68.9
76.1
81.4
86.1
90.6
94
96.8
98.1
98.9
99.4
99.7
99.8
99.9
100
14
15
16
17
18
0.01
0.08
appendix
146
Principal
components
EBIT int. cov.
10
11
12
13
0.90 0.16 -0.02 -0.17 -0.06 0.02 0.08 0.02 -0.18 -0.07 -0.22 -0.05 -0.16 -0.05 -0.04 -0.02 -0.01 -0.06
0.90 0.16 0.13 -0.17 -0.18 0.04 0.15 0.00 -0.06 -0.02 -0.10 -0.02 0.11
0.11
0.16
debt ratio
0.84 -0.01 0.39 -0.02 -0.16 0.01 0.18 -0.06 -0.01 0.09
0.10
0.04
0.06
debt-equity ratio 1
-0.17 0.83 0.28 -0.16 0.35 -0.01 0.03 -0.04 0.14 -0.09 0.04
0.02
0.02
-0.05 0.02
debt-equity ratio 2
-0.25 0.80 0.29 -0.19 0.34 0.04 0.03 -0.09 0.11 -0.17 -0.02 0.00
0.00
0.04
net gearing
-0.42 -0.32 0.72 -0.25 0.08 -0.13 0.27 0.01 -0.06 0.11
-0.06 0.09
return on equity
-0.42 -0.45 0.66 -0.29 0.08 -0.11 0.23 0.05 0.01 0.05
0.06
return on assets
0.27 -0.09 0.50 0.63 0.16 0.12 -0.33 0.02 0.11 -0.12 -0.11 -0.30 0.02
op inc / sales
0.43 -0.30 -0.32 -0.16 0.69 -0.17 0.03 0.12 0.13 0.12
0.04
0.59 -0.27 -0.32 -0.08 0.63 -0.09 0.03 0.01 -0.03 0.03
0.01
-0.05 0.16
-0.18 0.02
0.03
0.01
0.00
-0.05 0.14 0.01 -0.19 0.05 0.81 -0.11 0.05 0.11 0.50
-0.01 0.00
0.00
0.00
0.00
-0.07 0.08 -0.30 0.47 -0.06 -0.02 0.66 -0.12 0.43 0.07
0.00
0.00
0.00
0.00 0.37 0.15 0.40 0.06 -0.15 0.14 0.76 -0.21 0.15
0.01
0.07
0.00
-0.01 0.00
0.01
0.00
0.00
0.31 -0.36 0.09 -0.28 -0.13 0.34 -0.03 0.42 0.49 -0.37 0.06
0.07
0.00
-0.03 0.00
-0.01 0.01
0.00
cov forecasts
0.37 -0.28 0.48 0.46 0.33 0.24 -0.03 -0.23 -0.08 -0.06 -0.06 0.32
0.01
beta
-0.32 -0.09 -0.20 0.09 0.17 0.59 0.41 0.00 -0.40 -0.27 0.23
-0.10 0.00
eps
0.72 0.16 0.18 0.13 -0.12 -0.08 0.05 -0.11 0.16 0.17
0.92 0.15 0.05 -0.18 -0.09 0.03 0.08 -0.01 -0.16 -0.07 -0.16 -0.05 -0.10 -0.04 -0.02 0.01
0.54
0.20
0.02
0.00
0.00
0.04
-0.01
0.00
-0.08 0.00
-0.01 0.00
0.00
-0.01
-0.03 0.11
0.01
0.00
0.01
0.00
0.00
-0.01 0.00
0.00
0.02
0.00
0.01
0.67
0.61
0.54
0.48
0.41
0.35
0.28
0.22
0.15
0.09
0.02
0
1.
1.
1.
1.
1.
0.
0.
0.
0.
0.
0.
0.
35
28
20
13
05
98
90
83
75
68
60
53
50
38
26
14
02
90
78
66
53
41
29
17
M
or
e
1.
1.
1.
1.
1.
0.
0.
0.
0.
0.
0.
0.
05
.0
.1
0.
-0
-0
-0.05
net gearing
-0.11
200
45
debt-equity ratio 1
-0.18
800
0.
36.48
33.19
29.90
26.61
23.32
20.03
16.74
13.45
10.16
6.87
3.58
0.29
-3.00
-6.29
-9.58
-12.87
-16.16
-19.45
-22.74
-26.03
-29.32
-0.24
20
0
38
40
200
-0.31
60
400
30
80
600
0.
100
800
0.
120
1000
-0.37
1200
23
140
0.
160
1400
15
180
1600
0.
1800
.3
07
100
-0.44
1000
5
200
100
.4
300
200
-0.50
1400
.5
400
300
-0
500
400
-0.57
1600
-0
600
500
0.
0
-0
300
600
00
400
.0
28.83
26.18
23.54
20.89
18.25
15.60
12.96
10.31
7.67
5.02
2.38
-0.27
-2.91
-5.56
-8.20
-10.85
-13.50
-16.14
-18.79
-21.43
-24.08
700
0.
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
-0.04
-0.09
-0.14
-0.19
-0.24
-0.29
-0.34
-0.39
-0.44
700
-0
25.21
22.69
20.18
17.67
15.15
12.64
10.13
7.61
5.10
2.58
0.07
-2.44
-4.96
-7.47
-9.98
-12.50
-15.01
-17.53
-20.04
-22.55
-25.07
700
-0.63
34.37
31.02
27.66
24.31
20.96
17.60
14.25
10.90
7.54
4.19
0.84
-2.52
-5.87
-9.23
-12.58
-15.93
-19.29
-22.64
-25.99
-29.35
-32.70
Figure A-13 Histograms per variable after cut-off (continued on next two pages)
EBITDA interest coverage
250
debt ratio
600
500
200
150
200
100
100
50
debt-equity ratio 2
1200
return on equity
1200
1000
800
600
600
400
400
200
De sc r ipt iv e a na ly si s
147
46
10
03
96
89
81
18
M
or
e
1.
1.
1.
0.
0.
0.
74
67
50
0.
30
0.
40
100
More
24479.92
23191.51
21903.09
20614.68
19326.26
18037.84
16749.43
15461.01
14172.60
12884.18
11595.77
10307.35
9018.94
7730.52
total assets
60
150
0.68
0.61
0.54
0.47
0.40
0.33
0.26
0.19
0.12
0.05
-0.02
-0.09
-0.16
-0.23
-0.30
-0.36
-0.43
-0.50
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
63
57
52
46
41
35
30
24
19
13
08
02
.0
.0
.1
.2
.2
.3
.3
.4
0.
-0
-0
-0
-0
-0
-0
return on assets
0.
53
200
-0.57
0.
250
45
0.
100
6442.11
600
38
300
0.
800
400
5153.69
500
3865.28
600
31
800
0.
900
2576.86
200
24
400
0.
600
16
1000
-0.64
1200
-0
-0
1288.45
-0.71
0.08
0.07
0.07
0.06
0.06
0.05
0.05
0.04
0.03
0.03
0.02
0.02
0.01
0.00
0.00
-0.01
-0.01
-0.02
-0.03
-0.03
50
0.
0.03
232.39
206.04
179.69
153.35
127.00
100.66
74.31
47.96
21.62
-4.73
-31.07
-57.42
-83.77
-110.11
-136.46
-162.80
-189.15
-215.50
-0.04
100
09
02
26862.58
25521.11
24179.64
22838.17
21496.70
20155.23
18813.76
17472.29
16130.83
14789.36
13447.89
12106.42
10764.95
9423.48
8082.01
6740.54
5399.07
4057.60
2716.14
-241.84
-268.19
150
0.
33.20
148
1374.67
appendix
0.
3.
1
-2 8
9.
1
-2 8
5.
1
-2 9
1.
1
-1 9
7.
2
-1 0
3.
20
-9
.2
1
-5
.2
1
-1
.2
1
2.
78
6.
78
10
.7
14 7
.7
18 7
.7
22 6
.7
26 6
.7
30 5
.7
5
M
or
e
-3
250
350
op inc / sales
200
300
250
200
150
100
50
0
1600
800
1400
1200
1000
800
600
400
0
200
0
1400
market value
700
1200
1000
200
400
200
0
70
60
50
20
10
0
46
The histograms for coefficient of net income, coefficient of total assets and coefficient of forecasts are slightly different because they
only represent companies from the Consumer Cyclicals sector in stead of all the sectors.
2.95
2.67
2.39
2.11
1.84
1.56
1.28
1.01
0.73
0.45
0.17
-0.10
-0.38
-0.66
-0.93
-1.21
-1.49
-1.77
-2.04
-2.32
-2.60
99
80
60
41
21
2.
2.
2.
1.
1.
1.
1.
1.
1.
46
29
12
96
79
62
45
29
12
95
78
62
45
28
cov forecasts
0.
0.
0.
0.
0.
11
.0
0.
-0
40
.2
-0
60
20
.3
40
.5
.7
60
-0
80
-0
-0
19
M
or
e
1.
0.
0.
0.
0.
0.
02
.1
.3
.5
.7
.9
.1
.3
0.
-0
-0
-0
-0
-0
-1
-1
120
180
beta
100
160
140
120
100
80
20
0
600
eps
500
400
300
200
100
De sc r ipt iv e a na ly si s
149
appendix
150
Figure A-14 Self organizing map of sector consumer cyclicals 1998 fourth quarter, iteration 1
De sc r ipt iv e a na ly si s
151
Figure A-15 Self organizing map of sector consumer cyclicals 1998 fourth quarter, iteration 2
Cluster coincidence
9
8
iteration 1
7
6
5
4
3
2
1
0
0
iteration 2
appendix
152
Figure A-17 Self organizing map of sector consumer cyclicals 1998 fourth quarter, iteration 3
Cluster coincidence
8
7
Iteration 2
6
5
4
3
2
1
0
0
Iteration 3
De sc r ipt iv e a na ly si s
153
Figure A-19 Sensitivity analysis: using 100 neurons
Clustercoincidence
8
7
500 neurons
6
5
4
3
2
1
0
0
100 neurons
Figure A-20 Cluster coincidence of 100 neuron som vs 500 neuron som
appendix
154
8
7
500 neurons
6
5
4
3
2
1
0
0
250 neurons
Figure A-22 Cluster coincidence of 250 neuron som vs 500 neuron som
De sc r ipt iv e a na ly si s
155
Figure A-23 Sensitivity analysis: using 1000 neurons
Clustercoincidence
8
7
500 neurons
6
5
4
3
2
1
0
0
1000 neurons
Figure A-24 Cluster coincidence of 1000 neuron som vs 500 neuron som
appendix
156
8
7
edited data
6
5
4
3
2
1
0
0
non-edited data
Figure A-26 Cluster coincidence of non-edited data som vs edited data som
De sc r ipt iv e a na ly si s
157
Figure A-27 Sensitivity analysis: merging all four cross-sections of 1998
Cluster coincidence
1998C4
1998C4
Cluster coincidence
1
0
158
Cluster coincidence
Cluster coincidence
1998C4
1998C4
appendix
1
0
0
0
Figure A-28 Cluster coincidence of the separate quarters in the merged som of 1998 vs the final map.
De sc r ipt iv e a na ly si s
Figure A-29 Sensitivity analysis: using 1997 data
159
appendix
160
Figure A-30 Sensitivity analysis: using 1996 data
De sc r ipt iv e a na ly si s
161
Figure A-31 Sensitivity analysis: using 1995 data
appendix
162
Figure A-32 Sensitivity analysis: using 1994 data
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
1.59
1.64
1.40
1.59
1.65
1.45
1.49
1.47
1.59
1.49
1.64
1.61
1.52
1.48
1.41
1.56
1.51
1.53
1.49
1.66
1.48
1.53
1.51
1.60
1.49
1.49
1.52
1.48
1.56
0.23
0.23
0.29
0.23
0.19
0.27
0.29
0.30
0.25
0.29
0.24
0.22
0.24
0.29
0.25
0.22
0.27
0.23
0.25
0.26
0.23
0.25
0.24
0.26
0.26
0.25
0.27
0.28
0.24
0.57
0.56
0.62
0.56
0.55
0.62
0.62
0.57
0.54
0.62
0.54
0.57
0.57
0.62
0.63
0.56
0.62
0.60
0.61
0.57
0.61
0.57
0.56
0.58
0.61
0.60
0.58
0.58
0.59
0.76
0.77
0.84
0.79
0.80
0.80
0.84
0.80
0.80
0.84
0.77
0.80
0.80
0.84
0.83
0.79
0.79
0.80
0.82
0.77
0.79
0.80
0.82
0.77
0.79
0.80
0.82
0.82
0.78
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
success ratio
max 2 notches (%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
success ratio
max 1 notch (%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
success ratio
max 0 notches (%)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
MAD
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Eps
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Beta
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
CoV forecasts
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Return on assets
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Return on equity
Net gearing
+
+
+
+
+
+
+
+
Debt-equity ratio 2
Debt-equity ratio 1
Debt ratio
+
+
Model results
Variables
0.66
0.63
0.72
0.65
0.65
0.70
0.70
0.69
0.67
0.69
0.64
0.65
0.69
0.68
0.72
0.69
0.67
0.68
0.68
0.62
0.70
0.67
0.70
0.64
0.68
0.69
0.65
0.68
0.67
164
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
0.24
0.27
0.25
0.21
0.20
0.25
0.24
0.21
0.22
0.22
0.19
0.23
0.26
0.59
0.58
0.56
0.56
0.63
0.55
0.56
0.53
0.51
0.51
0.49
0.56
0.59
0.80
0.82
0.80
0.81
0.81
0.78
0.73
0.71
0.74
0.74
0.67
0.74
0.74
1.50
1.46
1.53
1.55
1.52
1.55
1.74
1.90
1.75
1.73
2.13
1.68
1.66
+
+
+
+
+
+
success ratio
max 2 notches (%)
+
+
+
+
+
+
success ratio
max 1 notch (%)
+
+
+
+
+
+
+
+
+
+
success ratio
max 0 notches (%)
+
+
+
+
+
+
Eps
+
+
+
+
+
+
Beta
+
+
+
+
+
+
CoV forecasts
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Net gearing
+
+
+
+
+
+
Return on assets
+
+
+
+
Debt-equity ratio 2
Debt ratio
+
+
+
+
+
+
+
+
Return on equity
+
+
+
+
+
+
+
+
+
+
+
MAD
appendix
+
+
+
+
+
+
+
Model results
Debt-equity ratio 1
+
+
+
+
+
+
Variables
0.70
0.72
0.68
0.69
0.69
0.68
0.59
0.54
0.62
0.60
0.41
0.64
0.61
-7.59
-7.46
-6.65
-6.34
-6.06
-4.60
-2.70
-1.62
10
-0.83
11
-0.32
12
0.39
13
1.13
14
1.97
15
2.48
16
3.41
17
4.39
18
4.86
19
20