Machine Learning PDF
Machine Learning PDF
Machine Learning PDF
Classication
Editors: D. Michie, D.J. Spiegelhalter, C.C. Taylor
February 17, 1994
2 Classication
2.1 DEFINITION OF CLASSIFICATION
2.1.1 Rationale
2.1.2 Issues
2.1.3 Class denitions
2.1.4 Accuracy
2.2 EXAMPLES OF CLASSIFIERS
2.2.1 Fishers linear discriminants
2.2.2 Decision tree and Rule-based methods
2.2.3 k-Nearest-Neighbour
2.3 CHOICE OF VARIABLES
2.3.1 Transformations and combinations of variables
2.4 CLASSIFICATION OF CLASSIFICATION PROCEDURES
2.4.1 Extensions to linear discrimination
2.4.2 Decision trees and Rule-based methods
1 Introduction
1.1 INTRODUCTION
1.2 CLASSIFICATION
1.3 PERSPECTIVES ON CLASSIFICATION
1.3.1 Statistical approaches
1.3.2 Machine learning
1.3.3 Neural networks
1.3.4 Conclusions
1.4 THE STATLOG PROJECT
1.4.1 Quality control
1.4.2 Caution in the interpretations of comparisons
1.5 THE STRUCTURE OF THIS VOLUME
6
6
6
7
8
8
8
9
9
10
11
11
12
12
12
1
1
1
2
2
2
3
3
4
4
4
5
Contents
17
17
17
18
20
20
21
22
22
23
23
24
25
27
27
27
27
27
29
29
30
33
35
36
37
39
40
41
45
46
46
47
2.7
2.6
2.5
ii
12
12
13
13
13
14
15
16
[Ch. 0
6 Neural Networks
6.1 INTRODUCTION
6.2 SUPERVISED NETWORKS FOR CLASSIFICATION
6.2.1 Perceptrons and Multi Layer Perceptrons
6.2.2 Multi Layer Perceptron structure and functionality
6.2.3 Radial Basis Function networks
6.2.4 Improving the generalisation of Feed-Forward networks
6.3 UNSUPERVISED LEARNING
6.3.1 The K-means clustering algorithm
6.3.2 Kohonen networks and Learning Vector Quantizers
6.3.3 RAMnets
6.4 DIPOL92
6.4.1 Introduction
6.4.2 Pairwise linear regression
6.4.3 Learning procedure
6.4.4 Clustering of classes
6.4.5 Description of the classication procedure
84
84
86
86
87
93
96
101
101
102
103
103
104
104
104
105
105
Sec. 0.0]
iii
50
50
50
54
56
57
61
61
63
65
65
65
67
68
70
73
73
77
79
79
80
81
83
107
107
108
108
108
109
110
111
111
111
112
112
112
116
120
120
120
121
122
123
124
124
125
125
125
126
127
127
127
129
129
129
130
131
131
132
132
134
135
135
137
138
140
iv
[Ch. 0
10 Analysis of Results
10.1 INTRODUCTION
10.2 RESULTS BY SUBJECT AREAS
10.2.1 Credit datasets
10.2.2 Image datasets
10.2.3 Datasets with costs
10.2.4 Other datasets
10.3 TOP FIVE ALGORITHMS
10.3.1 Dominators
10.4 MULTIDIMENSIONAL SCALING
10.4.1 Scaling of algorithms
10.4.2 Hierarchical clustering of algorithms
10.4.3 Scaling of datasets
10.4.4 Best algorithms for datasets
10.4.5 Clustering of datasets
10.5 PERFORMANCE RELATED TO MEASURES: THEORETICAL
10.5.1 Normal distributions
10.5.2 Absolute performance: quadratic discriminants
9.6
9.5
9.4
Sec. 0.0]
175
175
176
176
179
183
184
185
186
187
188
189
190
191
192
192
192
193
142
143
145
146
149
149
152
153
154
154
157
158
161
163
164
165
167
169
170
170
173
173
173
173
174
174
v
11 Conclusions
11.1 INTRODUCTION
11.1.1 Users guide to programs
11.2 STATISTICAL ALGORITHMS
11.2.1 Discriminants
11.2.2 ALLOC80
11.2.3 Nearest Neighbour
11.2.4 SMART
11.2.5 Naive Bayes
11.2.6 CASTLE
11.3 DECISION TREES
11.3.1
and NewID
11.3.2 C4.5
11.3.3 CART and IndCART
11.3.4 Cal5
11.3.5 Bayes Tree
11.4 RULE-BASED METHODS
11.4.1 CN2
11.4.2 ITrule
11.5 NEURAL NETWORKS
11.5.1 Backprop
11.5.2 Kohonen and LVQ
11.5.3 Radial basis function neural network
11.5.4 DIPOL92
11.6 MEMORY AND TIME
11.6.1 Memory
11.6.2 Time
11.7 GENERAL ISSUES
11.7.1 Cost matrices
11.7.2 Interpretation of error rates
11.7.3 Structuring the results
11.7.4 Removal of irrelevant attributes
213
213
214
214
214
214
216
216
216
217
217
218
219
219
219
220
220
220
220
221
221
222
223
223
223
223
224
224
224
225
225
226
193
194
197
197
198
202
205
207
209
209
210
211
[Ch. 0
12 Knowledge Representation
12.1 INTRODUCTION
12.2 LEARNING, MEASUREMENT AND REPRESENTATION
12.3 PROTOTYPES
12.3.1 Experiment 1
12.3.2 Experiment 2
12.3.3 Experiment 3
12.3.4 Discussion
12.4 FUNCTION APPROXIMATION
12.4.1 Discussion
12.5 GENETIC ALGORITHMS
12.6 PROPOSITIONAL LEARNING SYSTEMS
12.6.1 Discussion
12.7 RELATIONS AND BACKGROUND KNOWLEDGE
12.7.1 Discussion
12.8 CONCLUSIONS
228
228
229
230
230
231
231
231
232
234
234
237
239
241
244
245
246
246
248
250
250
252
252
254
255
255
256
256
256
258
261
262
262
265
11.7.5
11.7.6
11.7.7
11.7.8
226
226
227
227
Sec. 0.0]
vii
1
Introduction
D. Michie (1), D. J. Spiegelhalter (2) and C. C. Taylor (3)
(1) University of Strathclyde, (2) MRC Biostatistics Unit, Cambridge and (3) University
of Leeds
1.1 INTRODUCTION
The aim of this book is to provide an up-to-date review of different approaches to classication, compare their performance on a wide range of challenging data-sets, and draw
conclusions on their applicability to realistic industrial problems.
Before describing the contents, we rst need to dene what we mean by classication,
give some background to the different perspectives on the task, and introduce the European
Community StatLog project whose results form the basis for this book.
1.2 CLASSIFICATION
The task of classication occurs in a wide range of human activity. At its broadest, the
term could cover any context in which some decision or forecast is made on the basis of
currently available information, and a classication procedure is then some formal method
for repeatedly making such judgments in new situations. In this book we shall consider a
more restricted interpretation. We shall assume that the problem concerns the construction
of a procedure that will be applied to a continuing sequence of cases, in which each new case
must be assigned to one of a set of pre-dened classes on the basis of observed attributes
or features. The construction of a classication procedure from a set of data for which the
true classes are known has also been variously termed pattern recognition, discrimination,
or supervised learning (in order to distinguish it from unsupervised learning or clustering
in which the classes are inferred from the data).
Contexts in which a classication task is fundamental include, for example, mechanical
procedures for sorting letters on the basis of machine-read postcodes, assigning individuals
to credit status on the basis of nancial and other personal information, and the preliminary
diagnosis of a patients disease in order to select immediate treatment while awaiting
denitive test results. In fact, some of the most urgent problems arising in science, industry
Address for correspondence: MRC Biostatistics Unit, Institute of Public Health, University Forvie Site,
Robinson Way, Cambridge CB2 2SR, U.K.
Introduction
[Ch. 1
and commerce can be regarded as classication or decision problems using complex and
often very extensive data.
We note that many other topics come under the broad heading of classication. These
include problems of control, which is briey covered in Chapter 13.
1.3 PERSPECTIVES ON CLASSIFICATION
As the books title suggests, a wide variety of approaches has been taken towards this task.
Three main historical strands of research can be identied: statistical, machine learning
and neural network. These have largely involved different professional and academic
groups, and emphasised different issues. All groups have, however, had some objectives in
common. They have all attempted to derive procedures that would be able:
to equal, if not exceed, a human decision-makers behaviour, but have the advantage
of consistency and, to a variable extent, explicitness,
to handle a wide variety of problems and, given enough data, to be extremely general,
to be used in practical settings with proven success.
Machine learning
Sec. 1.4]
1.3.3
Perspectives on classication
Neural networks
The eld of Neural Networks has arisen from diverse sources, ranging from the fascination
of mankind with understanding and emulating the human brain, to broader issues of copying
human abilities such as speech and the use of language, to the practical commercial,
scientic, and engineering disciplines of pattern recognition, modelling, and prediction.
The pursuit of technology is a strong driving force for researchers, both in academia and
industry, in many elds of science and engineering. In neural networks, as in Machine
Learning, the excitement of technological progress is supplemented by the challenge of
reproducing intelligence itself.
A broad class of techniques can come under this heading, but, generally, neural networks
consist of layers of interconnected nodes, each node producing a non-linear function of its
input. The input to a node may come from other nodes or directly from the input data.
Also, some nodes are identied with the output of the network. The complete network
therefore represents a very complex set of interdependencies which may incorporate any
degree of nonlinearity, allowing very general functions to be modelled.
In the simplest networks, the output from one node is fed into another node in such a
way as to propagate messages through layers of interconnecting nodes. More complex
behaviour may be modelled by networks in which the nal output nodes are connected with
earlier nodes, and then the system has the characteristics of a highly nonlinear system with
feedback. It has been argued that neural networks mirror to a certain extent the behaviour
of networks of neurons in the brain.
Neural network approaches combine the complexity of some of the statistical techniques
with the machine learning objective of imitating human intelligence: however, this is done
at a more unconscious level and hence there is no accompanying ability to make learned
concepts transparent to the user.
1.3.4
Conclusions
The three broad approaches outlined above form the basis of the grouping of procedures used
in this book. The correspondence between type of technique and professional background
is inexact: for example, techniques that use decision trees have been developed in parallel
both within the machine learning community, motivated by psychological research or
knowledge acquisition for expert systems, and within the statistical profession as a response
to the perceived limitations of classical discrimination techniques based on linear functions.
Similarly strong parallels may be drawn between advanced regression techniques developed
in statistics, and neural network models with a background in psychology, computer science
and articial intelligence.
It is the aim of this book to put all methods to the test of experiment, and to give an
objective assessment of their strengths and weaknesses. Techniques have been grouped
according to the above categories. It is not always straightforward to select a group: for
example some procedures can be considered as a development from linear regression, but
have strong afnity to neural networks. When deciding on a group for a specic technique,
we have attempted to ignore its professional pedigree and classify according to its essential
nature.
Introduction
[Ch. 1
guidelines for parameter selection were not violated, and also gave some information on
the ease-of-use for a non-expert in the domain. Unfortunately, these guidelines were not
followed for the radial basis function (RBF) algorithm which for some datasets determined
the number of centres and locations with reference to the test set, so these results should be
viewed with some caution. However, it is thought that the conclusions will be unaffected.
1.4.2 Caution in the interpretations of comparisons
There are some strong caveats that must be made concerning comparisons between techniques in a project such as this.
First, the exercise is necessarily somewhat contrived. In any real application, there
should be an iterative process in which the constructor of the classier interacts with the
ESPRIT project 5170. Comparative testing and evaluation of statistical and logical learning algorithms on
large-scale applications to classication, prediction and control
Sec. 1.5]
expert in the domain, gaining understanding of the problem and any limitations in the data,
and receiving feedback as to the quality of preliminary investigations. In contrast, StatLog
datasets were simply distributed and used as test cases for a wide variety of techniques,
each applied in a somewhat automatic fashion.
Second, the results obtained by applying a technique to a test problem depend on three
factors:
1.
2.
3.
In Appendix B we have described the implementations used for each technique, and the
availability of more advanced versions if appropriate. However, it is extremely difcult to
control adequately the variations in the background and ability of all the experimenters in
StatLog, particularly with regard to data analysis and facility in tuning procedures to give
their best. Individual techniques may, therefore, have suffered from poor implementation
and use, but we hope that there is no overall bias against whole classes of procedure.
1.5 THE STRUCTURE OF THIS VOLUME
The present text has been produced by a variety of authors, from widely differing backgrounds, but with the common aim of making the results of the StatLog project accessible
to a wide range of workers in the elds of machine learning, statistics and neural networks,
and to help the cross-fertilisation of ideas between these groups.
After discussing the general classication problem in Chapter 2, the next 4 chapters
detail the methods that have been investigated, divided up according to broad headings of
Classical statistics, modern statistical techniques, Decision Trees and Rules, and Neural
Networks. The next part of the book concerns the evaluation experiments, and includes
chapters on evaluation criteria, a survey of previous comparative studies, a description of
the data-sets and the results for the different methods, and an analysis of the results which
explores the characteristics of data-sets that make them suitable for particular approaches:
we might call this machine learning on machine learning. The conclusions concerning
the experiments are summarised in Chapter 11.
The nal chapters of the book broaden the interpretation of the basic classication
problem. The fundamental theme of representing knowledge using different formalisms is
discussed with relation to constructing classication techniques, followed by a summary
of current approaches to dynamic control now arising from a rephrasing of the problem in
terms of classication and learning.
2
Classication
R. J. Henery
University of Strathclyde
Rationale
There are many reasons why we may wish to set up a classication procedure, and some
of these are discussed later in relation to the actual datasets used in this book. Here we
outline possible reasons for the examples in Section 1.2.
1.
2.
Mechanical classication procedures may be much faster: for example, postal code
reading machines may be able to sort the majority of letters, leaving the difcult cases
to human readers.
A mail order rm must take a decision on the granting of credit purely on the basis of
information supplied in the application form: human operators may well have biases,
i.e. may make decisions on irrelevant information and may turn away good customers.
Address for correspondence: Department of Statistics and Modelling Science, University of Strathclyde,
Glasgow G1 1XH, U.K.
Sec. 2.1]
3.
4.
Denition
In the medical eld, we may wish to avoid the surgery that would be the only sure way
of making an exact diagnosis, so we ask if a reliable diagnosis can be made on purely
external symptoms.
The Supervisor (refered to above) may be the verdict of history, as in meteorology or
stock-exchange transaction or investment and loan decisions. In this case the issue is
one of forecasting.
2.1.2
Issues
There are also many issues of concern to the would-be classier. We list below a few of
these.
Accuracy. There is the reliability of the rule, usually represented by the proportion
of correct classications, although it may be that some errors are more serious than
others, and it may be important to control the error rate for some key class.
Speed. In some circumstances, the speed of the classier is a major issue. A classier
that is 90% accurate may be preferred over one that is 95% accurate if it is 100 times
faster in testing (and such differences in time-scales are not uncommon in neural
networks for example). Such considerations would be important for the automatic
reading of postal codes, or automatic fault detection of items on a production line for
example.
Comprehensibility. If it is a human operator that must apply the classication procedure, the procedure must be easily understood else mistakes will be made in applying
the rule. It is important also, that human operators believe the system. An oft-quoted
example is the Three-Mile Island case, where the automatic devices correctly recommended a shutdown, but this recommendation was not acted upon by the human
operators who did not believe that the recommendation was well founded. A similar
story applies to the Chernobyl disaster.
At one extreme, consider the nave 1-nearest neighbour rule, in which the training set
is searched for the nearest (in a dened sense) previous example, whose class is then
assumed for the new case. This is very fast to learn (no time at all!), but is very slow in
practice if all the data are used (although if you have a massively parallel computer you
might speed up the method considerably). At the other extreme, there are cases where it is
very useful to have a quick-and-dirty method, possibly for eyeball checking of data, or for
providing a quick cross-checking on the results of another procedure. For example, a bank
manager might know that the simple rule-of-thumb only give credit to applicants who
already have a bank account is a fairly reliable rule. If she notices that the new assistant
(or the new automated procedure) is mostly giving credit to customers who do not have a
bank account, she would probably wish to check that the new assistant (or new procedure)
was operating correctly.
Classication
[Ch. 2
2.
3.
Sec. 2.2]
Examples of classiers
and Width. We have available fty pairs of measurements of each variety from which to
learn the classication rule.
2.2.1
This is one of the oldest classication procedures, and is the most commonly implemented
in computer packages. The idea is to divide sample space by a series of lines in two
dimensions, planes in 3-D and, generally hyperplanes in many dimensions. The line
dividing two classes is drawn to bisect the line joining the centres of those classes, the
direction of the line is determined by the shape of the clusters of points. For example, to
differentiate between Versicolor and Virginica, the following rule is applied:
Petal Length, then Versicolor.
Petal Length, then Virginica.
0( &%#!! 1
)' $ "
0( &%#!!
)' $ "
If Petal Width
If Petal Width
3.0
Fishers linear discriminants applied to the Iris data are shown in Figure 2.1. Six of the
observations would be misclassied.
Petal width
1.5
2.0
2.5
Virginica
1.0
Setosa
A AA
A
A
A AA A A A A
A
A A
A
A AA A A
A
A A
AA A A
AA A
A
AA A
AA A A A
E
AA
E
A
E E
E
A
E E EE EA A
A
E
E EE E
E
EE E EE E E
EE
EE E E E
EE
E E E EE
0.0
0.5
S
S
S SSS S
S SS S
S S S SS S S S
SS SS
SS
S SS
Versicolor
4
Petal length
2.2.2
One class of classication procedures is based on recursive partitioning of the sample space.
Space is divided into boxes, and at each stage in the procedure, each box is examined to
see if it may be split into two boxes, the split usually being parallel to the coordinate axes.
An example for the Iris data follows.
If Petal Length
If Petal Length
Classication
If 2.65
[Ch. 2
Petal Length
if Petal Width
4.95 then :
1
if Petal Width
10
2.5
3.0
The resulting partition is shown in Figure 2.2. Note that this classication rule has three
mis-classications.
A AA
A
A
A AA A A A A
A
A A
A
A AA A A
A
A A
AA A A
AA A
A
AA A
AA A A A
AA
E
A
E
E E
E
A
E E EE EA A
A
E
E EE E
EE E EE E E
EE
E
EE E E E
EE
E E E EE
Petal width
1.5
2.0
Virginica
Setosa
1.0
Virginica
0.5
0.0
S
S SSS S
S SS S
S S S SS S S S
SS SS
SS
S SS
Versicolor
4
Petal length
Weiss & Kapouleas (1989) give an alternative classication rule for the Iris data that is
very directly related to Figure 2.2. Their rule can be obtained from Figure 2.2 by continuing
the dotted line to the left, and can be stated thus:
1
Notice that this rule, while equivalent to the rule illustrated in Figure 2.2, is stated more
concisely, and this formulation may be preferred for this reason. Notice also that the rule is
ambiguous if Petal Length 2.65 and Petal Width 1.65. The quoted rules may be made
unambiguous by applying them in the given order, and they are then just a re-statement of
the previous decision tree. The rule discussed here is an instance of a rule-based method:
such methods have very close links with decision trees.
2.2.3
k-Nearest-Neighbour
We illustrate this technique on the Iris data. Suppose a new Iris is to be classied. The idea
is that it is most likely to be near to observations from its own proper population. So we
look at the ve (say) nearest observations from all previously recorded Irises, and classify
Sec. 2.3]
Variable selection
11
the observation according to the most frequent class among its neighbours. In Figure 2.3,
the new observation is marked by a , and the nearest observations lie within the circle
centred on the . The apparent elliptical shape is due to the differing horizontal and vertical
scales, but the proper scaling of the observations is a major difculty of this method.
This is illustrated in Figure 2.3 , where an observation centred at would be classied
as Virginica since it has Virginica among its nearest neighbours.
'
Petal width
1.5
2.0
2.5
'
3.0
1.0
A AA
A
A
A AA A A A A
A
A A
A
A AA A A
A
A A
AA A A
AA A
A
AA A
AA A A A
AA
E
A
E
E E
E
A
E E EE EA A
A
E
E EE E
EE E EE E E
EE
E
EE E E E
EE
Virginica
E E E EE
0.0
0.5
S
S
S SSS S
S SS S
S S S SS S S S
SS SS
SS
S SS
4
Petal length
12
Classication
[Ch. 2
combination to use. For example, in the Iris data, the product of the variables Petal Length
and Petal Width gives a single attribute which has the dimensions of area, and might be
labelled as Petal Area. It so happens that a decision rule based on the single variable Petal
Area is a good classier with only four errors:
If Petal Area 2.0 then Setosa.
If 2.0 Petal Area 7.4 then Virginica.
If Petal Area 7.4 then Virginica.
This tree, while it has one more error than the decision tree quoted earlier, might be preferred
on the grounds of conceptual simplicity as it involves only one concept, namely Petal
Area. Also, one less arbitrary constant need be remembered (i.e. there is one less node or
cut-point in the decision trees).
; Cal5; CN2;
Sec. 2.5]
2.
3.
Costs of misclassication
13
An implicit or explicit criterion for separating the classes: we may think of an underlying input/output relation that uses observed attributes to distinguish a random
individual from each class.
The cost associated with making a wrong classication.
Most techniques implicitly confound components and, for example, produce a classication rule that is derived conditional on a particular prior distribution and cannot easily be
adapted to a change in class frequency. However, in theory each of these components may
be individually studied and then the results formally combined into a classication rule.
We shall describe this development below.
Prior probabilities and the Default rule
D 4 @ 8 643
E4 CB (A975#
2.5.1
, and let
S3 P H 8 3
T#RQIG&F
3F
It is always possible to use the no-data rule: classify any new observation as class
,
irrespective of the attributes of the example. This no-data or default rule may even be
adopted in practice if the cost of gathering the data is too high. Thus, banks may give
credit to all their established customers for the sake of good customer relations: here the
cost of gathering the data is the risk of losing customers. The default rule relies only on
knowledge of the prior probabilities, and clearly the decision rule that has the greatest
chance of success is to allocate every new observation to the most frequent class. However,
if some classication errors are more serious than others we adopt the minimum risk (least
expected cost) rule, and the class is that with the least expected cost (see below).
Separating classes
2.5.2
Suppose we are able to observe data on an individual, and that we know the probability
distribution of within each class
to be
. Then for any two classes
the
likelihood ratio
provides the theoretical optimal form for discriminating
the classes on the basis of data . The majority of techniques featured in this book can be
thought of as implicitly or explicitly deriving an approximate form for this likelihood ratio.
S ` P
b3# a YX
e 43
fdcW
3
W
S ` P X hS3 ` P
Bei a Y!7bW g YX
2.5.3
Misclassication costs
S q46P
rTERBp
S t46P
wvERup
3F
S t46P
EERBp 3 F
3
x
y8 s
The Bayes minimum cost rule chooses that class that has the lowest expected cost. To
see the relation between the minimum error and minimum cost rules, suppose the cost of
14
Classication
[Ch. 2
misclassications to be the same for all errors and zero when a class is correctly identied,
i.e. suppose that
for
and
for
.
Then the expected cost is
q 8
6
8 S q46P
$ !ERup
S s F " vu8 3 F
@Pp
s T3
q 8
6
x p 3 F
p 8
s d3
p 8 S q46P
rTvBp
x EERBp 3 F
8 S t46P
3
x 8 s
and the minimum cost rule is to allocate to the class with the greatest prior probability.
Misclassication costs are very difcult to obtain in practice. Even in situations where
it is very clear that there are very great inequalities in the sizes of the possible penalties
or rewards for making the wrong or right decision, it is often very difcult to quantify
them. Typically they may vary from individual to individual, as in the case of applications
for credit of varying amounts in widely differing circumstances. In one dataset we have
assumed the misclassication costs to be the same for all individuals. (In practice, creditgranting companies must assess the potential costs for each applicant, and in this case the
classication algorithm usually delivers an assessment of probabilities, and the decision is
left to the human operator.)
S ` 3 H
P
given
3
W
Prob(class
8 3 P
GS ` WH
If we wish to use a minimum cost rule, we must rst calculate the expected costs of the
various decisions conditional on the given information .
Now, when decision
is made for examples with attributes , a cost of
is incurred for class
examples and these occur with probability
. As the
probabilities
depend on , so too will the decision rule. So too will the expected
cost
of making decision :
S t46P
wEvBp
S ` 3 RQH
P
S ` 3 H
P
3
SwtvE6RB7S ` 3 H yS P s
4 Pp
P x 8
S P s
s
In the special case of equal misclassication costs, the minimum cost rule is to allocate to
the class with the greatest posterior probability.
When Bayes theorem is used to calculate the conditional probabilities
for the
classes, we refer to them as the posterior probabilities of the classes. Then the posterior
probabilities
are calculated from a knowledge of the prior probabilities and the
conditional probabilities
of the data for each class . Thus, for class
suppose
that the probability of observing data is
. Bayes theorem gives the posterior
probability
for class
as:
S ` WRQH
3 P
3
W
3
&F
3
#
S 3 g YX
` P
S ` P
b3# g X
S e a YX e F
` P
e
S ` #RQH
3 P
x h7S 3 g YX 3 GS ` 3 H
P
` P F 8
S ` 3 RQH
P
3
Sec. 2.6]
Bayes rule
15
S ` 3 H
P
The divisor is common to all classes, so we may use the fact that
is proportional
to
. The class
with minimum expected cost (minimum risk) is therefore that
for which
S 3 a YX 3 F
` P
S 3 g Y7EERBp 3 F
` P XS t46P
3
is a minimum.
Assuming now that the attributes have continuous distributions, the probabilities above
become probability densities. Suppose that observations drawn from population
have
probability density function
and that the prior probability that an observation belongs to class
is . Then Bayes theorem computes the probability that an
observation belongs to class
as
3
S P 3 7EERBp 3 F x
S t46P
s
S ` 3 H 3 gS ` s H
P
8
P
e
S P e e F x dS P 3 3 GS ` 3 H
h
F 8
P
3
#
3 3
dF W
S 3 Y Q9S P 3
` P 8
to the class
max
is a minimum.
Consider the problem of discriminating between just two classes
and
assuming as before that
, we should allocate to class if
. Then
6
3
or equivalently
S64 qP
b7!fBp 3 F S
S q46P
!TvBp e F 1 S
P e
P3
which shows the pivotal role of the likelihood ratio, which must be greater than the ratio of
prior probabilities times the relative costs of the errors. We note the symmetry in the above
expression: changes in costs can be compensated in changes in prior to keep constant the
threshold that denes the classication rule - this facility is exploited in some techniques,
although for more than two groups this property only exists under restrictive assumptions
(see Breiman et al., page 112).
2.6.1 Bayes rule in statistics
via Bayes theorem, we could also use the empirical frequency
Rather than deriving
version of Bayes rule, which, in practice, would require prohibitively large amounts of data.
However, in principle, the procedure is to gather together all examples in the training set
that have the same attributes (exactly) as the given example, and to nd class proportions
among these examples. The minimum error rule is to allocate to the class
with
highest posterior probability.
Unless the number of attributes is very small and the training dataset very large, it will be
necessary to use approximations to estimate the posterior class probabilities. For example,
S ` WH
3 P
S ` 3 RQH
P
16
Classication
[Ch. 2
one way of nding an approximate Bayes rule would be to use not just examples with
attributes matching exactly those of the given example, but to use examples that were near
the given example in some sense. The minimum error decision rule would be to allocate
to the most frequent class among these matching examples. Partitioning algorithms, and
decision trees in particular, divide up attribute space into regions of self-similarity: all
data within a given box are treated as similar, and posterior class probabilities are constant
within the box.
Decision rules based on Bayes rules are optimal - no other rule has lower expected
error rate, or lower expected misclassication costs. Although unattainable in practice,
they provide the logical basis for all statistical algorithms. They are unattainable because
they assume complete information is known about the statistical distributions in each class.
Statistical procedures try to supply the missing distributional information in a variety of
ways, but there are two main lines: parametric and non-parametric. Parametric methods
make assumptions about the nature of the distributions (commonly it is assumed that the
distributions are Gaussian), and the problem is reduced to estimating the parameters of
the distributions (means and variances in the case of Gaussians). Non-parametric methods
make no assumptions about the specic distributions involved, and are therefore described,
perhaps more accurately, as distribution-free.
2.7 REFERENCE TEXTS
There are several good textbooks that we can recommend. Weiss & Kulikowski (1991)
give an overall view of classication methods in a text that is probably the most accessible
to the Machine Learning community. Hand (1981), Lachenbruch & Mickey (1975) and
Kendall et al. (1983) give the statistical approach. Breiman et al. (1984) describe CART,
which is a partitioning algorithm developed by statisticians, and Silverman (1986) discusses
density estimation methods. For neural net approaches, the book by Hertz et al. (1991) is
probably the most comprehensive and reliable. Two excellent texts on pattern recognition
are those of Fukunaga (1990) , who gives a thorough treatment of classication problems,
and Devijver & Kittler (1982) who concentrate on the k-nearest neighbour approach.
A thorough treatment of statistical procedures is given in McLachlan (1992), who also
mentions the more important alternative approaches. A recent text dealing with pattern
recognition from a variety of perspectives is Schalkoff (1992).
3
Classical Statistical Methods
J. M. O. Mitchell
University of Strathclyde
3.1 INTRODUCTION
This chapter provides an introduction to the classical statistical discrimination techniques
and is intended for the non-statistical reader. It begins with Fishers linear discriminant,
which requires no probability assumptions, and then introduces methods based on maximum
likelihood. These are linear discriminant, quadratic discriminant and logistic discriminant.
Next there is a brief section on Bayes rules, which indicates how each of the methods
can be adapted to deal with unequal prior probabilities and unequal misclassication costs.
Finally there is an illustrative example showing the result of applying all three methods to
a two class and two attribute problem. For full details of the statistical theory involved the
reader should consult a statistical text book, for example (Anderson, 1958).
The training set will consist of examples drawn from known classes. (Often will
be 2.) The values of numerically-valued attributes will be known for each of examples,
and these form the attribute vector
. It should be noted that these
methods require numerical attribute vectors, and also require that none of the values is
missing. Where an attribute is categorical with two values, an indicator is used, i.e. an
attribute which takes the value 1 for one category, and 0 for the other. Where there are
more than two categorical values, indicators are normally set up for each of the values.
However there is then redundancy among these new attributes and the usual procedure is
to drop one of them. In this way a single categorical attribute with values is replaced by
attributes whose values are 0 or 1. Where the attribute values are ordered, it may be
acceptable to use a single numerical-valued attribute. Care has to be taken that the numbers
used reect the spacing of the categories in an appropriate fashion.
S ( 4 BB 4 4 mlk
P 8
n
@ "q
Address for correspondence: Department of Statistics and Modelling Science, University of Strathclyde,
Glasgow G1 1XH, U.K.
18
[Ch. 3
a least-squares sense; the second is by Maximum Likelihood (see Section 3.2.3). We will
give a brief outline of these approaches. For a proof that they arrive at the same solution,
we refer the reader to McLachlan (1992).
3.2.1 Linear discriminants by least squares
Fishers linear discriminant (Fisher, 1936) is an empirical method for classication based
purely on attribute vectors. A hyperplane (line in two dimensions, plane in three dimensions,
etc.) in the -dimensional attribute space is chosen to separate the known classes as well
as possible. Points are classied according to the side of the hyperplane that they fall on.
For example, see Figure 3.1, which illustrates discrimination between two digits, with
the continuous line as the discriminating hyperplane between the two populations.
This procedure is also equivalent to a t-test or F-test for a signicant difference between
the mean discriminants for the two samples, the t-statistic or F-statistic being constructed
to have the largest possible value.
be respectively the means of
More precisely, in the case of two classes, let , ,
the attribute vectors overall and for the two classes. Suppose that we are given a set of
coefcients
and let us call the particular linear combination of attributes
ok ok ok
x 8 S k
e
e p eisP r
p
n v4 qq 4 p
the discriminant between the classes. We wish the discriminants for the two classes to
differ as much as possible, and one measure for this is the difference
between the mean discriminants for the two classes divided by the standard deviation of
the discriminants, say, giving the following measure of discrimination:
S tP r " S cP r
ok
ok
vu
S ok P r " S ok P r
v
cu
S k
sP r
S k
sRP r
S S ok P v u " S ok P P x
r
r
S cP r wS cP r
ok 2 ok
S cP r " S cP r
ok
ok
S k
sRP r
S k
sRP r
Sec. 3.2]
Linear discrimination
19
S tP
ok
S
d7S o k P r r "
} 8 SS
~Wd7ao k P r "
ok
" S cP
S ok P r P r
S k P
sRP r |{ sRP r
S k
vu
z " }
S " wwz 8
P h
@ hS }
ddz " fP
zh
rv}
zh
0v}
3p
H 4@ 8 e
R4 CB (q Bp
ep
z
S ok P r " S ok P r
3j
H
3yj
4 o o k " w@ " 3yj 3
8
3
@
where
is the
matrix of attribute values, and is the -dimensional row-vector
of attribute means. The pooled covariance matrix is
where the
summation is over all the classes, and the divisor
is chosen to make the pooled
covariance matrix unbiased. For invertibility the attributes must be linearly independent,
which means that no attribute may be an exact linear combination of other attributes. In
order to achieve this, some attributes may have to be dropped. Moreover no attribute can
be constant within each class. Of course an attribute which is constant within each class
but not overall may be an excellent discriminator and is likely to be utilised in decision tree
algorithms. However it will cause the linear discriminant algorithm to fail. This situation
can be treated by adding a small positive constant to the corresponding diagonal element of
D " j
S
0D " Rwv3 C@ " yRP {
jP h S 3 j
H
ok
20
[Ch. 3
the pooled covariance matrix, or by adding random noise to the attribute before applying
the algorithm.
In order to deal with the case of more than two classes Fisher (1938) suggested the use
of canonical variates. First a linear combination of the attributes is chosen to minimise
the ratio of the pooled within class sum of squares to the total sum of squares. Then
further linear functions are found to improve the discrimination. (The coefcients in
these functions are the eigenvectors corresponding to the non-zero eigenvalues of a certain
matrix.) In general there will be min
canonical variates. It may turn out that only
a few of the canonical variates are important. Then an observation can be assigned to the
class whose centroid is closest in the subspace dened by these variates. It is especially
useful when the class means are ordered, or lie along a simple curve in attribute-space. In
the simplest case, the class means lie along a straight line. This is the case for the head
injury data (see Section 9.4.1), for example, and, in general, arises when the classes are
ordered in some sense. In this book, this procedure was not used as a classier, but rather
in a qualitative sense to give some measure of reduced dimensionality in attribute space.
Since this technique can also be used as a basis for explaining differences in mean vectors
as in Analysis of Variance, the procedure may be called manova, standing for Multivariate
Analysis of Variance.
S H4 D
yR(@ " bP
S P3
4S k
c7S " RP S " RP @ " P
k
exp
(3.1)
` sF `
@
Sec. 3.3]
Linear discrimination
21
where is a -dimensional vector denoting the (theoretical) mean for a class and ,
the (theoretical) covariance matrix, is a
(necessarily positive denite) matrix. The
(sample) covariance matrix that we saw earlier is the sample analogue of this covariance
matrix, which is best thought of as a set of coefcients in the pdf or a set of parameters for
the distribution. This means that the points for the class are distributed in a cluster centered
at of ellipsoidal shape described by . Each cluster has the same orientation and spread
though their means will of course be different. (It should be noted that there is in theory
no absolute boundary for the clusters but the contours for the probability density function
have ellipsoidal shape. In practice occurrences of examples outside a certain ellipsoid
will be extremely rare.) In this case it can be shown that the boundary separating two
classes, dened by equality of the two pdfs, is indeed a hyperplane and it passes through
the mid-point of the two centres. Its equation is
H
H
3
W
3
4 $ 9S " P S 2 P @ " S " P k
8
(3.2)
where
denotes the population mean for class . However in classication the exact
distribution is usually not known, and it becomes necessary to estimate the parameters for
the distributions. With two classes, if the sample means are substituted for
and the
pooled sample covariance matrix for , then Fishers linear discriminant is obtained. With
more than two classes, this method does not in general give the same results as Fishers
discriminant.
S EtTdF
P33
3 3
&F W
3
#
3
W
3 3 @ " &F
3
8 3
Q
3
3 8 3
k
3 3
@ " 3 %2 3 F
3
#
S vt
P3
log
by
log
though these can be simplied by subtracting the coefcients for the last class.
The above formulae are stated in terms of the (generally unknown) population parameters , i and . To obtain the corresponding plug-in formulae, substitute the
corresponding sample estimators: for ; for i ; and for , where is the sample
proportion of class
examples.
3H
3F
3H
3 ok
3F
22
[Ch. 3
hS@ 2 P H
dBHED
3.3.1
The quadratic discriminant function is most simply dened as the logarithm of the appropriate probability density function, so that one quadratic discriminant is calculated for
each class. The procedure used is to take the logarithm of the probability density function
and to substitute the sample means and covariance matrices in place of the population
values, giving the so-called plug-in estimates. Taking the logarithm of Equation (3.1),
and allowing for differing prior class probabilities , we obtain
F
@ " S 3 P
log
log
8
9S P 3 3 F
log
as the quadratic discriminant for class . Here it is understood that the sufx refers to
the sample of values from class .
In classication, the quadratic discriminant is calculated for each class and the class
with the largest discriminant is chosen. To nd the a posteriori class probabilities explicitly,
the exponential is taken of the discriminant and the resulting quantities normalised to sum
to unity (see Section 2.6). Thus the posterior class probabilities
are given by
log
F
@ " S 3 RP
exp log
8 S P
Gsk ` 3 RYX
S P
sk ` 3 RYX
Sec. 3.3]
Quadratic discrimination
23
Once again, the above formulae are stated in terms of the unknown population parameters , i and . To obtain the corresponding plug-in formulae, substitute the
corresponding sample estimators: for ; for i ; and for , where is the sample
proportion of class
examples.
Many statistical packages allow for quadratic discrimination (for example, MINITAB
has an option for quadratic discrimination, SAS also does quadratic discrimination).
3H
3F
3H
3 ok 3
3F
3
6
3
@
3 2 3 S 3 " vP
where
is the class sample covariance matrix and is the pooled covariance matrix.
When is zero, there is no smoothing and the estimated class covariance matrix is just
the ith sample covariance matrix . When the are unity, all classes have the same
covariance matrix, namely the pooled covariance matrix . Friedman (1989) makes the
value of smaller for classes with larger numbers. For the ith sample with observations:
3j
s2 q 2 2 jl
j
j
8
7C@ " 3 R5S " EP20D " bP !h70D " P 8 3
S
jP
@
S
S
where
.
The other parameter is a (small) constant term that is added to the diagonals of the
covariance matrices: this is done to make the covariance matrix non-singular, and also has
the effect of smoothing out the covariance matrices. As we have already mentioned in
connection with linear discriminants, any singularity of the covariance matrix will cause
problems, and as there is now one covariance matrix for each class the likelihood of such
a problem is much greater, especially for the classes with small sample sizes.
This two-parameter family of procedures is described by Friedman (1989) as regularised discriminant analysis. Various simple procedures are included as special cases:
ordinary linear discriminants (
); quadratic discriminants (
); and
correspond to a minimum Euclidean distance rule.
the values
This type of regularisation has been incorporated in the Strathclyde version of Quadisc.
Very little extra programming effort is required. However, it is up to the user, by trial and
error, to choose the values of and . Friedman (1989) gives various shortcut methods for
reducing the amount of computation.
8
$ v4 $ 8
8 4@
$ E!8
@ 8 4@
wE!8
8
$
$ 8
24
[Ch. 3
0h F
F
H
S k
4 2 SskR
9kgl8 sRPP FF
log
where and the -dimensional vector are the parameters of the model that are to be
estimated. The case of normal distributions with equal covariance is a special case of
this, for which the parameters are functions of the prior probabilities, the class means and
the common covariance matrix. However the model covers other cases too, such as that
where the attributes are independent with values 0 or 1. One of the attractions is that the
is
discriminant scale covers all real numbers. A large positive value indicates that class
likely, while a large negative value indicates that class
is likely.
In practice the parameters are estimated by maximum
likelihood. The
model implies that, given attribute values x, the conditional class probabilities for classes
and
take the forms:
exp
exp
p j 6}6 t j
wsBbdi(sBcp
Ssk GbP
2
@
S 2
sk GbP
S 2
s k bP
exp
8 S P
sk ` RYX
8sk ` RPYX
S
2
9@
2
9@
respectively.
Given independent samples from the two classes, the conditional likelihood for the
parameters and is dened to be
s7
Ssk ` YX
P
sample
Ssk ` RYX
P
sample
g7
89fbP
S 4
and the parameter estimates are the values that maximise this likelihood. They are found by
iterative methods, as proposed by Cox (1966) and Day & Kerridge (1967). Logistic models
Sec. 3.4]
Logistic discrimination
25
belong to the class of generalised linear models (GLMs), which generalise the use of linear
regression models to deal with non-normal random variables, and in particular to deal with
binomial variables. In this context, the binomial variable is an indicator variable that counts
whether an example is class
or not. When there are more than two classes, one class is
taken as a reference class, and there are
sets of parameters for the odds of each class
relative to the reference class. To discuss this case, we abbreviate the notation for
to the simpler
. For the remainder of this section, therefore, x is a
-dimensional
vector with leading term unity, and the leading term in corresponds to the constant .
Again, the parameters are estimated by maximum conditional likelihood. Given attribute values x, the conditional class probability for class , where
, and the
conditional class probability for
take the forms:
exp
@ "D
k Gw
2
S@ 2
CHP
D 8
%6
e
Ssk gqP
Ssk 3 P
exp
S e
sk gqP
E e
x
exp
E e
8 S P
sk ` 3 RYX
8 S P
sk ` RYX
respectively. Given independent samples from the classes, the conditional likelihood for
the parameters is dened to be
sample
s7
S P
BC sk ` RYX
sample
S P
sk ` YX
S P
sk ` RX
7
8GS R4 qq 4 P
3
y
sample
Once again, the parameter estimates are the values that maximise this likelihood.
In the basic form of the algorithm an example is assigned to the class for which the
posterior is greatest if that is greater than 0, or to the reference class if all posteriors are
negative.
More complicated models can be accommodated by adding transformations of the
given attributes, for example products of pairs of attributes. As mentioned in Section
) values occur, it will generally be necessary
3.1, when categorical attributes with (
to convert them into
binary attributes before using the algorithm, especially if the
categories are not ordered. Anderson (1984) points out that it may be appropriate to
include transformations or products of the attributes in the linear function, but for large
datasets this may involve much computation. See McLachlan (1992) for useful hints. One
way to increase complexity of model, without sacricing intelligibility, is to add parameters
in a hierarchical fashion, and there are then links with graphical models and Polytrees.
1
@ "
26
[Ch. 3
Many statistical packages (GLIM, Splus, Genstat) now include a generalised linear
model (GLM) function, enabling logistic regression to be programmed easily, in two
or three lines of code.
The procedure is to dene an indicator variable for class
occurrences. The indicator variable is then declared to be a binomial variable with the
logit link function, and generalised regression performed on the attributes. We used the
package Splus for this purpose. This is ne for two classes, and has the merit of requiring
little extra programming effort. For more than two classes, the complexity of the problem
increases substantially, and, although it is technically still possible to use GLM procedures,
the programming effort is substantially greater and much less efcient.
The maximum likelihood solution can be found via a Newton-Raphson iterative procedure, as it is quite easy to write down the necessary derivatives of the likelihood (or,
equivalently, the log-likelihood). The simplest starting procedure is to set the
coefcients to zero except for the leading coefcients ( ) which are set to the logarithms of the
numbers in the various classes: i.e.
log , where
is the number of class
examples. This ensures that the values of are those of the linear discriminant after the
rst iteration. Of course, an alternative would be to use the linear discriminant parameters
as starting values. In subsequent iterations, the step size may occasionally have to be
reduced, but usually the procedure converges in about 10 iterations. This is the procedure
we adopted where possible.
3j
3j
3
8 3
However, each iteration requires a separate calculation of the Hessian, and it is here
that the bulk of the computational work is required. The Hessian is a square matrix with
rows, and each term requires a summation over all the observations in the
whole dataset (although some saving can by achieved using the symmetries of the Hessian).
Thus there are of order
computations required to nd the Hessian matrix at each
iteration. In the KL digits dataset (see Section 9.3.2), for example,
,
,
and
, so the number of operations is of order
in each iteration. In such
cases, it is preferable to use a purely numerical search procedure, or, as we did when
the Newton-Raphson procedure was too time-consuming, to use a method based on an
approximate Hessian. The approximation uses the fact that the Hessian for the zeroth
order iteration is simply a replicate of the design matrix (cf. covariance matrix) used by
the linear discriminant rule. This zero-order Hessian is used for all iterations. In situations
where there is little difference between the linear and logistic parameters, the approximation
is very good and convergence is fairly fast (although a few more iterations are generally
required). However, in the more interesting case that the linear and logistic parameters
are very different, convergence using this procedure is very slow, and it may still be quite
far from convergence after, say, 100 iterations. We generally stopped after 50 iterations:
although the parameter values were generally not stable, the predicted classes for the data
were reasonably stable, so the predictive power of the resulting rule may not be seriously
affected. This aspect of logistic regression has not been explored.
S@ 2 PS D
CeqHEB@ " P
@ 8
$ 8
w) eH $ D
H
Cy0D
($ @
$$ 8
!!$ my
The nal program used for the trials reported in this book was coded in Fortran, since
the Splus procedure had prohibitive memory requirements. Availablility of the Fortran
code can be found in Appendix B.
Sec. 3.6]
Bayes rules
27
S q46P
!TvBp
S 3 ` 7EERBp 3 F 3 {
kP S t46P
4@P
4 0SBC(vBpp
S @4 B
P
F
F
log
3
D 4@
0v4 CB !Aw6 3 r
F
8
gS " S 2 P @ " S " P fk
P
S 3 P
sk ` WRX
3.6 EXAMPLE
As illustration of the differences between the linear, quadratic and logistic discriminants,
we consider a subset of the Karhunen-Loeve version of the digits data later studied in this
book. For simplicity, we consider only the digits 1 and 2, and to differentiate between
them we use only the rst two attributes (40 are available, so this is a substantial reduction
in potential information). The full sample of 900 points for each digit was used to estimate
the parameters of the discriminants, although only a subset of 200 points for each digit is
plotted in Figure 3.1 as much of the detail is obscured when the full set is plotted.
3.6.1
Linear discriminant
Also shown in Figure 3.1 are the sample centres of gravity (marked by a cross). Because
there are equal numbers in the samples, the linear discriminant boundary (shown on the
diagram by a full line) intersects the line joining the centres of gravity at its mid-point. Any
new point is classied as a 1 if it lies below the line i.e. is on the same side as the centre
of the 1s). In the diagram, there are 18 2s below the line, so they would be misclassied.
3.6.2
Logistic discriminant
The logistic discriminant procedure usually starts with the linear discriminant line and then
adjusts the slope and intersect to maximise the conditional likelihood, arriving at the dashed
line of the diagram. Essentially, the line is shifted towards the centre of the 1s so as to
reduce the number of misclassied 2s. This gives 7 fewer misclassied 2s (but 2 more
misclassied 1s) in the diagram.
3.6.3
Quadratic discriminant
The quadratic discriminant starts by constructing, for each sample, an ellipse centred on
the centre of gravity of the points. In Figure 3.1 it is clear that the distributions are of
different shape and spread, with the distribution of 2s being roughly circular in shape
and the 1s being more elliptical. The line of equal likelihood is now itself an ellipse (in
general a conic section) as shown in the Figure. All points within the ellipse are classied
28
[Ch. 3
as 1s. Relative to the logistic boundary, i.e. in the region between the dashed line and the
ellipse, the quadratic rule misclassies an extra 7 1s (in the upper half of the diagram) but
correctly classies an extra 8 2s (in the lower half of the diagram). So the performance of
the quadratic classier is about the same as the logistic discriminant in this case, probably
due to the skewness of the 1 distribution.
150
2nd KL-variate
100
22
50
2
2
2
2
2 2
22 2
2
2
2 22 2 22 2
2 2
2
22
2
22 2
22 2 22222 222 2 22
2 222 2 2
2 2 22
2
2
22
1
22
222 22 2 22 2
2 22
2 2
1
2 2
2 2 2 22 2 2
2
22 2 2
2
1 2212 2222 2
1
2
2
2
2
2
2 12 2 2 2 2 2 2 2 22 2222 2 2
2
22
11
22
2
2
1
2
2
2 2 2 2 22
1 1 122 22 2
2
2 22
2
1
2
2
2
2 2 2
2
11 22 12 2
2 22
2 2
1 1
1 1
2
2
2 22 22
2
1 1
2
1
1
2
2
222
2
1
1 1 11 1 111 1 1 1
1
1 11 1 11 1 1 11 1 1
2 22
2
1 1
2
2 2
111 1111 1
11
1
1
2
22
2
221
1111 1 111 11 1 1 112 1
1
1
1 11 1 1
1 1
1 111111111111 111 1 1
111 1 1
2
1
1 1 11 11 1 11 1 11 1 11
1 1 111
111 11
1 11 111 11 111 111 1
2
22
111 11
2
1
1 1 1111 1 11 1
11
2
1 1
1
11
1
1
2
50
Linear
Logistic
Quadratic
100
1st KL-variate
2
150
200
Fig. 3.1: Decision boundaries for the three discriminants: quadratic (curved); linear (full line); and
logistic (dashed line). The data are the rst two Karhunen-Loeve components for the digits 1 and
2.
4
Modern Statistical Techniques
R. Molina (1), N. P rez de la Blanca (1) and C. C. Taylor (2)
e
(1) University of Granada and (2) University of Leeds
4.1 INTRODUCTION
In the previous chapter we studied the classication problem, from the statistical point of
view, assuming that the form of the underlying density functions (or their ratio) was known.
However, in most real problems this assumption does not necessarily hold. In this chapter
we examine distribution-free (often called nonparametric) classication procedures that
can be used without assuming that the form of the underlying densities are known.
denote the number of classes, of examples and attributes, respecRecall that
tively. Classes will be denoted by
and attribute values for example
will be denoted by the -dimensional vector
.
Elements in will be denoted
.
The Bayesian approach for allocating observations to classes has already been outlined
in Section 2.6. It is clear that to apply the Bayesian approach to classication we have
and
or
. Nonparametric methods to do this job will be
to estimate
discussed in this chapter. We begin in Section 4.2 with kernel density estimation; a close
relative to this approach is the k-nearest neighbour (k-NN) which is outlined in Section 4.3.
Bayesian methods which either allow for, or prohibit dependence between the variables
are discussed in Sections 4.5 and 4.6. A nal section deals with promising methods
which have been developed recently, but, for various reasons, must be regarded as methods
for the future. To a greater or lesser extent, these methods have been tried out in the
project, but the results were disappointing. In some cases (ACE), this is due to limitations
of size and memory as implemented in Splus. The pruned implementation of MARS in
Splus (StatSci, 1991) also suffered in a similar way, but a standalone version which also
does classication is expected shortly. We believe that these methods will have a place in
classication practice, once some relatively minor technical problems have been resolved.
As yet, however, we cannot recommend them on the basis of our empirical trials.
S n 4 4 P 83
f3 ( 4 BC v3 v3 |yk
S ( 4 BC 4
n
74 CB 4
H4 j4
R97wD
P 8
4 H |Gk
74
S
sk ` e H
P
S j 4@ 8 6
s74 BC 4 !aP
eF
S e `
kP
Address for correspondence: Department of Computer Science and AI, Facultad de Ciencas, University of
Granada, 18071 Granada, Spain
30
[Ch. 4
k
S kP e
swE
D 4 @ 8 q4S kP e
E4 BC 4 !TcswE
e
F
"H
D 4 @ 8 q4S k
v4 CB 4 (TcsP e
S kP
sRQ
V
jh
r&V X
j
S P 8
k(tdkRQweX
S kP
sRQ
S kP
dsRQeX
S kP
s
where is the volume enclosed by . This leads to the following procedure to estimate
the density at . Let
be the volume of
,
be the number of samples falling in
and
the estimate of
based on a sample of size , then
(4.1)
S kP
sRQ
w
j h 8 S kP
0rV 9sR
S kP
s
S k4 k
dE4 3 79RP
3 j
S E5y79RP
43 k4 k
8 S kP
x @ 9sR
S kP
sRQ
$ 8 S
HR4 CBW 4!@8q
h
C@ (` e ` @ 9RP
w |w 3x j 9sR
8 S kP
3yk
@
" k
@
where
"H
rV w
dimensional
(4.2)
otherwise
as an average function of
3
yk
where
are kernel functions. For instance, we could use, instead of the Parzen
window dened above,
S kP
sRQ
SwE4 3 749RP
k k
e
x @ "
& 5e " e
3
exp
(4.3)
S kP
s
k
w
S sF P
8 43 k4 k
gS vcy79RP
@
S kP
s
Sec. 4.2]
Density estimation
31
Before going into details about the kernel functions we use in the classication problem
and about the estimation of the smoothing parameter , we briey comment on the mean
behaviour of
. We have
(dS9RPQdS&wE4Wd49P 8bsR
t k
S kP
S kP
s
w
S kP
sR
S kP
s
q
Ce w
e
S 3
E4 Ee 4 e P Be w 9yv4 3 79RP
8 S k4 k
with
indicating the kernel function component of the th attribute and being not
dependent on . It is very important to note that as stressed by Aitchison & Aitken (1976),
this factorisation does not imply the independence of the attributes for the density we are
estimating.
It is clear that kernels could have a more complex form and that the smoothing parameter
could be coordinate dependent. We will not discuss in detail that possibility here (see
McLachlan, 1992 for details). Some comments will be made at the end of this section.
The kernels we use depend on the type of variable. For continuous variables
RB B
9 C@ "
3
Ee " e "
dC@ e l9@
S " P 2
RB C
@
R
e
w
2
9@
@
gh9@
2
2
g%9@
@
R
nominal values
R
hsF "
@
hsF "
@
8 S 43 4
EdEe de P Ce w
8
8 S 3
E4 Ee 4 e P Ce
8
log
log
exp
log
8 S 3
E4 Ee 4 e P Ce
otherwise.
nominal values
S P
ybY
R s @
C u U {
@ 8 S P
g74 w
where
if
For ordinal variables with
[Ch. 4
e
4
$ 8
32
8 S 43 4
EdEe de P Ce w
For the above expressions we can see that in all cases we can write
Bu
Ed
7S e o " Ee P 3 {
3
dS e o " 5e P 3 {
3
nominal
ordinal
e o
8 S 3
E4 Ee 4 e P Ce
The problem is that since we want to use the same smoothing parameter,
variables, we have to normalise them. To do so we substitute by
dened, depending on the type of variable, by
continuous
binary
q
3
@ " j
S0e 3
o " 5e P x @
S
B@ " Rgj
jP
SV
dbP e U { " j
where
denotes the number of examples for which attribute has the value and
is the sample mean of the th attribute.
With this selection of we have
average
So we can understand the above process as rescaling all the variables to the same scale.
For discrete variables the range of the smoothness parameter is the interval
. One
extreme leads to the uniform distribution and the other to a one-point distribution:
S@
CB4 $ P
if
3
Ee 8 e
4 Ee 8 e
3
S VP e
&wW
3
8 u(hdS Ee 4wU e P t 3 U
Bu
q
4 e
4
&e
@ 8 3
aS $ 4 5e
e h@ 8 S@43
rCaBCd5e
if
P
P
$ 8
@
8
8
$ %
j0hC@
$
3
5e
@ 8
I
@
I~ $
!
B
d
SE&s74 3 P x @ " j gS 3 RP 3
4 U k k
8 k
n @
S3 kP
dyREt3 3
Sec. 4.3]
Density estimation
33
This criterion makes the smoothness data dependent, leads to an algorithm for an arbitrary dimensionality of the data and possesses consistency requirements as discussed by
Aitchison & Aitken (1976).
An extension of the above model for is to make
dependent on the th nearest
neighbour distance to , so that we have a
for each sample point. This gives rise to
the so-called variable kernel model. An extensive description of this model was rst given
by Breiman et al. (1977). This method has promising results especially when lognormal
or skewed distributions are estimated. The kernel width
is thus proportional to the
th nearest neighbour distance in
denoted by
, i.e.
. We take for
the euclidean distance measured after standardisation of all variables. The proportionality
factor is (inversely) dependent on . The smoothing value is now determined by two
parameters, and ; can be though of as an overall smoothing parameter, while denes
the variation in smoothness of the estimated density over the different regions. If, for
example
, the smoothness will vary locally while for larger values the smoothness
tends to be constant over large regions, roughly approximating the xed kernel model.
We use a Normal distribution for the component
U 3 gm8 3
t
3
U3 t
3k
U3 t
3k
e BU 3 g
u t
3
w Ee " e @ "
@ 8
wV
exp
F u t
e CU 3 g gS 3 E4 Ee 4 e P Ce
8 3 w
@
To optimise for and the jackknife modication of the maximum likelihood method
can again be applied . However, for the variable kernel this leads to a more difcult twodimensional optimisation problem of the likelihood function
with one continuous
parameter ( ) and one discrete parameter ( ).
Silverman (1986, Sections 2.6 and 5.3) studies the advantages and disadvantages of
this approach. He also proposes another method to estimate the smoothing parameters in
a variable kernel model (see Silverman, 1986 and McLachlan, 1992 for details).
The algorithm we mainly used in our trials to classify by density estimation is ALLOC80
by Hermans at al. (1982) (see Appendix B for source).
S V4
&EiP
4.2.1
Example
We illustrate the kernel classier with some simulated data, which comprise 200 observations from a standard Normal distribution (class 1, say) and 100 (in total) values from
(class 2). The resulting estimates can then be used as a
an equal mixture of
basis for classifying future observations to one or other class. Various scenarios are given
in Figure 4.1 where a black segment indicates that observations will be allocated to class
2, and otherwise to class 1. In this example we have used equal priors for the 2 classes
(although they are not equally represented), and hence allocations are based on maximum
estimated likelihood. It is clear that the rule will depend on the smoothing parameters, and
can result in very disconnected sets. In higher dimensions these segments will become
regions, with potentially very nonlinear boundaries, and possibly disconnected, depending
on the smoothing parameters used. For comparison we also draw the population probability
densities, and the true decision regions in Figure 4.1 (top), which are still disconnected
but very much smoother than some of those constructed from the kernels.
S@ P
BC4
34
[Ch. 4
1111111111111111111111111111111111
1111111111111111111111111111111111
111111111111111111111111111111111
111111111111111111111111111111111
-2
-1
-3
1111111111111111111111111111111111
1111111111111111111111111111111111
111111111111111111111111111111111
111111111111111111111111111111111
0.0
0.1
0.2
0.3
0.4
0.4
0.2
0.2
0.0
0.1
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
-3
-2
-1
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1
2
3
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1
2
3
||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||
-3
-2
-1
0.3
0.6
0.4
1.5
f
2
0.0
0.5
1.0
0.2
0.1
0.0
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
-3
-2
-1
-1
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||
|||||||||||||||||||||
-2
0
x
|||||||||||
-3
0.3
2.0
Fig. 4.1: Classication regions for kernel classier (bottom) with true probability densities (top).
The smoothing parameters quoted in (A) (D) are the values of
used in Equation (4.3) for class
1 and class 2, respectively.
Sec. 4.3]
35
-NEAREST NEIGHBOUR
8
wV
k
VhS k V 8 S
!7sRP egsk ` y H
P
S
sk ` H
P
S kP jP hS k
dSs&z Rw7sRP V
S `
kP
j
rh j
S k V 4S k
sRP E4 CB csRP V
d4 BC 4
V
k
sj
D 4 @ 8 S kP
v4 CB ! w ` RQ
4.3
Knearest neighbour
S bP
V
F
S kP
sRdz
j { 8
j
j
0h j
8
3 i
i
fk
fk
S P
ik ` 3 RQH
@ 8
YV
S 3 P
sk ` WRQH
S P H S P
fk ` 3 RQsk ` 3 RH
36
[Ch. 4
@ 1
V
@ 8
YV
4.3.1
Example
1400
1000
1
1
1
800
600
Glucose area
1200
2
400
2
3
3
3
3
3
3
0.8
1
1
0.9
3 3
3
3
3
3
2
3
3
3
3
2
3 2
1.0
2
2
2
3
3
1.1
1.2
Relative weight
@ 8
IGV
Sec. 4.4]
37
(y-axis) is more useful in separating the three classes, and that class 3 is easier to distinguish
than classes 1 and 2. A new patient, whose condition is supposed unknown is assigned the
same classication as his nearest neighbour on the graph. The distance, as measured to
each point, needs to be scaled in some way to take account for different variability in the
different directions. In this case the patient is classied as being in class 2, and is classied
correctly.
The decision regions for the nearest neighbour are composed of piecewise linear boundaries, which may be disconnected regions. These regions are the union of Dirichlet cells;
each cell consists of points which are nearer (in an appropriate metric) to a given observation than to any other. For this data we have shaded each cell according to the class of its
centre, and the resulting decision regions are shown in Figure 4.3
1000
800
400
600
Glucose area
1200
1400
0.8
0.9
1.0
1.1
1.2
Relative weight
D 4
v4 CB !@84 e F
q
e
e
S(i k P S q 4 P
e
kP e S q4 6P
` e F rTv6Bp x S!ef ` w F !T!RBp x
D 4 @ 8 q F4
0v4 CB (T4 e 7cS e ` (
kP
k
e F
$ 9!ERP p
8 S q46
6q
8
S q46 p
d!ERP !
q4
TE6 e !TvB8 !EREp
F S q 4 6 P p e F S q 4 6 P
e r 7rTvB!
F S q46P p
e r
F
to be of the form
constant if
otherwise
Then an approximation to
is
S q46 p
7rTERP !
constraining
jrhR i5e y
S3 e
$ 8
3
W8 6
@ a3 U
e
U
3 U
x P U x " U " 3 U x sj j U F x
if in observation ,
otherwise
with
(4.6)
with
,
,
and
. The coefcients
and the functions
are parameters of the model and are estimated by least squares.
Equation (4.5) is approximated by
e
@
8 a e {
e
@ 8
&4 U
n
e
S e e x P s
U
n
yk ` U9
S br
tP
S 0
tP
8skRP t
S
0DE4 BB 4(@8tsk P
q4S
8 S k
` e Ry H sP t
e
dSsk ` iy H " sk ` iqHP x
S e P S e P
$ 8 U 8 U
x f yk ` 9
2 U 8 U
is
D 4 @ 8 q4 H
0v4 CB !tS 7` e P t
8 t
9S bP
St
" P
The very interesting point is that it can be easily shown that for any class probability
estimator we have
e
7sk e P e
SS
` i H " P x
S tP
w0
S
sk ` e RQ%yk ` e
P H 8
$ 8 e
e 8 @
D 4@ 8 q
E4 BB (T4 e
4
Ss74 e RPH
k
d4 BC 4 r
0v4 CB 4(8TcSsk ` e PHt
D @ q4
then
(4.5)
by
if
otherwise
e
i
q sk ` e RQ H
S
P
q S!ef ` e F
kP
k
S P
sk ` 3 H
S # ` 3 F
3 kP
or
(4.4)
we could use
when
S q46P
!ERBp
x F
e
e F
38
[Ch. 4
Sec. 4.4]
39
8
U
s
4 CB 4 !gd4 # 3
4@ 8 6 "
$
%
v4
E4 BB 4
E4 CB 4 v4
@ H q @ D V @
&
'
) $
0(
S n E4 CB 4 P
!
U
s04 e
)
0$
S & @
$ m3 vP
@
C4 CB 4 ~"
4
(@ "
) $
0%
U
` Us ` U 2 x 8
@ 2
e$
4 )
90$
) $
01
normalised so that the most important term has unit importance. (Note that the variance of
all the
is one.) The starting point for the minimisation of the largest model,
,
is given by an
term stagewise model (Friedman & Stuetzle, 1981 and StatSci, 1991 for
a very precise description of the process).
The sequence of solutions generated in this manner is then examined by the user and a
nal model is chosen according to the guidelines above.
The algorithm we used in the trials to classify by projection pursuit is SMART (see
Friedman, 1984 for details, and Appendix B for availability)
) $
04
4.4.1 Example
This method is illustrated using a 5-dimensional dataset with three classes relating to
chemical and overt diabetes. The data can be found in dataset 36 of Andrews & Herzberg
(1985) and were rst published in Reaven & Miller (1979). The SMART model can be
examined by plotting the smooth functions in the two projected data co-ordinates:
+ 0.0045
- 0.0213
+ 0.0010
- 0.0044
0.9998
- 0.0065
- 0.0001
+ 0.0005
- 0.0008
These are given in Figure 4.4 which also shows the class values given by the projected
points of the selected training data (100 of the 145 patients). The remainder of the model
chooses the values of
to obtain a linear combination of the functions which can then
be used to model the conditional probabilities. In this example we get
= -0.05
= -0.33
= -0.40
=
0.34
= 0.46
= -0.01
8
9
5
6
7
6
8
9
7
6
5
6
E5
d
3
y
t5
E
40
[Ch. 4
0.0
333333333333
3 333333 3 3
3 333 3
2 2 2 2 2 2222 2
2 22 2 2
2
11
1
1 1
1
1
1 1
1
1 11 1
1
1 11
11
-1.0
-0.5
f1
0.5
1.0
-30
-25
-20
-15
-10
-5
3 33 33333333333 33
3 33 33333 33 333
33 33
3
22 2 2 22 2 222 2 22
22 222
22
2
1 11
1
11
1 1 1
1 1 11 1
1
11
1
-1.0
1 1
-2.0
f2
0.0 0.5
projected point
-1.5
-1.0
-0.5
0.0
0.5
projected point
The remaining 45 patients were used as a test data set, and for each class the unscaled
conditional probability can be obtained using the relevant coefcients for that class. These
are shown in Figure 4.5, where we have plotted the predicted value against only one of the
projected co-ordinate axes. It is clear that if we choose the model (and hence the class) to
maximise this value, then we will choose the correct class each time.
4.5 NAIVE BAYES
All the nonparametric methods described so far in this chapter suffer from the requirements
that all of the sample must be stored. Since a large number of observations is needed to
obtain good estimates, the memory requirements can be severe.
In this section we will make independence assumptions, to be described later, among
the variables involved in the classication problem. In the next section we will address
the problem of estimating the relations between the variables involved in a problem and
display such relations by mean of a directed acyclic graph.
The nave Bayes classier is obtained as follows. We assume that the joint distribution
IND package of machine learning algorithms IND 1.0 by Wray Buntine (see Appendix B
for availability).
Sec. 4.6]
Causal networks
41
1
1
-30
11
-25
-20
-15
2
3
3
2 2 2 2 222 2 3 3333
2 22
33333
3 33333 3
3 33
-10
-5
22
2 22
2 2 2 2 2 2
2
1
1
11
-30
-25
33 33333 3
3 33
33333
3
3 3333
-20
-15
-10
-5
3 33
33
33333 3
3 33333 3
3
1
1
1
-30
1
-25
1
11
-20
2 2 22
2 2 222 2
22
-10
-15
projected point
-5
Fig. 4.5: Projected test data with conditional probablities for three classes. Class 1 (top), Class 2
(middle), Class 3 (bottom).
S 9e@
4 P 8
B
CA
a nite state
Sw74 B P
z
S X 4 P 8
E4 9e
S zP p
wRd~G
2
S zP
wRdp 2 z
z
G S zP
HBp
S 4 P 8
9mm@
z
d4 B 8 B 9bYX
X 8 S P
SbPX
B
B A
8
B A FB AhA
E D
Typical elements of
are denoted
we have a probability distribution
Gz
Denition 1 Let
be a directed acyclic graph (DAG). For each
let
be the set of all parents of and
be the set of all descendent of .
let
be the set of variables in excluding and s descendent.
Furthermore for
Then if for every subset
,
and are conditionally independent given
, the
is called a causal or Bayesian network.
~z
S zP
Bp
G S zP
IRdt
There are two key results establishing the relations between a causal network
and
. The proofs can be found in Neapolitan (1990).
The rst theorem establishes that if
is a causal network, then
can
be written as
8
I
S P
bYX
S P
YX
Sv4 gb|
X 4 P 8
SS zP zP
7Rup ` YX
S X 4
v4 9P
8 S P
9bYX
E D
QPB
Thus, in a causal network, if one knows the conditional probability distribution of each
variable given its parents, one can compute the joint probability distribution of all the
variables in the network. This obviously can reduce the complexity of determining the
42
[Ch. 4
distribution enormously. The theorem just established shows that if we know that a DAG
and a probability distribution constitute a causal network, then the joint distribution can
be retrieved from the conditional distribution of every variable given its parents. This
does not imply, however, that if we arbitrarily specify a DAG and conditional probability
distributions of every variables given its parents we will necessary have a causal network.
This inverse result can be stated as follows.
Let be a set of nite sets of alternatives (we are not yet calling the members of
variables since we do not yet have a probability distribution) and let
be a DAG.
In addition, for
let
be the set of all parents of , and let a conditional
probability distribution of given
be specied for every event in
, that is we have
a probability distribution
. Then a joint probability distribution of the vertices
in is uniquely determined by
S 9e@
4 P 8
X
S zP
Bp
SS zP zP
dwRBp Y X
S zP
wRBp ` z
G S zP
HBp
SS zP zP
7Rup ` R X
~z
S X 4 P 8
E4 9e
8 S P
9bYX
E D
QPB
UVT
X
VW
b
UVT
X
VW
c
!b
C
b
U
HT
X
YW
`a
b
`
`a
UVT
X
VW
c
!b
c
!b
UHT
X
HW
t
p
pR
t
p
pR
and
Then, the structure of our knowledge-base is represented by the DAG in Figure 4.6.
This structure together with quantitative knowledge of the conditional probability of every
variable given all possible parent states dene a causal network that can be used as device to
perform efcient (probabilistic) inference, (absorb knowledge about variables as it arrives,
be able to see the effect on the other variables of one variable taking a particular value and
so on). See Pearl (1988) and Lauritzen & Spiegelhalter (1988).
Sec. 4.6]
Causal networks
43
So, once a causal network has been built, it constitutes an efcient device to perform
probabilistic inference. However, there remains the previous problem of building such
a network, that is, to provide the structure and conditional probabilities necessary for
characterizing the network. A very interesting task is then to develop methods able to learn
the net directly from raw data, as an alternative to the method of eliciting opinions from
the experts.
In the problem of learning graphical representations, it could be said that the statistical
community has mainly worked in the direction of building undirected representations:
chapter 8 of Whittaker (1990) provides a good survey on selection of undirected graphical
representations up to 1990 from the statistical point of view. The program BIFROST
(Hjsgaard et al., 1992) has been developed, very recently, to obtain causal models. A
second literature on model selection devoted to the construction of directed graphs can be
found in the social sciences (Glymour et al., 1987; Spirtes et al., 1991) and the articial
intelligence community (Pearl, 1988; Herkovsits & Cooper, 1990; Cooper & Herkovsits ,
1991 and Fung & Crawford, 1991).
In this section we will concentrate on methods to build a simplied kind of causal
structure, polytrees (singly connected networks); networks where no more than one path
exists between any two nodes. Polytrees, are directed graphs which do not contain loops
in the skeleton (the network without the arrows) that allow an extremely efcient local
propagation procedure.
Before describing how to build polytrees from data, we comment on how to use a
polytree in a classication problem. In any classication problem, we have a set of variables
that (possibly) have inuence on a distinguished classication
variable . The problem is, given a particular instantiation of these variables, to predict
the value of , that is, to classify this particular case in one of the possible categories of .
For this task, we need a set of examples and their correct classication, acting as a training
sample. In this context, we rst estimate from this training sample a network (polytree),
structure displaying the causal relationships among the variables
;
next, in propagation mode, given a new case with unknown classication, we will instantiate
and propagate the available information, showing the more likely value of the classication
variable .
It is important to note that this classier can be used even when we do not know the
value of all the variables in . Moreover, the network shows the variables in
that
directly have inuence on , in fact the parents of , the children of and the other
parents of the children of (the knowledge of these variables makes independent of
the rest of variables in )(Pearl, 1988). So the rest of the network could be pruned, thus
reducing the complexity and increasing the efciency of the classier. However, since
the process of building the network does not take into account the fact that we are only
interested in classifying, we should expect as a classier a poorer performance than other
classication oriented methods. However, the built networks are able to display insights
into the classication problem that other methods lack. We now proceed to describe the
theory to build polytree-based representations for a general set of variables
.
Assume that the distribution
of discrete-value variables (which we are trying
to estimate) can be represented by some unknown polytree , that is,
has the form
H 4@ 8 6
4 CB !gd4 3 l8
dH 4@ 8 6 8
964 q !sd4 3 e
4 BB 4
S eP
afYX
&
S eP
gYX
44
[Ch. 4
4ggg
R3 e dEhih4 R3 e 74 R3 e 0
3
S R3 e 7Eihh4 i3 e d4 R3 e ` BRYX 9gYX
4ggg
3 P
8 S eP
where
is the (possibly empty) set of direct parents of the variable
in , and the parents of each variable are mutually independent. So we are aiming
at simpler representations than the one displayed in Figure 4.6. The skeleton of the graph
involved in that example is not a tree.
Then, according to key results seen at the beginning of this section, we have a causal
network
and
is a polytree. We will assume that
is nondegenerate, meaning that there exists a connected DAG that displays all the dependencies and
independencies embedded in .
It is important to keep in mind that a nave Bayes classier (Section 4.5) can be
represented by a polytree, more precisely a tree in which each attribute node has the class
variable as a parent.
The rst step in the process of building a polytree is to learn the skeleton. To build the
skeleton we have the following theorem:
S 4 P
S eP
afYX
S X
E4 4 |
P 8
S YX
P
S e RY7S 3 RX
P X P
S e d4 3 RYX
P
log
S e 43 P
!c7vBRYX
x 90e C3 w
8 S 4 P
Having found the skeleton of the polytree we move on to nd the directionality of the
branches. To recover the directions of the branches we use the following facts: nondegenthat do not have a common descendent
eracy implies that for any pairs of variables
we have
S e 4 3 P
1
P
$ fS e 4 3 w
q
r
we have
(4.7)
1 S
P t j
8
P
$ g0U ` e 4 3 whs(p $ 9S e 4 3 w
p p p
8 S
90U ` e 4 3 w
P
3 s(p e
t j
3 4e
q
r
8 S
P t j
1
P
$ g0U ` e 4 3 whs(p $ fS e 4 3 w
we have
S U P
(!d4 e 74 3 YX
log
S0U! ` e Yd(! ` 3 YX
P XS U P
S U P
0! ` e d4 3 RX
where
Sec. 4.7]
Causal networks
45
Taking all these facts into account we can recover the headtohead patterns, (4.7),
which are the really important ones. The rest of the branches can be assigned any direction
as long as we do not produce more headtohead patterns. The algorithm to direct the
skeleton can be found in Pearl (1988).
The program to estimate causal polytrees used in our trials is CASTLE, ( usal
ructures From Inductive arning). It has been developed at the University of Granada
for the ESPRIT project StatLog (Acid et al. (1991a); Acid et al. (1991b)). See Appendix
B for availability.
4.6.1
Example
We now illustrate the use of the Bayesian learning methodology in a simple model, the
digit recognition in a calculator.
Digits are ordinarily displayed on electronic watches and calculators using seven horizontal and vertical lights in onoff congurations (see Figure 4.7). We number the lights
as shown in Figure 4.7. We take
to be an eightdimensional
Ss 4 8
Bt14 BC 4 %4 %TbP
1
2
3
4
6
7
vector where
denotes the
digit,
and when xing
to the
remaining
is a seven dimensional vector of zeros and ones with
if the light in the position is on for the
digit and
otherwise.
We generate examples from a faulty calculator. The data consist of outcomes from
the random vector
where
is the class label, the digit, and assumes
the values in
with equal probability and the
are zero-one
variables. Given the value of , the
are each independently equal to the
value corresponding to the
with probability
and are in error with probability
.
Our aim is to build up the polytree displaying the (in)dependencies in .
We generate four hundred samples of this distributionand use them as a learning sample.
After reading in the sample, estimating the skeleton and directing the skeleton the polytree
estimated by CASTLE is the one shown in Figure 4.8. CASTLE then tells us what we had
expected:
@
8
4@ 8
E4 cC 4 !B4 $ W6
$ 8
@ &$
4 BC 4 4
}
vi6
}
vi6
Ss
ct14 CB 4 & 14 P
6 8
|9
$
dCB 4 4 3
4
4 4@
4 CB 4 4 E 4 c B4 (C4 $
s
Finally, we examine the predictive power of this polytree. The posterior probabilities of
each digit given some observed patterns are shown in Figure 4.9.
46
[Ch. 4
463
0
1
1
0
290
0
749
0
0
21
0
2
0
971
0
0
0
0
0
0
280
0
0
0
0
6
0
913
0
0
0
0
699
0
644
519
0
1
19
0
51
0
251
12
2
1
5
16
0
0
0
2
10
0
0
0
0
63
0
Digit
pu6
B!t r
4.7.1
ACE
S w ` S (P
P
Pw
S (w
P
8 P
P ` S P( gS (w w w
w
` S P( 8S
P
Pw
dS Q " S (
S Q
P
is
(4.8)
Sec. 4.7]
47
some starting functions and alternate these two steps until convergence. With multiple
predictors
, ACE seeks to minimise
S e P e x e " S (w
P
8
S
n
n 4 BB 4
(4.9)
S (w
P
8
$ 8 n 8 CB 8 w
3
t
. Next
yielding the solution
(4.10)
in turn with given
3 (E
e
3 (e
P 8 P x
3 ` S e P e x " S (w 9S 3 ff# 3 8 S 3 P 3
:
S (w
P
8
ET dS P
S e P e e {
`
8 P x w
P
n gS iy5'8 S (w
S Pe
!e wE e {
I `
S (w
P
S (w
P
n
with
n v4 BC 4
and
(4.11)
This constitutes one iteration of the algorithm which terminates when an iteration fails to
decrease .
ACE places no restriction on the type of each variable. The transformation functions
assume values on the real line but their arguments may assume
values on any set so ordered real, ordered and unordered categorical and binary variables
can all be incorporated in the same regression equation. For categorical variables, the
procedure can be regarded as estimating optimal scores for each of their values.
For use in classication problems, the response is replaced by a categorical variable
representing the class labels,
. ACE then nds the transformations that make the
relationship of
to the
as linear as possible.
S n P n E4 BB tS P vtS (w
4
4 P
S3 P3
e
f
S P
R(w
4.7.2 MARS
The MARS (Multivariate Adaptive Regression Spline) procedure (Friedman, 1991) is
based on a generalisation of spline methods for function tting. Consider the case of only
order regression spline function
is
one predictor variable, . An approximating
obtained by dividing the range of values into
disjoint regions separated by points
called knots. The approximation takes the form of a separate
degree polynomial in
each region, constrained so that the function and its
derivatives are continuous. Each
degree polynomial is dened by
parameters so there are a total of
parameters to be adjusted to best t the data. Generally the order of the spline is taken to
be low
. Continuity requirements place constraints at each knot location making
a total of
constraints.
S B0
P
S D S
B@%2gbP5C@2 P
" D
@G2
" D
@ " D
@ 2
%D
D
S D
! bP
" D
48
[Ch. 4
While regression spline tting can be implemented by directly solving this constrained
minimisation problem, it is more usual to convert the problem to an unconstrained optimisation by chosing a set of basis functions that span the space of all
order spline functions
(given the chosen knot locations) and performing a linear least squares t of the response
on this basis function set. In this case the approximation takes the form
" D
U
S
P 7 U wp ux 3gS P
U 8
(4.12)
are unconstrained and the continuwhere the values of the expansion coefcients
ity constraints are intrinsically embodied in the basis functions
. One such
basis, the truncated power basis, is comprised of the functions
(4.13)
dS
P d U P
U !
p
@
2
U}
v1
S0v} " P 8 0v} " P
U
U
v}
SU
$
0vu
U}
S U } " w&4 e
P e
where
are the knot locations dening the
functions are dened
The exibility of the regression spline approach can be enhanced by incorporating an automatic knot selection strategy as part of the data tting process. A simple and effective
strategy for automatically selecting both the number and locations for the knots was described by Smith(1982), who suggested using the truncated power basis in a numerical
minimisation of the least squares criterion
(4.15)
(v} "
S U P
T4 BB 4 u
} }
S U } " P S U } "
P
U ! ue
p
U R e
3
x
" e e R x " 3 x
SU
PU
(v} " wwp
}
4 BC 4 }
j sRB0
S kP
S kP
sRr0
S
B@ " RP
j
D
S k
sRP "
Sec. 4.7]
49
MARS implements a forward/backward stepwise selection strategy. The forward selection begins with only the constant basis function
in the model. In each
iteration we consider adding two terms to the model
@ 8 S k
wsP
S " fwg
}P e
S
v} " wg
Pe
(4.16)
where
is one of the basis functions already chosen, is one of the predictor variables
not represented in
and is a knot location on that variable. The two terms of this
form, which cause the greatest decrease in the residual sum of squares, are added to the
model. The forward selection process continues until a relatively large number of basis
functions is included in a deliberate attempt to overt the data. The backward pruning
procedure, standard stepwise linear regression, is then applied with the basis functions
representing the stock of variables. The best tting model is chosen with the t measured
by a cross-validation criterion.
MARS is able to incorporate variables of different type; continuous, discrete and
categorical.
5
Machine Learning of Rules and Trees
C. Feng (1) and D. Michie (2)
(1) The Turing Institute and (2) University of Strathclyde
This chapter is arranged in three sections. Section 5.1 introduces the broad ideas underlying
the main rule-learning and tree-learning methods. Section 5.2 summarises the specic
characteristics of algorithms used for comparative trials in the StatLog project. Section
5.3 looks beyond the limitations of these particular trials to new approaches and emerging
principles.
5.1 RULES AND TREES FROM DATA: FIRST PRINCIPLES
5.1.1
In a 1943 lecture (for text see Carpenter & Doran, 1986) A.M.Turing identied Machine
Learning (ML) as a precondition for intelligent systems. A more specic engineering
expression of the same idea was given by Claude Shannon in 1953, and that year also
saw the rst computational learning experiments, by Christopher Strachey (see Muggleton,
1993). After steady growth ML has reached practical maturity under two distinct headings:
(a) as a means of engineering rule-based software (for example in expert systems) from
sample cases volunteered interactively and (b) as a method of data analysis whereby rulestructured classiers for predicting the classes of newly sampled cases are obtained from a
training set of pre-classied cases. We are here concerned with heading (b), exemplied
by Michalski and Chilauskys (1980) landmark use of the AQ11 algorithm (Michalski &
Larson, 1978) to generate automatically a rule-based classier for crop farmers.
Rules for classifying soybean diseases were inductively derived from a training set of
290 records. Each comprised a description in the form of 35 attribute-values, together
with a conrmed allocation to one or another of 15 main soybean diseases. When used to
Addresses for correspondence: Cao Feng, Department of Computer Science, University of Ottowa, Ottowa,
K1N 6N5, Canada; Donald Michie, Academic Research Associates, 6 Inveralmond Grove, Edinburgh EH4 6RA,
U.K.
This chapter connes itself to a subset of machine learning algorithms, i.e. those that output propositional
classiers. Inductive Logic Programming (ILP) uses the symbol system of predicate (as opposed to propositional)
logic, and is described in Chapter 12
Sec. 5.1]
51
classify 340 or so new cases, machine-learned rules proved to be markedly more accurate
than the best existing rules used by soybean experts.
As important as a good t to the data, is a property that can be termed mental t.
As statisticians, Breiman and colleagues (1984) see data-derived classications as serving
two purposes: (1) to predict the response variable corresponding to future measurement
vectors as accurately as possible; (2) to understand the structural relationships between the
response and the measured variables. ML takes purpose (2) one step further. The soybean
rules were sufciently meaningful to the plant pathologist associated with the project that
he eventually adopted them in place of his own previous reference set. ML requires that
classiers should not only classify but should also constitute explicit concepts, that is,
expressions in symbolic form meaningful to humans and evaluable in the head.
We need to dispose of confusion between the kinds of computer-aided descriptions
which form the ML practitioners goal and those in view by statisticians. Knowledgecompilations, meaningful to humans and evaluable in the head, are available in Michalski
& Chilauskys paper (their Appendix 2), and in Shapiro & Michie (1986, their Appendix B)
in Shapiro (1987, his Appendix A), and in Bratko, Mozetic & Lavrac (1989, their Appendix
A), among other sources. A glance at any of these computer-authored constructions will
sufce to show their remoteness from the main-stream of statistics and its goals. Yet ML
practitioners increasingly need to assimilate and use statistical techniques.
Once they are ready to go it alone, machine learned bodies of knowledge typically
need little further human intervention. But a substantial synthesis may require months
or years of prior interactive work, rst to shape and test the overall logic, then to develop
suitable sets of attributes and denitions, and nally to select or synthesize voluminous data
les as training material. This contrast has engendered confusion as to the role of human
interaction. Like music teachers, ML engineers abstain from interaction only when their
pupil reaches the concert hall. Thereafter abstention is total, clearing the way for new forms
of interaction intrinsic to the pupils delivery of what has been acquired. But during the
process of extracting descriptions from data the working method of ML engineers resemble
that of any other data analyst, being essentially iterative and interactive.
In ML the knowledge orientation is so important that data-derived classiers, however
accurate, are not ordinarily acceptable in the absence of mental t. The reader should bear
this point in mind when evaluating empirical studies reported elsewhere in this book.
StatLogs use of ML algorithms has not always conformed to purpose (2) above. Hence
the reader is warned that the books use of the phrase machine learning in such contexts
is by courtesy and convenience only.
The Michalski-Chilausky soybean experiment exemplies supervised learning,
given:
a sample of input-output pairs of an unknown class-membership function,
required: a conjectured reconstruction of the function in the form of a rule-based
expression human-evaluable over the domain.
Note that the functions output-set is unordered (i.e. consisting of categoric rather than
numerical values) and its outputs are taken to be names of classes. The derived functionexpression is then a classier. In contrast to the prediction of numerical quantities, this
book connes itself to the classication problem and follows a scheme depicted in Figure
5.1.
Constructing ML-type expressions from sample data is known as concept learning.
52
[Ch. 5
Testing Data
Classification Rules
Learning Algorithm
Training Data
The rst such learner was described by Earl Hunt (1962). This was followed by Hunt,
Marin & Stones (1966) CLS. The acronym stands for Concept Learning System. In
ML, the requirement for user-transparency imparts a bias towards logical, in preference to
arithmetical, combinations of attributes. Connectives such as and, or, and if-then
supply the glue for building rule-structured classiers, as in the following englished form
of a rule from Michalski and Chilauskys soybean study.
if
leaf malformation is absent and stem is abnormal and internal discoloration
is black
then
Diagnosis is CHARCOAL ROT
Example cases (the training set or learning sample) are represented as vectors of
attribute-values paired with class names. The generic problem is to nd an expression that
predicts the classes of new cases (the test set) taken at random from the same population.
Goodness of agreement between the true classes and the classes picked by the classier is
then used to measure accuracy. An underlying assumption is that either training and test
sets are randomly sampled from the same data source, or full statistical allowance can be
made for departures from such a regime.
Symbolic learning is used for the computer-based construction of bodies of articulate
expertise in domains which lie partly at least beyond the introspective reach of domain
experts. Thus the above rule was not of human expert authorship, although an expert
can assimilate it and pass it on. To ascend an order of magnitude in scale, KARDIOs
comprehensive treatise on ECG interpretation (Bratko et al., 1989) does not contain a
single rule of human authorship. Above the level of primitive descriptors, every formulation was data-derived, and every data item was generated from a computable logic of
heart/electrocardiograph interaction. Independently constructed statistical diagnosis systems are commercially available in computer-driven ECG kits, and exhibit accuracies in
the 80% 90% range. Here the ML product scores higher, being subject to error only if
the initial logical model contained aws. None have yet come to light. But the difference
that illuminates the distinctive nature of symbolic ML concerns mental t. Because of its
mode of construction, KARDIO is able to support its decisions with insight into causes.
Statistically derived systems do not. However, developments of Bayesian treatments ini-
Sec. 5.1]
53
weak or
absent
(+)
(-)
attributes
Key:
+
(+)
(-)
ML expected to do well
ML expected to do well, marginally
ML expected to do poorly, marginally
54
[Ch. 5
hypothesis language respectively of Section 12.2. Under (ii) we consider in the present
chapter the machine learning of if-then rule-sets and of decision trees. The two kinds of
language are interconvertible, and group themselves around two broad inductive inference
strategies, namely specic-to-general and general-to-specic
5.1.2
Michalskis AQ11 and related algorithms were inspired by methods used by electrical engineers for simplifying Boolean circuits (see, for example, Higonnet & Grea, 1958). They
exemplify the specic-to-general, and typically start with a maximally specic rule for
assigning cases to a given class, for example to the class MAMMAL in a taxonomy of
vertebrates. Such a seed, as the starting rule is called, species a value for every member
of the set of attributes characterizing the problem, for example
Rule 1.123456789
if skin-covering = hair, breathing = lungs, tail = none, can-y =
y, reproduction = viviparous, legs = y, warm-blooded = y, diet =
carnivorous, activity = nocturnal
then MAMMAL.
We now take the reader through the basics of specic-to-general rule learning. As a minimalist tutorial exercise we shall build a MAMMAL-recogniser.
The initial rule, numbered 1.123456789 in the above, is so specic as probably to be
capable only of recognising bats. Specicity is relaxed by dropping attributes one at a time,
thus:
Rule 1.23456789
if breathing = lungs, tail = none, can-y = y, reproduction =
viviparous, legs = y, warm-blooded = y, diet = carnivorous, activity = nocturnal
then MAMMAL;
Rule 1.13456789
if skin-covering = hair, tail = none, can-y = y, reproduction =
viviparous, legs = y, warm-blooded = y, diet = carnivorous, activity
= nocturnal
then MAMMAL;
Rule 1.12456789
if skin-covering = hair, breathing = lungs, can-y = y, reproduction
= viviparous, legs = y, warm-blooded = y, diet = carnivorous,
activity = nocturnal
then MAMMAL;
Rule 1.12356789
if skin-covering = hair, breathing = lungs, tail = none, reproduction
= viviparous, legs = y, warm-blooded = y, diet = carnivorous,
activity = nocturnal
thenMAMMAL;
Rule 1.12346789
if skin-covering = hair, breathing = lungs, tail = none, can-y = y,
legs = y, warm-blooded = y, diet = carnivorous, activity = nocturnal
bf then MAMMAL;
and so on for all the ways of dropping a single attribute, followed by all the ways of dropping two attributes, three attributes etc. Any rule which includes in its cover a negative
example, i.e. a non-mammal, is incorrect and is discarded during the process. The cycle
terminates by saving a set of shortest rules covering only mammals. As a classier, such a
set is guaranteed correct, but cannot be guaranteed complete, as we shall see later.
Sec. 5.1]
55
In the present case the terminating set has the single-attribute description:
Rule 1.1
if skin-covering = hair
then MAMMAL;
The process now iterates using a new seed for each iteration, for example:
Rule 2.123456789
if skin-covering = none, breathing = lungs, tail = none, can-y =
n, reproduction = viviparous, legs = n, warm-blooded = y, diet =
mixed, activity = diurnal
then MAMMAL;
leading to the following set of shortest rules:
Rule 2.15
if skin-covering = none, reproduction = viviparous
then MAMMAL;
Rule 2.17
if skin-covering = none, warm-blooded = y
then MAMMAL;
Rule 2.67
if legs = n, warm-blooded = y
then MAMMAL;
Rule 2.57
if reproduction = viviparous, warm-blooded = y
then MAMMAL;
Of these, the rst covers naked mammals. Amphibians, although uniformly naked, are
oviparous. The second has the same cover, since amphibians are not warm-blooded, and
birds, although warm-blooded, are not naked (we assume that classication is done on adult
forms). The third covers various naked marine mammals. So far, these rules collectively
contribute little information, merely covering a few overlapping pieces of a large patchwork. But the last rule at a stroke covers almost the whole class of mammals. Every attempt
at further generalisation now encounters negative examples. Dropping warm-blooded
causes the rule to cover viviparous groups of sh and of reptiles. Dropping viviparous
causes the rule to cover birds, unacceptable in a mammal-recogniser. But it also has the
effect of including the egg-laying mammals Monotremes, consisting of the duck-billed
platypus and two species of spiny ant-eaters. Rule 2.57 fails to cover these, and is thus
an instance of the earlier-mentioned kind of classier that can be guaranteed correct, but
cannot be guaranteed complete. Conversion into a complete and correct classier is not
an option for this purely specic-to-general process, since we have run out of permissible
generalisations. The construction of Rule 2.57 has thus stalled in sight of the nishing line.
But linking two or more rules together, each correct but not complete, can effect the desired
result. Below we combine the rule yielded by the rst iteration with, in turn, the rst and
the second rule obtained from the second iteration:
Rule 1.1
if skin-covering = hair
then MAMMAL;
Rule 2.15
if skin-covering = none, reproduction = viviparous
then MAMMAL;
Rule 1.1
if skin-covering = hair
then MAMMAL;
Rule 2.17
if skin-covering = none, warm-blooded = y
then MAMMAL;
These can equivalently be written as disjunctive rules:
56
[Ch. 5
if
skin-covering = hair
or
skin-covering = none, reproduction = viviparous
then
MAMMAL;
and
if
skin-covering = hair
or
skin-covering = none, warm-blooded = y
then
MAMMAL;
In rule induction, following Michalski, an attribute-test is called a selector, a conjunction
of selectors is a complex, and a disjunction of complexes is called a cover. If a rule is true
of an example we say that it covers the example. Rule learning systems in practical use
qualify and elaborate the above simple scheme, including by assigning a prominent role to
general-to-specic processes. In the StatLog experiment such algorithms are exemplied
by CN2 (Clarke & Niblett, 1989) and ITrule. Both generate decision rules for each class
in turn, for each class starting with a universal rule which assigns all examples to the
current class. This rule ought to cover at least one of the examples belonging to that class.
Specialisations are then repeatedly generated and explored until all rules consistent with
the data are found. Each rule must correctly classify at least a prespecied percentage of
the examples belonging to the current class. As few as possible negative examples, i.e.
examples in other classes, should be covered. Specialisations are obtained by adding a
condition to the left-hand side of the rule.
CN2 is an extension of Michalskis (1969) algorithm AQ with several techniques to
process noise in the data. The main technique for reducing error is to minimise
(Laplacian function) where k is the number of examples classied correctly
by a rule, is the number classied incorrectly, and c is the total number of classes.
ITrule produces rules of the form if ... then ... with probability .... This algorithm
contains probabilistic inference through the J-measure, which evaluates its candidate rules.
J-measure is a product of prior probabilities for each class and the cross-entropy of class
values conditional on the attribute values. ITrule cannot deal with continuous numeric
values. It needs accurate evaluation of prior and posterior probabilities. So when such
information is not present it is prone to misuse. Detailed accounts of these and other
algorithms are given in Section 5.2.
e d
Y
m e k e d
nlj
h g
ihf
5.1.3
Decision trees
Sec. 5.1]
57
skin-covering?
q
q
o
p
o
p
o
o
pp
o
o
o o
NOT-MAMMAL
feathers
s
MAMMAL
scales
hair
none
NOT-MAMMAL
viviparous?
t
t
q
yes
q
u
!q q
t
s
NOT-MAMMAL
no
MAMMAL
Fig. 5.3: Translation of a mammal-recognising rule (Rule 2.15, see text) into tree form. The
attribute-values that gured in the rule-sets built earlier are here set larger in bold type. The rest are
tagged with NOT-MAMMAL labels.
In common with CN2 and ITrule but in contrast to the specic-to-general earlier style of
Michalskis AQ family of rule learning, decision-tree learning is general-to-specic. In
illustrating with the vertebrate taxonomy example we will assume that the set of nine attributes are sufcient to classify without error all vertebrate species into one of MAMMAL,
BIRD, AMPHIBIAN, REPTILE, FISH. Later we will consider elaborations necessary in
underspecied or in inherently noisy domains, where methods from statistical data analysis enter the picture.
As shown in Figure 5.4, the starting point is a tree of only one node that allocates all
cases in the training set to a single class. In the case that a mammal-recogniser is required,
this default class could be NOT-MAMMAL. The presumption here is that in the population
there are more of these than there are mammals.
Unless all vertebrates in the training set are non-mammals, some of the training set of
cases associated with this single node will be correctly classied and others incorrectly,
in the terminology of Breiman and colleagues (1984), such a node is impure. Each
available attribute is now used on a trial basis to split the set into subsets. Whichever split
minimises the estimated impurity of the subsets which it generates is retained, and the
cycle is repeated on each of the augmented trees end-nodes.
Numerical measures of impurity are many and various. They all aim to capture the
degree to which expected frequencies of belonging to given classes (possibly estimated, for
[Ch. 5
h
58
{ ~ ~ { ~ |
%ihi%i}{
6z
~
%hhi~
~ |
~
ihhi~
~ |
6z
Note that this schema is general enough to include multi-class trees, raising a tactical
problem in approaching the taxonomic material. Should we build in turn a set of yes/no
recognizers, one for mammals, one for birds, one for reptiles, etc., and then daisy-chain
them into a tree? Or should we apply the full multi-class procedure to the data wholesale,
risking a disorderly scattering of different class labels along the resulting trees perimeter?
If the entire tree-building process is automated, as for the later standardised comparisons,
the second regime is mandatory. But in interactive decision-tree building there is no
generally correct answer. The analyst must be guided by context, by user-requirements
and by intermediate results.
Sec. 5.1]
59
empty attribute-test
s
if no misclassications
conrm leaf (solid lines)
NOT-MAMMAL
empty attribute-test
s
and EXIT
NOT-MAMMAL
if misclassications occur
choose an attribute for
splitting the set; for each,
calculate a purity measure
from the tabulations below:
skin-covering?
ha
sc
sc
fe
ha
TOTAL
y
4v
no
no
fe
scales
breathing?
gills
gi
gi
y
4v
lu
lungs
lu
tail?
no
sh
no
y
4v
lo
sh
lo
short none
v
long
number of MAMMALs in set
number of NOT-MAMMALs
and so on
Fig. 5.4: First stage in growing a decision tree from a training set. The single end-node is a candidate
to be a leaf, and is here drawn with broken lines. It classies all cases to NOT-MAMMAL. If
correctly, the candidate is conrmed as a leaf. Otherwise available attribute-applications are tried for
their abilities to split the set, saving for incorporation into the tree whichever maximises some chosen
purity measure. Each saved subset now serves as a candidate for recursive application of the same
split-and-test cycle.
60
[Ch. 5
and
if (skin-covering = hair)
then MAMMAL
if (skin-covering = feathers)
then NOT-MAMMAL
if (skin-covering = scales)
then NOT-MAMMAL
if (skin-covering = none)
then if (warm-blooded = y)
then MAMMAL
else NOT-MAMMAL
Step 5: EXIT
Fig. 5.5: Illustration, using the MAMMAL problem, of the basic idea of decision-tree induction.
Sec. 5.1]
61
Either way, the crux is the idea of rening T into subsets of cases that are, or seem to be
heading towards, single-class collections of cases. This is the same as the earlier described
search for purity. Departure from purity is used as the splitting criterion, i.e. as the basis
on which to select an attribute to apply to the members of a less pure node for partitioning
it into purer sub-nodes. But how to measure departure from purity? In practice, as noted
by Breiman et al., overall misclassication rate is not sensitive to the choice of a splitting
rule, as long as it is within a reasonable class of rules. For a more general consideration
of splitting criteria, we rst introduce the case where total purity of nodes is not attainable:
i.e. some or all of the leaves necessarily end up mixed with respect to class membership.
In these circumstances the term noisy data is often applied. But we must remember that
noise (i.e. irreducible measurement error) merely characterises one particular form of
inadequate information. Imagine the multi-class taxonomy problem under the condition
that skin-covering, tail, and viviparous are omitted from the attribute set. Owls and
bats, for example, cannot now be discriminated. Stopping rules based on complete purity
have then to be replaced by something less stringent.
5.1.5
One method, not necessarily recommended, is to stop when the purity measure exceeds
some threshold. The trees that result are no longer strictly decision trees (although
for brevity we continue to use this generic term), since a leaf is no longer guaranteed to
contain a single-class collection, but instead a frequency distribution over classes. Such
trees are known as class probability trees. Conversion into classiers requires a separate
mapping from distributions to class labels. One popular but simplistic procedure says pick
the candidate with the most votes. Whether or not such a plurality rule makes sense
depends in each case on (1) the distribution over the classes in the population from which
the training set was drawn, i.e. on the priors, and (2) differential misclassication costs.
Consider two errors: classifying the shuttle main engine as ok to y when it is not, and
classifying it as not ok when it is. Obviously the two costs are unequal.
Use of purity measures for stopping, sometimes called forward pruning, has had
mixed results. The authors of two of the leading decision tree algorithms, CART (Breiman
et al., 1984) and C4.5 (Quinlan 1993), independently arrived at the opposite philosophy,
summarised by Breiman and colleagues as Prune instead of stopping. Grow a tree that
is much too large and prune it upward ... This is sometimes called backward pruning.
These authors denition of much too large requires that we continue splitting until each
terminal node
either
is pure,
or
contains only identical attribute-vectors (in which case splitting is impossible),
or
has fewer than a pre-specied number of distinct attribute-vectors.
Approaches to the backward pruning of these much too large trees form the topic of a
later section. We rst return to the concept of a nodes purity in the context of selecting
one attribute in preference to another for splitting a given node.
5.1.6
Splitting criteria
Readers accustomed to working with categorical data will recognise in Figure 5.4 crosstabulations reminiscent of the contingency tables of statistics. For example it only
62
[Ch. 5
requires completion of the column totals of the second tabulation to create the standard
input to a two-by-two . The hypothesis under test is that the distribution of cases
between MAMMALs and NOT-MAMMALs is independent of the distribution between
the two breathing modes. A possible rule says that the smaller the probability obtained
by applying a
test to this hypothesis then the stronger the splitting credentials of the
attribute breathing. Turning to the construction of multi-class trees rather than yes/no
concept-recognisers, an adequate number of shes in the training sample would, under
almost any purity criterion, ensure early selection of breathing. Similarly, given adequate
representation of reptiles, tail=long would score highly, since lizards and snakes account
for 95% of living reptiles. The corresponding 5 x 3 contingency table would have the form
given in Table 5.1. On the hypothesis of no association, the expected numbers in the
cells can be got from the marginal totals. Thus expected
long , where
is the total in the training set. Then
observed expected expected is distributed as
, with degrees of freedom equal to
, i.e. 8 in this case.
|
h g
4j1|
|
g
hf
g
if
}0
1 k
k
|
1|
1
|
1
i|
i|
Suppose, however, that the tail variable were not presented in the form of a categorical
attribute with three unordered values, but rather as a number, as the ratio, for example,
of the length of the tail to that of the combined body and head. Sometimes the rst step
is to apply some form of clustering method or other approximation. But virtually every
algorithm then selects, from all the dichotomous segmentations of the numerical scale
meaningful for a given node, that segmentation that maximises the chosen purity measure
over classes.
With suitable renements, the CHAID decision-tree algorithm (CHi-squared Automatic
Interaction Detection) uses a splitting criterion such as that illustrated with the foregoing
contingency table (Kass, 1980). Although not included in the present trials, CHAID enjoys
widespread commercial availability through its inclusion as an optional module in the SPSS
statistical analysis package.
Other approaches to such tabulations as the above use information theory. We then
enquire what is the expected gain in information about a cases row-membership from
knowledge of its column-membership?. Methods and difculties are discussed by Quinlan
(1993). The reader is also referred to the discussion in Section 7.3.3, with particular
reference to mutual information.
A related, but more direct, criterion applies Bayesian probability theory to the weighing
of evidence (see Good, 1950, for the classical treatment) in a sequential testing framework
(Wald, 1947). Logarithmic measure is again used, namely log-odds or plausibilities
Sec. 5.1]
63
~
ihi~ f
5.1.7
CARTs, and C4.5s, pruning starts with growing a tree that is much too large. How large
is too large? As tree-growth continues and end-nodes multiply, the sizes of their associated samples shrink. Probability estimates formed from the empirical class-frequencies at
the leaves accordingly suffer escalating estimation errors. Yet this only says that overgrown
trees make unreliable probability estimators. Given an unbiased mapping from probability
estimates to decisions, why should their performance as classiers suffer?
Performance is indeed impaired by overtting, typically more severely in tree-learning
than in some other multi-variate methods. Figure 5.6 typies a universally observed
relationship between the number of terminal nodes ( -axis) and misclassication rates ( axis). Breiman et al., from whose book the gure has been taken, describe this relationship
as a fairly rapid initial decrease followed by a long, at valley and then a gradual increase
... In this long, at valley, the minimum is almost constant except for up-down changes
well within the
SE range. Meanwhile the performance of the tree on the training sample
(not shown in the Figure) continues to improve, with an increasingly over-optimistic error
rate usually referred to as the resubstitution error. An important lesson that can be drawn
from inspection of the diagram is that large simplications of the tree can be purchased at
the expense of rather small reductions of estimated accuracy.
Overtting is the process of inferring more structure from the training sample than is
justied by the population from which it was drawn. Quinlan (1993) illustrates the seeming
paradox that an overtted tree can be a worse classier than one that has no information at
all beyond the name of the datasets most numerous class.
This effect is readily seen in the extreme example of random data in which the
class of each case is quite unrelated to its attribute values. I constructed an articial
dataset of this kind with ten attributes, each of which took the value 0 or 1 with
equal probability. The class was also binary, yes with probability 0.25 and no with
probability 0.75. One thousand randomly generated cases were split intp a training
set of 500 and a test set of 500. From this data, C4.5s initial tree-building routine
64
[Ch. 5
30
20
10
40
50
Fig. 5.6: A typical plot of misclassication rate against different levels of growth of a tted tree.
Horizontal axis: no. of terminal nodes. Vertical axis: misclassication rate measured on test data.
produces a nonsensical tree of 119 nodes that has an error rate of more than 35%
on the test cases
....For the random data above, a tree consisting of just the leaf no would have an
expected error rate of 25% on unseen cases, yet the elaborate tree is noticeably
less accurate. While the complexity comes as no surprise, the increased error
attributable to overtting is not intuitively obvious. To explain this, suppose we
have a two-class task in which a cases class is inherently indeterminate, with
of the cases belonging to the majority class (here no). If a
proportion
classier assigns all such cases to this majority class, its expected error rate is
clearly
. If, on the other hand, the classier assigns a case to the majority
class with probability and to the other class with probability
, its expected
error rate is the sum of
the probability that a case belonging to the majority class is assigned to the
other class,
, and
the probability that a case belonging to the other class is assigned to the
majority class,
which comes to
. Since is at least
0.5, this is generally greater than
, so the second classier will have a
higher error rate. Now, the complex decision tree bears a close resemblance
to this second type of classier. The tests are unrelated to class so, like a
symbolic pachinko machine, the tree sends each case randomly to one of the
leaves. ...
g
YHu}
0
f
Quinlan points out that the probability of reaching a leaf labelled with class C is the same
as the relative frequency of C in the training data, and concludes that the trees expected
or 37.5%, quite close to the observed
error rate for the random data above is
value.
Given the acknowledged perils of overtting, how should backward pruning be applied
to a too-large tree? The methods adopted for CART and C4.5 follow different philosophies,
and other decision-tree algorithms have adopted their own variants. We have now reached
the level of detail appropriate to Section 5.2 , in which specic features of the various tree
and rule learning algorithms, including their methods of pruning, are examined. Before
proceeding to these candidates for trial, it should be emphasized that their selection was
Sec. 5.2]
StatLogs ML algorithms
65
necessarily to a large extent arbitrary, having more to do with the practical logic of coordinating a complex and geographically distributed project than with judgements of merit
or importance. Apart from the omission of entire categories of ML (as with the genetic and
ILP algorithms discussed in Chapter 12) particular contributions to decision-tree learning
should be acknowledged that would otherwise lack mention.
First a major historical role, which continues today, belongs to the Assistant algorithm
developed by Ivan Bratkos group in Slovenia (Cestnik, Kononenko and Bratko, 1987).
Assistant introduced many improvements for dealing with missing values, attribute splitting and pruning, and has also recently incorporated the m-estimate method (Cestnik and
Bratko, 1991; see also Dzeroski, Cesnik and Petrovski, 1993) of handling prior probability
assumptions.
Second, an important niche is occupied in the commercial sector of ML by the XpertRule
family of packages developed by Attar Software Ltd. Facilities for large-scale data analysis
are integrated with sophisticated support for structured induction (see for example Attar,
1991). These and other features make this suite currently the most powerful and versatile
facility available for industrial ML.
5.2 STATLOGS ML ALGORITHMS
5.2.1
The reader should be aware that the two versions of C4.5 used in the StatLog trials differ in
certain respects from the present version which was recently presented in Quinlan (1993).
The version on which accounts in Section 5.1 are based is that of the radical upgrade,
described in Quinlan (1993).
5.2.2
NewID
NewID is a similar decision tree algorithm to C4.5. Similar to C4.5, NewID inputs a set of
examples , a set of attributes and a class . Its output is a decision tree, which performs
(probabilistic) classication. Unlike C4.5, NewID does not perform windowing. Thus its
core procedure is simpler:
m
1.
2.
3.
The termination condition is simpler than C4.5, i.e. it terminates when the node contains
all examples in the same class. This simple-minded strategy tries to overt the training
data and will produce a complete tree from the training data. NewID deals with empty
leaf nodes as C4.5 does, but it also considers the possibility of clashing examples. If the
set of (untested) attributes is empty it labels the leaf node as CLASH, meaning that it is
impossible to distinguish between the examples. In most situations the attribute set will
not be empty. So NewID discards attributes that have been used, as they can contribute no
more information to the tree.
66
[Ch. 5
For classication problems, where the class values are categorical, the evaluation function of NewID is the information gain function
. It does a similar 1-level
lookahead to determine the best attribute to split on using a greedy search. It also handles
numeric attributes in the same way as C4.5 does using the attribute subsetting method.
Numeric class values
NewID allows numeric class values and can produce a regression tree. For each split, it
aims to reduce the spread of class values in the subsets introduced by the split, instead of
trying to gain the most information. Formally, for each ordered categorical attribute with
values in the set
, it chooses the one that minimises the value of:
g
m k
f
attribute value of
~
i~ f
class of
|
m k
!PFP i
hhg
f
where
is the standard deviation, is the original example set, and the constant is a
user-tunable parameter.
Missing values
There are two types of missing values in NewID: unknown values and dont-care values.
During the training phase, if an example of class has an unknown attribute value, it is
split into fractional examples for each possible value of that attribute. The fractions of
the different values sum to 1. They are estimated from the numbers of examples of the
same class with a known value of that attribute.
Consider attribute with values
and . There are 9 examples at the current node
,2
and 1 missing (?). Naively, we would split the
in class with values for : 6
? in the ratio 6 to 2 (i.e. 75%
and 25% ). However, the Laplace criterion gives a
better estimate of the expected ratio of
to
using the formula:
d
!k
!k
!k
!k
h g
if
i
3
e
e
h g
if
~ g
k e k
!!
e k
!iF !
k m
ihi
where
iQ k
!k
k
Sec. 5.2]
StatLogs ML algorithms
67
!k
!k
F
|
P
{
5.2.3
[Ch. 5
68
dialog and interaction of the system with the user. The user interacts with
via a
graphical interface. This interface is consisting of graphical editors, which enable the user
to dene the domain, to interactively build the data base, and to go through the hierarchy
of classes and the decision tree.
can be viewed as an extension of a tree induction algorithm that is essentially the
same as NewID. Because of its user interface, it allows a more natural manner of interaction
with a domain expert, the validation of the trees produced, and the test of its accuracy and
reliability. It also provides a simple, fast and cheap method to update the rule and data
bases. It produces, from data and known rules (trees) of the domain, either a decision tree
or a set of rules designed to be used by expert system.
5.2.4
CART, Classication and Regression Tree, is a binary decision tree algorithm (Breiman et
al., 1984), which has exactly two branches at each internal node. We have used two different
implementations of CART: the commercial version of CART and IndCART, which is part
of the Ind package (also see Naive Bayes, Section 4.5). IndCART differs from CART as
described in Breiman et al. (1984) in using a different (probably better) way of handling
missing values, in not implementing the regression part of CART, and in the different
pruning settings.
Evaluation function for splitting
The evaluation function used by CART is different from that in the ID3 family of algorithms.
Consider the case of a problem with two classes, and a node has 100 examples, 50 from each
class, the node has maximum impurity. If a split could be found that split the data into one
subgroup of 40:5 and another of 10:45, then intuitively the impurity has been reduced. The
impurity would be completely removed if a split could be found that produced sub-groups
50:0 and 0:50. In CART this intuitive idea of impurity is formalised in the GINI index for
the current node :
m k
if
where
is the probability of class in . For each possible split the impurity of the
subgroups is summed and the split with the maximum reduction in impurity chosen.
For ordered and numeric attributes, CART considers all possible splits in the sequence.
For values of the attribute, there are
splits. For categorical attributes CART examines
all possible binary splits, which is the same as attribute subsetting used for C4.5. For
values of the attribute, there are
splits. At each node CART searches through the
attributes one by one. For each attribute it nds the best split. Then it compares the best
single splits and selects the best attribute of the best splits.
m
Sec. 5.2]
StatLogs ML algorithms
69
It is a two stage method. Considering the rst stage, let be a decision tree used to
classify examples in the training set . Let be the misclassied set of size . If
is the number of leaves in the cost complexity of for some parameter is:
z
z
f
~ g
z
ffi3e
h
z
I
where
is the error estimate of . If we regard as the cost for each leaf,
is a linear combination of its error estimate and a penalty for its complexity. If is
small the penalty for having a large number of leaves is small and will be large. As
increases, the minimising subtree will decrease in size. Now if we convert some subtree
to a leaf. The new tree
would misclassify more examples but would contain
fewer leaves. The cost complexity of
is the same as that of if
z
I
g
if
fhnk
#z
H
r|
6z
#z
~
i~ |
z
I
1.
2.
6xn
|
3.
70
[Ch. 5
5.2.5 Cal5
Cal5 is especially designed for continuous and ordered discrete valued attributes, though
an added sub-algorithm is able to handle unordered discrete valued attributes as well.
Let the examples be sampled from the examples expressed with attributes. CAL5
separates the examples from the dimensions into areas represented by subsets
of samples, where the class
exists with a probability
k
~
h~ f
g
k
f4
~
h~ f
tm
where
is a decision threshold. Similar to other decision tree methods, only class
areas bounded by hyperplanes parallel to the axes of the feature space are possible.
Evaluation function for splitting
The tree will be constructed sequentially starting with one attribute and branching with
other attributes recursively, if no sufcient discrimination of classes can be achieved. That
is, if at a node no decision for a class according to the above formula can be made, a
branch formed with a new attribute is appended to the tree. If this attribute is continuous,
a discretisation, i.e. intervals corresponding to qualitative values has to be used.
Let be a certain non-leaf node in the tree construction process. At rst the attribute
with the best local discrimination measure at this node has to be determined. For that
two different methods can be used (controlled by an option): a statistical and an entropy
measure, respectively. The statistical approach is working without any knowledge about the
result of the desired discretisation. For continuous attributes the quotient (see Meyer-Br tz
o
& Sch rmann, 1970):
u
k
9i
k
9i
StatLogs ML algorithms
~
h~ f
Sec. 5.2]
71
~
%hhi~
~ |
Cg
m
occurring in with
m
m
for
m
Cg
g m
k
m
%%
f
e k
xY!P
m
P
m
3Y!P
e k
Cg
m
%
H1:
i.e. H1 is true, if the complete condence interval lies above the predened threshold, and
g ihh~ f
m
H2:
i.e. this hypothesis is true, if for each class the complete condence interval is less than
the threshold.
Now the following meta-decision on the dominance of a class in can be dened as:
3.
2.
1.
72
[Ch. 5
Merging
Adjacent intervals
with the same class label can be merged. The resultant intervals
yield the leaf nodes of the decision tree. The same rule is applied for adjacent intervals
where no class dominates and which contain identical remaining classes due to the following
elimination procedure. A class within an interval is removed, if the inequality:
|
h f
h'g
m
is satised, where
is the total number of different class labels occurring in (i.e. a
class will be omitted, if its probability in is less than the value of an assumed constant
distribution of all classes occurring in ). These resultant intervals yield the intermediate
nodes in the construction of the decision tree, for which further branching will be performed.
Every intermediate node becomes the start node for a further iteration step repeating
the steps from sections 5.2.5 to 5.2.5. The algorithm stops when all intermediate nodes are
all terminated. Note that a majority decision is made at a node if, because of a too small ,
no estimation of probability can be done.
Discrete unordered attributes
To distinguish between the different types of attributes the program needs a special input
vector. The algorithm for handling unordered discrete valued attributes is similar to that
described in sections 5.2.5 to 5.2.5 apart from interval construction. Instead of intervals
discrete points on the axis of the current attribute have to be considered. All examples with
the same value of the current discrete attribute are related to one point on the axis. For
each point the hypotheses H1 and H2 will be tested and the corresponding actions (a) and
(b) performed, respectively. If neither H1 nor H2 is true, a majority decision will be made.
This approach also allows handling mixed (discrete and continuous) valued attributes.
Probability threshold and condence
As can be seen from the above two parameters affect the tree construction process: the rst
is a predened threshold for accept a node and the second is a predened condence level
. If the conditional probability of a class exceeds the threshold the tree is pre-pruned at
that node. The choice of should depend on the training (or pruning) set and determines
the accuracy of the approximation of the class hyperplane, i.e. the admissible error rate. The
higher the degree of overlapping of class regions in the feature space the less the threshold
has to be for getting a reasonable classication result.
Therefore by selecting the value of the accuracy of the approximation and simultaneously the complexity of the resulting tree can be controlled by the user. In addition to
a constant the algorithm allows to choose the threshold in a class dependent manner,
taking into account different costs for misclassication of different classes. With other
words the inuence of a given cost matrix can be taken into account during training, if the
different costs for misclassication can be reected by a class dependent threshold vector.
One approach has been adopted by CAL5:
~
h~ f
every column
of the cost matrix will be summed up ( );
the threshold of that class relating to the column , for which is a maximum (
has to be chosen by the user like in the case of a constant threshold (
);
the other thresholds will be computed by the formula
~
h~ f
3.
1.
2.
Sec. 5.2]
StatLogs ML algorithms
73
From experience should be set to one. Thus all values of the class dependent thresholds
are proportional to their corresponding column sums of the cost matrix, which can be
interpreted as a penalty measure for misclassication into those classes.
Compared with the threshold the condence level for estimating the appropriate class
probability has an inversely proportional effect. The less the value of the better the
demanded quality of estimation and the worse the ability to separate intervals, since the
algorithm is enforced to construct large intervals in order to get sufcient statistics.
A suitable approach for the automatically choosing the parameters and is not
available. Therefore a program for varying the parameter between, by default,
and
and between, by default,
and
in steps of
is used to predene the best
parameter combination, i.e. that which gives the minimum cost (or error rate, respectively)
on a test set. However, this procedure may be computationally expensive in relation to the
number of attributes and the size of data set.
5.2.6
Bayes tree
This is a Bayesian approach to decision trees that is described by Buntine (1992), and is
available in the IND package. It is based on a full Bayesian approach: as such it requires
the specication of prior class probabilities (usually based on empirical class proportions),
and a probability model for the decision tree. A multiplicative probability model for the
probability of a tree is adopted. Using this form simplies the problem of computing tree
probabilities, and the decision to grow a tree from a particular node may then be based on
the increase in probability of the resulting tree, thus using only information local to that
node. Of all potential splits at that node, that split is chosen which increases the posterior
probability of the tree by the greatest amount.
Post-pruning is done by using the same principle, i.e. choosing the cut that maximises
the posterior probability of the resulting tree. Of all those tree structures resulting from
pruning a node from the given tree, choose that which has maximum posterior probability.
An alternative to post-pruning is to smooth class probabilities. As an example is
dropped down the tree, it goes through various nodes. The class probabilities of each node
visited contribute to the nal class probabilities (by a weighted sum), so that the nal class
probabilities inherit probabilities evaluated higher up the tree. This stabilises the class
probability estimates (i.e. reduces their variance) at the expense of introducing bias.
Costs may be included in learning and testing via a utility function for each class (the
utility is the negative of the cost for the two-class case).
5.2.7
This algorithm of Clark and Nibletts was sketched earlier. It aims to modify the basic
AQ algorithm of Michalski in such a way as to equip it to cope with noise and other
complications in the data. In particular during its search for good complexes CN2 does
not automatically remove from consideration a candidate that is found to include one or
more negative example. Rather it retains a set of complexes in its search that is evaluated
statistically as covering a large number of examples of a given class and few of other classes.
Moreover, the manner in which the search is conducted is general-to-specic. Each trial
specialisation step takes the form of either adding a new conjunctive term or removing a
disjunctive one. Having found a good complex, the algorithm removes those examples it
74
[Ch. 5
covers from the training set and adds the rule if <complex> then predict <class> to the
end of the rule list. The process terminates for each given class when no more acceptable
complexes can be found.
Clark & Nibletts (1989) CN2 algorithm has the following main features: 1) the
dependence on specic training examples during search (a feature of the AQ algorithm)
is removed; 2) it combines the efciency and ability to cope with noisy data of decisiontree learning with the if-then rule form and exible search strategy of the AQ family;
3) it contrasts with other approaches to modify AQ to handle noise in that the basic AQ
algorithm itself is generalised rather than patched with additional pre- and post-processing
techniques; and 4) it produces both ordered and unordered rules.
CN2 inputs a set of training examples and output a set of rules called rule list. The
core of CN2 is the procedure as follows, but it needs to use a sub-procedure to return the
value of best cpx:
1. Let rule list be the empty list;
2. Let best cpx be the best complex found from ;
3. If best cpx or is empty then stop and return rule list;
4. Remove the examples covered by best cpx from and add the rule if best cpx then
class= to the end of rule list where is the most common class of examples covered
by best cpx; re-enter at step (2).
This subprocedure is used for producing ordered rules. CN2 also produces a set of unordered
rules, which uses a slightly different procedure. To produce unordered rules, the above
procedure is repeated for each class in turn. In addition, in step 4 only the positive examples
should be removed.
The procedure for nding the best complex is as follows:
1. Let the set star contain only the empty complex and best cpx be nil;
2. Let selectors be the set of all possible selectors;
3. If star is empty, then return the current best cpx;
star y
4. Specialise all complexes in star as newstar, which is the set
selectors and remove all complexes in newstar that are either in star (i.e. the unspecialised ones) or are null (i.e.
);
5. For every complex
in newstar, if
is statistically signicant (in signicance)
when tested on and better than (in goodness) best cpx according to user-dened
criteria when tested on , then replace the current value of best cpx by ; remove
all worst complexes from newstar until the size of newstar is below the user-dened
maximum; set star to newstar and re-enter at step (3).
As can be seen from the algorithm, the basic operation of CN2 is that of generating a
complex (i.e. a conjunct of attribute tests) which covers (i.e. is satised by) a subset of the
training examples. This complex forms the condition part of a production rule if condition
then class = , where class is the most common class in the (training) examples which
satisfy the condition. The condition is a conjunction of selectors, each of which represents
a test on the values of an attribute such as weather=wet. The search proceeds in both AQ
and CN2 by repeatedly specialising candidate complexes until one which covers a large
number of examples of a single class and few of other classes is located. Details of each
search are outlined below.
P
!k
{
h
f1
Sec. 5.2]
StatLogs ML algorithms
75
%
&
$#
!
"
d e k
h g
ihf
d
Ce Huf
k k
k m
61
m m
A C CG8 C C D8 C C D8 C D
ER HP HIUR HP FIUR HG FIHP H CG S0
AER CHP CHGQ8UR CHI8UR CHP CFT8HP FIUR H CG IHP HG FIHG FE0 ETHQHI8 S0
G
D C D8 C D8 C C D8 C D
AR8P8G D
A )
B@5 9761 30
8 ( 5
2 1
)
!
4
!
'
A CG8 D8 C D
ER HQH CP IHG FE0
76
[Ch. 5
CN2 uses one of these criteria according to the users choice to order the goodness of rules.
To test signicance, CN2 uses the entropy statistic. This is given by:
~ g
~ ~
ihihF|
log
|
iF|
~ ~
Missing values
Similar to NewID, CN2 can deal with unknown or dont-care values. During rule generation,
a similar policy of handling unknowns and dont-cares is followed: unknowns are split into
fractional examples and dont-cares are duplicated.
Each rule produced by CN2 is associated with a set of counts which corresponds to
the number of examples, covered by the rule, belonging to each class. Strictly speaking,
for the ordered rules the counts attached to rules when writing the rule set should be those
encountered during rule generation. However, for unordered rules, the counts to attach are
generated after rule generation in a second pass, following the execution policy of splitting
an example with unknown attribute value into equal fractions for each value rather than the
Laplace-estimated fractions used during rule generation.
When normally executing unordered rules without unknowns, for each rule which res
the class distribution (i.e. distribution of training examples among classes) attached to the
rule is collected. These are then summed. Thus a training example satisfying two rules
with attached class distributions [8,2] and [0,1] has an expected distribution [8,3] which
results in being predicted, or
:
:
if probabilistic classication is
desired. The built-in rule executer follows the rst strategy (the example is classed simply
).
With unordered CN2 rules, an attribute test whose value is unknown in the training
example causes the example to be examined. If the attribute has three values, 1/3 of the
example is deemed to have passed the test and thus the nal class distribution is weighted
by 1/3 when collected. A similar rule later will again cause 1/3 of the example to pass
the test. A dont-care value is always deemed to have passed the attribute test in full (i.e.
weight 1). The normalisation of the class counts means that an example with a dont-care
can only count as a single example during testing, unlike NewID where it may count as
representing several examples.
With ordered rules, a similar policy is followed, except after a rule has red absorbing,
say, 1/3 of the testing example, only the remaining 2/3s are sent down the remainder of
the rule list. The rst rule will cause 1/3 class frequency to be collected, but a second
similar rule will cause 2/3 1/3 class frequency to be collected. Thus the fraction of the
example gets less and less as it progresses down the rule list. A dont-care value always
f f
6h
f f
P#h
m
Sec. 5.2]
StatLogs ML algorithms
77
passes the attribute test in full, and thus no fractional example remains to propagate further
down the rule list.
Numeric attributes and rules
For numeric attributes, CN2 will partition the values into two subsets and test which subset
each example belongs to. The drawback with a nave implementation of this is that it
requires
evaluations where is the number of attribute values. Breiman et al. (1984)
proved that in the special case where there are two class values it is possible to nd an
comparisons. In the general case heuristic methods must be used.
optimal split with
The AQ algorithm produces an unordered set of rules, whereas the version of the CN2
algorithm used in StatLog produces an ordered list of rules. Unordered rules are on the
whole more comprehensible, but require also that they are qualied with some numeric
condence measure to handle any clashes which may occur. With an ordered list of rules,
clashes cannot occur as each rule in the list is considered to have precedence over all
subsequent rules.
Relation between CN2 and AQ
There are several differences between these two algorithms; however, it is possible to show
that strong relationships exist between the two, so much so that simple modications of the
CN2 system can be introduced to enable it to emulate the behaviour of the AQ algorithm.
See Michalski & Larson (1978).
AQ searches for rules which are completely consistent with the training data, whereas
CN2 may prematurely halt specialisation of a rule when no further rules above a certain
threshold of statistical signicance can be generated via specialisation. Thus, the behaviour
of AQ in this respect is equivalent to setting the threshold to zero.
When generating specialisations of a rule, AQ considers only specialisations which
exclude a specic negative example from the coverage of a rule, whereas CN2 considers
all specialisations. However, specialisations generated by CN2 which dont exclude any
negative examples will be rejected, as they do not contribute anything to the predictive
accuracy of the rule. Thus, the two algorithms search the same space in different ways.
Whereas published descriptions of AQ leave open the choice of evaluation function to
use during search, the published norm is that of number of correctly classied examples
divided by total examples covered. The original CN2 algorithm uses entropy as its
evaluation function. To obtain a synthesis of the two systems, the choice of evaluation
function can be user-selected during start of the system.
AQ generates order-independent rules, whereas CN2 generates an ordered list of rules.
To modify CN2 to produce order-independent rules requires a change to the evaluation
function, and a change to the way examples are removed from the training set between
iterations of the complex-nding algorithm. The basic search algorithm remains unchanged.
f
5.2.8 ITrule
Goodman & Smyths (1989) ITrule algorithm uses a function called the -measure to rank
the hypotheses during decision rule construction. Its output is a set of probability rules,
which are the most informative selected from the possible rules depending on the training
data.
The algorithm iterates through each attribute (including the class attribute) value in turn
to build rules. It keeps a ranked list of the
best rules determined to that point of the
[Ch. 5
78
algorithm execution ( is the size of the beam search). The -measure of the th rule
is used as the running minimum to determine whether a new rule should be inserted into
the rule list. For each attribute value the algorithm must nd all possible conditions to add
to the left hand side of an over-general rule to specialise it. Or it may decide to drop a
condition to generalise an over-specic rule. The rules considered are those limited by the
minimum running -measure value, which prevents the algorithm from searching a large
rule space.
Three points should be noted. First, ITrule produces rules for each attribute value. So
it can also capture the dependency relationships between attributes, between attributes and
classes and between class values. Secondly, ITrule not only specialises existing rules but
also generalises them if the need arises. Specialisation is done through adding conditions
to the left hand side of the rule and generalisation is done through dropping conditions.
Finally, ITrule only deals with categorical examples so it generally needs to convert numeric
attributes and discrete values.
Evaluation function: the -measure
and be an attribute with values
Let be an attribute with values in the set
in
. The -measure is a method for calculating the information content
of attribute given the value of attribute
. It is
~
h~ f
|
k
g
g
~
y
P
~
i~ f
F
g
where
is the conditional probability of
given
and
is the a priori
probability of
. These can normally be estimated from the (conditional) relative
frequency of the value of . When the distribution is uniform and the data set is sufcient
such estimates can be reasonable accurate. The ITrule algorithms uses a maximum entropy
estimator:
e
e
e k e
Yx
xe
where
and are parameters of an initial density estimate,
is the number of the
(conditional) event
in the data and is the (conditional) total number of .
The average information content is therefore dened as:
k
|
h
YX
`H
P
The above is true because it takes into account the fact that the probabilities of other values
can be interpreted as a measure of the simplicity of the
of are zero. The rst term
hypothesis that is related to . The second term
is equal to the cross-entropy
of the variable with the condition is dependent on the event
. Cross-entropy is
known to measure the goodness of t between two distributions; see Goodman & Smyth
(1989).
Rule searching strategy
ITrule performs both generalisation and specialisation. It starts with a model driven strategy
much like CN2. But its rules all have probability attached from the beginning. So a universal
rule will be
g
with probability
k
9F
Then
k
!hi
k
h
If
79
Sec. 5.3]
YX
`a
YX
`H
YX
UH
~ g
YX
`H
P
YX
`a
~
#
YX
`H
tg
hg
SQ
he c
afc db
1 fc db
e
c
YX
`H
where
. Namely the increase in simplicity is sufciently compensated for
by the decrease in cross-entropy.
80
[Ch. 5
training set gives fewest false positives. The one with most false positives is the last to be
applied. By that time some of the false-positive errors that it could have made have been
pre-empted by other rule-sets. Finally a default class is chosen to which all cases which do
not match any rule are to be assigned. This is calculated from the frequency statistics of
such left-overs in the training set. Whichever class appears most frequently among these
left-overs is selected as the default classication.
Rule-structured classiers generated in this way turn out to be smaller, and better in
mental t, than the trees from which the process starts. Yet accuracy is found to be fully
preserved when assayed against test data. A particularly interesting feature of Quinlans
(1993) account, for which space allows no discussion here, is his detailed illustration of the
Minimum Description Length (MDL) Principle, according to which the storage costs of
rulesets and of their exceptions are expressed in a common information-theoretic coinage.
This is used to address a simplication problem in building rule-sets that is essentially
similar to the regulation of the pruning process in decision trees. The trade-off in each case
is between complexity and predictive accuracy.
5.3.2
Sec. 5.3]
81
Table 5.2: The six attributes encode a position according to the scheme: a1 = le(BK); a2
= rank (BK); a3 = le(WR); a4 = rank(WR); a5 = le(BK); a6 = rank(BK).
ID no.
1
2
3
4
..
..
n-1
n
a1
7
6
2
2
..
..
2
7
a2
8
5
3
2
..
..
7
1
a3
1
8
3
5
..
..
5
5
a4
7
4
5
7
..
..
3
4
a5
6
6
8
5
..
..
2
3
a6
8
8
7
1
..
..
3
6
class
yes
no
no
yes
...
...
yes
no
sub-descriptions, such as the crucial same-le and same-rank relation between White rook
and Black king. Whenever one of these relations holds it is a good bet that the position is
illegal.
The gain in classication accuracy is impressive, yet no amount of added training data
could inductively rene the above excellent bet into a certainty. The reason again lies
with persisting limitations of the description language. To dene the cases where the
classiers use of samele(WR, BK) and samerank(WR, BK) lets it down, one needs to
say that this happens if and only if the WK is between the WR and BK. Decision-tree
learning, with attribute-set augmented as described, can patch together subtrees to do duty
for samele and samerank. But an equivalent feat for a sophisticated three-place relation
such as between is beyond the expressive powers of an attribute-value propositionallevel language. Moreover, the decision-tree learners constructions were described above
as very successful on purely operational grounds of accuracy relative to the restricted
amount of training material, i.e. successful in predictivity. In terms of descriptivity the
trees, while not as opaque as those obtained with primitive attributes only, were still far
from constituting intelligible theories.
5.3.3 Inherent limits of propositional-level learning
Construction of theories of high descriptivity is the shared goal of human analysts and
of ML. Yet the propositional level of ML is too weak to fully solve even the problem
here illustrated. The same task, however, was proved to be well within the powers (1) of
Dr. Jane Mitchell,a gifted and experienced human data analyst on the academic staff of
Strathclyde University, and (2) of a predicate-logic ML system belonging to the Inductive Logic Programming (ILP) family described in Chapter 12. The two independently
obtained theories were complete and correct. One theory-discovery agent was human,
namely a member of the academic staff of a University Statistics department. The other
was an ILP learner based on Muggleton & Fengs (1990) GOLEM, with Closed World
Specialization enhancements (Bain, private communication). In essentials the two theories closely approximated to the one shown below in the form of four if-then rules. These
are here given in english after back-interpretation into chess terms. Neither of the learning
agents had any knowledge of the meaning of the task, which was simply presented as in
Table 5.2. They did not know that it had anything to do with chess, nor even with objects
placed on plane surfaces. The background knowledge given to the ILP learner was similar
82
[Ch. 5
2.
3.
4.
Construction of this theory requires certain key sub-concepts, notably of directly between.
Denitions were invented by the machine learner, using lower-level concepts such as lessthan, as background knowledge. Directly between holds among the three co-ordinate
pairs if either the rst co-ordinates are all equal and the second co-ordinates are in ascending
or descending progression, or the second co-ordinates are all equal and the rst co-ordinates
show the progression. Bains ILP package approached the relation piece-wise, via invention
of between-le and between-rank. The human learner doubtless came ready-equipped
with some at least of the concepts that the ML system had to invent. None the less, with
unlimited access to training data and the use of standard statistical analysis and tabulation
software, the task of theory building still cost two days of systematic work. Human
learners given hours rather than days constructed only partial theories, falling far short
even of operational adequacy (see also Muggleton, S.H., Bain, M., Hayes-Michie, J.E. and
Michie, D. (1989)).
Bains new work has the further interest that learning takes place incrementally, by
successive renement, a style sometimes referred to as non-monotonic. Generalisations
made in the rst pass through training data yield exceptions when challenged with new
data. As exceptions accumulate they are themselves generalised over, to yield sub-theories
which qualify the main theory. These renements are in turn challenged, and so forth to
any desired level.
The KRK illegality problem was originally included in StatLogs datasets. In the
interests of industrial relevance, articial problems were not retained except for expository
purposes. No connection, however, exists between a data-sets industrial importance and
its intrinsic difculty. All of the ML algorithms tested by StatLog were of propositional
type. If descriptive adequacy is a desideratum, none can begin to solve the KRK-illegal
problem. It would be a mistake, however, to assume that problems of complex logical
structure do not occur in industry. They can be found, for example, in trouble-shooting
complex circuitry (Pearce, 1989), in inferring biological activity from specications of
macromolecular structure in the pharmaceutical industry (see last section of Chapter 12)
and in many other large combinatorial domains. As Inductive Logic Programming matures
and assimilates techniques from probability and statistics, industrial need seems set to
explore these more powerful ML description languages.
Sec. 5.3]
83
6
Neural Networks
R. Rohwer (1), M. Wynne-Jones (1) and F. Wysotzki (2)
(1) Aston University and (2) Fraunhofer-Institute
|
6.1 INTRODUCTION
The eld of Neural Networks has arisen from diverse sources, ranging from the fascination
of mankind with understanding and emulating the human brain, to broader issues of copying
human abilities such as speech and the use of language, to the practical commercial,
scientic, and engineering disciplines of pattern recognition, modelling, and prediction.
For a good introductory text, see Hertz et al. (1991) or Wasserman (1989).
Linear discriminants were introduced by Fisher (1936), as a statistical procedure for
classication. Here the space of attributes can be partitioned by a set of hyperplanes, each
dened by a linear combination of the attribute variables. A similar model for logical
processing was suggested by McCulloch & Pitts (1943) as a possible structure bearing
similarities to neurons in the human brain, and they demonstrated that the model could be
used to build any nite logical expression. The McCulloch-Pitts neuron (see Figure 6.1)
consists of a weighted sum of its inputs, followed by a non-linear function called the em
activation function, originally a threshold function. Formally,
u
v
s
t
if
(6.1)
otherwise
i qr
p
Other neuron models are quite widely used, for example in Radial Basis Function
networks, which are discussed in detail in Section 6.2.3.
Networks of McCulloch-Pitts neurons for arbitrary logical expressions were handcrafted, until the ability to learn by reinforcement of behaviour was developed in Hebbs
book The Organisation of Behaviour (Hebb, 1949). It was established that the functionality of neural networks was determined by the strengths of the connections between
neurons; Hebbs learning rule prescribes that if the network responds in a desirable way to
a given input, then the weights should be adjusted to increase the probability of a similar
Address for correspondence: Dept. of Computer Science and Applied Mathematics, Aston University,
Birmingham B4 7ET, U.K.
Sec. 6.2]
Introduction
y
x
weighted sum
input vector
weight vector
output
85
response to similar inputs in the future. Conversely, if the network responds undesirably to
an input, the weights should be adjusted to decrease the probability of a similar response.
A distinction is often made, in pattern recognition, between supervised and unsupervised
learning. The former describes the case where the the training data, measurements on the
surroundings, are accompanied by labels indicating the class of event that the measurements
represent, or more generally a desired response to the measurements. This is the more usual
case in classication tasks, such as those forming the empirical basis of this book. The
supervised learning networks described later in this chapter are the Perceptron and Multi
Layer Perceptron (MLP), the Cascade Correlation learning architecture, and Radial Basis
Function networks.
Unsupervised learning refers to the case where measurements are not accompanied by
class labels. Networks exist which can model the structure of samples in the measurement,
or attribute space, usually in terms of a probability density function, or by representing the
data in terms of cluster centres and widths. Such models include Gaussian mixture models
and Kohonen networks.
Once a model has been made, it can be used as a classier in one of two ways. The rst
is to determine which class of pattern in the training data each node or neuron in the model
responds most strongly to, most frequently. Unseen data can then be classied according to
the class label of the neuron with the strongest activation for each pattern. Alternatively, the
Kohonen network or mixture model can be used as the rst layer of a Radial Basis Function
network, with a subsequent layer of weights used to calculate a set of class probabilities.
The weights in this layer are calculated by a linear one-shot learning algorithm (see Section
6.2.3), giving radial basis functions a speed advantage over non-linear training algorithms
such as most of the supervised learning methods. The rst layer of a Radial Basis Function
network can alternatively be initialised by choosing a subset of the training data points to
use as centres.
86
Neural networks
[Ch. 6
g
v9f
It might seem more natural to use a percentage misclassication error measure in classication problems, but the total squared error has helpful smoothness and differentiability
properties. Although the total squared error was used for training in the StatLog trials,
percentage misclassication in the trained networks was used for evaluation.
6.2.1 Perceptrons and Multi Layer Perceptrons
The activation of the McCulloch-Pitts neuron has been generalised to the form
Qrf s
(6.2)
Sec. 6.2]
87
h
tdb
: output vector
output nodes
h
b
weights
(linear or non-linear)
h
b
hidden nodes
weights
(usually non-linear)
h
b
input nodes
: input vector
6.2.2
Figure 6.2 shows the structure of a standard two-layer perceptron. The inputs form the
input nodes of the network; the outputs are taken from the output nodes. The middle layer
of nodes, visible to neither the inputs nor the outputs, is termed the hidden layer, and unlike
the input and output layers, its size is not xed. The hidden layer is generally used to make
a bottleneck, forcing the network to make a simple model of the system generating the data,
with the ability to generalise to previously unseen patterns.
The operation of this network is specied by
fg h
` E b s
h h
fg b b s
(6.3)
h b
de h
6b
h
b
de ihb
h
This species how input pattern vector is mapped into output pattern vector
, via the
hidden pattern vector
, in a manner parameterised by the two layers of weights
and
. The univariate functions
are typically each set to
h
Eb
h fb
j
h
b
h
b
(6.4)
xe
l
3
88
Neural networks
[Ch. 6
the network, these internal variables form a linear or non-linear principal component
representation of the attribute space. If the data has noise added that is not an inherent
part of the generating system, then the principal component network acts as a lter of the
lower-variance noise signal, provided the signal to noise ratio of the data is sufciently
high. This property gives MLPs the ability to generalise to previously unseen patterns, by
modelling only the important underlying structure of the generating system. The hidden
nodes can be regarded as detectors of abstract features of the attribute space.
Universal Approximators and Universal Computers
In the Multilayer Perceptron (MLP) such as the two-layer version in Equation (6.3), the
output-layer node values are functions of the input-layer node values (and the weights
). It can be shown (Funahashi, 1989) that the two-layer MLP can approximate an arbitrary
continuous mapping arbitrarily closely if there is no limit to the number of hidden nodes.
In this sense the MLP is a universal function approximator. This theorem does not imply
that more complex MLP architectures are pointless; it can be more efcient (in terms of
the number of nodes and weights required) to use different numbers of layers for different
problems. Unfortunately there is a shortage of rigorous principles on which to base a
choice of architecture, but many heuristic principles have been invented and explored.
Prominent among these are symmetry principles (Lang et al., 1990; Le Cun et al., 1989)
and constructive algorithms (Wynne-Jones, 1991).
The MLP is a feedforward network, meaning that the output vector is a function of
the input vector and some parameters ; we can say
oI
m
n
(6.5)
for some vector function given in detail by (6.3) in the 2-layer case. It is also possible to
dene a recurrent network by feeding the outputs back to the inputs. The general form of
a recurrent perceptron is
~
fg
g
de
g
if
e
!h
m
p
g
if
e
t
Sec. 6.2]
89
(6.6)
q
r!
where
are the output values obtained by substituting the inputs
for
in (6.3). If
the t is perfect,
; otherwise
.
Probabilistic interpretation of MLP outputs
If there is a one-to-many relationship between the inputs and targets in the training data, then
it is not possible for any mapping of the form (6.5) to perform perfectly. It is straightforward
to show (Bourlard & Wellekens, 1990) that if a probability density
describes the
data, then the minimum of (6.6) is attained by the map taking to the average target
I s
(6.7)
fHu
s
Any given network might or not be able to approximate this mapping well, but when
trained as well as possible it will form its best possible approximation to this mean. Many
commonly-used error measures in addition to (6.6) share this property (Hampshire &
Pearlmuter, 1990).
Usually classication problems are represented using one-out-of-N output coding. One
for example is all s
output node is allocated for each class, and the target vector
except for a on the node indicating the correct class. In this case, the value computed
by the th target node can be directly interpreted as the probability that the input pattern
. This not only provides
belongs to class . Collectively the outputs express
helpful insight, but also provides a principle with which neural network models can be
combined with other probabilistic models (Bourlard & Wellekens, 1990) .
The probabilistic interpretation of the the output nodes leads to a natural error measure
for classication problems. Given that the value
output by the th target node given the
th training input , is
, so
is
, the probability of the entire
collection of training outputs is
Y
g
fHs
q
fHs
g
if
x
Ie w
(6.8)
z
T
q
IHs
y Ie f
x w
v
n
g
IHs
log
q
r
log
(6.9)
qy
Therefore the cross-entropy can be used as an error measure instead of a sum of squares
(6.6). It happens that its minimum also lies at the average target (6.7), so the network
outputs can still be interpreted probabilistically, and furthermore the minimisation of crossentropy is equivalent to maximisation of the likelihood of the training data in classication
problems.
The cross-entropy (6.9) has this interpretation when an input can simultaneously be a member of any number
of classes, and membership of one class provides no information about membership of another. If an input can
90
Neural networks
[Ch. 6
IHs
g hhi~
g
IH
IH
IH
|
fHY|
IHY|
in the innites-
(6.10)
rI
|
fHY}e
fH
IH
|
3
|
|
~
(6.11)
| ~
fHY
belong to one and only one class, then the simple entropy, obtained by dropping the terms involving
should be used.
Sec. 6.2]
91
popular heuristic is to use a moving average of the gradient vector in order nd a systematic
tendency. This is accomplished by adding a momentum term to (6.11),involving a parameter
:
f
| ~
fHY
old
Here
refers to the most recent weight change.
old
These methods offer the benet of simplicity, but their performance depends sensitively
on the parameters and (Toolenaere, 1990). Different values seem to be appropriate
for different problems, and for different stages of training in one problem. This circumstance has given rise to a plethora of heuristics for adaptive variable step size algorithms
(Toolenaere, 1990; Silva & Almeida, 1990; Jacobs, 1988) .
Second-Order methods
The underlying difculty in rst order gradient based methods is that the linear approximation (6.10) ignores the curvature of
. This can be redressed by extending (6.10) to
the quadratic approximation,
| |
fHY
fH
|
fHY}e
fH
IH
where
is the matrix with components
, called the inverse Hessian (or the
Hessian, depending on conventions), and
. The change
,
where
, brings to a stationary point of this quadratic form. This may be a
minimum, maximum, or saddle point. If it is a minimum, then a step in that direction seems
a good idea; if not, then a positive or negative step (whichever has a negative projection
on the gradient) in the conjugate gradient direction,
, is at least not unreasonable.
Therefore a large class of algorithms has been developed involving the conjugate gradient.
Most of these algorithms require explicit computation or estimation of the Hessian .
The number of components of is roughly half the square of the number of components
of , so for large networks involving many weights, such algorithms lead to impractical
computer memory requirements. But one algorithm, generally called the conjugate gradient
algorithm, or the memoryless conjugate gradient algorithm, does not. This algorithm
maintains an estimate of the conjugate direction without directly representing .
The conjugate gradient algorithm uses a sequence of linesearches, one-dimensional
searches for the minimum of
, starting from the most recent estimate of the minimum
and searching for the minimum in the direction of the current estimate of the conjugate
gradient. Linesearch algorithms are comparatively easy because the issue of direction
choice reduces to a binary choice. But because the linesearch appears in the inner loop
of the conjugate gradient algorithm, efciency is important. Considerable effort therefore
goes into it, to the extent that the linesearch is typically the most complicated module of
a conjugate gradient implementation. Numerical round-off problems are another design
consideration in linesearch implementations, because the conjugate gradient is often nearly
along the conjugate gradient
orthogonal to the gradient, making the variation of
especially small.
The update rule for the conjugate gradient direction is
x
e Q Q
|
|
|
|
|
u
fH
fH
e |
33Y3
where
old
(6.12)
Neural networks
[Ch. 6
Y|U
Y|
|
}3Y| y
|
E z
92
old
(6.13)
old
old
(This is the Polak-Ribiere variant; there are others.) Somewhat intricate proofs exist which
show that if
were purely quadratic in , were initialised to the gradient, and the
linesearches were performed exactly, then would converge on the conjugate gradient
and
would converge on its minimum after as many iterations of (6.12) as there are
components of . In practice good performance is often obtained on much more general
functions using very imprecise linesearches. It is necessary to augment (6.13) with a rule
to reset to
whenever becomes too nearly orthogonal to the gradient for progress
to continue.
An implementation of the conjugate gradient algorithm will have several parameters
controlling the details of the linesearch, and others which dene exactly when to reset
to
. But unlike the step size and momentum parameters of the simpler methods, the
performance of the conjugate gradient method is relatively insensitive to its parameters if
they are set within reasonable ranges. All algorithms are sensitive to process for selecting
initial weights, and many other factors which remain to be carefully isolated.
Gradient calculations in MLPs
It remains to discuss the computation of the gradient
in the case of an MLP
neural network model with an error measure such as (6.6). The calculation is conveniently
organised as a back propagation of error (Rumelhart et al., 1986; Rohwer & Renals, 1988).
For a network with a single layer of hidden nodes, this calculation proceeds by propagating
node output values forward from the input to output layers for each training example, and
then propagating quantities related to the output errors backwards through a linearised
version of the network. Products of s and s then give the gradient. In the case of a
network with an input layer
, a single hidden layer
, and an output or target layer
, the calculation is:
|
3
fg
h
E b s
h h
fg b b s
h$ b s hb y hib
q
r!
h
b
h
b
h
$hb s
z
f
h h
b y b
1Y
h h h
b b y ib
h
b
(6.15)
de ihb
h
de b
h
(6.14)
h
Eb s
The index is summed over training examples, while the s and s refer to nodes, and
d
z Hg y
x
h Tb y
j
This network architecture was used in the work reported in this book.
Sec. 6.2]
93
m
E%r
6.2.3
The radial basis function network consists of a layer of units performing linear or non-linear
functions of the attributes, followed by a layer of weighted connections to nodes whose
outputs have the same form as the target vectors. It has a structure like an MLP with one
hidden layer, except that each node of the the hidden layer computes an arbitrary function of
the inputs (with Gaussians being the most popular), and the transfer function of each output
node is the trivial identity function. Instead of synaptic strengths the hidden layer has
parameters appropriate for whatever functions are being used; for example, Gaussian widths
and positions. This network offers a number of advantages over the multi layer perceptron
under certain conditions, although the two models are computationally equivalent.
These advantages include a linear training rule once the locations in attribute space
of the non-linear functions have been determined, and an underlying model involving
localised functions in the attribute space, rather than the long-range functions occurring in
perceptron-based models. The linear learning rule avoids problems associated with local
minima; in particular it provides enhanced ability to make statments about the accuracy of
94
Neural networks
[Ch. 6
ui
m
xe
m
3e
Sec. 6.2]
95
the attribute variables themselves. There are a few subtleties however, which are discussed
here. Let
be the output of the th radial basis function on the th example. The output
of each target node is computed using the weights
as
h
b
h
b
(6.16)
be
on target node
h
b
(6.17)
q
r
IH
h
b
h h
b b
h h
b b
vanishes. Let
(6.18)
(6.19)
which minimises
y
h
b
(6.21)
h
b
Unless
, this is a rectangular system. In general an exact solution does not exist, but
the optimal solution in the least-squares sense is given by the pseudo-inverse (Kohonen,
of
, the matrix with elements
:
1989)
h
b
(6.22)
hb
h
b
h
b
Neural networks
[Ch. 6
g
I0
96
(6.23)
h h h
fb b b
6.2.4
Sec. 6.2]
97
ability for statistical problems. Cascade Correlation (Fahlman & Lebi` re, 1990) is an
e
example of such a network algorithm, and is described below.
Pruning has been carried out on networks in three ways. The rst is a heuristic approach
based on identifying which nodes or weights contribute little to the mapping. After these
have been removed, additional training leads to a better network than the original. An
alternative technique is to include terms in the error function, so that weights tend to zero
under certain circumstances. Zero weights can then be removed without degrading the
network performance. This approach is the basis of regularisation, discussed in more
detail below. Finally, if we dene the sensitivity of the global network error to the removal
of a weight or node, we can remove the weights or nodes to which the global error is least
sensitive. The sensitivity measure does not interfere with training, and involves only a
small amount of extra computational effort. A full review of these techniques can be found
in Wynne-Jones (1991).
Cascade Correlation: A Constructive Feed-Forward network
Cascade Correlation is a paradigm for building a feed-forward network as training proceeds
in a supervised mode (Fahlman & Lebi` re, 1990) . Instead of adjusting the weights in a
e
xed architecture, it begins with a small network, and adds new hidden nodes one by one,
creating a multi-layer structure. Once a hidden node has been added to a network, its
input-side weights are frozen and it becomes a permanent feature-detector in the network,
available for output or for creating other, more complex feature detectors in later layers.
Cascade correlation can offer reduced training time, and it determines the size and topology
of networks automatically.
Cascade correlation combines two ideas: rst the cascade architecture, in which hidden
nodes are added one at a time, each using the outputs of all others in addition to the input
nodes, and second the maximisation of the correlation between a new units output and the
residual classication error of the parent network. Each node added to the network may be
of any kind. Examples include linear nodes which can be trained using linear algorithms,
threshold nodes such as single perceptrons where simple learning rules such as the Delta
rule or the Pocket Algorithm can be used, or non-linear nodes such as sigmoids or Gaussian
functions requiring Delta rules or more advanced algorithms such as Fahlmans Quickprop
(Fahlman, 1988a, 1988b). Standard MLP sigmoids were used in the StatLog trials.
At each stage in training, each node in a pool of candidate nodes is trained on the
residual error of the parent network. Of these nodes, the one whose output has the greatest
correlation with the error of the parent is added permanently to the network. The error
function minimised in this scheme is , the sum over all output units of the magnitude of
the correlation (or, more precisely, the covariance) between , the candidate units value,
and
, the residual error observed at output unit for example . is dened by:
Neural networks
[Ch. 6
(6.24)
C h
98
where
is the sign of the correlation between the candidates value and the output ,
is the derivative for pattern of the candidate units activation function withe respect to the
is the input the candidate unit receives for pattern .
sum of its inputs, and
The partial derivatives are used to perform gradient ascent to maximise . When no
longer improves in training for any of the candidate nodes, the best candidate is added to
the network, and the others are scrapped.
In benchmarks on a toy problem involving classication of data points forming two
interlocked spirals, cascade correlation is reported to be ten to one hundred times faster
than conventional back-propagation of error derivatives in a xed architecture network.
Empirical tests on a range of real problems (Yang & Honavar, 1991) indicate a speedup of
one to two orders of magnitude with minimal degradation of classication accuracy. These
results were only obtained after many experiments to determine suitable values for the
many parameters which need to be set in the cascade correlation implementation. Cascade
correlation can also be implemented in computers with limited precision (Fahlman, 1991b),
and in recurrent networks (Hoehfeld & Fahlman, 1991).
Bayesian regularisation
In recent years the formalism of Bayesian probability theory has been applied to the
treatment of feedforward neural network models as nonlinear regression problems. This
has brought about a greatly improved understanding of the generalisation problem, and
some new techniques to improve generalisation. None of these techniques were used in
the numerical experiments described in this book, but a short introduction to this subject is
provided here.
A reasonable scenario for a Bayesian treatment of feedforward neural networks is to
presume that each target training data vector was produced by running the corresponding
input training vector through some network and corrupting the output with noise from a
stationary source. The network involved is assumed to have been drawn from a probability
distribution
, which is to be estimated. The most probable in this distribution can
be used as the optimal classier, or a more sophisticated average over
can be used.
(The latter technique is marginalisation (MacKay, 1992a).)
The notation used here for probability densities is somewhat cavalier. In discussions
involving several probability density functions, the notation should distinguish one density
function from another, and further notation should be used when such a density is indicated
can designate the density function over weights,
at a particular point; for example,
and
would designate this density at the particular point , which confusingly and
unsignicantly has the same name as the label index of . However, a tempting opportunity
to choose names which introduce this confusion will arise in almost every instance that
a density function is mentioned, so we shall not only succumb to the temptation, but
furthermore adopt the common practice of writing
when
is meant, in order
to be concise. Technically, this is an appalling case of using a function argument name
(which is ordinarily arbitrary) to designate the function.
The Bayesian analysis is built on a probabilistic interpretation of the error measure used
in training. Typically, as in Equations (6.6) or (6.9), it is additive over input-output pairs
y
fHs
I s
IHs
IHs
I s
Sec. 6.2]
99
(6.25)
IH
IHs
IHs
fHs
IHs
I s
IHs
g
and
z
y
IH
(6.26)
(6.27)
z y
w
FHs
IHs
~ |
(6.28)
an integral over all possible target training data sets of the size under consideration.
If in (6.25) is a function only of
, as is (6.6), then
turns out to be independent
of , a result which is useful later . The only common form of which does not have
this form is the cross-entropy (6.9). But this is normally used in classication problems, in
which case (6.9) and (6.8) together justify the assumption that
depends only on
;
and imply for (6.28) that
and
, so
is still independent of .
Density (6.27) can also be derived from somewhat different assumptions using a
maximum-entropy argument (Bilbro & van den Bout, 1992). It plays a prominent role
in thermodynamics, and thermodynamics jargon has drifted into the neural networks literature partly in consequence of the analogies it underlies.
is of greater interest than the
The probability of the weights given the data
probability of the data given the weights
(the likelihood), but unfortunately the
additivity argument does not go through for this. Instead, Bayes rule
@9#
fH
fHs g
IHs
g
g
I s
(6.29)
s
g
I s
IHs
I s
IHs
(6.30)
IHs
fH
s
IHs
'
There is a further technicality; the integral (6.28) over target data must be with respect to uniform measure,
which may not always be reasonable.
Neural networks
[Ch. 6
g
100
IHg
IH
E
h b
where
is given by normalisation.
Assembling all the pieces, the posterior probability of the weights given the data is
z y
y
z y
h b
IHs
(6.31)
h b
provided that (6.28) does not depend on . This ensures that the denominator of (6.31)
does not depend on , so the usual training process of minimising ( ; )
nds the maximum of
.
The Bayesian method helps with one of the most troublesome steps in the regularisation approach to obtaining good generalisation, deciding the values of the regularisation
expresses the relative importance of smoothing and data-tting,
parameters. The ratio
which deserves to be decided in a principled manner. The Bayesian Evidence formalism
provides a principle and an implementation. It can be computationally demanding if used
precisely, but there are practicable approximations.
The Evidence formalism simply assumes a prior distribution over the regularisation
parameters, and sharpens it using Bayes rule:
g
IH
IHs
g
t ~
(6.32)
g
t ~
fHs
g
t ~
IHs
H
g
t ~
fHs
g
I s
fHs
IHs
Sec. 6.3]
Unsupervised learning
101
The principle of clustering requires a representation of a set of data to be found which offers
a model of the distribution of samples in the attribute space. The K-means algorithm (for
example, Krishnaiah & Kanal, 1982) achieves this quickly and efciently as a model with
a xed number of cluster centres, determined by the user in advance. The cluster centres
are initially chosen from the data, and each centre forms the code vector for the patch of
the input space in which all points are closer to that centre than to any other. This division
of the space into patches is known as a Voronoi tessellation. Since the initial allocation of
centres may not form a good model of the probability distribution function (PDF) of the
input space, there follows a series of iterations where each cluster centre is moved to the
mean position of all the training patterns in its tessellation region.
A generalised variant of the K-means algorithm is the Gaussian Mixture Model, or
Adaptive K-means. In this scheme, Voronoi tessellations are replaced with soft transitions
from one centres receptive eld to anothers. This is achieved by assigning a variance to
each centre, thereby dening a Gaussian kernel at each centre. These kernels are mixed
102
Neural networks
[Ch. 6
#
#
#
Fig. 6.4: K-Means clustering: within each patch the centre is moved to the mean position of the
patterns.
together by a set of mixing weights to approximate the PDF of the input data, and an efcient
algorithm exists to calculate iteratively a set of mixing weights, centres, and variances for
the centres (Dubes & Jain, 1976, and Wu & Chan, 1991). While the number of centres for
these algorithms is xed in advance in more popular implementations, some techniques are
appearing which allow new centres to be added as training proceeds. (Wynne-Jones, 1992
and 1993)
6.3.2
Kohonens network algorithm (Kohonen, 1984) also provides a Voronoi tessellation of the
input space into patches with corresponding code vectors. It has the additional feature that
the centres are arranged in a low dimensional structure (usually a string, or a square grid),
such that nearby points in the topological structure (the string or grid) map to nearby points
in the attribute space. Structures of this kind are thought to occur in nature, for example in
the mapping from the ear to the auditory cortex, and the retinotopic map from the retina to
the visual cortex or optic tectum.
In training, the winning node of the network, which is the nearest node in the input
space to a given training pattern, moves towards that training pattern, while dragging with
its neighbouring nodes in the network topology. This leads to a smooth distribution of the
network topology in a non-linear subspace of the training data.
Vector Quantizers that conserve topographic relations between centres are also particularly useful in communications, where noise added to the coded vectors may corrupt the
representation a little; the topographic mapping ensures that a small change in code vector
is decoded as a small change in attribute space, and hence a small change at the output.
These models have been studied extensively, and recently unied under the framework of
Bayes theory (Luttrell, 1990, 1993).
Although it is fundamentally an unsupervised learning algorithm, The Learning Vector
Quantizer can be used as a supervised vector quantizer, where network nodes have class
labels associated with them. The Kohonen Learning Rule is used when the winning node
represents the same class as a new training pattern, while a difference in class between
Sec. 6.4]
DIPOL92
103
the winning node and a training pattern causes the node to move away from the training
pattern by the same distance. Learning Vector Quantizers are reported to give excellent
performance in studies on statistical and speech data (Kohonen et al., 1988).
6.3.3
RAMnets
One of the oldest practical neurally-inspired classication algorithms is still one of the
best. It is the n-tuple recognition method introduced by Bledsoe & Browning (1959) and
Bledsoe (1961) , which later formed the basis of a commercial product known as Wisard
(Aleksander et al., 1984) . The algorithm is simple. The patterns to be classied are bit
strings of a given length. Several (let us say ) sets of bit locations are selected randomly.
These are the n-tuples. The restriction of a pattern to an n-tuple can be regarded as an n-bit
number which constitutes a feature of the pattern. A pattern is classied as belonging to
the class for which it has the most features in common with at least 1 pattern in the training
data.
To be precise, the class assigned to unclassied pattern is
for
is the th
|
UX
where
(6.33)
h Xbx h bx
6 ff
argmax
(6.34)
h dx
b
iF
(6.35)
h Xbx
d 6
UX
Thus
is set if any pattern of
has feature and unset otherwise. Recognition is
accomplished by tallying the set bits in the RAMS of each class at the addresses given by
the features of the unclassied pattern.
RAMnets are impressive in that they can be trained faster than MLPs or radial basis
function networks by orders of magnitude, and often provide comparable results. Experimental comparisons between RAMnets and other methods can be found in Rohwer &
Cressy (1989) .
6.4 DIPOL92
This is something of a hybrid algorithm, which has much in common with both logistic
discrimination and some of the nonparametric statistical methods. However, for historical
reasons it is included here.
104
Neural networks
[Ch. 6
6.4.1 Introduction
DIPOL92 is a learning algorithm which constructs an optimised piecewise linear classier
by a two step procedure. In the rst step the initial positions of the discriminating hyperplanes are determined by pairwise linear regression. To optimise these positions in relation
to the misclassied patterns an error criterion function is dened. This function is then
minimised by a gradient descent procedure for each hyperplane separately. As an option
in the case of nonconvex classes (e.g. if a class has a multimodal probability distribution)
a clustering procedure decomposing the classes into appropriate subclasses can be applied.
(In this case DIPOL92 is really a three step procedure.)
Seen from a more general point of view DIPOL92 is a combination of a statistical part
(regression) with a learning procedure typical for articial neural nets. Compared with
most neural net algorithms an advantage of DIPOL92 is the possibility to determine the
number and initial positions of the discriminating hyperplanes (corresponding to neurons)
a priori, i.e. before learning starts. Using the clustering procedure this is true even in the
case that a class has several distinct subclasses. There are many relations and similarities
between statistical and neural net algorithms but a systematic study of these relations is
still lacking.
Another distinguishing feature of DIPOL92 is the introduction of Boolean variables
(signs of the normals of the discriminating hyperplanes) for the description of class regions
on a symbolic level and using them in the decision procedure. This way additional layers
of hidden units can be avoided.
DIPOL92 has some similarity with the MADALINE-system (Widrow, 1962) which is
also a piecewise linear classication procedure. But instead of applying a majority function
for class decision on the symbolic level (as in the case of MADALINE) DIPOL92 uses more
general Boolean descriptions of class and subclass segments, respectively. This extends
the variety of classication problems which can be handled considerably.
f
f
s
e
ih
e
e
F
with
~ |
~
F
|
for
for
~
1ihh~ |
then
then
if
if
Sec. 6.4]
DIPOL92
105
Cg
hm
g
if ~
m
hxe
~ f
hm
This means that costs are included explicitly in the learning procedure which consists of
minimizing the criterion function with respect to
by a gradient descent
algorithm for each decision surface successively.
~
1hih~ |
%hhi~ f
~
hb
IU
x g
x g
g
|
is calculated. Patterns are moved from one cluster to another if such a move will improve
the criterion function . The mean vectors and the criterion function are updated after
each pattern move. Like hillclimbing algorithms in general, these approaches guarantee
local but not global optimisation. Different initial partitions and sequences of the training
patterns can lead to different solutions. In the case of clustering the number of twoclass
problems increases correspondingly.
We note that by the combination of the clustering algorithm with the regression technique the number and initial positions of discriminating hyperplanes are xed a priori (i.e.
before learning) in a reasonable manner, even in the case that some classes have multimodal
distributions (i.e consist of several subclasses). Thus a well known bottleneck of articial
neural nets can at least be partly avoided.
~ ~
1hihi|
h g
ihf
k
!0e
|
Neural networks
g
ig
~ f
the function
sign
with
[Ch. 6
106
1hih~ f
~
|
~ f
f
F9
f
g
tg
max
~
ihih~ |
~
ihi~
|
is reached.
min min
7
Methods for Comparison
|
R. J. Henery
University of Strathclyde
Address for correspondence: Department of Statistics and Modelling Science, University of Strathclyde,
Glasgow G1 1XH, U.K.
108
[Ch. 7
procedure of Efron (1983). These three methods of estimating error rates are now described
briey.
7.1.1
Train-and-Test
The essential idea is this: a sample of data (the training data) is given to enable a classication rule to be set up. What we would like to know is the proportion of errors made by
this rule when it is up-and-running, and classifying new observations without the benet
of knowing the true classications. To do this, we test the rule on a second independent
sample of new observations (the test data) whose true classications are known but are not
told to the classier. The predicted and true classications on the test data give an unbiased
estimate of the error rate of the classier. To enable this procedure to be carried out from
a given set of data, a proportion of the data is selected at random (usually about 20-30%)
and used as the test data. The classier is trained on the remaining data, and then tested on
the test data. There is a slight loss of efciency here as we do not use the full sample to
train the decision rule, but with very large datasets this is not a major problem. We adopted
this procedure when the number of examples was much larger than 1000 (and allowed
the use of a test sample of size 1000 or so). We often refer to this method as one-shot
train-and-test.
7.1.2
Cross-validation
For moderate-sized samples, the procedure we adopted was cross-validation. In its most
elementary form, cross-validation consists of dividing the data into subsamples. Each
sub-sample is predicted via the classication rule constructed from the remaining
subsamples, and the estimated error rate is the average error rate from these subsamples.
In this way the error rate is estimated efciently and in an unbiased way. The rule nally
used is calculated from all the data. The leave-one-out method of Lachenbruch & Mickey
(1968) is of course -fold cross-validation with equal to the number of examples. Stone
(1974) describes cross-validation methods for giving unbiased estimates of the error rate.
A practical difculty with the use of cross-validation in computer-intensive methods
such as neural networks is the -fold repetition of the learning cycle, which may require
much computational effort.
g
hf
7.1.3
Bootstrap
The more serious objection to cross-validation is that the error estimates it produces are
too scattered, so that the condence intervals for the true error-rate are too wide. The
bootstrap procedure gives much narrower condence limits, but the penalty paid is that
the estimated error-rates are optimistic (i.e. are biased downwards). The trade-off between
bias and random error means that, as a general rule, the bootstrap method is preferred when
the sample size is small, and cross-validation when the sample size is large. In conducting
a comparative trial between methods on the same dataset, the amount of bias is not so
important so long as the bias is the same for all methods. Since the bootstrap represents
the best way to reduce variability, the most effective way to conduct comparisons in small
datasets is to use the bootstrap. Since it is not so widely used in classication trials as
perhaps it should be, we give an extended description here, although it must be admitted
Sec. 7.1]
109
that we did not use the bootstrap in any of our trials as we judged that our samples were
large enough to use either cross-validation or train-and-test.
In statistical terms, the bootstrap is a non-parametric procedure for estimating parameters generally and error-rates in particular. The basic idea is to re-use the original dataset
(of size ) to obtain new datasets also of size by re-sampling with replacement. See Efron
(1983) for the denitive introduction to the subject and Crawford (1989) for an application
to CART. Breiman et al. (1984) note that there are practical difculties in applying the
bootstrap to decision trees.
In the context of classication, the bootstrap idea is to replicate the whole classication
experiment a large number of times and to estimate quantities like bias from these replicate
experiments. Thus, to estimate the error rate in small samples (of size say), a large number
of bootstrap replicate samples are created, each sample being a replicate (randomly
chosen) of the original sample. That is, a random sample of size is taken from the
original sample by sampling with replacement. Sampling with replacement means, for
example, that some data points will be omitted (on average about
37% of data will
not appear in the bootstrap sample). Also, some data points will appear more than once
in the bootstrap sample. Each bootstrap sample is used to construct a classication rule
which is then used to predict the classes of those original data that were unused in the
training set (so about
37% of the original data will be used as test set). This gives
one estimate of the error rate for each bootstrap sample. The average error rates over all
bootstrap samples are then combined to give an estimated error rate for the original rule.
See Efron (1983) and Crawford (1989) for details. The main properties of the bootstrap
have been summarised by Efron(1983) as follows.
Properties of cross-validation and bootstrap
Efron (1983) gives the following properties of the bootstrap as an estimator of error-rate. By
taking very large (Efron recommends approximately 200), the statistical variability in the
average error rate EFRON is small, and for small sample size , this means that the bootstrap
will have very much smaller statistical variability than the cross-validation estimate.
The bootstrap and cross-validation estimates are generally close for large sample sizes,
and the ratio between the two estimates approaches unity as the sample size tends to innity.
The bootstrap and cross-validation methods tend to be closer for smoother costfunctions than the 0-1 loss-function implicit in the error rates discussed above. However
the Bootstrap may be biased, even for large samples.
The effective sample size is determined by the number in the smallest classication
group. Efron (1983) quotes a medical example with n = 155 cases, but primary interest
centres on the 33 patients that died. The effective sample size here is 33.
For large samples, group-wise cross-validation may give better results than the leaveone-out method, although this conclusion seems doubtful.
k
h
hf
h
hf
110
[Ch. 7
parameter will indicate what the best choice of parameter should be. However, the error
rate corresponding to this choice of parameter is a biased estimate of the error rate of the
classication rule when tested on unseen data. When it is necessary to optimise a parameter
in this way, we recommend a three-stage process for very large datasets: (i) hold back 20%
as a test sample; (ii) of the remainder, divide into two, with one set used for building the
rule and the other for choosing the parameter; (iii) use the chosen parameter to build a rule
for the complete training sample (containing 80% of the original data) and test this rule on
the test sample.
Thus, for example, Watkins (1987) gives a description of cross-validation in the context
of testing decision-tree classication algorithms, and uses cross-validation as a means of selecting better decision trees. Similarly, in this book, cross-validation was used by Backprop
in nding the optimal number of nodes in the hidden layer, following the procedure outlined
above. This was done also for the trials involving Cascade. However, cross-validation runs
involve a greatly increased amount of computational labour, increasing the learning time
fold, and this problem is particularly serious for neural networks.
In StatLog, most procedures had a tuning parameter that can be set to a default value, and
where this was possible the default parameters were used. This was the case, for example,
with the decision trees: generally no attempt was made to nd the optimal amount of
pruning, and accuracy and mental t (see Chapter 5) is thereby sacriced for the sake of
speed in the learning process.
Training Phase. The most elementary functionality required of any learning algorithm,
is to be able to take data from one le le1 (by assumption le1 contains known classes)
and create the rules.
2.
(Optionally) The resulting rules (or parameters dening the rule) may be saved
to another le le3;
(Optionally) A cost matrix (in le2 say) can be read in and used in building the
rules
Testing Phase. The algorithm can read in the rules and classify unseen data, in the
following sequence:
Sec. 7.2]
111
Read in the rules or parameters from the training phase (either passed on directly
from the training phase if that immediately precedes the testing phase or read
from the le le3)
Read in a set of unseen data from a le le4 with true classications that are
hidden from the classier
(Optionally) Read in a cost matrix from a le le5 (normally le5 = le2) and
use this cost matrix in the classication procedure
(Optionally) Output the classications to a le le6
If true classications were provided in the test le le4, output to le le7 a
confusion matrix whose rows represent the true classications and whose columns
represent the classications made by the algorithm
The two steps above constitute the most basic element of a comparative trial, and we
describe this basic element as a simple Train-and-Test (TT) procedure. All algorithms used
in our trials were able to perform the Train-and-Test procedure.
7.2.1 Cross-validation
To follow the cross-validation procedure, it is necessary to build an outer loop of control
procedures that divide up the original le into its component parts and successively use
each part as test le and the remaining part as training le. Of course, the cross-validation
procedure results in a succession of mini-confusion matrices, and these must be combined
to give the overall confusion matrix. All this can be done within the Evaluation Assistant
shell provided the classication procedure is capable of the simple Train-and-Test steps
above. Some more sophisticated algorithms may have a cross-validation procedure built
in, of course, and if so this is a distinct advantage.
7.2.2 Bootstrap
The use of the bootstrap procedure makes it imperative that combining of results, les etc.
is done automatically. Once again, if an algorithm is capable of simple Train-and-Test,
it can be embedded in a bootstrap loop using Evaluation Assistant (although perhaps we
should admit that we never used the bootstrap in any of the datasets reported in this book).
7.2.3 Evaluation Assistant
Evaluation Assistant is a tool that facilitates the testing of learning algorithms on given
datasets and provides standardised performance measures. In particular, it standardises
timings of the various phases, such as training and testing. It also provides statistics
describing the trial (mean error rates, total confusion matrices, etc. etc.). It can be obtained
from J. Gama of the University of Porto. For details of this, and other publicly available
software and datasets, see Appendices A and B. Two versions of Evaluation Assistant exist:
- Command version (EAC)
- Interactive version (EAI)
The command version of Evaluation Assistant (EAC) consists of a set of basic commands
that enable the user to test learning algorithms. This version is implemented as a set of
C-shell scripts and C programs.
The interactive version of Evaluation Assistant (EAI) provides an interactive interface
that enables the user to set up the basic parameters for testing. It is implemented in C and
112
[Ch. 7
the interactive interface exploits X windows. This version generates a customised version
of some EAC scripts which can be examined and modied before execution.
Both versions run on a SUN SPARCstation and other compatible workstations.
7.3 CHARACTERISATION OF DATASETS
An important objective is to investigate why certain algorithms do well on some datasets
and not so well on others. This section describes measures of datasets which may help to
explain our ndings. These measures are of three types: (i) very simple measures such
as the number of examples; (ii) statistically based, such as the skewness of the attributes;
and (iii) information theoretic, such as the information gain of attributes. We discuss
information theoretic measures in Section 7.3.3. There is a need for a measure which
indicates when decision trees will do well. Bearing in mind the success of decision trees
in image segmentation problems, it seems that some measure of multimodality might be
useful in this connection.
Some algorithms have built in measures which are given as part of the output. For
example, CASTLE measures the Kullback-Leibler information in a dataset. Such measures
are useful in establishing the validity of specic assumptions underlying the algorithm and,
although they do not always suggest what to do if the assumptions do not hold, at least they
give an indication of internal consistency.
The measures should continue to be elaborated and rened in the light of experience.
7.3.1 Simple measures
The following descriptors of the datasets give very simple measures of the complexity or
size of the problem. Of course, these measures might advantageously be combined to give
other measures more appropriate for specic tasks, for example by taking products, ratios
or logarithms.
Number of observations,
This is the total number of observations in the whole dataset. In some respects, it might
seem more sensible to count only the observations in the training data, but this is generally
a large fraction of the total number in any case.
Number of attributes,
The total number of attributes in the data as used in the trials. Where categorical attributes
were originally present, these were converted to binary indicator variables.
Number of classes,
The total number of classes represented in the entire dataset.
Number of binary attributes, Bin.att
The total number of number of attributes that are binary (including categorical attributes
coded as indicator variables). By denition, the remaining
Bin.att attributes are
numerical (either continuous or ordered) attributes.
Sec. 7.3]
Characterisation of datasets
113
number of binary attributes, and if this is so, the skewness and kurtosis are directly related
to each other. However, the statistical measures in this section are generally dened only for
continuous attributes. Although it is possible to extend their denitions to include discrete
and even categorical attributes, the most natural measures for such data are the information
theoretic measures discussed in section 7.3.3.
Test statistic for homogeneity of covariances
The covariance matrices are fundamental in the theory of linear and quadratic discrimination
detailed in Sections 3.2 and 3.3, and the key in understanding when to apply one and not
the other lies in the homogeneity or otherwise of the covariances. One measure of the lack
of homogeneity of covariances is the geometric mean ratio of standard deviations of the
populations of individual classes to the standard deviations of the sample, and is given by
(see below). This quantity is related to a test of the hypothesis that all populations
have a common covariance structure, i.e. to the hypothesis
:
which can be tested via Boxs
test statistic:
ih
g
if
log
f
k |
where
~
f
V
"
g
hf
e
r
hf
g
r
e
and
and are the unbiased estimators of the th sample covariance matrix and the
pooled covariance matrix respectively. This statistic has an asymptotic
distribution: and the approximation is good if each exceeds 20, and if and are both
much smaller than every .
In datasets reported in this volume these criteria are not always met, but the
statistic
can still be computed, and used as a characteristic of the data. The
statistic can be reexpressed as the geometric mean ratio of standard deviations of the individual populations
to the pooled standard deviations, via the expression
dh
b
dh
F
9k
9k
g
if
v
k
!9
|
exp
The
is strictly greater than unity if the covariances differ, and is equal to unity
if and only if the M-statistic is zero, i.e. all individual covariance matrices are equal to the
pooled covariance matrix.
In every dataset that we looked at the
statistic is signicantly different from zero,
in which case the
is signicantly greater than unity.
i
iF
114
[Ch. 7
If corr.abs is near unity, there is much redundant information in the attributes and some
procedures, such as logistic discriminants, may have technical problems associated with
this. Also, CASTLE, for example, may be misled substantially by tting relationships to
the attributes, instead of concentrating on getting right the relationship between the classes
and the attributes.
Canonical discriminant correlations
Assume that, in
dimensional space, the sample points from one class form clusters of
roughly elliptical shape around its population mean. In general, if there are classes, the
means lie in a
dimensional subspace. On the other hand, it happens frequently that the
classes form some kind of sequence, so that the population means are strung out along some
curve that lies in
dimensional space, where
. The simplest case of all occurs
when
and the population means lie along a straight line. Canonical discriminants
are a way of systematically projecting the mean vectors in an optimal way to maximise
the ratio of between-mean distances to within-cluster distances, successive discriminants
being orthogonal to earlier discriminants. Thus the rst canonical discriminant gives the
best single linear combination of attributes that discriminates between the populations. The
second canonical discriminant is the best single linear combination orthogonal to the rst,
and so on. The success of these discriminants is measured by the canonical correlations. If
the rst canonical correlation is close to unity, the means lie along a straight line nearly.
If the
th canonical correlation is near zero, the means lie in
dimensional space.
Proportion of total variation explained by rst k (=1,2,3,4) canonical discriminants
This is based on the idea of describing how the means for the various populations differ
in attribute space. Each class (population) mean denes a point in attribute space, and,
at its simplest, we wish to know if there is some simple relationship between these class
means, for example, if they lie along a straight line. The sum of the rst eigenvalues of
the canonical discriminant matrix divided by the sum of all the eigenvalues represents the
proportion of total variation explained by the rst canonical discriminants. The total
variation here is tr
. We calculate, as fractk, the values of
9d
9d
e
d
~ f
for
}e
hh
e
"e
h
g
}e
hi
This gives a measure of collinearity of the class means. When the classes form an ordered
sequence, for example soil types might be ordered by wetness, the class means typically
lie along a curve in low dimensional space. The s are the squares of the canonical
correlations. The signicance of the s can be judged from the
statistics produced by
manova. This representation of linear discrimination, which is due to Fisher (1936), is
discussed also in Section 3.2.
Departure from normality
The assumption of multivariate normality underlies much of classical discrimination procedures. But the effects of departures from normality on the methods are not easily or
clearly understood. Moreover, in analysing multiresponse data, it is not known how robust classical procedures are to departures from multivariate normality. Most studies on
robustness depend on simulation studies. Thus, it is useful to have measures for verifying
the reasonableness of assuming normality for a given dataset. If available, such a measure
would be helpful in guiding the subsequent analysis of the data to make it more normally
distributed, or suggesting the most appropriate discrimination method. Andrews et al.
Sec. 7.3]
Characterisation of datasets
115
(1973), whose excellent presentation we follow in this section, discuss a variety of methods
for assessing normality.
With multiresponse data, the possibilities for departure from joint normality are many
and varied. One implication of this is the need for a variety of techniques with differing
sensitivities to the different types of departure and to the effects that such departures have
on the subsequent analysis.
Of great importance here is the degree of commitment one wishes to make to the
coordinate system for the multiresponse observations. At one extreme is the situation
where the interest is completely conned to the observed coordinates. In this case, the
marginal distributions of each of the observed variables and conditional distributions of
certain of these given certain others would be the objects of interest.
At the other extreme, the class of all nonsingular linear transformations of the variables
would be of interest. One possibility is to look at all possible linear combinations of the
variables and nd the maximum departure from univariate normality in these combinations
(Machado, 1983). Mardia et al. (1979) give multivariate measures of skewness and kurtosis
that are invariant to afne transformations of the data: critical values of these statistics for
small samples are given in Mardia (1974). These measures are difcult to compare across
datasets with differing dimensionality. They also have the disadvantage that they do not
reduce to the usual univariate statistics when the attributes are independent.
Our approach is to concentrate on the original coordinates by looking at their marginal
distributions. Moreover, the emphasis here is on a measure of non-normality, rather than on
a test that tells us how statistically signicant is the departure from normality. See Ozturk
& Romeu (1992) for a review of methods for testing multivariate normality.
Univariate skewness and kurtosis
The usual measure of univariate skewness (Kendall et al., 1983) is , which is the ratio of
the mean cubed deviation from the mean to the cube of the standard deviation
|
h g
&|
u
fH
although, for test purposes, it is usual to quote the square of this quantity:
Another
measure is dened via the ratio of the fourth moment about the mean to the fourth power
of the standard deviation:
u
v fH
u
The quantity
is generally known as the kurtosis of the distribution. However, we will
refer to itself as the measure of kurtosis: since we only use this measure relative to other
measurements of the same quantity within this book, this slight abuse of the term kurtosis
may be tolerated. For the normal distribution, the measures are
and
, and
we will say that the skewness is zero and the kurtosis is 3, although the usual denition of
kurtosis gives a value of zero for a normal distribution.
Mean skewness and kurtosis
Denote the skewness statistic for attribute in population
by
. As a single
measure of skewness for the whole dataset, we quote the mean of the absolute value of
, averaged over all attributes and over all populations. This gives the measure
. For a normal population,
is zero: for uniform and exponential
variables, the theoretical values of
are zero and 2 respectively. Similarly, we nd
r
d
#i
d
#h
|
9
d
#h
[Ch. 7
g
116
k
f
d
#h
k
f
I0
log
f0
I0
where is the probability that takes on the ith value. Conventionally, logarithms are
to base 2, and entropy is then said to be measured in units called "bits" (binary information
units). In what follows, all logarithms are to base 2. The special cases to remember are:
Equal probabilities (uniform distribution). The entropy of a discrete random variable
is maximal when all are equal. If there are possible values for , the maximal
entropy is log .
Continuous variable with given variance. Maximal entropy is attained for normal
log
.
variables, and this maximal entropy is
In the context of classication schemes, the point to note is that an attribute that does
not vary at all, and therefore has zero entropy, contains no information for discriminating
between classes.
The entropy of a collection of attributes is not simply related to the individual entropies,
but, as a basic measure, we can average the entropy over all the attributes and take this as
a global measure of entropy of the attributes collectively. Thus, as a measure of entropy of
the attributes we take the
averaged over all attributes
:
~ ~
hQ|
F
I0
I0
Sec. 7.3]
Characterisation of datasets
117
log
{
g
where is the prior probability for class . Entropy is related to the average length of a
variable length coding scheme, and there are direct links to decision trees (see Jones, 1979
for example). Since class is essentially discrete, the class entropy
has maximal value
when the classes are equally likely, so that
is at most log , where is the number
of classes. A useful way of looking at the entropy
is to regard
as an effective
number of classes.
Joint entropy of class and attribute,
The joint entropy
of two variables and is a measure of total entropy of the
combined system of variables, i.e. the pair of variables
. If
denotes the joint
probability of observing class
and the -th value of attribute , the joint entropy is
dened to be:
log
h b
d
~
r{
~
I{
~
I{
~
I{
This is a simple extension of the notion of entropy to the combined system of variables.
Mutual information of class and attribute,
The mutual information
of two variables and is a measure of common information or entropy shared between the two variables. If the two variables are independent,
there is no shared information, and the mutual information
is zero. If
denotes
the joint probability of observing class
and the -th value of attribute , if the marginal
probability of class is , and if the marginal probability of attribute taking on its -th
value is , then the mutual information is dened to be (note that there is no minus sign):
g
~
r{
~
r{
~
r{
[Ch. 7
~
r{
~
I{
log
118
f0}
{
0}
{
f0"e
{
g
~
I{
~
I{
~
I{
~
r{
~ g
f0
Cg
~
r{
~
I{
~
r{
~
r{
~
r{
~ ~
ihiF|
g
g
~
r{
This average mutual information gives a measure of how much useful information about
classes is provided by the average attribute.
Mutual information may be used as a splitting criterion in decision tree algorithms, and
is preferable to the gain ratio criterion of C4.5 (Pagallo & Haussler, 1990).
Equivalent number of attributes, EN.attr
The information required to specify the class is
, and no classication scheme can
be completely successful unless it provides at least
bits of useful information.
This information is to come from the attributes taken together, and it is quite possible
that the useful information
of all attributes together (here
stands for the
vector of attributes
) is greater than the sum of the individual informations
. However, in the simplest (but most unrealistic) case that all
attributes are independent, we would have
{
~
r{
3e
v
~
r{
g |
~ ~
hQ|
~
I{
~
I{
0e
v
~
I{
g |
~
r{
In this case the attributes contribute independent bits of useful information for classication
purposes, and we can count up how many attributes would be required, on average, by taking
the ratio between the class entropy
and the average mutual information
.
Of course, we might do better by taking the attributes with highest mutual information, but,
in any case, the assumption of independent useful bits of information is very dubious in
any case, so this simple measure is probably quite sufcient:
g
~
r{
g
g
v
~
r{
0
{
EN.attr
Sec. 7.3]
Characterisation of datasets
119
~
I{
~
r{
~
r{
v
~I{
v
g
f0
f0
NS.ratio
imply a dataset that contains much irrelevant information (noise). Such datasets could be
condensed considerably without affecting the performance of the classier, for example by
removing irrelevant attributes, by reducing the number of discrete levels used to specify
the attributes, or perhaps by merging qualitative factors. The notation NS.ratio denotes the
Noise-Signal-Ratio. Note that this is the reciprocal of the more usual Signal-Noise-Ratio
(SNR).
Irrelevant attributes
The mutual information
between class and attribute
can be used to judge
if attribute
could, of itself, contribute usefully to a classication scheme. Attributes
would not, by themselves, be useful predictors of class. In
with small values of
this context, interpreting the mutual information as a deviance statistic would be useful,
and we can give a lower bound to statistically signicant values for mutual information.
Suppose that attribute
and class are, in fact, statistically independent, and suppose
has distinct levels. Assuming further that the sample size
is large, then it
that
is well known that the deviance statistic
is approximately equal to the chisquare statistic for testing the independence of attribute and class (for example Agresti,
1990). Therefore
has an approximate
distribution, and order of
magnitude calculations indicate that the mutual information contributes signicantly (in
the hypothesis testing sense) if its value exceeds
, where is the number
of classes,
is the number of examples, and is the number of discrete levels for the
attribute.
In our measures, is the number of levels for integer or binary attributes, and for
continuous attributes we chose
(so that, on average, there were about 10
observations per cell in the two-way table of attribute by class), but occasionally the
number of levels for so-called continuous attributes was less than
. If we adopt
a critical level for the
distribution as twice the number of degrees of freedom,
for the sake of argument, we obtain an approximate critical level for the mutual information
. With our chosen value of , this is of order
for continuous
as
attributes.
We have not quoted any measure of this form, as almost all attributes are relevant in this
sense (and this measure would have little information content!). In any case, an equivalent
measure would be the difference between the actual number of attributes and the value of
EN.attr.
g
~
r{
g
h g
if
dh
b
~
r{
g
hf
f
6h
f h
#hf
~
r{
f
#h
~
I{
h
b
h g
iif
g
hf
120
[Ch. 7
measures take no account of any lack of independence, and are therefore very crude approximations to reality. There are, however, some simple results concerning the multivariate
normal distribution, for which the entropy is
log
where
is the determinant of the covariance matrix of the variables. Similar results hold
for mutual information, and there are then links with the statistical measures elaborated
in Section 7.3.2. Unfortunately, even if such measures were used for our datasets, most
datasets are so far from normality that the interpretation of the resulting measures would
be very questionable.
g
!
7.4 PRE-PROCESSING
Usually there is no control over the form or content of the vast majority of datasets.
Generally, they are already converted from whatever raw data was available into some
suitable format, and there is no way of knowing if the manner in which this was done
was consistent, or perhaps chosen to t in with some pre-conceived type of analysis. In
some datasets, it is very clear that some very drastic form of pre-processing has already
been done see Section 9.5.4, for example.
7.4.1 Missing values
)
Some algorithms (e.g. Naive Bayes, CART, CN2, Bayes Tree, NewID, C4.5, Cal5,
can deal with missing values, whereas others require that the missing values be replaced.
The procedure Discrim was not able to handle missing values, although this can be done
in principle for linear discrimination for certain types of missing value. In order to get
comparable results we settled on a general policy of replacing all missing values. Where
an attribute value was missing it was replaced by the global mean or median for that
attribute. If the class value was missing, the whole observation was omitted. Usually, the
proportion of cases with missing information was very low. As a separate exercise it would
be of interest to learn how much information is lost (or gained) in such a strategy by those
algorithms that can handle missing values.
Unfortunately, there are various ways in which missing values might arise, and their
treatment is quite different. For example, a clinician may normally use the results of a
blood-test in making a diagnosis. If the blood-test is not carried out, perhaps because of
faulty equipment, the blood-test measurements are missing for that specimen. A situation
that may appear similar, results from doing measurements on a subset of the population,
for example only doing pregnancy tests on women, where the test is not relevant for men
(and so is missing for men). In the rst case, the measurements are missing at random, and
in the second the measurements are structured, or hierarchical. Although the treatment of
these two cases should be radically different, the necessary information is often lacking. In
at least one dataset (technical), it would appear that this problem arises in a very extreme
manner, as it would seem that missing values are coded as zero, and that a large majority
of observations is zero.
Sec. 7.4]
Pre-processing
121
reduction process was performed in advance of the trials. Again it is of interest to note
which algorithms can cope with the very large datasets. There are several ways in which
data reduction can take place. For example, the Karhunen-Loeve transformation can be
used with very little loss of information see Section 9.6.1 for an example. Another way
of reducing the number of variables is by a stepwise procedure in a Linear Discriminant
procedure, for example. This was tried on the Cut50 dataset, in which a version Cut20
with number of attributes reduced from 50 to 20 was also considered. Results for both these
versions are presented, and make for an interesting paired comparison: see the section on
paired comparisons for the Cut20 dataset in Section 10.2.2.
In some datasets, particularly image segmentation, extra relevant information can be
included. For example, we can use the prior knowledge that examples which are neighbours are likely to have the same class. A dataset of this type is considered in Section
9.6.5 in which a satellite image uses the fact that attributes of neighbouring pixels can give
useful information in classifying the given pixel.
Especially in an exploratory study, practitioners often combine attributes in an attempt
to increase the descriptive power of the resulting decision tree/rules etc. For example, it
might be conjectured that it is the sum of two attributes
that is important rather
than each attribute separately. Alternatively, some ratios are included such as
.
In our trials we did not introduce any such combinations. On the other hand, there existed
already some linear combinations of attributes in some of the datasets that we looked at. We
took the view that these combinations were included because the dataset provider thought
that these particular combinations were potentially useful. Although capable of running on
attributes with linear dependencies, some of the statistical procedures prefer attributes that
are linearly independent, so when it came to running LDA (Discrim), QDA (Quadisc) and
logistic discrimination (Logdisc) we excluded attributes that were linear combinations of
others. This was the case for the Belgian Power data which is described in section 9.5.5.
Although, in principle, the performance of linear discriminant procedures is not affected
by the presence of linear combinations of attributes, in practice the resulting singularities
are best avoided for numerical reasons.
As the performance of statistical procedures is directly related to the statistical properties
of the attributes, it is generally advisable to transform the attributes so that their marginal
distributions are as near normal as possible. Each attribute is considered in turn, and
some transformation, usually from the power-law family, is made on the attribute. Most
frequently, this is done by taking the square-root, logarithm or reciprocal transform. These
transforms may help the statistical procedures: in theory, at least, they have no effect on
non-parametric procedures, such as the decision trees, or Naive Bayes.
g
h |
7.4.3
We describe now the problems that arise for decision trees and statistical algorithms alike
when an attribute has a large number of categories. Firstly, in building a decision tree, a
potential split of a categorical attribute is based on some partitioning of the categories, one
partition going down one side of the split and the remainder down the other. The number of
where L is the number of different categories (levels) of the attribute.
potential splits is
Clearly, if is much larger than ten, there is an enormous computational load, and the tree
takes a very long time to train. However, there is a computational shortcut that applies
122
[Ch. 7
to two-class problems (see Clark & Pregibon, 1992 for example). The shortcut method is
not implemented in all StatLog decision-tree methods. With the statistical algorithms, a
categorical attribute with categories (levels) needs
binary variables for a complete
specication of the attribute.
Now it is a fact that decision trees behave differently for categorical and numerical
data. Two datasets may be logically equivalent, yet give rise to different decision trees.
As a trivial example, with two numerical attributes and , statistical algorithms would
probably see exactly the same predictive value in the pair of attributes (
,
) as in
the original pair ( , ), yet the decision trees would be different, as the decision boundaries
would now be at an angle of 45 degrees. When categorical attributes are replaced by binary
variables the decision trees will be very different, as most decision tree procedures look
at all possible subsets of attribute values when considering potential splits. There is the
additional, although perhaps not so important, point that the interpretation of the tree is
rendered more difcult.
It is therefore of interest to note where decision tree procedures get almost the same
accuracies on an original categorical dataset and the processed binary data. NewID, as
run by ISoft for example, obtained an accuracy of 90.05% on the processed DNA data
and 90.80% on the original DNA data (with categorical attributes). These accuracies are
probably within what could be called experimental error, so it seems that NewID does about
as well on either form of the DNA dataset.
In such circumstances, we have taken the view that for comparative purposes it is
better that all algorithms are run on exactly the same preprocessed form. This way we
avoid differences in preprocessing when comparing performance. When faced with a new
application, it will pay to consider very carefully what form of preprocessing should be
done. This is just as true for statistical algorithms as for neural nets or machine learning.
f
q
r
q e
7.4.4
First, some general remarks on potential bias in credit datasets. We do not know the way in
which the credit datasets were collected, but it is very probable that they were biased in the
following way. Most credit companies are very unwilling to give credit to all applicants.
As a result, data will be gathered for only those customers who were given credit. If the
credit approval process is any good at all, the proportion of bad risks among all applicants
will be signicantly higher than in the given dataset. It is very likely also, that the proles
of creditors and non-creditors are very different, so rules deduced from the creditors will
have much less relevance to the target population (of all applicants).
When the numbers of good and bad risk examples are widely different, and one would
expect that the bad risk examples would be relatively infrequent in a well managed lending
concern, it becomes rather awkward to include all the data in the training of a classication
procedure. On the one hand, if we are to preserve the true class proportions in the training
sample, the total number of examples may have to be extremely large in order to guarantee
sufcient bad risk examples for a reliable rule. On the other hand, if we follow the common
practice in such cases and take as many bad risk examples as possible, together with a
matching number of good risk examples, we are constructing a classication rule with its
boundaries in the wrong places. The common practice is to make an adjustment to the
boundaries to take account of the true class proportions. In the case of two classes, such
Sec. 7.4]
Pre-processing
123
7.4.5
Hierarchical attributes
It often happens that information is relevant only to some of the examples. For example,
certain questions in a population census may apply only to the householder, or certain
medical conditions apply to females. There is then a hierarchy of attributes: primary
variables refer to all members (Sex is a primary attribute); secondary attributes are only
relevant when the appropriate primary attribute is applicable (Pregnant is secondary to
Sex = Female); tertiary variables are relevant when a secondary variable applies (Duration
of pregnancy is tertiary to Pregnant = True); and so on. Note that testing all members of
a population for characteristics of pregnancy is not only pointless but wasteful. Decision
tree methods are readily adapted to deal with such hierarchical datasets, and the algorithm
has been so designed.
The Machine Fault dataset (see Section 9.5.7), which was created by ISoft, is an example
of a hierarchical dataset, with some attributes being present for one subclass of examples
and not for others. Obviously
can deal with this dataset in its original form, but, from
the viewpoint of the other algorithms, the dataset is unreadable, as it has a variable number
of attributes. Therefore, an alternative version needs to be prepared. Of course, the at
124
[Ch. 7
form has lost some of the information that was available in the hierarchical structure of the
data. The fact that
does best on this dataset when it uses this hierarchical information
suggests that the hierarchical structure is related to the decision class.
Coding of hierarchical attributes
Hierarchical attributes can be coded into at format without difculty, in that a one-to-one
correspondence can be set up between the hierarchically structured data and the at format.
We illustrate the procedure for an articial example. Consider the primary attribute Sex.
When Sex takes the value male, the value of attribute Baldness is recorded as one of
(Yes No), but when Sex takes the value female the attribute Baldness is simply Not
applicable. One way of coding this information in at format is to give two attributes,
with
denoting Sex and
Baldness. The three possible triples of values are (1 1), (1
0) and (0 0). In this formulation, the primary variable is explicitly available through the
value of , but there is the difculty, here not too serious, that when
is equal to 0,
it is not clear whether this means not bald or not applicable. Strictly, there are three
possible values for : bald, not bald and not applicable, the rst two possibilities
applying only to males. This gives a second formulation, in which the two attributes are
lumped together into a single attribute, whose possible values represent the possible states
of the system. In the example, the possible states are bald male, not bald male and
female. Of course, none of the above codings enables ordinary classiers to make use
of the hierarchical structure: they are designed merely to represent the information in at
form with the same number of attributes per example. Breiman et al. (1984) indicate how
hierarchical attributes may be programmed into a tree-building procedure. A logical ag
indicates if a test on an attribute is permissible, and for a secondary attribute this ag is set
to true only when the corresponding primary attribute has already been tested.
8
Review of Previous Empirical Comparisons
|
R. J. Henery
University of Strathclyde
8.1 INTRODUCTION
It is very difcult to make sense of the multitude of empirical comparisons that have been
made. So often, the results are apparently in direct contradiction, with one author claiming
that decision trees are superior to neural nets, and another making the opposite claim.
Even allowing for differences in the types of data, it is almost impossible to reconcile the
various claims that are made for this or that algorithm as being faster, or more accurate,
or easier, than some other algorithm. There are no agreed objective criteria by which to
judge algorithms, and in any case subjective criteria, such as how easy an algorithm is to
program or run, are also very important when a potential user makes his choice from the
many methods available.
Nor is it much help to say to the potential user that a particular neural network, say,
is better for a particular dataset. Nor are the labels neural network and Machine Learning
particularly helpful either, as there are different types of algorithms within these categories.
What is required is some way of categorising the datasets into types, with a statement that
for such-and-such a type of dataset, such-and-such a type of algorithm is likely to do well.
The situation is made more difcult because rapid advances are being made in all
three areas: Machine Learning, Neural Networks and Statistics. So many comparisons are
made between, say, a state-of-the-art neural network and an outmoded Machine Learning
procedure like ID3.
8.2 BASIC TOOLBOX OF ALGORITHMS
Before discussing the various studies, let us make tentative proposals for candidates in
future comparative trials, i.e. let us say what, in our opinion, form the basis of a toolbox
of good classication procedures. In doing so, we are implicitly making a criticism of any
comparative studies that do not include these basic algorithms, or something like them.
Most are available as public domain software. Any that are not can be made available
Address for correspondence: Department of Statistics and Modelling Science, University of Strathclyde,
Glasgow G1 1XH, U.K.
126
[Ch. 8
from the database of algorithms administered from Porto (see Appendix B). So there is no
excuse for not including them in future studies!
1. We should probably always include the linear discriminant rule, as it is sometimes
best, but for the other good reason that is a standard algorithm, and the most widely
available of all procedures.
2. On the basis of our results, the -nearest neighbour method was often the outright
winner (although if there are scaling problems it was sometimes outright loser too!)
so it would seem sensible to include -nearest neighbour in any comparative studies.
Although the generally good performance of -nearest neighbour is well known, it is
surprising how few past studies have involved this procedure, especially as it is so easy
to program.
3. In many cases where -nearest neighbour did badly, the decision-tree methods did
relatively well, for example in the (non-cost-matrix) credit datasets. So some kind of
decision tree should be included.
4. Yet again, some of the newer statistical procedures got very good results when all other
methods were struggling. So we would also recommend the inclusion of, say, SMART
as a modern statistical procedure.
5. Representing neural networks, we would probably choose LVQ and/or radial basis
functions, as these seem to have a distinct edge over the version of backpropagation
that we used. However, as the performance of LVQ seems to mirror that of k-NN rather
closely, we would recommend inclusion of RBF rather than LVQ if k-NN is already
included.
Any comparative study that does not include the majority of these algorithms is clearly
not aiming to be complete. Also, any comparative study that looks at only two procedures
cannot give reliable indicators of performance, as our results show.
d
Sec. 8.6]
127
We have attempted to minimise the above problems in our own study, for example, by adopting a uniform policy for missing values and a uniform manner of dealing with categorical
variables in some, but not all, of the datasets.
8.4 PREVIOUS EMPIRICAL COMPARISONS
While it is easy to criticise past studies on the above grounds, nonetheless many useful
comparative studies have been carried out. What they may lack in generality, they may
gain in specics, the conclusion being that, for at least one dataset, algorithm A is superior
(faster or more accurate ...) than algorithm B. Other studies may also investigate other
aspects more fully than we did here, for example, by studying learning curves, i.e. the
amount of data that must be presented to an algorithm before it learns something useful.
In studying particular characteristics of algorithms, the role of simulations is crucial, as
it enables controlled departures from assumptions, giving a measure of robustness etc..
(Although we have used some simulated data in our study, namely the Belgian datasets,
this was done because we believed that the simulations were very close to the real-world
problem under study, and it was hoped that our trials would help in understanding this
particular problem.)
Here we will not discuss the very many studies that concentrate on just one procedure
or set of cognate procedures: rather we will look at cross-disciplinary studies comparing
algorithms with widely differing capabilities. Among the former however, we may mention
comparisons of symbolic (ML) procedures in Clark & Boswell (1991), Sammut (1988),
Quinlan et al. (1986) and Aha (1992); statistical procedures in Cherkaoui & Cleroux (1991),
Titterington et al. (1981) and Remme et al. (1980), and neural networks in Huang et al.
(1991), Fahlman (1991a), Xu et al. (1991) and Ersoy & Hong (1991). Several studies use
simulated data to explore various aspects of performance under controlled conditions, for
example, Cherkaoui & Cleroux (1991) and Remme et al. (1980).
8.5 INDIVIDUAL RESULTS
Particular methods may do well in some specic domains and for some performance
measures, but not in all applications. For example, -nearest neighbour performed very
well in recognising handwritten characters (Aha, 1992) and (Kressel, 1991) but not as well
on the sonar-target task (Gorman & Sejnowski, 1988).
d
128
[Ch. 8
ID3, which has been repeatedly shown to be less effective than its successors (NewID and
C4.5 in this book).
Kirkwood et al. (1989) found that a symbolic algorithm, ID3, performed better than
discriminant analysis for classifying the gait cycle of articial limbs. Tsaptsinos et al.
(1990) also found that ID3 was more preferable on an engineering control problem than
two neural network algorithms. However, on different tasks other researchers found that a
higher order neural network (HONN) performed better than ID3 (Spivoska & Reid, 1990)
and back-propagation did better than CART (Atlas et al., 1991). Gorman & Sejnowski
(1988) reported that back-propagation outperformed nearest neighbour for classifying sonar
targets, whereas some Bayes algorithms were shown to be better on other tasks (Shadmehr
& DArgenio, 1990).
More extensive comparisons have also been carried out between neural network and
symbolic methods. However, the results of these studies were inconclusive. For example,
whereas Weiss & Kulikowski (1991) and Weiss & Kapouleas (1989) reported that backpropagation performed worse than symbolic methods (i.e. CART and PVM), Fisher &
McKusick (1989) and Shavlik et al. (1989) indicated that back-propagation did as well or
better than ID3. Since these are the most extensive comparisons to date, we describe their
ndings briey and detail their limitations in the following two paragraphs.
First, Fisher & McKusick (1989) compared the accuracy and learning speed (i.e. the
number of example presentations required to achieve asymptotic accuracy) of ID3 and backpropagation. This study is restricted in the selection of algorithms,evaluation measures, and
data sets. Whereas ID3 cannot tolerate noise, several descendants of ID3 can tolerate noise
more effectively (for example, Quinlan, 1987b), which would improve their performance
on many noisy data sets. Furthermore, their measure of speed, which simply counted the
number of example presentations until asymptotic accuracy was attained, unfairly favours
ID3. Whereas the training examples need be given to ID3 only once, they were repeatedly
presented to back-propagation to attain asymptotic accuracies. However, their measure
ignored that back-propagations cost per example presentation is much lower than ID3s.
This measure of speed was later addressed in Fisher et al. (1989), where they dened speed
as the product of total example presentations and the cost per presentation. Finally, the
only data set with industrial ramications used in Fisher & McKusick (1989) is the Garvan
Institutes thyroid disease data set. We advocate using more such data sets.
Second, Mooney et al. (1989) and Shavlik et al. (1991) compared similar algorithms
on a larger collection of data sets. There were only three algorithms involved (i.e. ID3,
perceptron and back-propagation). Although it is useful to compare the relative performance of a few algorithms, the symbolic learning and neural network elds are rapidly
developing; there are many newer algorithms that can also solve classication tasks (for
example, CN2 (Clark & Boswell, 1991), C4.5 (Quinlan, 1987b), and radial basis networks
(Poggio & Girosi, 1990). Many of these can outperform the algorithms selected here. Thus,
they should also be included in a broader evaluation. In both Fisher & McKusick (1989),
Mooney et al. (1989) and Shavlik et al. (1991), data sets were separated into a collection
of training and test sets. After each system processed a training set its performance, in
terms of error rate and training time, was measured on the corresponding test set. The nal
error rate was the geometric means of separate tests. Mooney et al. (1989) and Shavlik
et al. (1991) measured speed differently from Fisher et al. (1989); they used the length of
Sec. 8.8]
129
training. In both measures, Mooney et al. (1989) and Shavlik et al. (1991) and Fisher et al.
(1990) found that back-propagation was signicantly slower than ID3. Other signicant
characteristics are: 1) they varied the number of training examples and studied the effect
on the performance that this will have; and 2) they degenerated data in several ways and
investigated the sensitivity of the algorithms to the quality of data.
8.7 STUDIES INVOLVING ML, k-NN AND STATISTICS
Thrun, Mitchell, and Cheng (1991) conducted a co-ordinated comparison study of many
algorithms on the MONKs problem. This problem features 432 simulated robots classied
into two classes using six attributes. Although some algorithms outperformed others, there
was no apparent analysis of the results. This study is of limited practical interest as it
involved simulated data, and, even less realistically, was capable of error-free classication.
Other small-scale comparisons include Huang & Lippmann (1987), Bonelli & Parodi
(1991) and Sethi & Otten (1990), who all concluded that the various neural networks
performed similarly to, or slightly better than, symbolic and statistical algorithms.
Weiss & Kapouleas (1989) involved a few (linear) discriminants and ignored much of the
new development in modern statistical classication methods. Ripley (1993) compared a
diverse set of statistical methods, neural networks, and a decision tree classier on the Tsetse
y data. This is a restricted comparison because it has only one data set and includes only
one symbolic algorithm. However, some ndings are nevertheless interesting. In accuracy,
the results favoured nearest neighbour, the decision tree algorithm, back-propagation and
projection pursuit. The decision tree algorithm rapidly produced most interpretable results.
More importantly, Ripley (1993) also described the degree of frustration in getting some
algorithms to produce the eventual results (whereas others, for example, Fisher & McKusick
(1989) and Shavlik et al. (1991) did not). The neural networks were bad in this respect: they
were very sensitive to various system settings (for example, hidden units and the stopping
criterion) and they generally converged to the nal accuracies slowly.
Of course, the inclusion of statistical algorithms does not, of itself, make the comparisons valid. For example, statisticians would be wary of applying a Bayes algorithm
to the four problems involved in Weiss & Kapouleas (1989) because of the lack of basic
information regarding the prior and posterior probabilities in the data. This same criticism
could be applied to many, if not most, of the datasets in common use. The class proportions are clearly unrealistic, and as a result it is difcult to learn the appropriate rule.
Machine Learning algorithms in particular are generally not adaptable to changes in class
proportions, although it would be straightforward to implement this.
8.8 SOME EMPIRICAL STUDIES RELATING TO CREDIT RISK
As this is an important application of Machine Learning methods, we take some time to
mention some previous empirical studies concerning credit datasets.
8.8.1
130
[Ch. 8
9
Dataset Descriptions and Results
|
9.1 INTRODUCTION
We group the dataset results according to domain type, although this distinction is perhaps
arbitrary at times. There are three credit datasets, of which two follow in the next section;
the third dataset (German credit) involved a cost matrix, and so is included in Section 9.4
with other cost matrix datasets. Several of the datasets involve image data of one form
or another. In some cases we are attempting to classify each pixel, and thus segment the
image, and in other cases, we need to classify the whole image as an object. Similarly the
data may be of raw pixel form, or else processed data. These datasets are given in Section
9.3. The remainder of the datasets are harder to group and are contained in Section 9.5.
See the appendices for general availability of datasets, algorithms and related software.
The tables contain information on time, memory and error rates for the training and test
sets. The time has been standardised for a SUN IPC workstation (quoted at 11.1 SPECs),
and for the cross-validation studies the quoted times are the average for each cycle. The
unit of memory is the maximum number of pages used during run time. This quantity is
obtained from the set time UNIX command and includes the program requirements as well
as data and rules stored during execution. Ideally, we would like to decompose this quantity
into memory required by the program itself, and the amount during the training, and testing
phase, but this was not possible. A page is currently 4096 bytes, but the quoted gures
are considered to be very crude. Indeed, both time and memory measurements should be
treated with great caution, and only taken as a rough indication of the truth.
In all tables we quote the error rate for the Default rule, in which each observation
is allocated to the most common class. In addition there is a rank column which orders
the algorithms on the basis of the error rate for the test data. Note, however, that this is not
the only measure on which they could be ranked, and many practitioners will place great
importance on time, memory, or interpretability of the algorithms classifying rule. We
use the notation * for missing (or not applicable) information, and FD to indicate that
Address for correspondence: Charles Taylor, Department of Statistics, University of Leeds, Leeds LS2 9JT,
U.K.
132
[Ch. 9
an algorithm failed on that dataset. We tried to determine reasons for failure, but with little
success. In most cases it was a Segmentation Violation probably indicating a lack of
memory.
In Section 9.6, we present both the statistical and information-based measures for all
of the datasets, and give an interpreation for a few of the datasets.
9.2 CREDIT DATASETS
9.2.1
This dataset was donated to the project by a major British engineering company, and comes
from the general area of credit management, that is to say, assessing methods for pursuing
debt recovery. Credit Scoring (CS) is one way of giving an objective score indicative of
credit risk: it aims to give a numerical score, usually containing components from various
factors indicative of risk, by which an objective measure of credit risk can be obtained.
The aim of a credit scoring system is to assess the risk associated with each application for
credit. Being able to assess the risk enables the bank to improve their pricing, marketing
and debt recovery procedures. Inability to assess the risk can result in lost business. It
is also important to assess the determinants of the risk: Lawrence & Smith (1992) state
that payment history is the overwhelming factor in predicting the likelihood of default in
mobile home credit cases. Risk assessment may inuence the severity with which bad debts
are pursued. Although it might be thought that the proper end product in this application
should be a risk factor or probability assessment rather than a yes-no decision, the dataset
was supplied with pre-allocated classes. The aim in this dataset was therefore to classify
customers (by simple train-and-test) into one of the two given classes. The classes can be
interpreted as the method by which debts will be retrieved, but, for the sake of brevity, we
refer to classes as good and bad risk.
Table 9.1: Previously obtained results for the original Credit management data, with equal
class proportions (* supplied by the Turing Institute, ** supplied by the dataset providers).
algorithm
NewID*
CN2*
Neural Net**
error rate
0.05
0.06
0.06
The original dataset had 20 000 examples of each class. To make this more representative of the population as a whole (where approximately 5% of credit applicants were
assessed by a human as bad risk), the dataset used in the project had 20 000 examples
with 1000 of these being class 1 (bad credit risk) and 19 000 class 2 (good credit risk). As
is common when the (true) proportion of bad credits is very small, the default rule (to grant
credit to all applicants) achieves a small error rate (which is clearly 5% in this case). In such
circumstances the credit-granting company may well adopt the default strategy for the sake
of good customer relations see Lawrence & Smith (1992). However, most decision tree
algorithms do worse than the default if they are allowed to train on the given data which is
strongly biased towards bad credits (typically decision tree algorithms have an error rate of
around 6% error rate). This problem disappears if the training set has the proper class proportions. For example, a version of CART(the Splus module tree()) obtained an error rate
Sec. 9.2]
Credit data
133
Table 9.2: Results for the Credit management dataset (2 classes, 7 attributes, (train, test)=
(15 000, 5 000) observations).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Max.
Storage
68
71
889
412
220
108
48
FD
1656
104
7250
1368
956
2100
620
377
167
715
218
148
253
476
*
Time (sec.)
Train
Test
32.2
3.8
67.2
12.5
165.6
14.2
27930.0
5.4
22069.7
*
124187.0
968.0
370.1
81.4
FD
FD
423.1
415.7
3035.0
2.0
5418.0 3607.0
53.1
3.3
24.3
2.8
2638.0
9.5
171.0
158.0
4470.0
1.9
553.0
7.2
*
*
2340.0
57.8
5950.0
3.0
435.0
26.0
2127.0
52.9
*
*
Error Rate
Train
Test
0.031 0.033
0.051 0.050
0.031 0.030
0.021 0.020
0.033 0.031
0.028 0.088
0.051 0.047
FD
FD
0.010 0.025
0.000 0.033
0.000 0.030
0.002 0.028
0.041 0.043
0.000 0.032
0.014 0.022
0.041 0.046
0.018 0.023
0.037 0.043
0.020 0.020
0.020 0.023
0.033 0.031
0.024 0.040
0.051 0.047
Rank
13
21
8
1
10
22
19
6
13
8
7
16
12
3
18
4
16
1
4
10
15
19
of 5.8% on the supplied data but only 2.35% on the dataset with proper class proportions,
whereas linear discriminants obtained an error rate of 5.4% on the supplied data and 2.35%
on the modied proportions. (The supplier of the credit management dataset quotes error
rates for neural nets and decision trees of around 56% also when trained on the 50-50
dataset). Note that the effective bias is in favour of the non-statistical algorithms here, as
statistical algorithms can cope, to a greater or lesser extent, with prior class proportions
that differ from the training proportions.
In this dataset the classes were chosen by an expert on the basis of the given attributes
(see below) and it is hoped to replace the expert by an algorithm rule in the future. All
attribute values are numeric. The dataset providers supplied the performance gures for
algorithms which have been applied to the data drawn from the same source.Note that the
gures given in Table 9.1 were achieved using the original dataset with equal numbers of
examples of both classes.
The best results (in terms of error rate) were achieved by SMART, DIPOL92 and the
tree algorithms C4.5 and Cal5. SMART is very time consuming to run: however, with
credit type datasets small improvements in accuracy can save vast amounts of money so
134
[Ch. 9
this has to be considered if sacricing accuracy for time. k-NN did badly due to irrelevant
attributes; with a variable selection procedure, it obtained an error rate of 3.1%. CASTLE,
Kohonen, ITrule and Quadisc perform poorly (the result for Quadisc equalling the default
rule). CASTLE uses only attribute 7 to generate the rule, concluding that this is the only
relevant attribute for the classication. Kohonen works best for datasets with equal class
distributions which is not the case for the dataset as preprocessed here. At the cost of
signicantly increasing the CPU time, the performance might be improved by using a
larger Kohonen net.
The best result for the Decision Tree algorithms was obtained by C4.5 which used the
smallest tree with 62 nodes. Cal5 used 125 nodes and achieved a similar error rate; NewID
and
used 448 and 415 nodes, respectively, which suggests that they over trained on
this dataset.
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Max.
Storage
366
353
329
762
102
758
62
149
668
28
404
524
420
215
62
124
128
FD
52
147
231
81
*
Time (sec.)
Train Test
31.8
6.7
30.5
7.2
21.0 18.0
246.0
0.2
876.9
*
3.0
7.0
46.8
5.3
68.4
1.6
34.2 32.7
15.2
0.3
400.0 14.0
7.2
0.4
3.7
0.4
42.0
3.0
6.0
1.0
173.6
0.6
24.0
2.2
FD
FD
55.6
2.0
1369.8
0.0
12.2
2.4
260.8
7.2
*
*
Error Rate
Train
Test
0.139 0.141
0.185 0.207
0.125 0.141
0.090 0.158
0.194 0.201
0.000 0.181
0.144 0.148
0.145 0.145
0.081 0.152
0.000 0.181
0.000 0.181
0.000 0.171
0.136 0.151
0.001 0.204
0.099 0.155
0.162 0.137
0.132 0.131
FD
FD
0.139 0.141
0.087 0.154
0.107 0.145
0.065 0.197
0.440 0.440
Rank
3
21
3
13
19
15
8
6
10
15
15
14
9
20
12
2
1
3
11
6
18
22
The aim is to devise a rule for assessing applications for credit cards. The dataset has
been studied before (Quinlan, 1987a, 1993) . Interpretation of the results is made difcult
because the attributes and classes have been coded to preserve condentiality, however
Sec. 9.3]
135
examples of likely attributes are given for another credit data set in Section 9.4.3. For
our purposes, we replaced the missing values by the overall medians or means (5% of the
examples had some missing information).
Due to the condentiality of the classes, it was not possible to assess the relative costs
of errors nor to assess the prior odds of good to bad customers. We decided therefore to
use the default cost matrix. The use of the default cost matrix is not realistic. In practice it
is generally found that it is very difcult to beat the simple rule: Give credit if (and only
if) the applicant has a bank account. We do not know, with this dataset, what success this
default rule would have. The results were obtained by 10-fold cross validation.
The best result here was obtained by Cal5, which used only an average of less than 6
nodes in its decision tree. By contrast
and NewID used around 70 nodes and achieved
higher error rates, which suggests that pruning is necessary.
This dataset consists of 18 000 examples of the digits 0 to 9 gathered from postcodes on
letters in Germany. The handwritten examples were digitised onto images with 16 16
pixels and 256 grey levels. They were read by one of the automatic address readers built
by a German company. These were initially scaled for height and width but not thinned
or rotated in a standard manner. An example of each digit is given in Figure 9.1.
The dataset was divided into a training set with 900 examples per digit and a test set
with 900 examples per digit. Due to lack of memory, very few algorithms could cope with
the full dataset. In order to get comparable results we used a version with 16 attributes
prepared by averaging over 4 4 neighbourhoods in the original images.
For the k-NN classier this averaging resulted in an increase of the error rate from 2.0%
to 4.7%, whereas for Discrim the error rate increased from 7.4% to 11.4%. Backprop could
also cope with all 256 attributes but when presented with all 9000 examples in the training
set took an excessively long time to train (over two CPU days).
The fact that k-NN and LVQ do quite well is probably explained by the fact that they
make the fewest restrictive assumptions about the data. Discriminant analysis, on the other
hand, assumes that the data follows a multi-variate normal distribution with the attributes
obeying a common covariance matrix and can model only linear aspects of the data. The
fact that Quadisc, using a reduced version of the dataset, does better than Discrim, using
either the full version or reduced version, shows the advantage of being able to model
non-linearity. CASTLE approximates the data by a polytree and this assumption is too
restrictive in this case. Naive Bayes assumes the attributes are conditionally independent.
That Naive Bayes does so badly is explained by the fact that the attributes are clearly not
conditionally independent, since neighbouring pixels are likely to have similar grey levels.
It is surprising that Cascade does better than Backprop, and this may be attributed to the
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Cascade
Default
136
[Ch. 9
Max.
Storage
252
324
1369
337
393
497
116
240
884
532
770
186
129
1926
248
504
1159
646
110
884
268
249
2442
*
Time (sec.)
Train
Test
65.3
30.2
194.4
152.0
5110.2
138.2
19490.6
33.0
1624.0
7041.0
2230.7
2039.2
252.6
4096.8
251.6
40.8
3614.5
50.6
500.7
112.5
10596.0 22415.0
1117.0
59.8
42.7
61.8
3325.9
119.9
778.1
60.6
1800.1
9000
571.0
55.2
67176.0
2075.1
191.2
43.6
28910.0
110.0
1400.0
250.0
1342.6
123.0
19171.0
1.0
*
*
Error Rate
Train
Test
0.111 0.114
0.052 0.054
0.079 0.086
0.096 0.104
0.066 0.068
0.016 0.047
0.180 0.170
0.180 0.160
0.011 0.154
0.080 0.150
*
0.155
0.015 0.140
0.220 0.233
0.000 0.134
0.041 0.149
*
0.222
0.118 0.220
0.051 0.075
0.065 0.072
0.072 0.080
0.080 0.083
0.040 0.061
0.064 0.065
0.900 0.900
Rank
12
2
10
11
5
1
20
19
17
16
18
14
23
13
15
22
21
7
6
8
9
3
4
24
Sec. 9.3]
137
quadratic terms. Both methods achieved about 2% error rates on the test set. (2.24% for
the linear/quadratic classier and 1.91% errors for the MLP). hidden layer.
9.3.2 Karhunen-Loeve digits (KL)
Table 9.5: Results for the KL digits dataset (10 classes, 40 attributes, (train, test) = (9000,
9000) observations).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Cascade
Default
Max.
Storage
306
1467
1874
517
500
500
779
FD
341
1462
1444
289
1453
732
310
1821
1739
FD
221
1288
268
368
2540
*
Time (sec.)
Train
Test
87.1
53.9
1990.2
1647.8
31918.3
194.4
174965.8
57.7
23239.9 23279.3
0.0
6706.4
4535.4 56052.7
FD
FD
3508.0
46.9
779.0
109.0
15155.0
937.0
1100.4
53.0
64.9
76.0
2902.1
99.7
1437.0
35.5
*
8175.0
3053.4
64.3
FD
FD
462.8
80.0
129600.0
4.0
1700.0
580.0
1692.1
158.1
10728.0
1.0
*
*
Error Rate
Train
Test
0.070 0.075
0.016 0.025
0.032 0.051
0.043 0.057
0.000 0.024
0.000 0.020
0.126 0.135
FD
FD
0.003 0.170
0.000 0.162
0.000 0.168
0.006 0.163
0.205 0.223
0.036 0.180
0.050 0.180
*
0.216
0.128 0.270
FD
FD
0.030 0.039
0.041 0.049
0.048 0.055
0.011 0.026
0.063 0.075
0.900 0.900
Rank
10
3
7
9
2
1
12
16
13
15
14
20
17
17
19
21
5
6
8
4
10
22
An alternative data reduction technique (to the 4 4 averaging above) was carried out
using the rst 40 principal components. It is interesting that, with the exception of Cascade
correlation, the order of performance of the algorithms is virtually unchanged (see Table
9.5) and that the error rates are now very similar to those obtained (where available) using
the original 16 16 pixels.
The results for the digits dataset and the KL digits dataset are very similar so are treated
together. Most algorithms perform a few percent better on the KL digits dataset. The KL
digits dataset is the closest to being normal. This could be predicted beforehand, as it
is a linear transformation of the attributes that, by the Central Limit Theorem, would be
closer to normal than the original. Because there are very many attributes in each linear
combination, the KL digits dataset is very close to normal (skewness = 0.1802, kurtosis =
2.9200) as against the exact normal values of (skewness = 0, kurtosis = 3.0).
138
[Ch. 9
In both Digits datasets dataset k-NN comes top and RBF and ALLOC80 also do fairly
well in fact ALLOC80 failed and an equivalent kernel method, with smoothing parameter
asymptotically chosen, was used. These three algorithms are all closely related. Kohonen
also does well in the Digits dataset (but for some reason failed on KL digits); Kohonen has
some similarities with k-NN type algorithms. The success of such algorithms suggests that
the attributes are equally scaled and equally important. Quadisc also does well, coming
second in both datasets. The KL version of digits appears to be well suited to Quadisc:
there is a substantial difference in variances (SD ratio = 1.9657), while at the same time
the distributions are not too far from multivariate normality with kurtosis of order 3.
Backprop and LVQ do quite well on the
digits dataset, bearing out the oftrepeated claim in the neural net literature that neural networks are very well suited to
pattern recognition problems (e.g. Hecht-Nelson, 1989) .
The Decision Tree algorithms do not do very well on these digits datasets. The tree
sizes are typically in the region of 7001000 nodes.
9.3.3
Fig. 9.2: Vehicle silhouettes prior to high level feature extraction. These are clockwise from top left:
Double decker bus, Opel Manta 400, Saab 9000 and Chevrolet van.
Sec. 9.3]
139
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Cascade
Default
Max.
Storage
231
593
685
105
227
104
80
158
296
*
776
71
56
*
*
307
171
1441
64
186
716
77
238
*
Time (sec.)
Train
Test
16.3
3.0
250.9
28.6
757.9
8.3
2502.5
0.7
30.0
10.0
163.8
22.7
13.1
1.8
24.4
0.8
113.3
0.4
18.0
1.0
3135.0 121.0
27.1
0.5
5.4
0.6
100.0
1.0
174.0
2.0
985.3
*
23.3
0.5
5962.0
50.4
150.6
8.2
14411.2
3.7
1735.9
11.8
229.1
2.8
289.0
1.0
*
*
Error Rate
Train
Test
0.202 0.216
0.085 0.150
0.167 0.192
0.062 0.217
0.000 0.173
0.000 0.275
0.545 0.505
0.284 0.235
0.047 0.298
0.030 0.298
*
0.296
0.079 0.271
0.519 0.558
0.018 0.314
0.065 0.266
*
0.324
0.068 0.279
0.115 0.340
0.079 0.151
0.168 0.207
0.098 0.307
0.171 0.287
0.263 0.280
0.750 0.750
Rank
6
1
4
7
3
11
22
8
16
16
15
10
23
19
9
20
12
21
2
5
18
14
13
24
One would expect this dataset to be non-linear since the attributes depend on the angle
at which the vehicle is viewed. Therefore they are likely to have a sinusoidal dependence,
although this dependence was masked by issuing the dataset in permuted order. Quadisc
does very well, and this is due to the highly non-linear behaviour of this data. One would
have expected the Backprop algorithm to perform well on this dataset since, it is claimed,
Backprop can successfully model the non-linear aspects of a dataset. However, Backprop
is not straightforward to run. Unlike discriminant analysis, which requires no choice of
free parameters, Backprop requires essentially two free parameters - the number of hidden
140
[Ch. 9
nodes and the training time. Neither of these is straightforward to decide. This gure
for Backprop was obtained using 5 hidden nodes and a training time of four hours for
the training time in each of the nine cycles of cross-validation. However, one can say
that the sheer effort and time taken to optimise the performance for Backprop is a major
disadvantage compared to Quadisc which can achieve a much better result with a lot less
effort. DIPOL92 does nearly as well as Quadisc. As compared with Backprop it performs
better and is quicker to run. It determines the number of nodes (hyperplanes, neurons)
and the initial weights by a reasonable procedure at the beginning and doesnt use an
additional layer of hidden units but instead a symbolic level. The poor performance of
CASTLE is explained by the fact that the attributes are highly correlated. In consequence
the relationship between class and attributes is not built strongly into the polytree. The same
explanation accounts for the poor performance of Naive Bayes. k-NN, which performed
so well on the raw digits dataset, does not do so well here. This is probably because in
the case of the digits the attributes were all commensurate and carried equal weight. In the
vehicle dataset the attributes all have different meanings and it is not clear how to build an
appropriate distance measure.
The attributes for the vehicle dataset, unlike the other image analysis, were generated
using image analysis tools and were not simply based on brightness levels. This suggests
that the attributes are less likely to be equally scaled and equally important. This is
conrmed by the lower performances of k-NN, LVQ and Radial Basis functions, which
treat all attributes equally and have a built in mechanism for normalising, which is often
not optimal. ALLOC80 did not perform well here, and so an alternative kernel method was
used which allowed for correlations between the attributes, and this appeared to be more
robust than the other three algorithms although it still fails to learn the difference between
the cars. The original Siebert (1987) paper showed machine learning performing better
than k-NN, but there is not much support for this in our results. The tree sizes for
and
Cal5 were 116 and 156 nodes, respectively.
The high value of fract2 = 0.8189 (see Table 9.30) might indicate that linear discrimination could be based on just two discriminants. This may relate to the fact that the two cars
are not easily distinguishable, so might be treated as one (reducing dimensionality of the
mean vectors to 3D). However, although the fraction of discriminating power for the third
discriminant is low (1 - 0.8189), it is still statistically signicant, so cannot be discarded
without a small loss of discrimination.
9.3.4
The dataset was constructed by David J. Slate, Odesta Corporation, Evanston, IL 60201.
The objective here is to classify each of a large number of black and white rectangular
pixel displays as one of the 26 capital letters of the English alphabet. (One-shot train and
test was used for the classication.) The character images produced were based on 20
different fonts and each letter within these fonts was randomly distorted to produce a le
of 20 000 unique images. For each image, 16 numerical attributes were calculated using
edge counts and measures of statistical moments which were scaled and discretised into a
range of integer values from 0 to 15.
Perfect classication performance is unlikely to be possible with this dataset. One of
the fonts used, Gothic Roman, appears very different from the others.
Sec. 9.3]
141
Table 9.7: Results for the letters dataset (26 classes, 16 attributes, (train, test) = (15 000,
5000) observations).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Max.
Storage
78
80
316
881
758
200
1577
FD
3600
376
2033
2516
1464
*
1042
593
1554
1204
189
154
418
377
*
Time (sec.)
Train
Test
325.6
84.0
3736.2 1222.7
5061.6
38.7
400919.0
184.0
39574.7
*
14.8 2135.4
9455.3 2933.4
FD
FD
1098.2 1020.2
1056.0
2.0
2529.0
92.0
275.5
7.1
74.6
17.9
40458.3
52.2
309.0
292.0
22325.4
69.1
1033.4
8.2
*
*
1303.4
79.5
277445.0
22.0
*
*
1487.4
47.8
*
*
Error Rate
Train
Test
0.297 0.302
0.101 0.113
0.234 0.234
0.287 0.295
0.065 0.064
0.000 0.068
0.237 0.245
FD
FD
0.010 0.130
0.000 0.128
0.000 0.245
0.015 0.124
0.516 0.529
0.021 0.115
0.042 0.132
0.585 0.594
0.158 0.253
0.218 0.252
0.167 0.176
0.323 0.327
0.220 0.233
0.057 0.079
0.955 0.960
Rank
18
4
12
17
1
2
13
8
7
13
6
20
5
9
21
16
15
10
19
11
3
22
Quadisc is the best of the classical statistical algorithms on this dataset. This is perhaps
not surprising since the measures data gives some support to the assumptions underlying
the method. Discrim does not perform well although the logistic version is a signicant
improvement. SMART is used here with a 22 term model and its poor performance
is surprising. A number of the attributes are nonlinear combinations of some others
and SMART might have been expected to model this well. ALLOC80 achieves the best
performance of all with k-NN close behind. In this dataset all the attributes are prescaled
and all appear to be important so good performance from k-NN is to be expected. CASTLE
constructs a polytree with only one attribute contributing to the classication which is too
restrictive with this dataset. Naive Bayes assumes conditional independence and this is
certainly not satised for a number of the attributes. NewID and
were only trained
on 3000 examples drawn from the full training set and that in part explains their rather
uninspiring performance. NewID builds a huge tree containing over 1760 nodes while the
tree is about half the size. This difference probably explains some of the difference in
their respective results. Cal5 and C4.5 also build complex trees while CN2 generates 448
rules in order to classify the training set. ITrule is the poorest algorithm on this dataset.
142
[Ch. 9
Generally we would not expect ITrule to perform well on datasets where many of the
attributes contributed to the classication as it is severely constrained in the complexity of
the rules it can construct. Of the neural network algorithms, Kohonen and LVQ would be
expected to perform well for the same reasons as k-NN. Seen in that light, the Kohonen
result is a little disappointing.
In a previous study Frey & Slate (1991) investigated the use of an adaptive classier
system and achieved a best error rate of just under 20%.
9.3.5 Chromosomes (Chrom)
Table 9.8: Results for the chromosome dataset (24 classes, 16 attributes, (train, test) =
(20 000, 20 000) observations).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Max.
Storage
1586
1809
1925
1164
1325
1097
279
FD
3768
1283
1444
2840
1812
1415
589
637
1071
1605
213
FD
471
373
*
Time (sec.)
Train
Test
830.0
357.0
1986.3
1607.0
20392.8
291.4
307515.4
92.9
184435.0
*
20.1 14140.6
230.2
96.2
FD
FD
2860.3
2763.8
552.0
17.0
1998.0
138.0
1369.5
29.7
107.8
61.0
9192.6
131.9
1055.3
*
34348.0
30.0
564.5
31.5
*
*
961.8
258.2
FD
FD
*
*
1065.5
*
*
*
Error Rate
Train
Test
0.073 0.107
0.046 0.084
0.079 0.131
0.082 0.128
0.192 0.253
0.000 0.123
0.129 0.178
FD
FD
0.007 0.173
0.000 0.176
0.000 0.234
0.034 0.164
0.260 0.324
0.010 0.150
0.038 0.175
0.681 0.697
0.142 0.244
0.109 0.174
0.049 0.091
FD
FD
0.087 0.129
0.067 0.121
0.956 0.956
Rank
3
1
8
7
18
5
15
11
14
16
10
19
9
13
20
17
12
2
6
4
21
This data was obtained via the MRC Human Genetics Unit, Edinburgh from the routine
amniotic 2668 cell data set (courtesy C. Lundsteen, Righospitalet, Copenhagen). In our
trials we used only 16 features (and 40 000 examples) which are a subset of a larger
database which has 30 features and nearly 80 000 examples. The subset was selected to
reduce the scale of the problem, and selecting the features dened as level 1 (measured
directly from the chromosome image) and level 2 (measures requiring the axis, e.g. length,
to be specied). We omitted observations with an unknown class as well as features
with level 3 (requiring both axis and prole and knowledge of the chromosome polarity)
Sec. 9.3]
143
and level 4 (requiring both the axis and both the polarity and the centrometre location).
Classication was done using one-shot train-and-test.
The result for ALLOC80 is very poor, and the reason for this is not clear. An alternative
kernel classier (using a Cauchy kernel, to avoid numerical difculties) gave an error rate
of 10.67% which is much better. Although quadratic discriminants do best here, there
is reason to believe that its error rate is perhaps not optimal as there is clear evidence of
non-normality in the distribution of the attributes.
The best of Decision Tree results is obtained by CN2 which has 301 rules. C4.5 and
have 856 and 626 terminal nodes, respectively, and yet obtain very differnt error rates.
By contrast NewID has 2967 terminal nodes, but does about as well as C4.5.
Further details of this dataset can be found in Piper & Granum (1989) who have done
extensive experiments on selection and measurement of variables. For the dataset which
resembled the one above most closely, they achieved an error rate of 9.2%.
9.3.6
The original Landsat data for this database was generated from data purchased from NASA
by the Australian Centre for Remote Sensing, and used for research at the University of
New South Wales. The sample database was generated taking a small section (82 rows and
100 columns) from the original data. The classication for each pixel was performed on
the basis of an actual site visit by Ms. Karen Hall, when working for Professor John A.
Richards, at the Centre for Remote Sensing. The database is a (tiny) sub-area of a scene,
consisting of 82 100 pixels, each pixel covering an area on the ground of approximately
80*80 metres. The information given for each pixel consists of the class value and the
intensities in four spectral bands (from the green, red, and infra-red regions of the spectrum).
The original data are presented graphically in Figure 9.3. The rst four plots (top row
and bottom left) show the intensities in four spectral bands: Spectral bands 1 and 2 are
in the green and red regions of the visible spectrum, while spectral bands 3 and 4 are in
the infra-red (darkest shadings represent greatest intensity). The middle bottom diagram
shows the land use, with shadings representing the seven original classes in the order: red
soil, cotton crop, vegetation stubble, mixture (all types present), grey soil, damp grey soil
and very damp grey soil, with red as lightest and very damp grey as darkest shading. Also
shown (bottom right) are the classes as predicted by linear discriminants. Note that the
most accurate predictions are for cotton crop (rectangular region bottom left of picture),
and that the predicted boundary damp-vary damp grey soil (L-shape top left of picture) is
not well positioned.
So that information from the neighbourhood of a pixel might contribute to the classication of that pixel, the spectra of the eight neighbours of a pixel were included as attributes
together with the four spectra of that pixel. Each line of data corresponds to a 3 3 square
neighbourhood of pixels completely contained within the 82 100 sub-area. Thus each
line contains the four spectral bands of each of the 9 pixels in the 3 3 neighbourhood and
the class of the central pixel which was one of: red soil, cotton crop, grey soil, damp grey
soil, soil with vegetation stubble, very damp grey soil. The mixed-pixels, of which there
were 8.6%, were removed for our purposes, so that there are only six classes in this dataset.
The examples were randomised and certain lines were deleted so that simple reconstruction of the original image was not possible. The data were divided into a train set and
144
[Ch. 9
Spectral band 1
Spectral band 2
Spectral band 3
Spectral band 4
Fig. 9.3: Satellite image dataset. Spectral band intensities as seen from a satellite for a small (8.2*6.6
km) region of Australia. Also given are the actual land use as determined by on-site visit and the
estimated classes as given by linear discriminants.
a test set with 4435 examples in the train set and 2000 in the test set and the error rates are
given in Table 9.9.
In the satellite image dataset k-NN performs best. Not surprisingly, radial basis functions, LVQ and ALLOC80 also do fairly well as these three algorithms are closely related.
[In fact, ALLOC80 failed on this dataset, so an equivalent method, using an asymptotically
chosen bandwidth, was used.] Their success suggests that all the attributes are equally
scaled and equally important. There appears to be little to choose between any of the other
algorithms, except that Naive Bayes does badly (and its close relative CASTLE also does
relatively badly).
The Decision Tree algorithms perform at about the same level, with CART giving the
used trees with 156 and 116 nodes, respectively,
best result using 66 nodes. Cal5 and
which suggests more pruning is desired for these algorithms.
This dataset has the highest correlation between attributes (corr.abs = 0.5977). This may
partly explain the failure of Naive Bayes (assumes attributes are conditionally independent),
and CASTLE (confused if several attributes contain equal amounts of information). Note
that only three canonical discriminants are sufcient to separate all six class means (fract3
Sec. 9.3]
145
Table 9.9: Results for the satellite image dataset (6 classes, 36 attributes, (train, test) =
(4435, 2000) observations).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Cascade
Default
Maximum
Storage
254
364
1205
244
244
180
*
253
819
1800
*
161
133
682
1150
FD
412
*
293
469
195
227
1210
*
Time (sec.)
Train
Test
67.8
11.9
157.0
52.9
4414.1
41.2
27376.2
10.8
63840.2 28756.5
2104.9
944.1
75.0
80.0
329.9
14.2
2109.2
9.2
226.0
53.0
8244.0 17403.0
247.8
10.2
75.1
16.5
1664.0
35.8
434.0
1.0
FD
FD
764.0
7.2
12627.0
129.0
764.3
110.7
72494.5
52.6
564.2
74.1
1273.2
44.2
7180.0
1.0
*
*
Error Rate
Train
Test
0.149 0.171
0.106 0.155
0.119 0.163
0.123 0.159
0.036 0.132
0.089 0.094
0.186 0.194
0.079 0.138
0.023 0.138
0.067 0.150
*
0.157
0.020 0.147
0.308 0.287
0.010 0.150
0.040 0.150
FD
FD
0.125 0.151
0.101 0.179
0.051 0.111
0.112 0.139
0.111 0.121
0.048 0.105
0.112 0.163
0.758 0.769
Rank
19
14
17
16
5
1
21
6
6
10
15
9
22
10
10
13
20
3
8
4
2
17
23
= 0.9691). This may be interpreted as evidence of seriation, with the three classes grey
soil, damp grey soil and very damp grey soil forming a continuum. Equally, this result
can be interpreted as indicating that the original four attributes may be successfully reduced
to three with no loss of information. Here information should be interpreted as mean
square distance between classes, or equivalently, as the entropy of a normal distribution.
The examples were created using a 3 3 neighbourhood so it is no surprise that there
is a very large correlation amongst the 36 variables. The results from CASTLE suggest
that only three of the variables for the centre pixel are necessary to classify the observation.
However, other algorithms found a signicant improvement when information from the
neighbouring pixels was used.
146
[Ch. 9
Table 9.10: Results for the image segmentation dataset (7 classes, 11 attributes, 2310
observations, 10-fold cross-validation).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Max.
Storage
365
395
535
144
124
171
142
175
744
*
7830
676
564
174
57
139
373
233
91
148
381
123
*
Time (sec.)
Train
Test
73.6
6.6
49.7
15.5
301.8
8.4
13883.9
0.5
15274.3
*
5.0
28.0
465.4
38.3
79.0
2.3
1410.5 1325.1
386.0
2.0
18173.0
479.0
677.3
26.9
516.4
29.0
114.2
2.7
142.0
1.3
545.7
19.9
247.1
13.7
11333.2
8.5
503.0
25.0
88467.2
0.4
65.0
11.0
368.2
6.4
*
*
Error Rate
Train
Test
0.112 0.116
0.155 0.157
0.098 0.109
0.039 0.052
0.033 0.030
0.000 0.077
0.108 0.112
0.005 0.040
0.012 0.045
0.000 0.034
0.000 0.031
0.000 0.033
0.260 0.265
0.003 0.043
0.013 0.040
0.445 0.455
0.042 0.062
0.046 0.067
0.021 0.039
0.028 0.054
0.047 0.069
0.019 0.046
0.760 0.760
Rank
19
20
17
11
1
16
18
6
9
4
2
3
21
8
6
22
13
14
5
12
15
10
23
Average error rates were obtained via 10-fold cross-validation, and are given in Table 9.10.
did very well here and used an average of 52 nodes in its decision trees. It is
interesting here that ALLOC80 does so much better than k-NN. The reason for this is that
ALLOC80 has a variable selection option which was initially run on the data, and only
5 of the original attributes were nally used. When 14 variables were used the error rate
increased to 21%. Indeed a similar attribute selection procedure increased the performance
of k-NN to a very similar error rate. This discrepancy raises the whole issue of preprocessing the data before algorithms are run, and the substantial difference this can make.
It is clear that there will still be a place for intelligent analysis alongside any black-box
techniques for quite some time!
9.3.8
Cut
This dataset was supplied by a StatLog partner for whom it is commercially condential.
The dataset was constructed during an investigation into the problem of segmenting individual characters from joined written text. Figure 9.4 shows an example of the word Eins
(German for One). Each example consists of a number of measurements made on the text
relative to a potential cut point along with a decision on whether to cut the text at that
Sec. 9.3]
147
Fig. 9.4: The German word Eins with an indication of where it should be cut to separate the
individual letters.
point or not. As supplied, the dataset contained examples with 50 real valued attributes.
In an attempt to assess the performance of algorithms relative to the dimensionality of the
problem, a second dataset was constructed from the original using the best 20 attributes
selected by stepwise regression on the whole dataset. This was the only processing carried
out on this dataset. The original and reduced datasets were tested. In both cases training
sets of 11220 examples and test sets of 7480 were used in a single train-and-test procedure
to assess accuracy.
Although individual results differ between the datasets, the ranking of methods is
broadly the same and so we shall consider all the results together. The default rule in both
cases would give an error rate of around 6% but since Kohonen, the only unsupervised
method in the project, achieves an error rate of 5% for both datasets it seems reasonable to
choose this value as our performance threshold.
This is a dataset on which knearest neighbour might be expected to do well; all
attributes are continuous with little correlation, and this proves to be the case. Indeed,
with a variable selection option k-NN obtained an error rate of only 2.5%. Conversely,
the fact that k-NN does well indicates that many variables contribute to the classication.
ALLOC80 approaches k-NN performance by undersmoothing leading to overtting on the
training set. While this may prove to be an effective strategy with large and representative
training sets, it is not recommended in general. Quadisc, CASTLE and Naive Bayes
perform poorly on both datasets because, in each case, assumptions underlying the method
do not match the data.
Quadisc assumes multivariate normality and unequal covariance matrices and neither
of these assumptions is supported by the data measures. CASTLE achieves default performance using only one variable, in line with the assumption implicit in the method that only
a small number of variables will determine the class. Naive Bayes assumes conditional
independence amongst the attributes and this is unlikely to hold for a dataset of this type.
Machine learning algorithms generally perform well although with wide variation in
tree sizes. Baytree and IndCART achieve low error rates at the expense of building trees
containing more than 3000 nodes. C4.5 performs almost as well, though building a tree
containing 159 terminal nodes. Cal5 produces a very parsimonious tree, containing only 26
and NewID build trees
nodes for the Cut20 dataset, which is very easy to understand.
148
[Ch. 9
Table 9.11: Comparative results for the Cut20 dataset (2 classes, 20 attributes, (train, test)
= (11 220, 7480) observations).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Max.
Storage
71
75
1547
743
302
190
175
FD
1884
1166
915
1676
1352
9740
2436
630
188
1046
379
144
901
291
*
Time (sec.)
Train
Test
115.5
22.7
394.8
214.2
587.0
101.2
21100.5
21.8
32552.2
*
54810.7 6052.0
1006.0
368.5
FD
FD
*
*
1445.0
3.0
917.0
48.0
145.3
25.9
83.6
27.6
5390.0
470.0
293.0
28.0
11011.0
50.9
455.5
23.4
*
*
506.0
36.1
88532.0
7.0
6041.0
400.0
1379.0
86.9
*
*
Error Rate
Train
Test
0.052 0.050
0.090 0.088
0.046 0.046
0.047 0.047
0.033 0.037
0.031 0.036
0.060 0.061
FD
FD
0.002 0.040
0.000 0.039
0.000 0.063
0.002 0.034
0.074 0.077
0.000 0.042
0.010 0.036
0.083 0.082
0.043 0.045
0.046 0.050
0.043 0.045
0.037 0.043
0.042 0.044
0.029 0.041
0.059 0.061
Rank
15
22
13
14
4
2
17
6
5
19
1
20
8
2
21
11
15
11
9
10
7
17
with 38 and 339 nodes, respectively. ITrule, like CASTLE, cannot deal with continuous
attributes directly and also discretises such variables before processing. The major reason
for poor performance, though, is that tests were restricted to conjunctions of up to two
attributes. CN2, which tested conjunctions of up to 5 attributes, achieved a much better
could not handle the full dataset and the results reported are for a 10%
error rate.
subsample.
It is interesting that almost all algorithms achieve a better result on Cut50 than Cut20.
This suggests that the attributes excluded from the reduced dataset contain signicant
discriminatory power. Cal5 achieves its better performance by building a tree ve times
larger than that for Cut20. NewID and
both build signicantly smaller trees (196
and 28 nodes) and classify more accurately with them. C4.5 uses a tree with 142 nodes
with a slight improvement in accuracy. Similarly CN2 discovers a smaller set of rules for
Cut50 which deliver improved performance. This general improvement in performance
underlines the observation that what is best or optimal in linear regression terms may
not be best for other algorithms.
Sec. 9.4]
Cost datasets
149
Table 9.12: Results for the Cut50 dataset (2 classes, 50 attributes, (train, test) = (11 220,
7480) observations).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Max.
Storage
73
77
1579
779
574
356
765
FD
3172
1166
1812
2964
2680
*
*
642
508
*
884
146
649
476
*
Time (sec.)
Train
Test
449.2
52.5
2230.7 1244.2
1990.4
227.0
63182.0
50.4
32552.2
*
62553.6 6924.0
7777.6 1094.8
FD
FD
2301.4 2265.4
1565.0
2.0
1850.0
47.0
324.0
65.4
219.4
69.9
28600.0
501.0
711.0
31.0
61287.5
*
1131.9
58.7
*
*
1242.5
96.9
18448.0
12.0
6393.0 1024.0
2991.2
205.0
*
*
Error Rate
Train
Test
0.052 0.050
0.092 0.097
0.038 0.037
0.035 0.039
0.030 0.034
0.025 0.027
0.060 0.061
FD
FD
0.004 0.037
0.000 0.038
0.000 0.054
0.001 0.035
0.106 0.112
0.000 0.030
0.008 0.035
*
0.084
0.030 0.037
0.046 0.050
0.031 0.036
0.041 0.041
0.036 0.038
0.024 0.040
0.059 0.061
Rank
15
21
7
12
3
1
18
7
10
18
4
22
2
4
20
7
15
6
14
10
13
17
The data set is a series of 1000 patients with severe head injury collected prospectively by
neurosurgeons between 1968 and 1976. This head injury study was initiated in the Institute
150
[Ch. 9
d/v
0
10
750
sev
10
0
100
m/g
75
90
0
The dataset had a very large number of missing values for patients (about 40%) and
these were replaced with the median value for the appropriate class. This makes our
version of the data considerably easier for classication than the original data, and has
the merit that all procedures can be applied to the same dataset, but has the disadvantage
that the resulting rules are unrealistic in that this replacement strategy is not possible for
real data of unknown class. Nine fold cross-validation was used to estimate the average
misclassication cost. The predictive variables are age and various indicators of the brain
damage, as reected in brain dysfunction. These are listed below. Indicators of brain
dysfunction can vary considerably during the few days after injury. Measurements were
therefore taken frequently, and for each indicant the best and worst states during each of
a number of successive time periods were recorded. The data supplied were based on the
best state during the rst 24 hours after the onset of coma. The EMV score in the table is
known in the medical literature as the Glasgow Coma Scale.
e
P
~
1#
~ ~
ihhi6 f
f 6
~
Sec. 9.4]
Cost datasets
151
Table 9.14: Results for the head injury dataset (3 classes, 6 attributes, 900 observations,
9-fold cross-validation). Algorithms in italics have not incorporated costs.
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
AC
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Cascade
Default
Max.
Storage
200
642
1981
81
191
144
82
154
88
38
400
73
52
149
339
97
51
90
41
518
150
82
271
*
Time (sec.)
Train Test
12.6
3.1
36.6 32.0
736.4
7.3
572.2
3.5
1.4 38.3
9.0 11.2
2.6
2.0
17.6
0.8
5.5
0.4
9.0
3.0
624.0 28.0
2.5
0.3
2.9
0.3
24.3
3.0
5.0
0.2
6.5
*
3.0
0.2
1772.0
3.0
10.0
1.0
312.5 31.9
17.4
5.1
190.7
1.2
181.0
1.0
*
*
Average Costs
Train
Test
19.76 19.89
17.83 20.06
16.60 17.96
13.59 21.81
18.9 31.90
9.20 35.30
18.87 20.87
19.84 20.38
25.76 25.52
18.91 53.64
17.88 56.87
10.94 22.69
23.68 23.95
14.36 53.55
59.82 82.60
*
37.61
32.54 33.26
35.6 70.70
25.31 26.52
18.23 21.53
53.37 63.10
29.30 46.58
15.25 19.46
44.10 44.10
Rank
3
4
1
8
13
15
6
5
11
20
21
9
10
19
24
16
14
23
12
7
22
18
2
17
SMART and DIPOL92 are the only algorithms that as standard can utilise costs directly
in the training phase (we used in our results a modied version of Backprop that could
utilise costs, but this is very experimental). However, although these two algorithms do
reasonably well, they are not the best. Logistic regression does very well and so do Discrim
and Quadisc
CART, IndCART, Bayes Tree and Cal5 are the only decision trees that used a cost
matrix here, and hence the others have performed worse than the Default rule. CART and
Cal5 both had trees of around 5-7 nodes, whereas
and NewID both had around 240
nodes. However, using error rate as a criterion we cannot judge whether these algorithms
152
[Ch. 9
were under-pruning, since no cost matrix was used in the classier. But, for interpretability,
the smaller trees are preferred.
Titterington et al. (1981) compared several discrimination procedures on this data. Our
dataset differs by replacing all missing values with the class median and so the results are
not directly comparable.
9.4.2 Heart disease (Heart)
Table 9.15: Results for the heart disease dataset (2 classes, 13 attributes, 270 observations,
9-fold cross-validation). Algorithms in italics have not incorporated costs.
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
AC
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Cascade
Default
Max.
Storage
223
322
494
88
95
88
93
142
65
21
209
63
50
125
93
102
51
36
53
299
154
54
122
*
Time (sec.)
Train Test
7.7
1.8
18.2
9.2
79.9
4.2
350.0
0.1
31.2
5.2
0.0
1.0
20.0
3.4
4.1
0.8
8.4
0.1
9.0
3.0
243.0
7.0
2.7
0.3
1.5
1.0
19.2
4.7
29.4
0.8
5.1
*
2.3
0.8
227.1
1.9
18.0
0.3
128.2 12.9
20.4
3.7
76.6
1.0
78.3
1.0
*
*
Average Costs
Train
Test
0.315 0.393
0.274 0.422
0.271 0.396
0.264 0.478
0.394 0.407
0.000 0.478
0.374 0.441
0.463 0.452
0.261 0.630
0.000 0.844
0.000 0.744
0.111 0.526
0.351 0.374
0.206 0.767
0.439 0.781
*
0.515
0.330 0.444
0.429 0.693
0.429 0.507
0.381 0.574
0.303 0.781
0.140 0.600
0.207 0.467
0.560 0.560
Rank
2
5
3
10
4
10
6
8
18
24
20
14
1
21
22
13
7
19
12
16
22
17
9
15
This database comes from the Cleveland Clinic Foundation and was supplied by Robert
Detrano, M.D., Ph.D. of the V.A. Medical Center, Long Beach, CA. It is part of the
collection of databases at the University of California, Irvine collated by David Aha.
The purpose of the dataset is to predict the presence or absence of heart disease given the
results of various medical tests carried out on a patient. This database contains 13 attributes,
which have been extracted from a larger set of 75. The database originally contained 303
examples but 6 of these contained missing class values and so were discarded leaving 297.
27 of these were retained in case of dispute, leaving a nal total of 270. There are two
classes: presence and absence(of heart-disease). This is a reduction of the number of classes
Sec. 9.4]
Cost datasets
153
in the original dataset in which there were four different degrees of heart-disease. Table
9.16 gives the different costs of the possible misclassications. Nine fold cross-validation
was used to estimate the average misclassication cost. Naive Bayes performed best on
the heart dataset. This may reect the careful selection of attributes by the doctors. Of the
decision trees, CART and Cal5 performed the best. Cal5 tuned the pruning parameter, and
used an average of 8 nodes in the trees, whereas
used 45 nodes. However,
did
not take the cost matrix into account, so the prefered pruning is still an open question.
This data has been studied in the literature before, but without taking any cost matrix
into account and so the results are not comparable with those obtained here.
Table 9.16: Misclassication costs for the heart disease dataset. The columns represent the
predicted class, and the rows the true class.
{
absent
0
5
absent
present
present
1
0
154
[Ch. 9
Results are given in Table 9.18. The providers of this dataset suggest the cost matrix
of Table 9.17. It is interesting that only 10 algorithms do better than the Default. The
results clearly demonstrate that some Decision Tree algorithms are at a disadvantage when
costs are taken into account. That it is possible to include costs into decision trees, is
demonstrated by the good results of Cal5 and CART (Breiman et al., 1984). Cal5 achieved
a good result with an average of only 2 nodes which would lead to very transparent rules.
Of those algorithms that did not include costs, C4.5 used a tree with 49 nodes (with an error
rate of 27.3%), whereas
and NewID used an average of over 300 nodes (with error
rates of 29.4% and 32.8% respectively).
Table 9.18: Results for the German credit dataset (2 classes, 24 attributes, 1000 observations, 10-fold cross-validation). Algorithms in italics have not incorporated costs.
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
AC
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Maximum
Storage
556
534
391
935
103
286
93
95
668
118
771
79
460
320
82
69
167
152
53
148
215
97
*
Time (sec.)
Train
Test
50.1
7.3
53.6
8.2
56.0
6.7
6522.9
*
9123.3
*
2.4
9.0
109.9
9.5
114.0
1.1
337.5 248.0
12.8
15.2
9668.0 232.0
7.4
0.4
26.0
5.3
116.8
3.1
13.7
1.0
32.5
3.0
19.5
1.9
5897.2
5.3
77.8
5.0
2258.5
0.0
24.5
3.4
322.7
4.7
*
*
Average Costs
Train
Test
0.509 0.535
0.431 0.619
0.499 0.538
0.389 0.601
0.597 0.584
0.000 0.694
0.582 0.583
0.581 0.613
0.069 0.761
0.000 0.925
0.000 0.878
0.126 0.778
0.600 0.703
0.000 0.856
0.640 0.985
*
0.879
0.600 0.603
0.689 1.160
0.574 0.599
0.446 0.772
0.848 0.971
0.229 0.963
0.700 0.700
Rank
1
9
2
6
4
10
3
8
14
19
17
15
12
16
22
18
7
23
5
13
21
20
11
Sec. 9.5]
Miscellaneous data
155
appears to be noise-free in the sense that arbitrarily small error rates are possible given
sufcient data.
200
100
0
X9
-100
X1<52.5
X9<3
X9<3
Rad_Flow High
Rad_Flow High
30
40
-300
-200
Rad_Flow
+
- -- ----++
-- - ++
- -++
+
-+
+
+
- -- - +
-+
+
- -- ++
+ + + + + + + + +-+-+ + + + + +
+
+ + + + + + + + +-+-+ + + + + + +
++
+
X1<54.5
|
- -
+
+
50
X1
60
Fig. 9.5: Shuttle data: attributes 1 and 9 for the two classes Rad Flow and High only. The symbols
+ and - denote the state Rad Flow and High respectively. The 40 856 examples are classied
correctly by the decision tree in the right diagram.
The data was divided into a train set and a test set with 43500 examples in the train
set and 14500 in the test set. A single train-and-test was used to calculate the accuracy.
With samples of this size, it should be possible to obtain an accuracy of 99 - 99.9%.
Approximately 80% of the data belong to class 1. At the other extreme, there are only 6
examples of class 6 in the learning set.
The shuttle dataset also departs widely from typical distribution assumptions. The
attributes are numerical and appear to exhibit multimodality (we do not have a good
statistical test to measure this). Some feeling for this dataset can be gained by looking at
Figure 9.5. It shows that a rectangular box (with sides parallel to the axes) may be drawn to
enclose all examples in the class High, although the lower boundary of this box (X9 less
than 3) is so close to examples of class Rad Flow that this particular boundary cannot
be clearly marked to the scale of Figure 9.5. In the whole dataset, the data seem to consist
of isolated islands or clusters of points, each of which is pure (belongs to only one class),
with one class comprising several such islands. However, neighbouring islands may be
very close and yet come from different populations. The boundaries of the islands seem to
be parallel with the coordinate axes. If this picture is correct, and the present data do not
contradict it, as it is possible to classify the combined dataset with 100% accuracy using a
decision tree, then it is of interest to ask which of our algorithms are guaranteed to arrive
at the correct classication given an arbitrarily large learning dataset. In the following, we
ignore practical matters such as training times, storage requirements etc., and concentrate
on the limiting behaviour for an innitely large training set.
156
[Ch. 9
Table 9.19: Results for the shuttle dataset with error rates are in % (7 classes, 9 attributes,
(train, test) = (43 500, 14 500) observations).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Max.
Storage
1957
1583
1481
636
636
636
77
176
329
1535
200
368
225
1432
3400
665
372
FD
674
144
249
650
*
Time (sec.)
Train
Test
507.8
102.3
708.6
176.6
6945.5
106.2
110009.8
93.2
55215.0 18333.0
32531.3 10482.0
461.3
149.7
79.0
2.3
1151.9
16.2
6180.0
*
2553.0
2271.0
240.0
16.8
1029.5
22.4
11160.0
*
13742.4
11.1
91969.7
*
313.4
10.3
FD
FD
2068.0
176.2
5174.0
21.0
*
*
2813.3
83.8
*
*
% Error
Train
Test
4.98
4.83
6.35
6.72
3.94
3.83
0.61
0.59
0.95
0.83
0.39
0.44
3.70
3.80
0.04
0.08
0.04
0.09
0.00
0.01
0.00
0.32
0.00
0.02
4.60
4.50
0.00
0.03
0.04
0.10
*
0.41
0.03
0.03
FD
FD
0.44
0.48
4.50
0.43
1.60
1.40
0.40
0.44
21.59 20.84
Rank
20
21
18
14
15
11
17
5
6
1
8
2
19
3
7
9
3
13
10
16
11
22
Procedures which might therefore be expected to nd a perfect rule for this dataset
would seem to be: k-NN, Backprop and ALLOC80. ALLOC80 failed here, and the result
obtained by another kernel method (using a sphered transformation of the data) was far
from perfect. RBF should also be capable of perfect accuracy, but some changes would be
required in the particular implementation used in the project (to avoid singularities). Using
a variable selection method (selecting 6 of the attributes) k-NN achieved an error rate of
0.055%.
Decision trees will also nd the perfect rule provided that the pruning parameter is
properly set, but may not do so under all circumstances as it is occasionally necessary
to override the splitting criterion (Breiman et al., 1984) . Although a machine learning
procedure may nd a decision tree which classies perfectly, it may not nd the simplest
representation. The tree of Figure 9.5, which was produced by the Splus procedure tree(),
gets 100% accuracy with ve terminal nodes, whereas it is easy to construct an equivalent
tree with only three terminal nodes (see that the same structure occurs in both halves of the
tree in Figure 9.5). It is possible to classify the full 58 000 examples with only 19 errors
using a linear decision tree with nine terminal nodes. Since there are seven classes, this
Sec. 9.5]
Miscellaneous data
157
is a remarkably simple tree. This suggests that the data have been generated by a process
that is governed by a linear decision tree, that is, a decision tree in which tests are applied
sequentially, the result of each test being to allocate one section of the data to one class
and to apply subsequent tests to the remaining section. As there are very few examples of
class 6 in the whole 58 000 dataset, it would require enormous amounts of data to construct
reliable classiers for class 6. The actual trees produced by the algorithms are rather small,
as expected:
has 13 nodes, and both Cal5 and CART have 21 nodes.
9.5.2
Diabetes (Diab)
This dataset was originally donated by Vincent Sigillito, Applied Physics Laboratory, Johns
Hopkins University, Laurel, MD 20707 and was constructed by constrained selection from
a larger database held by the National Institute of Diabetes and Digestive and Kidney
Diseases. It is publicly available from the machine learning database at UCI (see Appendix
A). All patients represented in this dataset are females at least 21 years old of Pima Indian
heritage living near Phoenix, Arizona, USA.
The problem posed here is to predict whether a patient would test positive for diabetes
according to World Health Organization criteria (i.e. if the patients 2 hour postload
plasma glucose is at least 200 mg/dl.) given a number of physiological measurements and
medical test results. The attribute details are given below:
number of times pregnant
plasma glucose concentration in an oral glucose tolerance test
diastolic blood pressure ( mm/Hg )
triceps skin fold thickness ( mm )
2-hour serum insulin ( mu U/ml )
body mass index ( kg/m )
diabetes pedigree function
age ( years )
This is a two class problem with class value 1 being interpreted as tested positive
for diabetes. There are 500 examples of class 1 and 268 of class 2. Twelvefold cross
validation was used to estimate prediction accuracy.
The dataset is rather difcult to classify. The so-called class value is really a binarised
form of another attribute which is itself highly indicative of certain types of diabetes but
does not have a onetoone correspondence with the medical condition of being diabetic.
No algorithm performs exceptionally well, although ALLOC80 and k-NN seem to
be the poorest. Automatic smoothing parameter selection in ALLOC80 can make poor
choices for datasets with discrete valued attributes and k-NN can have problems scaling
such datasets. Overall though, it seems reasonable to conclude that the attributes do not
predict the class well. Cal5 uses only 8 nodes in its decision tree, whereas NewId, which
and C4.5 have 116 and 32 nodes, repectively and
performs less well, has 119 nodes.
CN2 generates 52 rules, although there is not very much difference in the error rates here.
This dataset has been studied by Smith et al. (1988) using the ADAP algorithm. Using
576 examples as a training set, ADAP achieved an error rate of .24 on the remaining 192
instances.
158
[Ch. 9
Table 9.20: Results for the diabetes dataset (2 classes, 8 attributes, 768 observations,
12-fold cross-validation).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
9.5.3
Max.
Storage
338
327
311
780
152
226
82
144
596
87
373
68
431
190
61
60
137
62
52
147
179
69
*
Time (sec.)
Train
Test
27.4
6.5
24.4
6.6
30.8
6.6
3762.0
*
1374.1
*
1.0
2.0
35.3
4.7
29.6
0.8
215.6 209.4
9.6
10.2
4377.0 241.0
10.4
0.3
25.0
7.2
38.4
2.8
11.5
0.9
31.2
1.5
236.7
0.1
1966.4
2.5
35.8
0.8
7171.0
0.1
4.8
0.1
139.5
1.2
*
*
Error Rate
Train
Test
0.220 0.225
0.237 0.262
0.219 0.223
0.177 0.232
0.288 0.301
0.000 0.324
0.260 0.258
0.227 0.255
0.079 0.271
0.000 0.289
0.000 0.276
0.008 0.271
0.239 0.262
0.010 0.289
0.131 0.270
0.223 0.245
0.232 0.250
0.134 0.273
0.220 0.224
0.198 0.248
0.218 0.243
0.101 0.272
0.350 0.350
Rank
3
11
1
4
21
22
10
9
14
19
18
14
11
19
13
6
8
17
2
7
5
16
23
DNA
This classication problem is drawn from the eld of molecular biology. Splice junctions
are points on a DNA sequence at which superuous DNA is removed during protein
creation. The problem posed here is to recognise, given a sequence of DNA, the boundaries
between exons (the parts of the DNA sequence retained after splicing) and introns (the parts
of the DNA that are spliced out). The dataset used in the project is a processed version
of the Irvine Primate splice-junction database. Each of the 3186 examples in the database
consists of a window of 60 nucleotides, each represented by one of four symbolic values
(a,c,g,t), and the classication of the middle point in the window as one of; intronextron
boundary, extronintron boundary or neither of these. Processing involved the removal of a
small number of ambiguous examples (4), conversion of the original 60 symbolic attributes
to 180 or 240 binary attributes and the conversion of symbolic class labels to numeric labels
(see Section 7.4.3). The training set of 2000 was chosen randomly from the dataset and the
remaining 1186 examples were used as the test set.
This is basically a partitioning problem and so we might expect, in advance, that
Decision Tree algorithms should do well. The classes in this problem have a heirarchical
Sec. 9.5]
Miscellaneous data
159
Table 9.21: Results for the DNA dataset (3 classes, 60/180/240 attributes, (train, test) =
(2000, 1186) observations).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Max.
Storage
215
262
1661
247
188
247
86
283
729
729
9385
727
727
10732
1280
282
755
2592
518
161
1129
FD
*
Time (sec.)
Train
Test
928.5
31.1
1581.1 808.6
5057.4
76.2
79676.0
16.0
14393.5
*
2427.5 882.0
396.7 225.0
615.0
8.6
523.0 515.8
698.4
1.0
12378.0
87.0
81.7
10.5
51.8
14.8
869.0
74.0
9.0
2.0
2211.6
5.9
1616.0
7.5
*
*
213.4
10.1
4094.0
9.0
*
*
FD
FD
*
*
Error Rate
Train
Test
0.034 0.059
0.000 0.059
0.008 0.061
0.034 0.115
0.063 0.057
0.000 0.146
0.061 0.072
0.075 0.085
0.040 0.073
0.000 0.100
0.000 0.100
0.001 0.095
0.052 0.068
0.002 0.095
0.040 0.076
0.131 0.135
0.104 0.131
0.104 0.339
0.007 0.048
0.014 0.088
0.015 0.041
FD
FD
0.475 0.492
Rank
4
4
6
17
3
20
8
11
9
15
15
13
7
13
10
19
18
21
2
12
1
22
structure; the primary decision is whether the centre point in the window is a splice
junction or not. If it is a splicejunction then the secondary classication is as to its type;
intronextron or extronintron.
Unfortunately comparisons between algorithms are more difcult than usual with this
dataset as a number of methods were tested with a restricted number of attributes; some
were tested with attribute values converted to 180 binary values, and some to 240 binary
values. CASTLE and CART only used the middle 90 binary variables. NewID, CN2 and
C4.5 used the original 60 categorical variables and k-NN, Kohonen, LVQ, Backprop and
RBF used the oneoffour coding. The classical statistical algorithms perform reasonable
well achieving roughly 6% error rate. k-NN is probably hampered by the large number of
binary attributes, but Naive Bayes does rather well helped by the fact that the attributes are
independent.
Surprisingly, machine learning algorithms do not outperform classical statistical algorithms on this problem. CASTLE and CART were at a disadvantage using a smaller
window although performing reasonably. IndCART used 180 attributes and improved on
the CART error rate by around 1%. ITrule and Cal5 are the poorest performers in this
160
[Ch. 9
group. ITrule, using only univariate and bivariate tests, is too restricted and Cal5 is
probably confused by the large number of attributes.
Of the neural network algorithms, Kohonen performs very poorly not helped by unequal
class proportions in the dataset. DIPOL92 constructs an effective set of piecewise linear
decision boundaries but overall, RBF is the most accurate algorithm using 720 centres. It
is rather worrying here, that LVQ claimed an error rate of 0, and this result was unchanged
when the test data had the classes permuted. No reason could be found for this phenomenon
presumably it was caused by the excessive number of attributes but that the algorithm
should lie with no explanation or warning is still a mystery. This problem did not occur
with any other dataset.
In order to assess the importance of the window size in this problem, we can examine
in a little more detail the performance of one of the machine learning algorithms. CN2
classied the training set using 113 rules involving tests on from 2 to 6 attributes and
misclassifying 4 examples. Table 9.22 shows how frequently attributes in different ranges
appeared in those 113 rules. From the table it appears that a window of size 20 contains the
Table 9.22: Frequency of occurrence of attributes in rules generated by CN2 for the DNA
training set.
class 1
class 2
class 3
total
110
17
17
6
40
1120
10
28
8
46
2130
12
78
57
147
3140
59
21
55
135
4150
7
13
4
24
5160
2
11
3
16
most important variables. Attributes just after the middle of the window are most important
in determining class 1 and those just before the middle are most important in determining
class 2. For class 3, variables close to the middle on either side are equally important.
Overall though, variables throughout the 60 attribute window do seem to contribute. The
question of how many attributes to use in the window is vitally important for procedures
that include many parameters - Quadisc gets much better results (error rate of 3.6% on the
test set) if it is restricted to the middle 20 categorical attributes.
It is therefore of interest to note that decision tree procedures get almost the same
accuracies on the original categorical data and the processed binary data. NewID, obtained
an error rate of 9.95% on the preprocessed data (180 variables) and 9.20% on the original
data (with categorical attributes). These accuracies are probably within what could be
called experimental error, so it seems that NewID does about as well on either form of the
dataset. There is a little more to the story however, as the University of Wisconsin ran
several algorithms on this dataset. In Table 9.23 we quote their results alongside ours for
nearest neighbour. In this problem, ID3 and NewID are probably equivalent, and the slight
discrepancies in error rates achieved by ID3 at Wisconsin (10.5%) compared to NewID
(9.95%) in this study are attributable to the different random samples used. This cannot be
the explanation for the differences between the two nearest neighbour results: there appears
to be an irreconcilable difference, perhaps due to preprocessing, perhaps due to distance
being measured in a conditional (class dependent) manner.
Certainly, the Kohonen algorithm used here encountered a problem when dening dis-
Sec. 9.5]
Miscellaneous data
161
tances in the attribute space. When using the coding of 180 attributes, the Euclidean
distances between pairs were not the same (the squared distances were 2.0 for pairs
but only 1.0 for the pairs involving :
).
Therefore Kohonen needs the coding of 240 attributes. This coding was also adopted by
other algorithms using distance measures (k-NN, LVQ).
Table 9.23: DNA dataset error rates for each of the three classes: splicejunction is Intron
Extron (IE), ExtronIntron (EI) or Neither. All trials except the last were carried out by the
University of Wisconsin, sometimes with local implementations of published algorithms,
using ten-fold cross-validation on 1000 examples randomly selected from the complete set
of 3190. The last trial was conducted with a training set of 2000 examples and a test set of
1186 examples.
~ g
~
r{
~ g
IE
8.47
10.75
7.55
17.41
13.99
9.46
9.09
36.79
~
Y
EI
7.56
5.74
8.18
16.32
10.58
15.04
11.65
25.74
Neither
4.62
5.29
6.86
3.99
8.84
11.80
31.11
0.50
~
1r{
~ g
%~
~ g
{
1~
Algorithm
KBANN
Backprop
PEBLS
PERCEPTRON
ID3
Cobweb
N Neighbour (Wisconsin)
N Neighbour (Leeds)
Overall
6.28
6.69
7.36
10.31
10.50
12.08
20.94
14.60
Very little is known about this dataset as the nature of the problem domain is secret. It
is of commercial interest to Daimler-Benz AG, Germany. The dataset shows indications
of some sort of preprocessing, probably by some decision-tree type process, before it
was received. To give only one instance, consider only the four most common classes
(
), and consider only one attribute (X52). By simply tabulating the
values of attribute X52 it becomes obvious that the classications are being made according
to symmetrically placed boundaries on X52, specically the two boundaries at -0.055 and
+0.055, and also the boundaries at -0.085 and +0.085. These boundaries divide the range
of X52 into ve regions, and if we look at the classes contained in these regions we get the
frequency table in Table 9.24. The symmetric nature of the boundaries suggests strongly
that the classes have been dened by their attributes, and that the class denitions are only
concerned with inequalities on the attributes. Needless to say, such a system is perfectly
suited to decision trees , and we may remark, in passing, that the above table was discovered
~
F
162
[Ch. 9
by a decision tree when applied to the reduced technical dataset with all 56 attributes but
with only the four most common classes (in other words, the decision tree could classify
the reduced dataset with 1 error in 2193 examples using only one attribute).
Table 9.25: Results for the technical dataset (91 classes, 56 attributes, (train, test) = (4500,
2580) observations).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Max.
Storage
365
334
354
524
FD
213
FD
FD
3328
592
7400
1096
656
*
2876
FD
842
640
941
FD
510
559
*
Time (sec.)
Train
Test
421.3
200.8
19567.8 11011.6
18961.2
195.9
21563.7
56.8
FD
FD
5129.9
2457.0
FD
FD
FD
FD
1418.6
1423.3
527.1
12.5
5028.0
273.0
175.5
9.8
169.2
81.6
3980.0
465.0
384.0
96.0
FD
FD
2422.1
7.1
*
*
7226.0
1235.0
FD
FD
1264.0
323.0
2443.2
87.3
*
*
Error Rate
Train
Test
0.368 0.391
0.405 0.495
0.350 0.401
0.356 0.366
FD
FD
0.007 0.204
FD
FD
FD
FD
0.007 0.095
0.000 0.090
0.006 0.102
0.019 0.174
0.323 0.354
0.048 0.123
0.050 0.120
FD
FD
0.110 0.183
0.326 0.357
0.080 0.192
FD
FD
0.304 0.324
0.196 0.261
0.770 0.777
Rank
15
17
16
14
9
2
1
3
6
12
5
4
7
13
8
11
10
18
The dataset consists of 7080 examples with 56 attributes and 91 classes. The attributes
are all believed to be real: however, the majority of attribute values are zero. This may be
the numerical value 0 or more likely not relevant, not measured or not applicable.
One-shot train and test was used to calculate the accuracy.
The results for this dataset seem quite poor although all are signicantly better than
the default error rate of 0.777. Several algorithms failed to run on the dataset as they
could not cope with the large number of classes. The decision tree algorithms IndCART,
gave the best results in terms of error rates. This reects the nature of the
NewID and
preprocessing which made the dataset more suited to decision trees algorithms. However,
the output produced by the tree algorithms is (not surprisingly) difcult to interpret NewID
has a tree with 590 terminal nodes, C4.5 has 258 nodes, Cal5 has 507 nodes and
has
589 nodes. Statistical algorithms gave much poorer results with Quadisc giving the highest
error rate of all. They appear to over-train slightly as a result of too many parameters.
Sec. 9.5]
Miscellaneous data
163
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Cascade
Default
Max.
Storage
588
592
465
98
125
86
279
170
293
846
222
289
276
345
77
293
62
216
49
146
*
115
391
*
Time (sec.)
Train
Test
73.8
27.8
85.2
40.5
130.4
27.1
7804.1
15.6
3676.2
*
1.0 137.0
230.2
96.2
135.1
8.5
86.5
85.4
142.0
1.0
1442.0
79.0
24.7
6.7
17.4
7.6
272.2
16.9
66.0
11.6
1906.2
41.1
13.9
7.2
7380.6
54.9
43.0
11.9
478.0
2.0
121.4
29.3
977.7
32.0
806.0
1.0
*
*
Error Rate
Train
Test
0.022 0.025
0.036 0.052
0.002 0.007
0.003 0.006
0.026 0.044
0.000 0.059
0.029 0.047
0.009 0.034
0.007 0.034
0.017 0.027
0.000 0.034
0.000 0.030
0.046 0.062
0.000 0.032
0.010 0.040
0.043 0.065
0.025 0.029
0.026 0.056
0.015 0.018
0.011 0.017
0.021 0.034
0.002 0.054
0.005 0.019
0.363 0.362
Rank
6
18
2
1
16
21
17
11
11
7
11
9
22
10
15
23
8
20
4
3
11
19
5
24
The object of this dataset is to nd a fast and reliable indicator of instability in large scale
power systems. The dataset is condential to StatLog and belongs to T. van Cutsem and L.
Wehenkel, University of Li` ge, Institut Monteore, Sart-Tilman, B-4000 Li` ge, Belgium.
e
e
The emergency control of voltage stability is still in its infancy but one important aspect
of this control is the early detection of critical states in order to reliably trigger automatic
corrective actions. This dataset has been constructed by simulating up to ve minutes
of the system behaviour. Basically, a case is labelled stable if all voltages controlled by
On-Load Tap Changers are successfully brought back to their set-point values. Otherwise,
the system becomes unstable.
There are 2500 examples of stable and unstable states each with 28 attributes which involve measurements of voltage magnitudes, active and reactive power ows and injections.
Statistical algorithms cannot be run on datasets which have linearly dependent attributes
and there are 7 such attributes (X18,X19,X20,X21,X23,X27,X28) in the Belgian Power
dataset. These have to be removed when running the classical statistical algorithms. No
other form of pre-processing was done to this dataset. Train and test sets have 1250
164
[Ch. 9
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
2
1
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
2
2
2
2
2
2
1
1
2
2
2
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
Fig. 9.6: Kohonen map of the Belgian Power data, showing potential clustering. Both classes 1 and
2 appear to have two distinct clusters.
Sec. 9.5]
Miscellaneous data
165
Table 9.27: Results for the Belgian Power II dataset (2 classes, 57 attributes, (train, test) =
( 2000, 1000) observations).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Max.
Storage
75
75
1087
882
185
129
80
232
1036
624
3707
968
852
4708
1404
291
103
585
154
148
*
194
*
Time (sec.)
Train
Test
107.5
9.3
516.8 211.8
336.0
43.6
11421.3
3.1
6238.4
*
408.5 103.4
9.5
4.3
467.9
11.8
349.5 335.2
131.0
0.5
3864.0
92.0
83.7
11.8
54.9
12.5
967.0
28.0
184.0
18.0
9024.1
17.9
62.1
9.8
*
*
95.4
13.1
4315.0
1.0
*
*
1704.0
50.8
*
*
Error Rate
Train
Test
0.048 0.041
0.015 0.035
0.031 0.028
0.010 0.013
0.057 0.045
0.000 0.052
0.062 0.064
0.022 0.022
0.004 0.014
0.000 0.017
0.000 0.019
0.000 0.014
0.087 0.089
0.000 0.025
0.008 0.018
0.080 0.081
0.037 0.026
0.061 0.084
0.030 0.026
0.021 0.022
0.037 0.035
0.018 0.065
0.076 0.070
Rank
15
13
12
1
16
17
18
7
2
4
6
2
23
9
5
21
10
22
10
7
13
19
20
of Li` ge and Electricit` de France. The training set consists of 2000 examples with 57
e
e
attributes. The test set contains 1000 examples and there are two classes. No pre-processing
was done and one-shot train-and-test was used to calculate the accuracy.
As for the previous Belgian Power dataset, SMART comes out top in terms of test
error rate (although it takes far longer to run than the other algorithms considered here).
Logdisc hasnt done so well on this larger dataset. k-NN was again confused by irrelevant
attributes, and a variable selection option reduced the error rate to 2.2%. The machine
learning algorithms IndCART, NewID,
, Baytree and C4.5 give consistently good
results. The tree sizes here were more similar with
using 36 nodes, C4.5 25 nodes,
and NewID using 37 nodes. Naive Bayes is worst and along with Kohonen and ITrule give
poorer results than the default rule for the test set error rate (0.074).
There is a detailed description of this dataset and related results in Wehenkel et al.
(1993) .
9.5.7
Due to the condential nature of the problem, very little is known about this dataset. It
was donated to the project by the software company ISoft, Chemin de Moulon, F-91190
166
[Ch. 9
Table 9.28: Results for the Machine Faults dataset (3 classes, 45 attributes, 570 observations, 10-fold cross-validation).
Max.
Storage
457
299
406
105
129
87
176
164
672
*
826
596
484
1600
700
75
197
188
52
147
332
72
*
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Time (sec.)
Train Test
51.1
6.8
46.0
8.4
67.6
6.2
13521.0
*
802.4
*
260.7
5.2
350.3 17.3
90.6
0.9
36.7 37.2
*
*
265.0
9.0
8.6
1.8
3.3
0.4
69.2
7.8
6.3
1.7
42.1
1.8
472.8
1.2
*
*
54.0 10.0
3724.6
0.0
58.6 12.0
90.6
2.3
*
*
Error Rate
Train
Test
0.140 0.204
0.107 0.293
0.122 0.221
0.101 0.339
0.341 0.339
0.376 0.375
0.254 0.318
0.244 0.318
0.156 0.335
0.000 0.304
0.000 0.174
0.003 0.283
0.232 0.274
0.000 0.354
0.125 0.305
0.331 0.330
0.231 0.297
0.193 0.472
0.120 0.191
0.028 0.228
0.102 0.320
0.019 0.444
0.610 0.610
Rank
3
8
4
17
17
20
12
12
16
10
1
7
6
19
11
15
9
22
2
5
14
21
23
Gif sur Yvette, France. The only information known about the dataset is that it involves
the nancial aspect of mechanical maintenance and repair. The aim is to evaluate the cost
of repairing damaged entities. The original dataset had multiple attribute values and a few
errors. This was processed to split the 15 attributes into 45. The original train and test
sets supplied by ISoft were concatenated and the examples permuted randomly to form a
dataset with 570 examples. The pre-processing of hierarchical data is discussed further in
Section 7.4.5. There are 45 numerical attributes and 3 classes and classication was done
using 10-fold cross-validation.
This is the only hierarchical dataset studied here. Compared with the other algorithms,
gives the best error rate. The
trials were done on the original dataset whereas
the other algorithms on the project used a transformed dataset because they cannot handle
datasets expressed in the knowledge representation language of
. In other words, this
dataset was preprocessed in order that other algorithms could handle the dataset. This
preprocessing was done without loss of information on the attributes, but the hierarchy
between attributes was destroyed. The dataset of this application has been designed to run
with
, thus all the knowledge entered has been used by the program. This explains (in
part) the performance of
and underlines the importance of structuring the knowledge
Sec. 9.5]
Miscellaneous data
167
for an application. Although this result is of interest, it was not strictly a fair comparison,
since
used domain-specic knowledge which the other algorithms did not (and for
the most part, could not). In addition, it should be pointed out that the cross-validation
procedure used with
involved a different splitting method that preserved the class
proportions, so this will also bias the result somewhat. The size of the tree produced by
is 340 nodes, whereas Cal5 and NewID used trees with 33 nodes and 111 nodes,
respectively.
Kohonen gives the poorest result which is surprising as this neural net algorithm should
do better on datasets with nearly equal class numbers. It is interesting to compare this
with the results for k-NN. The algorithm should work well on all datasets on which any
algorithm similar to the nearestneighbour algorithm (or a classical cluster analysis) works
well. The fact the k-NN performs badly on this dataset suggests that Kohonen will too.
9.5.8
-22
-20
LATITUDE
-18
-16
28
30
32
LONGITUDE
Fig. 9.7: Tsetse map: The symbols + and - denote the presence and absence of tsetse ies
respectively.
Tsetse ies are one of the most prevalent insect hosts spreading disease (namely tripanosomiasis) from cattle to humans in Africa. In order to limit the spread of disease it is of
interest to predict the distribution of ies and types of environment to which they are best
suited.
The tsetse dataset contains interpolated data contributed by CSIRO Division of Forrestry, Australia (Booth et al., 1990 ) and was donated by Trevor H. Booth, PO Box 4008,
168
[Ch. 9
The machine learning algorithms produce the best (CN2) and worst (ITrule) results for
all give rise to
this dataset. The decision tree algorithms C4.5, CART, NewID and
fairly accurate classication rules. The modern statistical algorithms, SMART, ALLOC80
and k-NN do signicantly better than the classical statistical algorithms (Discrim, Quadisc
and Logdisc). With a variable selection procedure k-NN obtains an error rate of 3.8%,
again indicating some unhelpful attributes.
Similar work has been done on this dataset by Booth et al. (1990) and Ripley (1993)
The dataset used by Ripley was slightly different in that the attributes were normalised
to be in the range [0,1] over the whole dataset. Also, the train and test sets used in the
classication were both samples of size 500 taken from the full dataset, which explains the
less accurate results achieved. For example, linear discriminants had an error rate of 13.8%,
an algorithm similar to SMART had 10.2%, 1-nearest neighbour had 8.4% and Backprop
had 8.4%. The best results for LVQ was 9% and for tree algorithms an error rate of 10%
was reduced to 9.6% on pruning.
However, the conclusions of both studies agree. The nearest neighbour and LVQ
algorithms work well (although they provide no explanation of the structure in the dataset).
Sec. 9.6]
Measures
169
Table 9.29: Results for the tsetse dataset (2 classes, 14 attributes, (train, test) = (3500,
1499) observations).
Algorithm
Discrim
Quadisc
Logdisc
SMART
ALLOC80
k-NN
CASTLE
CART
IndCART
NewID
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Default
Max.
Storage
69
73
599
179
138
99
233
182
1071
207
2365
979
811
6104
840
199
123
*
131
144
1239
141
*
Time (sec.)
Train
Test
25.8
3.6
58.5
19.7
139.7
21.9
7638.0
4.0
1944.7
*
3898.8 276.0
458.0 172.3
63.5
3.8
*
*
49.0
1.0
2236.0 173.0
21.9
2.6
13.5
2.7
468.0
21.0
32.0
4.0
761.4
3.4
49.6
2.4
*
*
406.1
53.3
1196.0
2.0
*
*
536.5
14.0
*
*
Error Rate
Train
Test
0.120 0.122
0.092 0.098
0.116 0.117
0.042 0.047
0.053 0.057
0.053 0.057
0.141 0.137
0.006 0.041
0.009 0.039
0.000 0.040
0.000 0.047
0.001 0.037
0.128 0.120
0.000 0.036
0.015 0.049
0.233 0.228
0.041 0.055
0.055 0.075
0.043 0.053
0.059 0.065
0.043 0.052
0.039 0.065
0.492 0.488
Rank
20
17
18
6
12
12
21
5
3
4
6
2
19
1
8
22
11
16
10
14
9
14
23
That the tree-based methods provide a very good and interpretable t can be seen from
the results of
, CART, Cal5 and NewID. Similar error rates were obtained for
(which used 128 nodes), C4.5 (which used 92 nodes) and NewID (which used 130 nodes).
However, Cal5 used only 72 nodes, and achieved a slightly higher error rate, which possibly
suggests over-pruning. CASTLE has a high error rate compared with the other algorithms
it appears to use only one attribute to construct the classication rule. The MLP result
(Backprop) is directly comparable with the result achieved by Ripley (attribute values were
normalised) and gave a slightly better result (error rate 1.9% lower). However, the overall
conclusion is the same in that MLPs did about the same as LVQ and nearest-neighbour,
both of which are much simpler to use.
170
[Ch. 9
comprehensive. Rather, some instructive points are chosen for illustrating the important
ideas contained in the measures.
9.6.1
KL-digits dataset
The dataset that looks closest to being normal is the Karhunen-Loeve version of digits.
This could be predicted beforehand, as it is a linear transformation of the attributes that,
by the Central Limit Theorem, would be closer to normal than the original. Because there
are very many attributes in each linear combination, the KL-digits dataset is very close to
normal with skewness = 0.1802, and kurtosis = 2.92, as against the exact normal values of
skewness = 0 and kurtosis = 3.0.
Rather interestingly, the multivariate kurtosis statistic
for KL digits show
a very marked departure from multivariate normality (3.743), despite the fact that the
univariate statistics are close to normal (e.g. kurtosis = 2.920). This is not too surprising:
it is possible to take a linear transform from Karhunen-Loeve space back to the original
highly non-normal dataset. This shows the practical desirability of using a multivariate
version of kurtosis.
The KL version of digits appears to be well suited to quadratic discriminants: there
is a substantial difference in variances (SD ratio = 1.9657), while at the same time the
distributions are not too far from multivariate normality with kurtosis of order 3. Also, and
more importantly, there are sufcient examples that the many parameters of the quadratic
discriminants can be estimated fairly accurately.
Also the KL version appears to have a greater difference in variances (SD ratio =1.9657)
than the raw digit data (SD ratio = 1.5673). This is an artefact: the digits data used here
is got by summing over a set of
pixels. The original digits data, with 256 attributes,
had several attributes with zero variances in some classes, giving rise to an innite value
for SD ratio.
The total of the individual mutual informations for the KL dataset is 40 0.2029 =
8.116, and this gure can be compared with the corresponding total for the 4x4 digit dataset,
namely 16 0.5049 = 8.078. These datasets are ultimately derived from the same dataset,
so it is no surprise that these totals are rather close. However, most algorithms found the
KL attributes more informative about class (and so obtained reduced error rates).
k
9.6.2
Vehicle silhouettes
In the vehicle dataset, the high value of fract2 = 0.9139 might indicate that discrimination
could be based on just two discriminants. This may relate to the fact that the two cars
are not easily distinguishable, so might be treated as one (reducing dimensionality of the
mean vectors to 3D). However, although the fraction of discriminating power for the third
discriminant is low (1 - 0.9139), it is still statistically signicant, so cannot be discarded
without a small loss of discrimination.
This dataset also illustrates that using mean statistics may mask signicant differences in
behaviour between classes. For example, in the vehicle dataset, for some of the populations
(vehicle types 1 and 2), Mardias kurtosis statistic is not signicant. However, for both
vehicle types 1 and 2, the univariate statistics are very signicantly low, indicating marked
departure from normality. Mardias statistic does not pick this up, partly because the
Sec. 9.6]
Measures
N
p
k
Bin.att
Cost
SD
corr.abs
cancor1
cancor2
fract1
fract2
skewness
kurtosis
{
g
g
v
~I{
I0
g
N
p
k
Bin.att
Cost
SD
corr.abs
cancor1
cancor2
fract1
fract2
skewness
kurtosis
Vehicle
846
18
4
0
0
1.5392
0.4828
0.8420
0.8189
0.4696
0.9139
0.8282
5.1800
1.9979
4.2472
0.3538
CUT
18 700
20
2
0
0
1.0320
0.2178
0.5500
1.0000
0.9012
3.5214
0.3256
4.6908
0.0292
171
g
g
v
~r{
f0
[Ch. 9
v
~r
{
I
0
g
v
~I{
I0
172
Sec. 9.6]
Measures
173
number of attributes is fairly large in relation to the number of examples per class, and
partly because Mardias statistic is less efcient than the univariate statistics.
9.6.3
Head injury
Among the datasets with more than two classes, the clearest evidence of collinearity is in
the head injury dataset. Here the second canonical correlation is not statistically different
from zero, with a critical level of = 0.074.
It appears that a single linear discriminant is sufcient to discriminate between the
classes (more precisely: a second linear discriminant does not improve discrimination).
Therefore the head injury dataset is very close to linearity. This may also be observed from
the value of fract1 = 0.979, implying that the three class means lie close to a straight line.
In turn, this suggests that the class values reect some underlying continuum of severity,
so this is not a true discrimination problem. Note the similarity with Fishers original use
of discrimination as a means of ordering populations.
Perhaps this dataset would best be dealt with by a pure regression technique, either
linear or logistic. If so, Manova gives the best set of scores for the three categories of injury
as (0.681,-0.105,-0.725), indicating that the middle group is slightly nearer to category 3
than 1, but not signicantly nearer.
It appears that there is not much difference between the covariance matrices for the
three populations in the head dataset (SD ratio = 1.1231), so the procedure quadratic
discrimination is not expected to do much better than linear discrimination (and will
probably do worse as it uses many more parameters).
9.6.4
Heart disease
The leading correlation coefcient cancor1 = 0.7384 in the heart dataset is not very high
(bear in mind that it is correlation that gives a measure of predictability). Therefore the
discriminating power of the linear discriminant is only moderate. This ties up with the
moderate success of linear discriminants for this dataset (cost for the training data of 0.32).
h b
td
174
[Ch. 9
there are six classes in the shuttle dataset, some class probabilities are very low indeed: so
low, in fact, that the complexity of the classication problem is on a par with a two-class
problem.
9.6.7 Technical
Although all attributes are nominally continuous, there are very many zeroes, so many that
we can regard some of the attributes as nearly constant (and equal to zero). This is shown
0.379, which is substantially less than one bit.
by the average attribute entropy
The average mutual information
0.185 and this is about half of the information
carried by each attribute, so that, although the attributes contain little information content,
this information contains relatively little noise.
v
f0
~
r{
10
Analysis of Results
P. B. Brazdil (1) and R. J. Henery (2)
(1) University of Porto and (2) University of Strathclyde
|
10.1 INTRODUCTION
We analyse the results of the trials in this chapter using several methods:
The section on Results by Subject Areas shows that Neural Network and Statistical
methods do better in some areas and Machine Learning procedures in others. The idea
is to give some indication of the subject areas where certain methods do best.
Multidimensional Scaling is a method that can be used to point out similarities in
both algorithms and datasets using the performance (error-rates) of every combination
algorithm dataset as a basis. The aim here is to understand the relationship between
the various methods.
We also describe a simple-minded attempt at exploring the relationship between pruning
and accuracy of decision trees.
A principal aim of StatLog was to relate performance of algorithms (usually interpreted
as accuracy or error-rate) to characteristics or measures of datasets. Here the aim is
to give objective measures describing a dataset and to predict how well any given
algorithm will perform on that dataset. We discuss several ways in which this might
be done. This includes an empirical study of performance related to statistical and
information-theoretic measures of the datasets. In particular, one of the learning
algorithms under study (C4.5) is used in an ingenious attempt to predict performance
of all algorithms (including C4.5!) from the measures on a given dataset.
The performance of an algorithm may be predicted by the performance of similar
algorithms. If results are already available for a few yardstick methods, the hope is
that the performance of other methods can be predicted from the yardstick results.
In presenting these analyses, we aim to give many different views of the results so that a
reasonably complete (although perhaps not always coherent) picture can be presented of a
very complex problem, namely, the problem of explaining why some algorithms do better
Address for correspondence: Laboratory of AI and Computer Science (LIACC), University of Porto, R.
Campo Alegre 823, 4100 Porto, Portugal
176
Analysis of results
[Ch. 10
on some datasets and not so well on others. These differing analyses may give conicting
and perhaps irreconcilable conclusions. However, we are not yet at the stage where we can
say that this or that analysis is the nal and only word on the subject, so we present all the
facts in the hope that the reader will be able to judge what is most relevant to the particular
application at hand.
10.2 RESULTS BY SUBJECT AREAS
To begin with, the results of the trials will be discussed in subject areas. This is partly
because this makes for easier description and interpretation, but, more importantly, because
the performance of the various algorithms is much inuenced by the particular application.
Several datasets are closely related, and it is easier to spot differences when comparisons
are made within the same dataset type. So we will discuss the results under four headings:
Datasets Involving Costs
Credit Risk Datasets
Image Related Datasets
Others
Of course, these headings are not necessarily disjoint: one of our datasets (German credit)
was a credit dataset involving costs. The feature dominating performance of algorithms is
costs, so the German credit dataset is listed under the Cost datasets.
We do not attempt to give any absolute assessment of accuracies, or average costs. But
we have listed the algorithms in each heading by their average ranking within this heading.
Algorithms at the top of the table do well, on average, and algorithms at the bottom do
badly.
To illustrate how the ranking was calculated, consider the two (no-cost) credit datasets.
Because, for example, Cal5 is ranked 1st in the Australian.credit and 4th in the credit
management dataset, Cal5 has a total rank of 5, which is the smallest total of all, and Cal5
is therefore top of the listing in the Credit datasets. Similarly, DIPOL92 has a total rank of
7, and so is 2nd in the list.
Of course, other considerations, such as memory storage, time to learn etc., must not
be forgotten. In this chapter, we take only error-rate or average cost into account.
10.2.1 Credit datasets
We have results for two credit datasets. In two of these, the problem is to predict the
creditworthiness of applicants for credit, but they are all either coded or condential to
a greater or lesser extent. So, for example, we do not know the exact denition of
uncreditworthy or bad risk. Possible denitions are (i) More than one month late with
the rst payment; (ii) More than two months late with the rst payment; or even (iii)
The (human) credit manager has already refused credit to this person.
Credit Management. Credit management data from the UK (condential).
German. Credit risk data from Germany.
Australian. Credit risk data from (Quinlan, 1993)
It may be that these classications are dened by a human: if so, then the aim of the decision
rule is to devise a procedure that mimics the human decision process as closely as possible.
Machine Learning procedures are very good at this, and this probably reects a natural
Sec. 10.2]
177
Qm
hm
h
ig
fhm
fhm
hm
Y
9
g
~
fhm
(10.1)
Y
9
HF
m
fhxe
c
$
H
Qm
hm
c$
Y
9
hm
hm
fhm
178
Analysis of results
[Ch. 10
Using assumption (10.2), one can get the expected misclassication cost K from equation
(10.1)
c
$
Qm
g ~
fH3e
c
$
Qm
(10.3)
is the same for all algorithms, so one can use the
xe
Cr.Aus
0.131
0.141
0.141
0.158
0.155
0.152
0.154
0.141
0.145
0.171
0.137
0.181
0.181
0.151
0.148
0.201
0.145
0.181
0.204
0.197
0.207
0.440
Cr.Man
0.023
0.020
0.030
0.020
0.022
0.025
0.023
0.033
0.031
0.028
0.046
0.030
0.031
0.043
0.047
0.031
0.033
0.032
0.040
0.043
0.050
0.050
The table of error rates for the credit datasets is given in Table 10.1. In reading this table,
the reader should beware that:
Not much can be inferred from only two cases re the suitability of this or that algorithm
for credit datasets generally;
In real credit applications, differential misclassication costs tend to loom large, if not
explicitly then by implication.
It is noteworthy that three of the top six algorithms are decision trees (Cal5, C4.5 and
IndCART), while the algorithm in second place (DIPOL92) is akin to a neural network.
We may conclude that decision trees do reasonably well on credit datasets. This conclusion
Sec. 10.2]
179
would probably be strengthened if we had persuaded CART to run on the credit management
dataset, as it is likely that the error rate for CART would be fairly similar to IndCARTs
value, and then CART would come above IndCART in this table. However, where values
were missing, as is the case with CART, the result was assumed to be the default value - an
admittedly very conservative procedure, so CART appears low down in Table 10.1.
By itself, the conclusion that decision trees do well on credit datasets, while giving some
practical guidance on a specic application area, does not explain why decision trees should
be successful here. A likely explanation is that both datasets are partitioning datasets. This
is known to be true for the credit management dataset where a human classied the data on
the basis of the attributes. We suspect that it holds for the other credit dataset also, in view
of the following facts: (i) they are both credit datasets; (ii) they are near each other in the
multidimensional scaling representation of all datasets; and (iii) they are similar in terms
of number of attributes, number of classes, presence of categorical attributes etc. Part of
the reason for their success in this subject area is undoubtedly that decision tree methods
can cope more naturally with a large number of binary or categorical attributes (provided
the number of categories is small). They also incorporate interaction terms as a matter of
course. And, perhaps more signicantly, they mirror the human decision process.
10.2.2 Image datasets
Image classication problems occur in a wide variety of contexts. In some applications,
the entire image (or an object in the image) must be classied, whereas in other cases the
classication proceeds on a pixel-by-pixel basis (possibly with extra spatial information).
One of the rst problems to be tackled was of LANDSAT data, where Switzer (1980, 1983)
considered classication of each pixel in a spatial context. A similar dataset was used in
our trials, whereby the attributes (but not the class) of neighbouring pixels was used to aid
the classication (Section 9.3.6). A further image segmentation problem, of classifying
each pixel is considered in Section 9.3.7. An alternative problem is to classify the entire
image into one of several classes. An example of this is object recognition, for example
classifying a hand-written character (Section 9.3.1), or a remotely sensed vehicle (Section
9.3.3). Another example in our trials is the classication of chromosomes (Section 9.3.5),
based on a number of features extracted from an image.
There are different levels of image data. At the simplest level we can consider the
grey values at each pixel as the set of variables to classify each pixel, or the whole image.
Our trials suggest that the latter are not likely to work unless the image is rather small; for
example classifying a hand-written number on the basis of
grey levels defeated
most of our algorithms. The pixel data can be further processed to yield a sharper image,
or other information which is still pixel-based, for example a gradient lter can be used
to extract edges. A more promising approach to classify images is to extract and select
appropriate features and the vehicle silhouette (Section 9.3.3) and chromosome (Section
9.3.5) datasets are of this type. The issue of extracting the right features is a harder problem.
The temptation is to measure everything which may be useful but additional information
which is not relevant may spoil the performance of a classier. For example, the nearest
neighbour method typically treats all variables with equal weight, and if some are of no
value then very poor results can occur. Other algorithms are more robust to this pitfall.
For presentation purposes we will categorise each of the nine image datasets as being
180
Analysis of results
[Ch. 10
one of Segmentation or Object Recognition, and we give the results of the two types
separately.
Results and conclusions: Object Recognition
Table 10.2: Error rates for Object Recognition Datasets. Algorithms are listed in order
of their average ranking over the ve datasets. Algorithms near the top tend to do well at
object recognition.
object
Quadisc
k-NN
DIPOL92
LVQ
ALLOC80
Logdiscr
Discrim
SMART
RBF
Baytree
Backprop
CN2
C4.5
NewID
IndCART
Cascade
Kohonen
CASTLE
Cal5
CART
ITrule
NaiveBay
Default
KL
0.025
0.020
0.039
0.026
0.024
0.051
0.075
0.057
0.055
0.163
0.049
0.180
0.180
0.162
0.170
0.075
0.168
0.135
0.270
0.216
0.223
0.900
Digits
0.054
0.047
0.072
0.061
0.068
0.086
0.114
0.104
0.083
0.140
0.080
0.134
0.149
0.150
0.154
0.065
0.155
0.075
0.170
0.220
0.160
0.222
0.233
0.900
Vehic
0.150
0.275
0.151
0.287
0.173
0.192
0.216
0.217
0.307
0.271
0.207
0.314
0.266
0.298
0.298
0.280
0.296
0.340
0.505
0.279
0.235
0.324
0.558
0.750
Chrom
0.084
0.123
0.091
0.121
0.253
0.131
0.107
0.128
0.129
0.164
0.150
0.175
0.176
0.173
Letter
0.113
0.070
0.176
0.079
0.064
0.234
0.302
0.295
0.233
0.124
0.327
0.115
0.132
0.128
0.130
0.234
0.174
0.178
0.244
0.245
0.252
0.245
0.253
0.697
0.324
0.960
0.594
0.529
0.960
Table 10.2 gives the error-rates for the ve object recognition datasets. It is believed that
this group contains pure discrimination datasets (digit, vehicle and letter recognition). On
these datasets, standard statistical procedures and neural networks do well overall.
It would be wrong to draw general conclusions from only ve datasets but we can make
the following points. The proponents of backpropagation claim that it has a special ability
to model non-linear behaviour. Some of these datasets have signicant non-linearity and it
is true that backpropagation does well. However, in the case of the digits it performs only
marginally better than quadratic discriminants, which can also model non-linear behaviour,
and in the case of the vehicles it performs signicantly worse. When one considers the
large amount of extra effort required to optimise and train backpropagation one must ask
whether it really offers an advantage over more traditional algorithms. Ripley (1993) also
raises some important points on the use and claims of Neural Net methods.
Sec. 10.2]
181
CASTLE performs poorly but this is probably because it is not primarily designed for
discrimination. Its main advantage is that it gives an easily comprehensible picture of the
structure of the data. It indicates which variables inuence one another most strongly and
can identify which subset of attributes are the most strongly connected to the decision class.
However, it ignores weak connections and this is the reason for its poor performance, in
that weak connections may still have an inuence on the nal decision class.
SMART and linear discriminants perform similarly on these datasets. Both of these
work with linear combinations of the attributes, although SMART is more general in that
it takes non-linear functions of these combinations. However, quadratic discriminants
performs rather better which suggests that a better way to model non-linearity would be to
input selected quadratic combinations of attributes to linear discriminants.
The nearest neighbour algorithm does well if all the variables are useful in classication
and if there are no problems in choosing the right scaling. Raw pixel data such as the
satellite data and the hand-written digits satisfy these criteria. If some of the variables are
misleading or unhelpful then a variable selection procedure should precede classication.
The algorithm used here was not efcient in cpu time, since no condensing was used.
Results from Ripley (1993) indicate that condensing does not greatly affect the classication
performance.
Paired Comparison on Digits Data: KL and the 4x4 digits data represent different
preprocessed versions of one and the same original dataset. Not unexpectedly, there is
a high correlation between the error-rates (0.944 with two missing values: CART and
Kohonen on KL).
Of much more interest is the fact that the statistical and neural net procedures perform
much better on the KL version than on the 4x4 version. On the other hand, Machine
Learning methods perform rather poorly on the 4x4 version and do even worse on the KL
version. It is rather difcult to account for this phenomenon. ML methods, by their nature,
do not seem to cope with situations where the information is spread over a large number of
variables. By construction, the Karhunen-Loeve dataset deliberately creates variables that
are linear combinations of the original pixel gray levels, with the rst variable containing
most information, the second variable containing the maximum information orthogonal
to the rst, etc.. From one point of view therefore, the rst 16 KL attributes contain more
information than the complete set of 16 attributes in the 4x4 digit dataset (as the latter is
a particular set of linear combinations of the original data), and the improvement in error
rates of the statistical procedures is consistent with this interpretation.
Results and conclusions: Segmentation
Table 10.3 gives the error rates for the four segmentation problems. Machine Learning
procedures do fairly well in segmentation datasets, and traditional statistical methods
do very badly. The probable explanation is that these datasets originate as partitioning
problems.
Paired Comparison of Cut20 and Cut50: The dataset Cut20 consists of the rst 20
attributes in the Cut50 dataset ordered by importance in a stepwise regression procedure.
One would therefore expect, and generally one observes, that performance deteriorates
when the number of attributes is decreased (so that the information content is decreased).
One exception to this rule is quadratic discrimination which does badly in the Cut20 dataset
and even worse in the Cut50 data. This is the converse of the paired comparison in the digits
182
Analysis of results
[Ch. 10
Table 10.3: Error rates for Segmentation Datasets. Algorithms are listed in order of their
average ranking over the four datasets. Algorithms near the top tend to do well in image
segmentation problems.
segment
ALLOC80
Baytree
k-NN
DIPOL92
C4.5
NewID
CN2
IndCART
LVQ
RBF
Backprop
Cal5
SMART
Logdisc
CART
Kohonen
Discrim
CASTLE
Quadisc
Default
NaiveBay
ITrule
Cascade
Satim
0.132
0.147
0.094
0.111
0.150
0.150
0.150
0.138
0.105
0.121
0.139
0.151
0.159
0.157
0.163
0.138
0.179
0.171
0.194
0.155
0.760
0.287
Segm
0.030
0.033
0.077
0.039
0.040
0.034
0.043
0.045
0.046
0.069
0.054
0.062
0.052
0.031
0.109
0.040
0.067
0.116
0.112
0.157
0.857
0.265
0.455
Cut20
0.037
0.034
0.036
0.045
0.036
0.039
0.042
0.040
0.041
0.044
0.043
0.045
0.047
0.063
0.046
Cut50
0.034
0.035
0.027
0.036
0.035
0.038
0.030
0.037
0.040
0.038
0.041
0.037
0.039
0.054
0.037
0.050
0.050
0.061
0.088
0.060
0.077
0.082
0.050
0.050
0.061
0.097
0.060
0.112
0.084
0.163
dataset: it appears that algorithms that are already doing badly on the most informative set
of attributes do even worse when the less informative attributes are added.
Similarly, Machine Learning methods do better on the Cut50 dataset, but there is a
surprise: they use smaller decision trees to achieve greater accuracy. This must mean
that some of the less signicant attributes contribute to the discrimination by means of
interactions (or non-linearities). Here the phrase less signicant is used in a technical
sense, referring to the least informative attributes in linear discriminants. Clearly attributes
that have little information for linear discriminants may have considerable value for other
procedures that are capable of incorporating interactions and non-linearities directly.
k-NN is best for images
Perhaps the most striking result in the images datasets is the performance of k-nearest
neighbour, with four outright top places and two runners-up. It would seem that, in terms
of error-rate, best results in image data are obtained by k-nearest neighbour.
Sec. 10.2]
183
The average costs of the various algorithms are given in Table 10.4. There are some
surprises in this table, particularly relating to the default procedure and the performance
of most Machine Learning and some of the Neural Network procedures. Overall, it would
seem that the ML procedures do worse than the default (of granting credit to everyone, or
declaring everyone to be seriously ill).
184
Analysis of results
[Ch. 10
C4.5
Cal5
SMART
Logdisc
CN2
CART
Backprop
RBF
Discrim
Quadisc
ALLOC80
NaiveBay
CASTLE
k-NN
ITrule
LVQ
Kohonen
Default
Belg
.018
.030
.027
.034
.034
.040
.029
.006
.007
.032
.034
.017
.034
.025
.052
.044
.062
.047
.059
.065
.054
.056
.362
NewBel
.026
.014
.017
.014
.019
.018
.026
.013
.028
.025
.022
.022
.035
.041
.035
.045
.089
.064
.052
.081
.065
.084
.074
Tset
.053
.037
.040
.039
.047
.049
.055
.047
.117
.036
.041
.065
.052
.122
.098
.057
.120
.137
.057
.228
.065
.075
.490
Diab
.224
.271
.289
.271
.276
.270
.250
.232
.223
.289
.255
.248
.243
.225
.262
.301
.262
.258
.324
.245
.272
.273
.350
DNA
.048
.095
.100
.073
.100
.076
.131
.141
.061
.095
.085
.088
.041
.059
.059
.057
.068
.072
.155
.135
.339
.480
Faults
.191
.283
.304
.335
.174
.305
.297
.339
.221
.354
.318
.228
.320
.204
.293
.339
.274
.318
.375
.330
.444
.472
.610
Shutt
.480
.020
.010
.090
.320
.100
.030
.590
3.830
.030
.080
.430
1.400
4.830
6.720
.830
4.500
3.800
.440
.410
.440
21.400
Tech
.192
.174
.090
.095
.102
.120
.183
.366
.401
.123
.324
.391
.495
.354
.204
.261
.357
.770
Of the remaining datasets, at least two (shuttle and technical) are pure partitioning
problems, with boundaries characteristically parallel to the attribute axes, a fact that can
be judged from plots of the attributes. Two are simulated datasets (Belgian and Belgian
Power II), and can be described as somewhere between prediction and partitioning. The
aim of the tsetse dataset can be precisely stated as partitioning a map into two regions, so as
to reproduce a given partitioning as closely as possible. The tsetse dataset is also articial
insofar as some of the attributes have been manufactured (by an interpolation from a small
amount of information). The Diabetes dataset is a prediction problem.
The nature of the other datasets (DNA, Machine Faults), i.e. whether we are dealing
with partitioning, prediction or discrimination, is not known precisely.
Results and conclusions
Table 10.5 gives the error-rates for these eight datasets. It is perhaps inappropriate to draw
general conclusions from such a mixed bag of datasets. However, it would appear, from the
performance of the algorithms, that the datasets are best dealt with by Machine Learning
Sec. 10.3]
Top ve algorithms
185
or Neural Network procedures. How much relevance this has to practical problems is
debatable however, as two are simulated and two are pure partitioning datasets.
10.3 TOP FIVE ALGORITHMS
In Table 10.6 we present the algorithms that came out top for each of the 22 datasets. Only
the top ve algorithms are quoted. The table is quoted for reference only, so that readers
can see which algorithms do well on a particular dataset. The algorithms that make the top
ve most frequently are DIPOL92 (12 times), ALLOC80 (11), Discrim (9), Logdiscr and
Quadisc (8), but not too much should be made of these gures as they depend very much
on the mix of problems used.
Table 10.6: Top ve algorithms for all datasets.
Dataset
KL
Dig44
Satim
Vehic
Head
Heart
Belg
Segm
Diab
Cr.Ger
Chrom
Cr.Aus
Shutt
DNA
Tech
NewBel
ISoft
Tset
cut20
cut50
Cr.Man
letter
First
k-NN
k-NN
k-NN
Quadisc
Logdiscr
Naivebay
SMART
ALLOC80
Logdiscr
Discrim
Quadisc
CAL5
NewID
RBF
NewID
SMART
AC2
CN2
Baytree
k-NN
SMART
ALLOC80
Second
ALLOC80
Quadisc
LVQ
DIPOL92
Cascade
Discrim
Logdiscr
AC2
DIPOL92
Logdiscr
DIPOL92
ITrule
Baytree
DIPOL92
IndCART
IndCART
DIPOL92
Baytree
k-NN
CN2
DIPOL92
k-NN
Third
Quadisc
LVQ
DIPOL92
ALLOC80
Discrim
Logdiscr
Bprop
Baytree
Discrim
CASTLE
Discrim
Discrim
CN2
ALLOC80
AC2
Baytree
Discrim
IndCART
C4.5
ALLOC80
C4.5
LVQ
Fourth
LVQ
Cascade
RBF
Logdiscr
Quadisc
ALLOC80
DIPOL92
NewID
SMART
ALLOC80
LVQ
Logdiscr
CAL5
Discrim
C4.5
NewID
Logdiscr
NewID
ALLOC80
Baytree
CAL5
Quadisc
Fifth
DIPOL92
ALLOC80
ALLOC80
Bprop
CART
Quadisc
Discrim
DIPOL92
RBF
DIPOL92
k-NN
DIPOL92
CART
Quadisc
CN2
C4.5
Bprop
CART
NewID
C4.5
Bprop
CN2
Table 10.7 gives the same information as Table 10.6, but here it is the type of algorithm
(Statistical, Machine Learning or Neural Net) that is quoted.
In the Head injury dataset, the top ve algorithms are all Statistical, whereas the top
ve are all Machine Learning for the Shuttle and Technical datasets. Between these two
extremes, there is a variety. Table 10.8 orders the datasets by the number of Machine
Learning, Statistical or Neural Network algorithms that are in the top ve.
From inspection of the frequencies in Table 10.8, it appears that Neural Networks and
Statistical procedures do well on the same kind of datasets. In other words, Neural Nets
tend to do well when statistical procedures do well and vice versa. As an objective measure
of this tendency, a correspondence analysis can be used. Correspondence analysis attempts
186
Analysis of results
[Ch. 10
Table 10.7: Top ve algorithms for all datasets, by type: Machine Learning (ML); Statistics
(Stat); and Neural Net (NN).
Dataset
KL
Dig44
Satim
Vehic
Head
Heart
Belg
Segm
Diab
Cr.Ger
Chrom
Cr.Aus
Shutt
DNA
Tech
NewBel
ISoft
Tset
cut20
cut50
Cr.Man
letter
First
Stat
Stat
Stat
Stat
Stat
Stat
Stat
Stat
Stat
Stat
Stat
ML
ML
NN
ML
Stat
ML
ML
ML
Stat
Stat
Stat
Second
Stat
Stat
NN
NN
NN
Stat
Stat
ML
NN
Stat
NN
ML
ML
NN
ML
ML
NN
ML
Stat
ML
NN
Stat
Third
Stat
NN
NN
Stat
Stat
Stat
NN
ML
Stat
Stat
Stat
Stat
ML
Stat
ML
ML
Stat
ML
ML
Stat
ML
NN
Fourth
NN
NN
NN
Stat
Stat
Stat
NN
ML
Stat
Stat
NN
Stat
ML
Stat
ML
ML
Stat
ML
Stat
ML
ML
Stat
Fifth
NN
Stat
Stat
NN
ML
Stat
Stat
NN
NN
NN
Stat
NN
ML
Stat
ML
ML
NN
ML
ML
ML
NN
ML
to give scores to the rows (here datasets) and columns (here procedure types) of an array
with positive entries in such a way that the scores are mutually consistent and maximally
correlated. For a description of correspondence analysis, see Hill (1982) and Mardia et al.
(1979) . It turns out that the optimal scores for columns 2 and 3 (neural net and statistical
procedures) are virtually identical, but these are quite different from the score of column
1 (the ML procedures). It would appear therefore that neural nets are more similar to
statistical procedures than to ML. In passing we may note that the optimal scores that are
given to the datasets may be used to give an ordering to the datasets, and this ordering can
be understood as a measure of how suited the dataset is to ML procedures. If the same
scores are allocated to neural net and statistical procedures, the corresponding ordering of
the datasets is exactly that given in the table, with datasets at the bottom being more of type
ML.
10.3.1 Dominators
It is interesting to note that some algorithms always do better than the default (among
the datasets we have looked at). There are nine such: Discrim, Logdisc, SMART, k-NN,
ALLOC80, CART, Cal5, DIPOL92 and Cascade. These algorithms dominate the default
strategy. Also, in the seven datasets on which Cascade was run, ITrule is dominated by
Cascade. The only other case of an algorithm being dominated by others is Kohonen: it
Sec. 10.4]
Multidimensional scaling
187
Table 10.8: Datasets ordered by algorithm type. Datasets at the top are most suited to
Statistical and Neural Net procedures: Datasets at the bottom most suited to Machine
Learning.
Dataset
Heart
Cr.Ger
KL
Dig44
Vehic
Belg
Diab
Chrom
DNA
Satim
Head
letter
ISoft
Cr.Aus
Cr.Man
cut20
cut50
Segm
NewBel
Shutt
Tech
Tset
ML
0
0
0
0
0
0
0
0
0
0
1
1
1
2
2
3
3
3
4
5
5
5
NN
0
1
2
2
2
2
2
2
2
3
1
1
2
1
2
0
0
1
0
0
0
0
Stat
5
4
3
3
3
3
3
3
3
2
3
3
2
2
1
2
2
1
1
0
0
0
is dominated by DIPOL92, Cascade and LVQ. These comparisons do not include datasets
where results is missing (NA), so we should really say: Where results are available,
Kohonen is always worse than DIPOL92 and LVQ. Since we only have results for 7
Cascade trials, the comparison Cascade-Kohonen is rather meaningless.
10.4 MULTIDIMENSIONAL SCALING
It would be possible to combine the results of all the trials to rank the algorithms by overall
success rate or average success rate, but not without some rather arbitrary assumptions
to equate error rates with costs. We do not attempt to give such an ordering, as we
believe that this is not protable. We prefer to give a more objective approach based on
multidimensional scaling (an equivalent procedure would be correspondence analysis). In
so doing, the aim is to demonstrate the close relationships between the algorithms, and,
at the same time, the close similarities between many of the datasets. Multidimensional
scaling has no background theory: it is an exploratory tool for suggesting relationships in
data rather than testing pre-chosen hypotheses. There is no agreed criterion which tells us
if the scaling is successful, although there are generally accepted guidelines.
188
Analysis of results
[Ch. 10
1.0
Cascade
Quadisc
Discrim
DIPOL92
SMART
ALLOC80
Bprop
kNN
CASTLE
RBF
IndCART
CN2
NewID
-1.0
CART
BaytreeLVQ
-0.5
Logdiscr
Kohonen
CAL5
AC2
Naivebay
ITrule
C4.5
-0.5
0.0
0.5
1.0
First scaling coordinate
1.5
Fig. 10.1: Multidimensional scaling representation of algorithms in the 22-dimensional space (each
dimension is an error rate or average cost measured on a given dataset). Points near to each other in
this 2-D plot are not necessarily close in 22-D.
Whether the 2-dimensional plot is a good picture of 22-dimensional space can be judged
from a comparison of the set of distances in 2-D compared to the set of distances in 22-D.
One simple way to measure the goodness of the representation is to compare the total
squared distances. Let
be the total of the squared distances taken over all pairs of
points in the 2-dimensional plot, and let
be the total squared distances over all pairs
of points in 22-dimensions. The stress is dened to be
. For Figure 10.1
the stress gure is 0.266. Considering the number of initial dimensions is very high, this
is a reasonably small stress, although we should say that, conventionally, the stress is
aU
Sec. 10.4]
Multidimensional scaling
189
said to be small when less than 0.05. With a 3-dimensional representation, the stress factor
would be 0.089, indicating that it would be more sensible to think of algorithms differing
in at least 3-dimensions. A three-dimensional representation would raise the prospect of
representing all results in terms of three scaling coordinates which might be interpretable
as error-rates of three (perhaps notional) algorithms.
Because the stress gure is low relative to the number of dimensions, points near each
other in Figure 10.1 probably represent algorithms that are similar in performance. For
example, the Machine Learning methods CN2, NewID and IndCART are very close to
each other, and in general, all the machine learning procedures are close in Figure 10.1.
Before jumping to the conclusion that they are indeed similar, it is as well to check the
tables of results (although the stress is low, it is not zero so the distances in Figure 10.1
are approximate only). Looking at the individual tables, the reader should see that, for
example, CN2, NewID and IndCART tend to come at about the same place in every table
apart from a few exceptions. So strong is this similarity, that one is tempted to say that
marked deviations from this general pattern should be regarded with suspicion and should
be double checked.
10.4.2 Hierarchical clustering of algorithms
CN2
AC2
NewID
C4.5
CAL5
CART
Baytree
Bprop
IndCART
DIPOL92
SMART
Cascade
Logdiscr
LVQ
Discrim
kNN
Kohonen
RBF
Naivebay
CASTLE
ITrule
ALLOC80
0.5
1.0
1.5
Quadisc
2.0
Fig. 10.2: Hierarchical clustering of algorithms using standardised error rates and costs.
There is another way to look at relationships between the algorithms based on the set
of paired distances, namely by a hierarchical clustering of the algorithms. The resulting
Figure 10.2 does indeed capture known similarities (linear and logistic discriminants are
very close), and is very suggestive of other relationships.
It is to be expected that some of the similarities picked up by the clustering procedure
190
Analysis of results
[Ch. 10
will be accidental. In any case, algorithms should not be declared as similar on the basis
of empirical evidence alone, and true understanding of the relationships will follow only
when theoretical grounds are found for similarities in behaviour.
Finally, we should say something about some dissimilarities. There are some surprising
errors in the clusterings of Figure 10.2. For example, CART and IndCART are attached
to slightly different clusterings. This is a major surprise, and we do have ideas on why
this is indeed true, but, nonetheless, CART and IndCART were grouped together in Tables
10.1-10.5 to facilitate comparisons between the two.
10.4.3 Scaling of datasets
The same set of re-scaled error rates may be used to give a 2-dimensional plot of datasets.
From a formal point of view, the multidimensional scaling procedure is applied to the
transpose of the matrix of re-scaled error rates. The default algorithm was excluded from
this exercise as distances from this to the other algorithms were going to dominate the
picture.
Multidimensional scaling of 22 datasets (23 algorithms)
ML
ISoft
ML
Cr.Man
NN
Stat
Shutt
ML
ML
NewBel
Diab
Stat
Stat
Head
Vehic
Stat
Stat
Satim
Stat
Stat
Stat
NN
Cr.Ger
cut20
ML
cut50
Stat
Heart
DNA
Belg
Tech
1.0
Cr.Aus
Tset
ML
Segm
Stat
letter
Stat
Chrom
Stat
-1.0
Dig44
Stat
KL
Stat
-1.0
-0.5
0.0
First scaling coordinate
0.5
1.0
Fig. 10.3: Multidimensional scaling representation of the Datasets in 23-dimensional space (each
dimension is an error rate and cost achieved by a particular algorithms). The symbols ML, NN and
Stat below each dataset indicate which type of algorithm achieved the lowest error-rate or cost on
that dataset. Datasets near to each other in this 2-D plot are not necessarily close in 23-D.
Figure 10.3 is a multidimensional scaling representation of the error rates and costs
given in Tables 10.1-10.5. Each dataset in Tables 10.1-10.5 is described by a point in
23-dimensional space, the coordinates of which are the (scaled) error rates or costs of
the various algorithms. To help visualise the relationships between the points (datasets),
they have been projected down to 2-dimensions in such a way as to preserve their mutual
Sec. 10.4]
Multidimensional scaling
191
distances as much as possible. This projection is fairly successful as the stress factor is
only 0.149 (a value of 0.01 is regarded as excellent, a value of 0.05 is good). Again, a 3dimensional representation might be more acceptable with a stress factor of 0.063. Such
a 3-D representation could be interpreted as saying that datasets differ in three essentially
orthogonal ways, and is suggestive of a description of datasets using just three measures.
This idea is explored further in the next subsection.
Several interesting similarities are obvious from Figure 10.3. The Costs datasets are
close to each other, as are the two types of image datasets. In addition, the credit datasets are
all at the top of the diagram (except for the German credit data which involves costs). The
two pathologically partitioned datasets Shuttle and Technical are together at the extreme
top right of the diagram.
In view of these similarities, it is tempting to classify datasets of unknown origin by
their proximities to other datasets of known provenance. For example, the Diabetes dataset
is somewhere between a partitioning type dataset (cf. credit data) and a prediction type
dataset (cf. head injury).
Interpretation of Scaling Coordinates
The plotting coordinates for the 2-dimensional description of datasets in Figure 10.3 are
derived by orthogonal transformation of the original error rates/costs. These coordinates
clearly represent distinctive features of the datasets as similar datasets are grouped together
in the diagram. This suggests either that the scaling coordinates might be used as characteristics of the datasets, or, equivalently, might be related to characteristics of the datasets.
This suggests that we look at these coordinates and try to relate them to the dataset measures
that we dened in Chapter 7. For example, it turns out that the rst scaling coordinate is
positively correlated with the number of examples in the dataset. In Figure 10.3, this means
that there is a tendency for the larger datasets to lie to the right of the diagram. The second
scaling coordinate is correlated with the curious ratio kurtosis , where is the number of
classes. This implies that a dataset with small kurtosis and large number of classes will
tend to lie in the bottom half of Figure 10.3. However, the correlations are quite weak, and
in any case only relate to a subspace of two dimensions with a stress of 0.149, so we
cannot say that these measures capture the essential differences between datasets.
192
Analysis of results
[Ch. 10
Belg
ISoft
Cr.Man
Dig44
NewBel
Vehic
KL
Cr.Aus
Diab
Cr.Ger
Heart
Head
letter
Satim
cut50
cut20
Tset
Segm
Chrom
Tech
0.5
Shutt
1.0
DNA
1.5
2.0
2.5
Fig. 10.4: Hierarchical clustering of datasets based on standardised error rates and costs.
Sec. 10.5]
193
194
Analysis of results
[Ch. 10
Logdisc
0.0072
0.1310
0.1406
0.5380
0.0300
0.0460
0.0370
0.0610
0.2230
0.0860
0.2210
0.0510
0.2340
0.0280
0.1630
0.1090
0.0380
0.4010
0.1170
0.1910
seems that generally Logdisc is better than DIPOL92 for two-class problems. Knowing
this, we can look back at the main tables and come to the following conclusions about the
relative performance of Logdisc and DIPOL92.
Rules comparing Logdisc to DIPOL92
We can summarise our conclusions viz-a-viz logistic and DIPOL by the following rules,
which amount to saying that DIPOL92 is usually better than Logdisc except for the cases
stated.
Sec. 10.5]
195
1.00
Algor_3
Algor_2
Algor_1
0.50
Error
Dataset I
Algor_3
Algor_2
Dataset II
0.05
0.10
Algor_1
100
500
1000
5000
10000
50000
Nodes
Fig. 10.5: Hypothetical dependence of error rates on number of end nodes (and so on pruning) for
three algorithms on two datasets.
balance must be struck between conicting criteria. One way of achieving a balance is the
use of cost-complexity as a criterion, as is done by Breiman et al. (1984). This balances
complexity of the tree against the error rate, and is used in their CART procedure as a
criterion for pruning the decision tree. All the decision trees in this project incorporate
some kind of pruning, and the extent of pruning is controlled by a parameter. Generally, a
tree that is overpruned has too high an error rate because the decision tree does not represent
the full structure of the dataset, and the tree is biased. On the other hand, a tree that is
not pruned has too much random variation in the allocation of examples. In between these
two extremes, there is usually an optimal amount of pruning. If an investigator is prepared
to spend some time trying different values of this pruning parameter, and the error-rate
is tested against an independent test set, the optimal amount of pruning can be found by
plotting the error rate against the pruning parameter. Equivalently, the error-rate may be
plotted against the number of end nodes. Usually, the error rate drops quite quickly to its
minimum value as the number of nodes increases, increasing slowly as the nodes increase
beyond the optimal value.
The number of end nodes is an important measure of the complexity of a decision tree.
If the decision tree achieves something near the optimal error-rate, the number of end nodes
is also measure of the complexity of the dataset. Although it is not to be expected that all
decision trees will achieve their optimal error-rates with the same number of end-nodes, it
seems reasonable that most decision trees will achieve their optimal error-rates when the
number of end-nodes matches the complexity of the dataset.
Considerations like these lead us to expect that the error-rates of different algorithms
196
Analysis of results
[Ch. 10
6
Table 10.11: Error rates and number of end nodes for four algorithms relative to the values
for C4.5.
Algorithm
0.239
1.750
Cal5
0.088
1.250
C4.5
1.000
1.000
NewID
2.132
1.083
$ "!
('&6
$ "!
%#
with standardised results from 15 other datasets for which we had the relevant information,
with the name of the algorithm as label. Of course, each dataset will give rise to at least
one point with
and
, but we are here concerned with the results
that are not near this optimal point.
Note that Cal5 appears most frequently in the left of the Figure 10.6 (where it has less
appear most frequently in the
nodes than the best algorithm) and both NewID and
right of the diagram (where they have too many nodes). It would also appear that C4.5 is
most likely to use the best number of nodes - and this is very indirect evidence that the
amount of pruning used by C4.5 is correct on average, although this conclusion is based on
a small number of datasets.
One would expect that a well-trained procedure should attain the optimal number of
nodes on average, but it is clear that Cal5 is biased towards small numbers (this may be done
deliberately to obtain trees with simple structure), whereas NewID and
are biased
towards more complex trees. In the absence of information on the relative weights to be
attached to complexity (number of nodes) or cost (error rate), we cannot say whether Cal5
has struck the right balance, but it does seem clear that NewID and
often use very
1(0 6
$ "!
)%#
$ "!
Sec. 10.6]
197
Cal5
Cal5
AC2 Cal5
NewID
Cal5
AC2
Cal5
C4.5
C4.5
AC2
AC2
AC2
NewID
NewID C4.5
C4.5
Cal5
AC2 Cal5
AC2
C4.5NewID
NewID
Cal5
0.5
0.1
1.0
AC2
AC2
5.0
Node.ratio
NewID
AC2
Error.ratio
NewID
C4.5
AC2
NewID
Cal5
NewID
Cal5
AC2 C4.5
AC2
NewID
NewID
AC2
Cal5
AC2
Cal5
NewID
10.0
50.0 100.0
Fig. 10.6: Error rate and number of nodes for 16 datasets. Results for each dataset are scaled
separately so that the algorithm with lowest error rate on that dataset has unit error rate and unit
number of nodes.
198
Analysis of results
[Ch. 10
In order to achieve this aim, we need to determine which dataset features are relevant.
After that, various instances of learning tasks can be examined with the aim of formulating a theory concerning the applicability of different machine learning and statistical
algorithms.
The knowledge concerning which algorithm is applicable can be summarised in the
form of rules stating that if the given dataset has certain characteristics then learning a
particular algorithm may be applicable. Each rule can, in addition, be qualied using a
certain measure indicating how reliable the rule is. Rules like this can be constructed
manually, or with the help of machine learning methods on the basis of past cases. In this
section we are concerned with this latter method. The process of constructing the rules
represents a kind of meta-level learning.
As the number of tests was generally limited, few people have attempted to automate the
formulation of a theory concerning the applicability of different algorithms. One exception
was the work of Aha (1992) who represented this knowledge using the following rule
schemas:
uprgq
@AT iG8&UF58GhD &&ISi
f
D @ T F 7 f
7 R D 7 T
UtA7 AGS@
7 T s D
6&4
@
ArSq
@ i f F 8 D @
AT p8 USG4%hD T &Gg7 3
F 7 f
e
UT
6 4
6 4
4
D
7 d
6
R F
IGT F SIF
R H
7 H
PIF G5CA97
F E D 7 B @ 8
6
6
Q4
4
6 52
4 3
c D VY Y Y X V H R
1baUU`AWUU@
H @ T
UA7
f V
&PgX
D
7 d
6
Y H
hdU7 U%AT
@
F
D F B 7 f 7 e
bGG&AT 2
x
&UhSUIP&hU4 Ut4s e i GGs yv
Y H H F f4 R T 7 4H 7
s
i T
x
&UhUU59UUtbF h%G4h4%IGT yv
H 7 R D
H D @ 8 D @ D @ F
4
4 x v 3
Syw2
where IB1
C4 means that algorithm IB1 is predicted to have signicantly higher
accuracies than algorithm C4. Our approach differs from Ahas in several respects. The
main difference is that we are not concerned with just a comparison between two algorithms,
but rather a group of them.
Our aim is to obtain rules which would indicate when a particular algorithm works
better than the rest. A number of interesting relationships have emerged. However, in order
to have reliable results, we would need quite an extensive set of test results, certainly much
more than the 22 datasets considered in this book.
As part of the overall aim of matching features of datasets with our past knowledge of
algorithms, we need to determine which dataset features are relevant. This is not known
a priori, so, for exploratory purposes, we used the reduced set of measures given in Table
10.12. This includes certain simple measures, such as number of examples, attributes and
classes, and more complex statistical and information-based measures. Some measures
represent derived quantities and include, for example, measures that are ratios of other
measures. These and other measures are given in Sections 7.3.1 7.3.3.
g
hg
Sec. 10.6]
199
Number of examples
Number of attributes
Number of Classes
Number of Binary Attributes
Cost matrix indicator
Standard deviation ratio (geometric mean)
Mean absolute correlation of attributes
First canonical correlation (7.3.2)
Fraction separability due to cancor1
Skewness - mean of
Kurtosis - mean of
Entropy (complexity) of class
Mean entropy (complexity) of attributes
Mean mutual information of class and attributes
Equivalent number of attributes
Noise-signal ratio
nvkyTj
x
w
nvx{Tj u
k
u w
yzyQj
n n k x
zlj j
n k
u w
ut
n
Qj
nzkyxQj
nvlj
k
n
Qj
ut
u w
EN.attr
NS.ratio
Denition
s i sn m kj
A@Upaorq i
p i pn m
A3UGo l i
kj
Measure
Simple
N
p
q
Bin.att
Cost
Statistical
SD
corr.abs
cancor1
fract1
skewness
kurtosis
Information theory
this dataset. The other algorithms were considered inapplicable. This categorisation of
the test results can be seen as a preparatory step for the metalevel learning task. Of course,
the categorisation will permit us also to make prediction regarding which algorithms are
applicable on a new dataset.
Of course, the question of whether the error rate is high or low is rather relative. The
error rate of 15% may be excellent in some domains, while 5% may be bad in others.
This problem is resolved using a method similar to subset selection in statistics. First,
the best algorithm is identied according to the error rates. Then an acceptable margin
of tolerance is calculated. All algorithms whose error rates fall within this margin are
considered applicable, while the others are labelled as inapplicable. The level of tolerance
can reasonably be dened in terms of the standard deviation of the error rate, but since
each algorithm achieves a different error rate, the appropriate standard deviation will vary
across algorithms.
To keep things simple, we will quote the standard deviations for the error rate of the
best algorithm, i.e. that which achieves the lowest error rate. Denote the lowest error rate
by
. Then the standard deviation is dened by
x n | j |
hh9a{ ~
}
|
where
is the number of examples in the test set. Then all algorithms whose error rates
fall within the interval
are considered applicable. Of course we still
need to choose a value for which determines the size of the interval. This affects the
condence that the truly best algorithm appears in the group considered. The larger the
value of , the higher the condence that the best algorithm will be in this interval.
n
| x |
GSQy{j
200
Analysis of results
[Ch. 10
For example, let us consider the tests on the Segmentation dataset consisting of 2310
examples. The best algorithm appears to be ALLOC80 with the error rate of 3%
. Then
|
j
& I
&n
&G
aj
&G
}
~
which is 0.35%. In this example, we can say with high condence that the best algorithms
are in the group with error rates between 3% and
%. If
the interval is relatively
small
%
% and includes only two other algorithms (
, BayesTree) apart from
ALLOC80. All the algorithms that lie in this interval can be considered applicable to this
dataset, and the others inapplicable. If we enlarge the margin, by considering larger values
of , we get a more relaxed notion of applicability (see Table 10.13).
Table 10.13: Classied Test Results on Image Segmentation Dataset for k=16.
&G
&G
Algorithm Error
ALLOC80 .030
.031
BayesTree .033
Class
Appl
Appl
Appl
NewID
.034
Appl
C4.5
CART
DIPOL92
CN2
.040
.040
.040
.043
Appl
Appl
Appl
Appl
IndCART
LVQ
SMART
Backprop
.045
.046
.052
.054
Appl
Appl
Appl
Appl
Cal5
Kohonen
RBF
k-NN
.062
.067
.069
.077
Appl
Appl
Appl
Appl
Logdisc
CASTLE
Discrim
Quadisc
Bayes
ITrule
Default
.109
.112
.116
.157
.265
.455
.900
Margin
0.030 Margin for k=0
Non-Appl
Non-Appl
Non-Appl
Non-Appl
Non-Appl
Non-Appl
Non-Appl
The decision as to where to draw the line (by choosing a value for ) is, of course,
rather subjective. In this work we had to consider an additional constraint related to the
purpose we had in mind. As our objective is to generate rules concerning applicability of
Sec. 10.6]
201
algorithms we have opted for the more relaxed scheme of appplicability (k = 8 or 16), so
as to have enough examples in each class (Appl, Non-Appl).
Some of the tests results analysed are not characterised using error rates, but rather
costs. Consequently the notion of error margin discussed earlier has to be adapted to
costs. The standard error of the mean cost can be calculated from the confusion matrices
(obtained by testing), and the cost matrix. The values obtained for the leading algorithm in
the three relevant datasets were:
Dataset
German credit
Heart disease
Head injury
Algorithm
Discrim
Discrim
Logdisc
Mean cost
0.525
0.415
18.644
In the experiments reported later the error margin was simply set to the values 0.0327,
0.0688 and 1.3523 respectively, irrespective of the algorithm used.
Joining data relative to one algorithm
The problem of learning was divided into several phases. In each phase all the test results
relative to just one particular algorithm (for example, CART) were joined, while all the
other results (relative to other algorithms) were temporarily ignored. The purpose of this
strategy was to simplify the class structure. For each algorithm we would have just two
classes (Appl and Non-Appl). This strategy worked better than the obvious solution that
included all available data for training. For example, when considering the CART algorithm
and a margin of
we get the scheme illustrated in Figure 10.7. The classied test
v
CART-Appl,
CART-Appl,
CART-Appl,
CART-Appl,
CART-Appl,
CART-Appl,
CART-Appl,
CART-Appl,
CART-Appl,
CART-Appl,
CART-Appl,
CART-Appl,
CART-Appl,
Satim
Vehic
Head
Heart
Belg
Segm
Diab
Cr.Ger
Cr.Aust
DNA
BelgII
Faults
Tsetse
CART-Non-Appl,
CART-Non-Appl,
CART-Non-Appl,
CART-Non-Appl,
CART-Non-Appl,
CART-Non-Appl,
CART-Non-Appl,
CART-Non-Appl,
KL
Dig44
Chrom
Shut
Tech
Cut
Cr.Man
Letter
Fig. 10.7: Classied test results relative to one particular algorithm (CART).
results are then modied as follows. The dataset name is simply substituted by a vector
containing the corresponding dataset characteristics. Values which are not available or
missing are simply represented by ?. This extended dataset is then used in the meta-level
learning.
Choice of algorithm for learning
A question arises as to which algorithm we should use in the process of meta-level learning. We have decided to use C4.5 for the following reasons. First, as our results have
202
Analysis of results
[Ch. 10
demonstrated, this algorithm achieves quite good results overall. Secondly, the decision
tree generated by C4.5 can be inspected and analysed. This is not the case with some
statistical and neural learning algorithms.
So, for example, when C4.5 has been supplied with the partial test results relative to
CART algorithm, it generated the decision tree in Figure 10.8. The gures that appear on
Y
hd
hY X
fCUs Ud
s Y
7 b
v fCsUs& CDbiGUdP7 b
hY
Y
dUGf SU
f s
CPs CbpUGf S
D i
the right hand side of each leaf are either of the form (N) or (N/E), where N represents
the total number of examples satisfying the conditions of the associated branch, and E
the number of examples of other classes that have been erroneously covered. If the data
contains unknown values, the numbers N and E may be fractional.
It has been argued that rules are more legible than trees. The decision tree shown earlier
can be transformed into a rule form using a very simple process, where each branch of a
tree is simply transcribed as a rule. The applicability of CART can thus be characterised
using the rules in Figure 10.9.
6435, Skew
6435
6435, Skew
0.57
N
N
N
CART-Appl
CART-Non-Appl
CART-Non-Appl
0.57
Quinlan (1993) has argued that rules obtained from decision trees can be improved
upon in various ways. For example, it is possible to eliminate conditions that are irrelevant,
or even drop entire rules that are irrelevant or incorrect. In addition it is possible to reorder
the rules according to certain criteria and introduce a default rule to cover the cases that
have not been covered. The program C4.5 includes a command that permits the user to
transform a decision tree into a such a rule set. The rules produced by the system are
characterised using (pessimistic) error rate estimates.
As is shown in the next section, error rate (or its estimate) is not an ideal measure,
however. This is particularly evident when dealing with continuous classes. This problem
has motivated us to undertake a separate evaluation of all candidate rules and characterise
them using a new measure. The aim is to identify those rules that appear to be most
informative.
10.6.3 Characterizing predictive power
The rules concerning applicability of a particular algorithm were generated on the basis of
only about 22 examples (each case represents the results of particular test on a particular
dataset). Of these, only a part represented positive examples, corresponding to the
datasets on which the particular algorithm performed well. This is rather a modest number.
Also, the set of dataset descriptors used may not be optimal. We could thus expect that the
rules generated capture a mixture of relevant and fortuitous regularities.
Sec. 10.6]
203
In order to strengthen our condence in the results we have decided to evaluate the rules
generated. Our aim was to determine whether the rules could actually be used to make
useful predictions concerning its applicability. We have adopted a leave-one-out procedure
and applied it to datasets, such as the one shown in Table 10.13.
Following this procedure, we used all but one items in training, while the remaining
item was used for testing. Of course, the set of rules generated in each pass could be slightly
different, but the form of the rules was not our primary interest here. We were interested to
verify how successful the rules were in predicting the applicability (or non-applicability)
of the algorithm.
Let us analyse an example. Consider, for example, the problem of predicting the
applicability of CART. This can be characterised using confusion matrices, such as the
ones shown in Figure 10.10, showing results relative to the error margin k=16. Note that
an extra (simulated) dataset has been used in the following calculations and tables, which
is why the sum is now 22.
Appl
Non-appl
Appl
11
1
Non-appl
2
8
Fig. 10.10: Evaluation of the meta-rules concerning applicability of CART. The rows represent the
true class, and the columns the predicted class.
The confusion matrix shows that the rules generated were capable of correctly predicting
the applicability of CART on an unseen dataset in 11 cases. Incorrect prediction was made
only in 1 case. Similarly, if we consider non-applicability, we see that correct prediction
is made in 8 cases, and incorrect one in 2. This gives a rather good overall success rate of
86%.
We notice that success rate is not an ideal measure, however. As the margin of
applicability is extended (by making larger), more cases will get classied as applicable.
If we consider an extreme case, when the margin covers all algorithms, we will get an
apparent success rate of 100%. Of course we are not interested in such a useless procedure!
This apparent paradox can be resolved by adopting the measure called information
score (IS) (Kononenko & Bratko, 1991) in the evaluation. This measure takes into account
prior probabilities. The information score associated with a denite positive classication
is dened as log
, where
represents the prior probability of class C. The
information scores can be used to weigh all classier answers. In our case we have two
classes Appl and Non-Appl. The weights can be represented conveniently in the form of an
information score matrix as shown in Figure 10.11.
n j
Q
n
n
Appl
Non- Appl
log
Non-Appl
log Non- Appl
j
j
Waj
Appl
j
j
j
Appl
log
log
n
n
n j
T
Appl
Non-Appl
Fig. 10.11: Information Score Matrix. The rows represent the true class, and the columns the
predicted class.
The information scores can be used to calculate the total information provided by a rule
204
Analysis of results
[Ch. 10
on the given dataset. This can be done simply by multiplying each element of the confusion
matrix by the corresponding element of the information score matrix.
The quantities Appl and Non-Appl can be estimated from the appropriate frequencies. If we consider the frequency of Appl and Non-Appl for all algorithms (irrespective
of the algorithm in question), we get a kind of absolute reference point. This enables us to
make comparisons right across different algorithms.
For example, for the value of log Appl we consider a dataset consisting of 506
cases (23 algorithms
22 datasets). As it happens 307 cases fall into the class Appl.
The information associated with
Appl is log
. Similarly, the
value of log Non-Appl is log
.
We notice that due to the distribution of this data (given by a relatively large margin
of applicability of
), the examples of applicable cases are relatively common.
Consequently, the information concerning applicability has a somewhat smaller weight
(.721) than the information concerning non-applicability (1.346).
If we multiply the elements of the confusion matrix for CART by the corresponding
elements of the information score matrix we get the matrix shown in Figure 10.12.
n
on
&
j
j
`& &%j
n
n P(0
j
j
Appl
Non-Appl
Appl
7.93
0.72
Non-Appl
2.69
10.77
Fig. 10.12: Adjusted confusion matrix for CART. The rows represent the true class, and the columns
the predicted class.
This matrix is in a way similar to the confusion matrix shown earlier with the exception
that the error counts have been weighted by the appropriate information scores. To obtain
an estimate of the average information relative to one case, we need to divide all elements
by the number of cases considered (i.e. 22). This way we get the scaled matrix in Figure
10.13.
Appl
Non-Appl
Appl
0.360
0.122
Non-Appl
0.033
0.489
Fig. 10.13: Rescaled adjusted confusion matrix for CART.
&G
bits.
This information provided by the classication of Appl is
The information provided by classication of Non-Appl is similarly
bits.
This information obtained in the manner described can be compared to the information
provided by a default rule. This can be calculated simply as follows. First we need to
decide whether the algorithm should be applicable or non-applicable by default. This
is quite simple. We just look for the classication which provides us with the highest
information.
If we consider the previous example, the class Appl is the correct default for CART. This
is because the information associated with this default is
which is greater than the information associated with the converse rule (i.e. that
) %
n
0 &H%
P I
9
Sec. 10.6]
205
CART is Non-Appl).
How can we decide whether the rules involved in classication are actually useful?
This is quite straightforward. A rule can be considered useful if it provides us with more
information than the default. If we come back to our example, we see that the classication
for Appl provides us with .327 bits, while the default classication provides only .131 bits.
This indicates that the rules used in the classication are more informative than the default.
In consequence, the actual rule should be kept and the default rule discarded.
10.6.4 Rules generated in metalevel learning
Figure 10.14 contains some rules generated using the method described. As we have
not used a uniform notion of applicability throughout, each rule is qualied by additional
information. The symbol Appl represents the concept of applicability derived on the
basis of the best error rate. In case of Appl
the interval of applicability is Best error
rate, Best error rate + 16 STDs and the interval of non-applicability is Best error rate +
16 STDs, 1 .
Each rule also shows the information score. This parameter gives an estimate of the
usefulness of each rule. The rules presented could be supplemented by another set generated
on the basis of the worst error rate (i.e. the error rate associated with the choice of most
common class or worse). In the case of Appl the interval of applicability is Best error
rate, Default error rate - 8 STDs and the interval of non-applicability is Default error rate
- 8 STDs, 1 .
The set of rules generated includes a number of default rules which can be easily
recognised (they do not have any conditions on the right hand side of ).
Each rule included shows also the normalised information score. This parameter gives
an estimate of the usefulness of each rule. Only those rules that could be considered
minimally useful (with information score
) have been included here. All rules for
CART are also shown, as these were discussed earlier. In the implemented system we use
a few more rules which are a bit less informative (with inf. scores down to
).
j
&
&
Discussion
The problem of learning rules for all algorithms simultaneously is formidable. We want to
obtain a sufcient number rules to qualify each algorithm. To limit the complexity of the
problem we have considered one algorithm at a time. This facilitated the construction of
rules. Considering that the problem is difcult, what condence can we have that the rules
generated are minimally sensible?
One possibility is to try to evaluate the rules, by checking whether they are capable of
giving useful predictions. This is what we have done in one of the earlier sections. Note
that measuring simply the success rate has the disadvantage that it does not distinguish
between predictions that are easy to make, and those that are more difcult. This is why
we have evaluated the rules by examining how informative they are in general.
For example, if we examine the rules for the applicability of CART we observe that the
rules provide us with useful information if invoked. These measures indicate that the rules
generated can indeed provide us with useful information.
Instead of evaluating rules in the way shown, we could present them to some expert to
see if he would nd them minimally sensible. On a quick glance the condition N 6435
206
Analysis of results
[Ch. 10
Inf. Score
.477
.609
.447
.186
.328
.367
.384
.524
.702
.549
.917
Statistical Algorithms:
Discrim-Appl
Discrim-Non-Appl
Discrim-Non-Appl
Quadisc-Appl
Logdisc-Appl
Logdisc-Non-Appl
ALLOC80-Appl
ALLOC80-Appl
k-NN-Appl
Bayes-Non-Appl
Bayes-Non-Appl
BayTree-Appl
BayTree-Non-Appl
CASTLE-Non-Appl
CASTLE-Non-Appl
.247
.452
.367
.309
.495
.367
.406
.797
.766
.418
.705
.557
.305
.420
.734
g
g
k 7
k 7
N 768, Cost
Bin.att 0
1000
1000
4
1000
3186
4
N
N
k
N
N
k
3000
.341
.544
.401
.498
.495
.641
.706
.866
Sec. 10.6]
207
is a bit puzzling. Why should CART perform reasonably well, if the number of examples
is less than this number?
Obviously, as the rules were generated on the basis of a relatively small number of
examples, the rules could contain some fortuitous features. Of course, unless we have more
data available it is difcult to point out which features are or are not relevant. However, it
is necessary to note that the condition N 6435 is not an absolute one. Rules should
not be simply interpreted as - The algorithm performs well if such and such condition is
satised. The correct interpretation is something like - The algorithm is likely to compete
well under the conditions stated, provided no other more informative rule applies. This
view helps also to understand better the rule for Discrim algorithm generated by the system.
&
Discrim-Appl
The condition N 1000 does not express all the conditions of applicability of algorithm
Discrim, and could appear rather strange. However, the condition does make sense. Some
algorithms have a faster learning rate than others. These algorithms compete well with
others, provided the number of examples is small. The fast learning algorithms may
however be overtaken by others later. Experiments with learning curves on the Satellite
Image dataset show that the Discrim algorithm is among the rst six algorithms in terms
of error rate as long as the number of examples is relatively small (100, 200 etc.). This
algorithm seems to pick up quickly what is relevant and so we could say, it competes
well under these conditions. When the number of examples is larger, however, Discrim is
overtaken by other algorithms. With the full training set of 6400 examples Discrim is in
19th place in the ranking. This is consistent with the rule generated by our system. The
condition generated by the system is not so puzzling as it seems at rst glance!
There is of course a well recognised problem that should be tackled. Many conditions
contain numeric tests which are either true or false. It does not make sense to consider the
Discrim algorithm applicable if the number of examples is less than 1000, and inapplicable,
if this number is just a bit more. A more exible approach is needed (for example using
exible matching).
208
Analysis of results
[Ch. 10
95
an algorithm are taken with a positive sign, the others with a negative one. For example, if
we get a recommendation to apply an algorithm with an indication that this is apparently 0.5
bits worth, and if we also get an opposite recommendation (i.e. not to apply this algorithm)
with an indication that this is 0.2 bits worth, we will go ahead with the recommendation,
but decrease the information score accordingly (i.e. to 0.3 bits).
The output of this phase is a list of algorithms accompanied by their associated overall
information scores. A positive score can be interpreted as an argument to apply the
algorithm. A negative score can be interpreted as an argument against the application of the
algorithm. Moreover, the higher the score, the more informative is the recommendation in
general. The information score can be then considered as a strength of the recommendation.
The recommendations given are of course not perfect. They do not guarantee that
the rst algorithm in the recommendation ordering will have the best performance in
reality. However, our results demonstrate that the algorithms accompanied by a strong
recommendation do perform quite well in general. The opposite is also true. The algorithms
that have not been recommended have a poorer performance in general. In other words, we
observe that there is a reasonable degree of correlation between the recommendation and
the actual test results. This is illustrated in Figure 10.15 which shows the recommendations
generated for one particular dataset (Letters).
Quadisc
BayTree
IndCART
C4.5
k-NN
ALLOC80
CN2
NewID
DIPOL92
Logdisc
CASTLE
Kohonen
RBF
AC2
Cal5
SMART
Discrim
70
90
LVQ
65
Backprop
-0.6
-0.3
0
Information score (bits)
0.3
0.6
Fig. 10.15: Recommendations of the Application Assistant for the Letters dataset.
The recommendations were generated on the basis of a rules set similar to the one
shown in Figure 10.14 (the rule set included just a few more rules with lower information
scores).
The top part shows the algorithms with high success rates. The algorithms on the right
are accompanied by a strong recommendation concerning applicability. We notice that
Sec. 10.6]
209
several algorithms with high success rates apear there. The algorithm that is accompanied
by the strongest reccomendation for this dataset is ALLOC80 (Information Score = 0.663
bits). This algorithm has also the highest success rate of 93.6 %. The second place in the
ordering of algorithms recommended is k-NN shared by k-NN and DIPOL92. We note that
k-NN is a very good choice, while DIPOL92 is not too bad either.
The correlation between the information score and success rate could, of course, be
better. The algorithm CASTLE is given somewhat too much weight, while BayTree which
is near the top is somewhat undervalued. The correlation could be improved, in the rst
place, by obtaining more test results. The results could also be improved by incorporating
a better method for combining rules and the corresponding information scores. It would be
benecial to consider also other potentially useful sets of rules, including the ones generated
on the basis of other values of k, or even different categorisation schemes. For example, all
algorithms with a performance near the default rule could be considered non-applicable,
while all others could be classied as applicable.
Despite the fact that there is room for possible improvements, the Application Assistant
seems to produce promising results. The user can get a recommendation as to which
algorithm could be used with a new dataset. Although the recommendation is not guaranteed
always to give the best possible advice, it narrows down the users choice.
10.6.6 Criticism of metalevel learning approach
Before accepting any rules, generated by C4.5 or otherwise, it is wise to check them
against known theoretical and empirical facts. The rules generated in metalevel learning
could contain spurious rules with no foundation in theory. If the rule-based approach has
shortcomings, how should we proceed? Would it be better to use another classication
scheme in place of the metalevel learning approach using C4.5? As there are insufcient
data to construct the rules, the answer is probably to use an interactive method, capable of
incorporating prior expert knowledge (background knowledge). As one simple example,
if it is known that an algorithm can handle cost matrices, this could simply be provided to
is
the system. As another example, the knowledge that the behaviour of NewID and
likely to be similar could also be useful to the system. The rules for
could then be
constructed from the rule for NewID, by adding suitable conditions concerning, for example
the hierarchical structure of the attributes. Also, some algorithms have inbuilt checks on
applicability, such as linear or quadratic discriminants, and these should be incorporated
into the learnt rules.
210
Analysis of results
[Ch. 10
k-NN
0.80
1.12
Baytree
Naivebay
CN2
C4.5
ITrule
CAL5
Kohonen
DIPOL92
Bprop
Cascade
RBF
LVQ
IndCART
1.43
0.29
0.56
0.74
1.23
1.12
0.89
0.79
1.01
1.17
1.87
0.97
1.24
0.29
0.53
0.78
0.88
1.05
Discrim
1.34
0.79
0.54
q
Algorithm
Quadisc
Logdisc
SMART
ALLOC80
CASTLE
CART
NewID
R-square
0.640
0.921
0.450
0.846
0.874
0.860
0.840
0.723
0.897
0.840
0.862
0.752
0.601
0.709
0.672
0.533
0.679
0.684
0.534
0.821
n trials
22
22
22
21
21
15
22
22
22
22
22
22
20
22
18
22
20
6
22
21
The Discrim coefcient of 0.79 in the Logdisc example shows that Logdisc is generally
about 21% more accurate than Discrim, and that the performance of the other two reference
methods does not seem to help in the prediction. With an R-squared value of 0.921, we can
Sec. 10.7]
Prediction of performance
211
be quite condent that Logdisc does better than Discrim. This result should be qualied
with information on the number of attributes, normality of variables, etc, and these are
quantities that can be measured. In the context of deciding if further trials on additional
algorithms are necessary, take the example of the shuttle dataset, and consider what action
to recommend after discovering Discrim = 4.83%, IndCART = 0.09% and k-NN = 0.44%.
It does not look as if the error rates of either Logdisc or SMART will get anywhere near
IndCARTs value, and the best prospect of improvement lies in the decision tree methods.
Consider DIPOL92 now. There appears to be no really good predictor, as the R-squared
value is relatively small (0.533). This means that DIPOL92 is doing something outside
the scope of the three reference algorithms. The best single predictor is: DIPOL92 =
0.29 Discrim, apparently indicating that DIPOL92 is usually much better than Discrim
(although not so much better that it would challenge IndCARTs good value for the shuttle
dataset). This formula just cannot be true in general however: all we can say is that,
for datasets around the size in StatLog, DIPOL92 has error rates about one third that of
Discrim, but considerable uctuation round this value is possible. If we have available the
three reference results, the formula would suggest that DIPOL92 should be tried unless
either k-NN or CART gets an accuracy much lower than a third of Discrim. Knowing
the structure of DIPOL92 we can predict a good deal more however. When there are just
two classes (and 9 of our datasets were 2-class problems), and if DIPOL92 does not use
clustering, DIPOL92 is very similar indeed to logistic regression (they optimise on slightly
different criteria). So the best predictor for DIPOL92 in 2-class problems with no obvious
clustering will be Logdisc. At the other extreme, if many clusters are used in the initial
stages of DIPOL92, then the performance is bound to approach that of, say, radial basis
functions or LVQ.
Also, while on the subject of giving explanations for differences in behaviour, consider
the performance of ALLOC80 compared to k-NN. From Table 10.14 it is clear that ALLOC80 usually outperforms k-NN. The reason is probably due to the mechanism within
ALLOC80 whereby irrelevant attributes are dropped, or perhaps because a surrogate was
substituted for ALLOC80 when it performed badly. If such strategies were instituted for
k-NN, it is probable that their performances would be even closer.
Finally, we should warn that such rules should be treated with great caution, as we have
already suggested in connection with the rules derived by C4.5. It is especially dangerous
to draw conclusions from incomplete data, as with CART for example, for the reason
that a Not Available result is very likely associated with factors leading to high error
rates, such as inability to cope with large numbers of categories, or large amounts of data.
Empirical rules such as those we have put forward should be rened by the inclusion of
other factors in the regression, these other factors being directly related to known properties
of the algorithm. For example, to predict Quadisc, a term involving the measures SD ratio
would be required (if that is not too circular an argument).
10.7.1 ML on ML vs. regression
Two methods have been given above for predicting the performance of algorithms, based
respectively on rule-based advice using dataset measures (ML on ML) and comparison with
reference algorithms (regression). It is difcult to compare directly the success rates of the
respective predictions, as the former is stated in terms of proportion of correct predictions
212
Analysis of results
[Ch. 10
and the latter in terms of squared correlation. We now give a simple method of comparing
the predictability of performance from the two techniques. The R-squared value
from
regression and the C4.5 generated rule error rate can be compared by the following
formula which is based on the assumption of equal numbers of Non-Appl and Appl:
n j
y&9a |
&
&
j
n G |
11
Conclusions
D. Michie (1), D. J. Spiegelhalter (2) and C. C.Taylor (3)
(1) University of Strathclyde, (2) MRC Biostatistics Unit, Cambridge and (3) University
of Leeds
11.1 INTRODUCTION
In this chapter we try to draw together the evidence of the comparative trials and subsequent
analyses, comment on the experiences of the users of the algorithms, and suggest topics and
areas which need further work. We begin with some comments on each of the methods. It
should be noted here that our comments are often directed towards a specic implementation
of a method rather than the method per se. In some instances the slowness or otherwise
poor performance of an algorithm is due at least in part to the lack of sophistication of the
program. In addition to the potential weakness of the programmer, there is the potential
reported on previous
inexperience of the user. To give an example, the trials of
chapters were based on a version programmed in LISP. A version is now available in the
C language which cuts the CPU time by a factor of 10. In terms of error rates, observed
differences in goodness of result can arise from
1.
2.
3.
4.
The stronger a program in respect of 2, then the better buffered against shortcomings in
3. Alternatively, if there are no options to select or parameters to tune, then item 3 is not
important.
We give a general view of the ease-of-use and the suitable applications of the algorithms.
Some of the properties are subject to different interpretations. For example, in general a
decision tree is considered to be less easy to understand than decision rules. However, both
are much easier to understand than a regression formula which contains only coefcients,
and some algorithms do not give any easily summarised rule at all (for example, k-NN).
Address for correspondence: Department of Statistics, University of Leeds, Leeds LS2 9JT, U.K.
214
Conclusions
[Ch. 11
The remaining sections discuss more general issues that have been raised in the trials,
such as time and memory requirements, the use of cost matrices and general warnings on
the interpretation of our results.
11.1.1 Users guide to programs
Here we tabulate some measures to summarise each algorithm. Some are subjective
quantities based on the users perception of the programs used in StatLog, and may not hold
for other implementations of the method. For example, many of the classical statistical
algorithms can handle missing values, whereas those used in this project could not. This
would necessitate a front-end to replace missing values before running the algorithm.
Similarly, all of these programs should be able to incorporate costs into their classication
procedure, yet some of them have not. In Table 11.1 we give information on various basic
capabilities of each algorithm.
11.2 STATISTICAL ALGORITHMS
11.2.1 Discriminants
It can fairly be said that the performance of linear and quadratic discriminants was exactly
as might be predicted on the basis of theory. When there was sufcient data, and the class
covariance matrices quite dissimilar then quadratic discriminant did better, although at the
expense of some computational costs. Several practical problems remain however:
1.
2.
the problem of deleting attributes if they do not contribute usefully to the discrimination
between classes (see McLachlan, 1992)
the desirability of transforming the data; and the possibility of including some quadratic
terms in the linear discriminant as a compromise between pure linear and quadratic
discrimination. Much work needs to be done in this area.
We found that there was little practical difference in the performance of ordinary and logistic
discrimination. This has been observed before - Fienberg (1980) quotes an example where
the superiority of logistic regression over discriminant analysis is slight - and is related
to the well-known fact that different link functions in generalised linear models often t
empirical data equally well, especially in the region near classication boundaries where
the curvature of the probability surface may be negligible. McLachlan (1992) quotes
several empirical studies in which the allocation performance of logistic regression was
very similar to that of linear discriminants.
In view of the much greater computational burden required, the advice must be to use
linear or quadratic discriminants for large datasets. The situation may well be different for
small datasets.
11.2.2 ALLOC80
This algorithm was never intended for the size of datasets considered in this book, and it
often failed on the larger datasets with no adequate diagnostics. It can accept attribute
data with both numeric and logical values and in this respect appears superior to the
other statistical algorithms. The cross-validation methods for parameter selection are
too cumbersome for these larger datasets, although in principle they should work. An
outstanding problem here is to choose good smoothing parameters - this program uses a
Sec. 11.2]
Statistical Algorithms
215
Baytree
NaiveBay
CN2
C4.5
ITrule
Cal5
Kohonen
DIPOL92
Backprop
RBF
LVQ
Cascade
Key:
MV
Cost
Interp.
Compreh.
Params
User-fr.
Data
MV
N
N
N
N
N
N
N
Y
Y
Y
Y
Y
Y
Y
Y
N
Y
N
N
N
N
N
N
Cost
T
T
T
LT
LT
T
T
T
T
N
N
T
T
N
N
N
LT
N
LT
T
N
N
T
Interp.
3
2
3
1
1
1
3
5
5
5
5
4
3
5
5
3
5
1
2
1
1
1
1
Compreh.
4
3
4
2
2
5
3
4
4
4
4
4
4
4
4
4
4
1
3
3
1
1
3
Params
4
3
4
1
2
2
3
5
5
4
4
5
4
4
4
4
5
1
2
3
1
1
2
User-fr.
Y
Y
Y
N
N
N
Y
Y
Y
Y
Y
N
Y
Y
Y
N
Y
N
N
N
N
N
N
Data
N
N
N
NC
NC
N
NC
NC
NC
NC
NCH
NC
N
NC
NC
NC
NC
N
NC
N
N
N
N
216
Conclusions
[Ch. 11
multiplicative kernel, which may be rather inexible if some of the attributes are highly
correlated. Fukunaga (1990) suggests a pre-whitening of the data which is equivalent to
using a multivariate normal kernel with parameters estimated from the sample covariance
matrix. This method has shown promise in some of the datasets here, although it is not
very robust, and of course still needs smoothing parameter choices.
ALLOC80 has a slightly lower error-rate than k-NN, and uses marginally less storage,
but takes about twice as long in training and testing (and k-NN is already a very slow
algorithm). However, since k-NN was always set to
this may not generally be true.
Indeed,
is a special case of the kernel method so it should be expected to do better.
In addition, for those not uent with the statistical background there is generally little
indication of why it has classied some examples in one class or the other.
Theory indicates, and our experience conrms, that Naive Bayes does best if the
attributes are conditionally independent given the class. This seems to hold true for many
Sec. 11.3]
Decision Trees
217
medical datasets. One reason for this might be that doctors gather as many different
(independent) bits of relevant information as possible, but they do not include two
attributes where one would do. For example, it could be that only one measure of high
blood pressure (say diastolic) would be quoted although two (diastolic and systolic) would
be available.
11.2.6 CASTLE
In essence CASTLE is a full Bayesian modelling algorithm, i.e. it builds a comprehensive
probabilistic model of the events (in this case attributes) of the empirical data. It can be used
to infer the probability of attributes as well as classes given the values of other attributes.
The main reason for using CASTLE is that the polytree models the whole structure of the
data, and no special role is given to the variable being predicted, viz. the class of the object.
However instructive this may be, it is not the principal task in the above trials (which is to
produce a classication procedure). So maybe there should be an option in CASTLE to
produce a polytree which classies rather than ts all the variables. To emphasise the point,
it is easy to deect the polytree algorithm by making it t irrelevant bits of the tree (that are
strongly related to each other but are irrelevant to classication). CASTLE can normally
be used in both interactive and batch modes. It accepts any data described in probabilities
and events, including descriptions of attributes-and-class pairs of data such as that used
here. However, all attributes and classes must be discretised to categorical or logical data.
The results of CASTLE are in the form of a (Bayesian) polytree that provides a graphical
explanation of the probabilistic relationships between attributes and classes. Thus it is
better in term of comprehensibility compared to some of the other statistical algorithms in
its explanation of the probabilistic relationships between attributes and classes.
The performance of CASTLE should be related to how tree-like the dataset is. A
major criticism of CASTLE is that there is no internal measure that tells us how closely the
empirical data are tted by the chosen polytree . We recommend that any future implementation of CASTLE incorporates such a polytree measure. It should be straightforward to
build a goodness-of-t measure into CASTLE based on a standard test.
As a classier, CASTLE did best in the credit datasets where, generally, only a few
attributes are important, but its most useful feature is the ability to produce simple models
of the data. Unfortunately, simple models tted only a few of our datasets.
11.3 DECISION TREES
There is a confusing diversity of Decision Tree algorithms, but they all seem to perform
, NewID, Cal5, C4.5, IndCART)
at about the same level. Five of the decision trees (
considered in this book are similar in structure to the original ID3 algorithm, with partitions
being made by splitting on an attribute, and with an entropy measure for the splits. There
are no indications that this or that splitting criterion is best, but the case for using some
kind of pruning is overwhelming, although, again, our results are too limited to say exactly
how much pruning to use. It was hoped to relate the performance of a decision tree to
some measures of complexity and pruning, specically the average depth of the tree and the
number of terminal nodes (leaves). In a sense CARTs cost-complexity pruning automates
this. Cal5 has generally much fewer nodes, so gives a simpler tree.
generally has
many more nodes, and occasionally scores a success because of that. The fact that all
218
Conclusions
[Ch. 11
the decision trees perform at the same accuracy with such different pruning procedures
suggests that much work needs to be done on the question of how many nodes to use.
On the basis of our trials on the Tsetse y data and the segmentation data, we speculate
that Decision Tree methods will work well compared to classical statistical methods when
the data are multimodal. Their success in the shuttle and technical datasets is due to the
special structure of these datasets. In the case of the technical dataset observations were
partly pre-classied by the use of a decision tree, and in the shuttle dataset we believe that
this may also be so, although we have been unable to obtain conrmation from the data
provider.
Among the decision trees, IndCART, CART and Cal5 method emerge as superior to
others because they incorporate costs into decisions. Both CART and IndCART can deal
with categorical variables, and CART has an important additional feature in that it has a
systematic method for dealing with missing values. However, for the larger datasets the
commercial package CART often failed where the IndCART implementation did not. In
common with all other decision trees, CART, IndCART and Cal5 have the advantage of
being distribution free.
and NewID
11.3.1
NewID and
are direct descendants of ID3, and, empirically, their performance as
classiers is very close. The main reason for choosing
would be to use other aspects
package, for example, the interactive graphical package and the possibility of
of the
incorporating prior knowledge about the dataset, in particular certain forms of hierarchical
structure; see Chapter 12. We looked at one dataset that was hierarchical in nature, in
which
showed considerable advantage over other methods - see Section 9.5.7.
NewID is based on Ross Quinlans original ID3 program which generates decision
trees from examples. It is similar to CN2 in its interface and command system. Similar
to CN2, NewID can be used in both interactive and batch mode. The interactive mode is
its native mode; and to run in batch mode users need to write a Unix shell script as for
CN2. NewID accepts attribute-value data sets with both logical and numeric data. NewID
has a post-pruning facility that is used to deal with noise. It can also deal with unknown
values. NewID outputs a confusion matrix. But this confusion matrix must be used with
care because the matrix has an extra row and column for unclassied examples some
examples are not classied by the decision tree. It does not accept or incorporate a cost
matrix.
is an extension to ID3 style of decision tree classiers to learn structures from a
predened hierarchy of attributes. Similarly to ID3 it uses an attribute-value based format
for examples with both logical and numeric data. Because of its hierarchical representation
it can also encode some relations between attribute values. It can be run in interactive mode
and data can be edited visually under its user interface.
uses an internal format that is
different from the usual format - mainly due to the need to express hierarchical attributes
when there are such. But for non-hierarchical data, there is very limited requirement for data
conversion.
can deal with unknown values in examples, and multi-valued attributes.
It is also able to deal with knowledge concerning the studied domain, but with the exception
of the Machine Faults dataset, this aspect was deliberately not studied in this book. The
user interacts with
via a graphical interface. This interface consists of graphical
Sec. 11.3]
Decision Trees
219
editors, which enable the user to dene the knowledge of the domain, to interactively build
the example base and to go through the hierarchy of classes and the decision tree.
produces decision trees which can be very large compared to the other decision
is relatively slow. This older
tree algorithms. The trials reported here suggest that
version used common LISP and has now been superseded by a C version, resulting in a
much faster program.
11.3.2 C4.5
C4.5 is the direct descendent of ID3. It is run in batch mode for training with attribute-value
data input. For testing, both interactive and batch modes are available. Both logical and
numeric values can be used in the attributes; it needs a declaration for the types and range
of attributes, and such information needs to be placed in a separate le. C4.5 is very easy
to set up and run. In fact it is only a set of UNIX commands, which should be familiar
to all UNIX users. There are very few parameters. Apart from the pruning criterion no
major parameter adjustment is needed for most applications - in the trials reported here,
the windowing facility was disabled. C4.5 produces a confusion matrix from classication
results. However, it does not incorporate a cost matrix. C4.5 allows the users to adjust the
degree of the tracing information displayed while the algorithm is running. This facility
can satisfy both the users who do not need to know the internal operations of the algorithm
and the users who need to monitor the intermidate steps of tree construction.
Note that C4.5 has a rule-generating module, which often improves the error rate and
almost invariably the user-transparancy, but this was not used in the comparative trials
reported in Chapter 9.
11.3.3 CART and IndCART
CART and IndCART are decision tree algorithms based on the work of Breiman et al.
(1984). The StatLog version of CART is the commercial derivative of the original algorithm
developed at Caltech. Both are classication and regression algorithms but they treat
regression and unknown values in the data somewhat differently. In both systems there are
very few parameters to adjust for new tasks.
The noise handling mechanism of the two algorithms are very similar. Both can also
deal with unknown values, though in different ways. The algorithms both output a decision
tree and a confusion matrix as output. But only CART incorporates costs (and it does so
in both training and test phases). Note that CART failed to run in many of trials involving
very large datasets.
11.3.4 Cal5
Cal5 is a numeric value decision tree classier using statistical methods. Thus discrete
values have to be changed into numeric ones. Cal5 is very easy to set up and run. It has a
number of menus to guide the user to complete operations. However, there are a number
of parameters, and for novice users the meanings of these parameters are not very easy to
understand. The results from different parameter settings can be very different, but tuning
of parameters is implemented in a semi-automatic manner.
The decision trees produced by Cal5 are usually quite small and are reasonably easy
to understand compared to algorithms such as C4.5 when used with default settings of
220
Conclusions
[Ch. 11
pruning parameters. Occasionally, from the point of view of minimising error rates, the
tree is over-pruned, though of course the rules are then more transparent. Cal5 produces a
confusion matrix and incorporates a cost matrix.
11.3.5 Bayes Tree
Our trials conrm the results reported in Buntine (1992): Bayes trees are generally slower
in learning and testing, but perform at around the same accuracy as, say, C4.5 or NewID.
However, it is not so similar to these two algorithms as one might expect, sometimes being
substantially better (in the cost datasets), sometimes marginally better (in the segmented
image datasets) and sometimes noticeably worse. Bayes tree also did surprisingly badly, for
a decision tree, on the technical dataset. This is probably due to the relatively small sample
sizes for a large number of the classes. Samples with very small a priori probabilities
are allocated to the most frequent classes, as the dataset is not large enough for the a
priori probabilities to be adapted by the empirical probabilities. Apart from the technical
dataset, Bayes trees probably do well as a result of the explicit mechanism for pruning
via smoothing class probabilities, and their success gives empirical justication for the
at-rst-sight-articial model of tree probabilities.
11.4 RULE-BASED METHODS
11.4.1 CN2
The rule-based algorithm CN2 also belongs to the general class of recursive partitioning
algorithms. Of the two possible variants, ordered and unordered rules, it appears that
unordered rules give best results, and then the performance is practically indistinguishable
from the decision trees, while at the same time offering gains in mental t over decision
trees. However, CN2 performed badly on the datasets involving costs, although this should
not be difcult to x. As a decision tree may be expressed in the form of rules (and viceversa), there appears to be no practical reason for choosing rule-based methods except when
the complexity of the data-domain demands some simplifying change of representation.
This is not an aspect with which this book has been concerned.
CN2 can be used in both interactive and batch mode. The interactive mode is its native
mode; and to run in batch mode users need to write a Unix shell script that gives the
algorithm a sequence of instructions to run. The slight deviation from the other algorithms
is that it needs a set of declarations that denes the types and range of attribute-values for
each attribute. In general there is very little effort needed for data conversion.
CN2 is very easy to set up and run. In interactive mode, the operations are completely
menu driven. After some familiarity it would be very easy to write a Unix shell script to
run the algorithm in batch mode. There are a few parameters that the users will have to
choose. However, there is only one parameter rule types which may have signicant
effect on the training results for most applications.
11.4.2 ITrule
Strictly speaking, ITrule is not a classication type algorithm, and was not designed for
large datasets, or for problems with many classes. It is an exploratory tool, and is best
regarded as a way of extracting isolated interesting facts (or rules) from a dataset. The
facts (rules) are not meant to cover all examples. We may say that ITrule does not look for
Sec. 11.5]
Neural Networks
221
the best set of rules for classication (or for any other purpose). Rather it looks for a set
of best rules, each rule being very simple in form (usually restricted to conjunctions of
two conditions), with the rules being selected as having high information content (in the
sense of having high -measure). Within these limitations, and also with the limitation of
discretised variates, the search for the rules is exhaustive and therefore time-consuming.
Therefore the number of rules found is usually limited to some small number, which can
be as high as 5000 or more however. For use in classication problems, if the preset rules
have been exhausted, a default rule must be applied, and it is probable that most errors are
committed at this stage. In some datasets, ITrule may generate contradictory rules (i.e.
rules with identical condition parts but different conclusions), and this may also contribute
to a high error-rate. This last fact is connected with the asymmetric nature of the -measure
compared to the usual entropy measure. The algorithm does not incorporate a cost matrix
facility, but it would appear a relatively simple task to incorporate costs as all rules are
associated with a probability measure. (In multi-class problems approximate costs would
need to be used, because each probability measure refers to the odds of observing a class
or not).
222
Conclusions
[Ch. 11
new modules, based on existing templates and hooks. One of the fundamental modules
provides routines for manipulating matrices, submatrices, and linked lists of submatrices.
It includes a set of macros written for the UNIX utility m4 which allows complicated
array-handling routines to be written using relatively simple m4 source code, which in turn
is translated into C source by m4. All memory management is handled dynamically.
There are several neural network modules, written as applications to the minimisation
module. These include a special purpose 3-layer MLP, a fully-connected recurrent MLP,
a fully-connected recurrent MLP with an unusual training algorithm (Silva & Almeida,
1990), and general MLP with architecture specied at runtime. There is also an RBF
network which shares the I/O routines but does not use the minimiser.
There is a general feeling, especially among statisticians, that the multilayer perceptron
is just a highly-parameterised form of non-linear regression. This is not our experience.
In practice, the Backprop procedure lies somewhere between a regression technique and a
decision tree, sometimes being closer to one and sometimes closer to the other. As a result,
we cannot make general statements about the nature of the decision surfaces, but it would
seem that they are not in any sense local (otherwise there would be a greater similarity
with k-NN). Generally, the absence of diagnostic information and the inability to interpret
the output is a great disadvantage.
11.5.2 Kohonen and LVQ
Kohonens net is an implementation of the self-organising feature mapping algorithm
based on the work of Kohonen (1989). Kohonen nets have an inherent parallel feature in
the evaluation of links between neurons. So this program is implemented, by Luebeck
University of Germany, on a transputer with an IBM PC as the front-end for user interaction.
This special hardware requirement thus differs from the norm and makes comparison of
memory and CPU time rather difcult.
Kohonen nets are more general than a number of other neural net algorithms such as
backpropagation. In a sense, it is a modelling tool that can be used to model the behaviour
of a system with its input and output variables (attributes) all modelled as linked neuronal.
In this respect, it is very similar to the statistical algorithm CASTLE both can be used in
wider areas of applications including classication and regression. In this book, however,
we are primarily concerned with classication. The network can accept attribute-value
data with numeric values only. This makes it necessary to convert logical or categorical
attributes into numeric data.
In use there are very few indications as to how many nodes the system should have
and how many times the examples should be repeatedly fed to the system for training. All
such parameters can only be decided on a trial-and-error basis. Kohonen does not accept
unknown values so data sets must have their missing attribute-values replaced by estimated
values through some statistical methods. Similar to all neural networks, the output of the
Kohonen net normally gives very little insight to users as to why the conclusions have been
derived from the given input data. The weights on the links of the nodes in the net are not
generally easy to explain from a viewpoint of human understanding.
LVQ is also based on a Kohonen net and the essential difference between these two
programs is that LVQ uses supervised training, so it should be no surprise that in all the
trials (with the exception of the DNA dataset) the results of LVQ are better than those of
Sec. 11.6]
223
Kohonen. So, the use of Kohonen should be limited to clustering or unsupervised learning,
and LVQ should always be preferred for standard classication tasks. Unfortunately, LVQ
has at least one bug that may give seriously misleading results, so the output should be
checked carefully (beware reported error rates of zero!).
11.5.3 Radial basis function neural network
The radial basis function neural network (RBF for short) is similar to other neural net
algorithms. But it uses a different error estimation and gradient descent function i.e. the
radial basis function. Similar to other neural net algorithms the results produced by RBF
are very difcult to understand.
RBF uses a cross-validation technique to handle the noise. As the algorithm trains it
continually tests on a small set called the cross-validation set. When the error on this set
starts to increase it stops training. Thus it can automatically decide when to stop training,
which is a major advantage of this algorithm compared to other neural net algorithms.
However it cannot cope with unknown values.
The algorithm is fairly well implemented so it is relatively easy to use compared to
many neural network algorithms. Because it only has one parameter to adjust for each new
application the number of centres of the radial basis function it is fairly easy to use.
11.5.4 DIPOL92
This algorithm has been included as a neural network, and is perhaps closest to MADALINE, but in fact it is rather a hybrid, and could also have been classied as a nonparametric statistical algorithm. It uses methods related to logistic regression in the rst
stage, except that it sets up a discriminating hyperplane between all pairs of classes, and
then minimises an error function by gradient descent. In addition, an optional clustering
procedure allows a class to be treated as several subclasses.
This is a new algorithm and the results are very encouraging. Although it never quite
comes rst in any one trial, it is very often second best, and its overall performance is
excellent. It would be useful to quantify how much the success of DIPOL92 is due to the
multi-way hyperplane treatment, and how much is due to the initial clustering, and it would
also be useful to automate the selection of clusters (at present the number of subclasses is
a user-dened parameter).
It is easy to use, and is intermediate between linear discriminants and multilayer
perceptron in ease of interpretation. It strengthens the case for other hybrid algorithms to
be explored.
11.6 MEMORY AND TIME
So far we have said very little about either memory requirements or CPU time to train
and test on each dataset. On reason for this is that these can vary considerably from one
implementation to another. We can, however, make a few comments.
11.6.1 Memory
In most of these large datasets, memory was not a problem. The exception to this was
the full version of the hand-written digit dataset see Section 9.3.1. This dataset had 256
variables and 10,000 examples and most algorithms (running on an 8 MB machine) could
224
Conclusions
[Ch. 11
not handle it. However, such problems are likely to be rare in most applications. A problem
with the interpretation of these gures is that they were obtained from the UNIX command
set time = (0 "%U %S %M")
and, for a simple FORTRAN program for example, the output is directly related to the
dimension declarations. So an edited version could be cut to t the given dataset and
produce a smaller memory requirement. A more sensible way to quantify memory
would be in terms of the size of the data. For example, in the SAS manual (1985) it states
that the memory required for nearest neighbour is
, i.e.
for most situations, of order . If similar results could be stated for all our algorithms
this would make comparisons much more transparent, and also enable predictions for new
datasets.
As far as our results are concerned, it is clear that the main difference in memory
requirements will depend on whether the algorithm has to store all the data or can process
it in pieces. The theory should determine this as well as the numbers, but it is clear that
linear and quadratic discriminant classiers are the most efcient here.
n %
j n
g
j
11.6.2 Time
Again, the results here are rather confusing. The times do not always measure the same
thing, for example if there are parameters to select there are two options. User may decide
to just plug in the parameter(s) and suffer a slight loss in accuracy of the classier. User
may decide to choose the parameters by cross-validation and reduce the error rate at the
expense of a vastly inated training time. It is clear then, that more explanation is required
and a more thorough investigation to determine selection of parameters and the trade-off
between time and error rate in individual circumstances. There are other anomalies: for
example, SMART often quotes the smallest time to test, and the amount of computation
required is a superset of that required for Discrim, which usually takes longer. So it appears
that the interpretation of results will again be inuenced by the implementation. It is of
interest that SMART has the largest ratio of training to testing time in nearly all of the
datasets. As with memory requirements, a statement that time is proportional to some
function of the data size would be preferred. For example, the SAS manual quotes the time
for the nearest neighbour classier to test as proportional to
where is the number of
observations in the training data. The above warnings should make us cautious in drawing
conclusions, in that some algorithms may not require parameter selection. However, if we
sum the training and testing times, we can say generally that
Sec. 11.7]
General Issues
225
DIPOL92 and SMART, they do not use costs as part of the learning process. Of those
algorithms which do not incorporate costs, many output a measure which can be interpreted
as a probability, and costs could therefore be incorporated. This book has only considered
three datasets which include costs partly for the very reason that some of the decision tree
programs cannot cope with them. There is a clear need to have the option of incorporating
a cost matrix into all classication algorithms, and in principle this should be a simple
matter.
11.7.2 Interpretation of error rates
The previous chapter has already analysed the results from the trials and some sort of
a pattern is emerging. It is hoped that one day we can nd a set of measures which
can be obtained from the data, and then using these measures alone we can predict with
some degree of condence which algorithms or methods will perform the best. There
is some theory here, for example, the similarity of within-class covariance matrices will
determine the relative performance of linear and quadratic discriminant functions and
also the performance of these relative to decision tree methods (qualitative conditional
dependencies will favour trees). However, from an empirical perspective there is still some
way to go, both from the point of view of determining which measures are important, and
how best to make the prediction. The attempts of the previous chapter show how this
may be done, although more datasets are required before condence can be attached to the
conclusions.
The request for more datasets raises another issue: What kind of datasets? It is clear
that we could obtain very biased results if we limit our view to certain types, and the
question of what is representative is certainly unanswered. Section 2.1.3 outlines a number
of different dataset types, and is likely that this consideration will play the most important
r le in determining the choice of algorithm.
o
The comparison of algorithms here is almost entirely of a black-box nature. So the
recommendations as they stand are really only applicable to the nave user. In the hands
226
Conclusions
[Ch. 11
An outlier here is the Tsetse y data - which could also easily been placed in the category
of image datasets:segmentation, since the data are of a spatial nature, although it is not a
standard image!
The analysis of Section 10.6 is a promising one, though there is not enough data to make
strong conclusions or to take the rules too seriously. However, it might be better to predict
performance on a continuous scale rather than the current approach which discretises the
algorithms into Applicable and Non-Applicable. Indeed, the choice of
or
(see Section 10.6.3) is very much larger than the more commonly used 2 or 3 standard errors
in hypothesis testing.
The attempts to predict performance using the performance of benchmark algorithms
(see Section 10.7) is highly dependent on the choice of datasets used. Also, it needs to be
remembered that the coefcients reported in Table 10.14 are not absolute. They are again
based on a transformation of all the results to the unit interval. So for example, the result
that the error rate for ALLOC80 could be predicted by taking 0.8 the error rate for k-NN
takes into account the error rates for all of the other algorithms. If we only consider this
pair (k-NN and ALLOC80) then we get a coefcient of 0.9 but this is still inuenced by
one or two observations. An alternative is to consider the average percentage improvement
of ALLOC80, which is 6.4%, but none of these possibilities takes account of the different
sample sizes.
Sec. 11.7]
General Issues
227
into the structure of the classication process, then neural nets, k-nearest neighbour and
ALLOC80 are unlikely to be much use. No matter what procedure is actually used, it is
often best to prune radically, by keeping only two or three signicant terms in a regression,
or by using trees of depth two, or using only a small number of rules, in the hope that the
important structure is retained. Less important structures can be added on later as greater
accuracy is required. It should also be borne in mind that in exploratory work it is common
to include anything at all that might conceivably be relevant, and that often the rst task is
to weed out the irrelevant information before the task of exploring structure can begin.
11.7.7 Special features
If a particular application has some special features such as missing values, hierarchical
structure in the attributes, ordered classes, presence of known subgroups within classes
(hierarchy of classes), etc. etc., this extra structure can be used in the classication process
to improve performance and to improve understanding. Also, it is crucial to understand if
the class values are in any sense random variables, or outcomes of a chance experiment, as
this alters radically the approach that should be adopted.
The Procrustean approach of forcing all datasets into a common format, as we have
done in the trials of this book for comparative purposes, is not recommended in general.
The general rule is to use all the available external information, and not to throw it away.
11.7.8 From classication to knowledge organisation and synthesis
In Chapter 5 it was stressed that Machine Learning classiers should possess a mental t to
the data, so that the learned concepts are meaningful to and evaluable by humans. On this
criterion, the neural net algorithms are relatively opaque, whereas most of the statistical
methods which do not have mental t can at least determine which of the attributes are
important. However, the specic black-box use of methods would (hopefully!) never
take place, and it is worth looking forwards more speculatively to AI uses of classication
methods.
For example, KARDIOs comprehensive treatise on ECG interpretation (Bratko et al.,
1989) does not contain a single rule of human authorship. Seen in this light, it becomes
clear that classication and discrimination are not narrow elds within statistics or machine
learning, but that the art of classication can generate substantial contributions to organise
(and improve) human knowledge, even, as in KARDIO, to manufacture new knowledge.
Another context in which knowledge derived from humans and data is synthesised is
in the area of Bayesian expert systems (Spiegelhalter et al., 1993), in which subjective
judgments of model structure and conditional probabilities are formally combined with
likelihoods derived from data by Bayes theorem: this provides a way for a system to
smoothly adapt a model from being initially expert-based towards one derived from data.
However, this representation of knowledge by causal nets is necessarily rather restricted
because it does demand an exhaustive specication of the full joint distribution. However,
such systems form a complete model of a process and are intended for more than simply classication. Indeed, they provide a unied structure for many complex stochastic
problems, with connections to image processing, dynamic modelling and so on.
12
Knowledge Representation
Claude Sammut
University of New South Wales
12.1 INTRODUCTION
In 1956, Bruner, Goodnow and Austin published their book A Study of Thinking, which
became a landmark in psychology and would later have a major impact on machine learning. The experiments reported by Bruner, Goodnow and Austin were directed towards
understanding a humans ability to categorise and how categories are learned.
We begin with what seems a paradox. The world of experience of any normal
man is composed of a tremendous array of discriminably different objects, events,
people, impressions...But were we to utilise fully our capacity for registering the
differences in things and to respond to each event encountered as unique, we would
soon be overwhelmed by the complexity of our environment... The resolution of
this seeming paradox ... is achieved by mans capacity to categorise. To categorise
is to render discriminably different things equivalent, to group objects and events
and people around us into classes... The process of categorizing involves ... an act
of invention... If we have learned the class house as a concept, new exemplars
can be readily recognised. The category becomes a tool for further use. The
learning and utilisation of categories represents one of the most elementary and
general forms of cognition by which man adjusts to his environment.
The rst question that they had to deal with was that of representation: what is a concept? They assumed that objects and events could be described by a set of attributes and
were concerned with how inferences could be drawn from attributes to class membership.
Categories were considered to be of three types: conjunctive, disjunctive and relational.
...when one learns to categorise a subset of events in a certain way, one is doing
more than simply learning to recognise instances encountered. One is also learning
a rule that may be applied to new instances. The concept or category is basically,
this rule of grouping and it is such rules that one constructs in forming and
attaining concepts.
Address for correspondence: School of Computer Science and Engineering,Articial Intelligence Laboratory,
University of New South Wales, PO Box 1, Kensigton, NSW 2033, Australia
Sec. 12.2]
229
The notion of a rule as an abstract representation of a concept in the human mind came to
be questioned by psychologists and there is still no good theory to explain how we store
concepts. However, the same questions about the nature of representation arise in machine
learning, for the choice of representation heavily determines the nature of a learning
algorithm. Thus, one critical point of comparison among machine learning algorithms is
the method of knowledge representation employed.
In this chapter we will discuss various methods of representation and compare them
according to their power to express complex concepts and the effects of representation on
the time and space costs of learning.
12.2 LEARNING, MEASUREMENT AND REPRESENTATION
A learning program is one that is capable of improving its performance through experience.
Given a program, , and some input, , a normal program would yield the same result
after every application. However, a learning program can alter its initial state
so that its performance is modied with each application. Thus, we can say
.
That is, is the result of applying program to input, , given the initial state, . The
goal of learning is to construct a new initial, , so that the program alters its behaviour to
give a more accurate or quicker result. Thus, one way of thinking about what a learning
program does is that it builds an increasingly accurate approximation to a mapping from
input to output.
The most common learning task is that of acquiring a function which maps objects, that
share common properties, to the same class value. This is the categorisation problem to
which Bruner, Goodnow and Austin referred and much of our discussion will be concerned
with categorisation.
Learning experience may be in the form of examples from a trainer or the results of
trial and error. In either case, the program must be able to represent its observations
of the world, and it must also be able to represent hypotheses about the patterns it may
nd in those observations. Thus, we will often refer to the observation language and the
hypothesis language. The observation language describes the inputs and outputs of the
program and the hypothesis language describes the internal state of the learning program,
which corresponds to its theory of the concepts or patterns that exist in the data.
The input to a learning program consists of descriptions of objects from the universe
and, in the case of supervised learning, an output value associated with the example. The
universe can be an abstract one, such as the set of all natural numbers, or the universe
may be a subset of the real-world. No matter which method of representation we choose,
descriptions of objects in the real world must ultimately rely on measurements of some
properties of those objects. These may be physical properties such as size, weight, colour,
etc or they may be dened for objects, for example the length of time a person has been
employed for the purpose of approving a loan. The accuracy and reliability of a learned
concept depends heavily on the accuracy and reliability of the measurements.
A program is limited in the concepts that it can learn by the representational capabilities
of both observation and hypothesis languages. For example, if an attribute/value list is used
to represent examples for an induction program, the measurement of certain attributes and
not others clearly places bounds on the kinds of patterns that the learner can nd. The
learner is said to be biased by its observation language. The hypothesis language also places
n i j
n j
)C
230
Knowledge Representation
[Ch. 12
constraints on what may and may not be learned. For example, in the language of attributes
and values, relationships between objects are difcult to represent. Whereas, a more
expressive language, such as rst order logic, can easily be used to describe relationships.
Unfortunately, representational power comes at a price. Learning can be viewed as a
search through the space of all sentences in a language for a sentence that best describes
the data. The richer the language, the larger the search space. When the search space is
small, it is possible to use brute force search methods. If the search space is very large,
additional knowledge is required to reduce the search.
We will divide our attention among three different classes of machine learning algorithms that use distinctly different approaches to the problem of representation:
Instance-based learning algorithms learn concepts by storing prototypic instances of the
concept and do not construct abstract representations at all.
Function approximation algorithms include connectionist and statistics methods. These
algorithms are most closely related to traditional mathematical notions of approximation and interpolation and represent concepts as mathematical formulae.
Symbolic learning algorithms learn concepts by constructing a symbolic which describes a class of objects. We will consider algorithms that work with representations
equivalent to propositional logic and rst-order logic.
12.3 PROTOTYPES
The simplest form of learning is memorisation. When an object is observed or the solution
to a problem is found, it is stored in memory for future use. Memory can be thought of as
a look up table. When a new problem is encountered, memory is searched to nd if the
same problem has been solved before. If an exact match for the search is required, learning
is slow and consumes very large amounts of memory. However, approximate matching
allows a degree of generalisation that both speeds learning and saves memory.
For example, if we are shown an object and we want to know if it is a chair, then we
compare the description of this new object with descriptions of typical chairs that we
have encountered before. If the description of the new object is close to the description
of one of the stored instances then we may call it a chair. Obviously, we must dened what
we mean by typical and close.
To better understand the issues involved in learning prototypes, we will briey describe three experiments in Instance-based learning (IBL) by Aha, Kibler & Albert (1991).
IBL learns to classify objects by being shown examples of objects, described by an attribute/value list, along with the class to which each example belongs.
12.3.1 Experiment 1
In the rst experiment (IB1), to learn a concept simply required the program to store every
example. When an unclassied object was presented for classication by the program, it
used a simple Euclidean distance measure to determine the nearest neighbour of the object
and the class given to it was the class of the neighbour.
This simple scheme works well, and is tolerant to some noise in the data. Its major
disadvantage is that it requires a large amount of storage capacity.
Sec. 12.3]
Prototypes
231
12.3.2 Experiment 2
The second experiment (IB2) attempted to improve the space performance of IB1. In this
case, when new instances of classes were presented to the program, the program attempted
to classify them. Instances that were correctly classied were ignored and only incorrectly
classied instances were stored to become part of the concept.
While this scheme reduced storage dramatically, it was less noise-tolerant than the rst.
12.3.3 Experiment 3
The third experiment (IB3) used a more sophisticated method for evaluating instances to
decide if they should be kept or not. IB3 is similar to IB2 with the following additions. IB3
maintains a record of the number of correct and incorrect classication attempts for each
saved instance. This record summarised an instances classication performance. IB3 uses
a signicance test to determine which instances are good classiers and which ones are
believed to be noisy. The latter are discarded from the concept description. This method
strengthens noise tolerance, while keeping storage requirements down.
12.3.4 Discussion
+
+ +
+
+
+ +
+
+
-
+
+
Fig. 12.1: The extension of an IBL concept is shown in solid lines. The dashed lines represent the
target concept. A sample of positive and negative examples is shown. Adapted from Aha, Kibler and
Albert (1991).
232
Knowledge Representation
[Ch. 12
IB1 is strongly related to the -nearest neighbour methods described in Section 4.3. Here
is 1. The main contribution of Aha, Kibler and Albert (1991) is the attempt to achieve
satisfactory accuracy while using less storage. The algorithms presented in Chapter 4
assumed that all training data are available. Whereas IB2 and IB3 examine methods for
forgetting instances that do not improve classication accuracy.
Figure 12.1 shows the boundaries of an imaginary concept in a two dimensions space.
The dashed lines represent the boundaries of the target concept. The learning procedure
attempts to approximate these boundaries by nearest neighbour matches. Note that the
boundaries dened by the matching procedure are quite irregular. This can have its
advantages when the target concept does not have a regular shape.
Learning by remembering typical examples of a concept has several other advantages.
If an efcient indexing mechanism can be devised to nd near matches, this representation
can be very fast as a classier since it reduces to a table look up. It does not require any
sophisticated reasoning system and is very exible. As we shall see later, representations
that rely on abstractions of concepts can run into trouble with what appear to be simple
concepts. For example, an abstract representation of a chair may consist of a description
of the number legs, the height, etc. However, exceptions abound since anything that can be
sat on can be thought of as a chair. Thus, abstractions must often be augmented by lists of
exceptions. Instance-based representation does not suffer from this problem since it only
consists exceptions and is designed to handle them efciently.
One of the major disadvantages of this style of representation is that it is necessary to
dene a similarity metric for objects in the universe. This can often be difcult to do when
the objects are quite complex.
Another disadvantage is that the representation is not human readable. In the previous section we made the distinction between an language of observation and a hypothesis
language. When learning using prototypes, the language of observation may be an attribute/value representation. The hypothesis language is simply a set of attribute/value or
feature vectors, representing the prototypes. While examples are often a useful means
of communicating ideas, a very large set of examples can easily swamp the reader with
unnecessary detail and fails to emphasis important features of a class. Thus a collection of
typical instances may not convey much insight into the concept that has been learned.
g
SI
Sec. 12.4]
Function approximation
+
+
233
where each element of u indicates membership of a class and each row in W is the set of
weights for one LTU. This architecture is called a pattern associator.
LTUs can only produce linear discriminant functions and consequently, they are limited
in the kinds of classes that can be learned. However, it was found that by cascading pattern
associators, it is possible to approximate decision surfaces that are of a higher order than
simple hyperplanes. In cascaded system, the outputs of one pattern associator are fed into
the inputs of another, thus:
n j
oq
To facilitate learning, a further modication must be made. Rather than using a simple
threshold, as in the perceptron, multi-layer networks usually use a non-linear threshold
such as a sigmoid function. Like perceptron learning, back-propagation attempts to reduce
the errors between the output of the network and the desired result. Despite the non-linear
threshold, multi-layer networks can still be thought of as describing a complex collection
of hyperplanes that approximate the required decision surface.
234
Knowledge Representation
[Ch. 12
x
Fig. 12.3: A Pole Balancer.
12.4.1 Discussion
Function approximation methods can often produce quite accurate classiers because they
are capable of constructing complex decision surfaces. The observation language for
algorithms of this class is usually a vector of numbers. Often preprocessing will convert
raw data into a suitable form. For example, Pomerleau (1989) accepts raw data from a
camera mounted on a moving vehicle and selects portions of the image to process for input
to a neural net that learns how to steer the vehicle. The knowledge acquired by such a
system is stored as weights in a matrix. Therefore, the hypothesis language is usually an
array of real numbers. Thus, the results of learning are not easily available for inspection by
a human reader. Moreover, the design of a network usually requires informed guesswork
on the part of the user in order to obtain satisfactory results. Although some effort has been
devoted to extracting meaning from networks, the still communicate little about the data.
Connectionist learning algorithms are still computationally expensive. A critical factor
in their speed is the encoding of the inputs to the network. This is also critical to genetic
algorithms and we will illustrate that problem in the next section.
12.5 GENETIC ALGORITHMS
Genetic algorithms (Holland, 1975) perform a search for the solution to a problem by
generating candidate solutions from the space of all solutions and testing the performance
of the candidates. The search method is based on ideas from genetics and the size of
the search space is determined by the representation of the domain. An understanding of
genetic algorithms will be aided by an example.
A very common problem in adaptive control is learning to balance a pole that is hinged
on a cart that can move in one dimension along a track of xed length, as show in Figure
12.3. The control must use bang-bang control, that is, a force of xed magnitude can be
applied to push the cart to the left or right.
Before we can begin to learn how to control this system, it is necessary to represent
it somehow. We will use the BOXES method that was devised by Michie & Chambers
Sec. 12.5]
Genetic Algorithms
235
x
.
.
x
.
Fig. 12.4: Discretisation of pole balancer state space.
(1968). The measurements taken of the physical system are the angle of the pole, , and its
angular velocity and the position of the cart, , and its velocity. Rather than treat the four
variables as continuous values, Michie and Chambers chose to discretise each dimension
of the state space. One possible discretisation is shown in Figure 12.4.
boxes that partition the state space.
This discretisation results in
Each box has associated with it an action setting which tells the controller that when the
system is in that part of the state space, the controller should apply that action, which is a
push to the left or a push to the right. Since there is a simple binary choice and there are
162 boxes, there are
possible control strategies for the pole balancer.
The simplest kind of learning in this case, is to exhaustively search for the right
combination. However, this is clearly impractical given the size of the search space.
Instead, we can invoke a genetic search strategy that will reduce the amount of search
considerably.
In genetic learning, we assume that there is a population of individuals, each one of
which, represents a candidate problem solver for a given task. Like evolution, genetic
algorithms test each individual from the population and only the ttest survive to reproduce
for the next generation. The algorithm creates new generations until at least one individual
is found that can solve the problem adequately.
Each problem solver is a chromosome. A position, or set of positions in a chromosome
is called a gene. The possible values (from a xed set of symbols) of a gene are known
and
as alleles. In most genetic algorithm implementations the set of symbols is
chromosome lengths are xed. Most implementations also use xed population sizes.
The most critical problem in applying a genetic algorithm is in nding a suitable
encoding of the examples in the problem domain to a chromosome. A good choice of
representation will make the search easy by limiting the search space, a poor choice will
result in a large search space. For our pole balancing example, we will use a very simple
encoding. A chromosome is a string of 162 boxes. Each box, or gene, can take values: 0
(meaning push left) or 1 (meaning push right). Choosing the size of the population can be
tricky since a small population size provides an insufcient sample size over the space of
solutions for a problem and large population requires a lot of evaluation and will be slow.
In this example, 50 is a suitable population size.
&%
%Ux
236
Knowledge Representation
[Ch. 12
&
&
&
&
Offspring produced by crossover cannot contain information that is not already in the
population, so an additional operator, mutation, is required. Mutation generates an offspring
by randomly changing the values of genes at one or more gene positions of a selected
chromosome. For example, if the following chromosome,
&
&
The number of offspring produced for each new generation depends on how members are
introduced so as to maintain a xed population size. In a pure replacement strategy, the
whole population is replaced by a new one. In an elitist strategy, a proportion of the
population survives to the next generation.
In pole balancing, all offspring are created by crossover (except when more the 60% will
survive for more than three generations when the rate is reduced to only 0.75 being produced
by crossover). Mutation is a background operator which helps to sustain exploration. Each
offspring produced by crossover has a probability of 0.01 of being mutated before it enters
the population. If more then 60% will survive, the mutation rate is increased to 0.25.
The number of offspring an individual can produce by crossover is proportional to its
tness:
&#I
SA
%o#
bdt
hU%o(
Sec. 12.6]
v_small
237
Class3
small
Class2
Class2
Class3
medium
Class2
Class2
Class3
large
Class1
Class1
red
orange
v_large
yellow
green
blue
violet
where the number of children is the total number of individuals to be replaced. Mates are
chosen at random among the survivors.
The pole balancing experiments described above, were conducted by Odetayo (1988).
This may not be the only way of encoding the problem for a genetic algorithm and so other
solutions may be possible. However, this requires effort on the part of the user to devise a
clever encoding.
12.6 PROPOSITIONAL LEARNING SYSTEMS
Rather than searching for discriminant functions, symbolic learning systems nd expressions equivalent to sentences in some form of logic. For example, we may distinguish
objects according to two attributes: size and colour. We may say that an object belongs to
class 3 if its colour is red and its size is very small to medium. Following the notation of
Michalski (1983), the classes in Figure 12.5 may be written as:
t59()Al&
(U vUq
%
((Py&Uh&U z9(0Cl %a(U &vUq
x
x
x
UA& vk(PdUq
Ph
Ud
%Ud
Ud
Note that this kind of description partitions the universe into blocks, unlike the function
approximation methods that nd smooth surfaces to discriminate classes.
Interestingly, one of the popular early machine learning algorithms, Aq (Michalski,
1973), had its origins in switching theory. One of the concerns of switching theory is to
nd ways of minimising logic circuits, that is, simplifying the truth table description of the
function of a circuit to a simple expression in Boolean logic. Many of the algorithms in
switching theory take tables like Figure 12.5 and search for the best way of covering all of
the entries in the table.
Aq, uses a covering algorithm, to build its concept description:
7
R u4 D R
f
i
7
UHS7
Rt T6 4 h&R
H D i
4
7oIfCPbFG7g7C@ 4 UHGCsh4 Dbi
c 7 s q 3 B
@ i 6 4g
7
R4 7 f 7
UIbUH
4
4
F 7 s 7
&tAT
4 T 7 i
USCB &R
238
Knowledge Representation
[Ch. 12
.....
vn
7 H H
USU@
6
&4
v1
7 f H 7 f s q F 7 7 B
IUICUGgA@ UIIICT A7
@ H i s i D 7 F 7 T
f
&@ U
D
4
UE3 7 T 7 i R H 7 f s q F 7 4 7 B
epAAB bgUICPbGgC@ UGCQUUAB C4%AT 4
@ H i s f f F 7 i6 q 7
s
7PRhb&S7
D i R
i
R D u H @ E 7 D F H F E
tUb&UC4hQSI5PE F
4 H
c c D i 6 @ 5H 3 7 T
d1&UUUCG4 s 7
H 7
UhqU7
7 H i
UIUi R
P4
6 4
6
q i4
GT
The best expression is usually some compromise between the desire to cover as
many positive examples as possible and the desire to have as compact and readable a
representation as possible. In designing Aq, Michalski was particularly concerned with the
expressiveness of the concept description language.
A drawback of the Aq learning algorithm is that it does not use statistical information,
present in the training sample, to guide induction. However, decision tree learning algorithms (Quinlan, 1993) do. The basic method of building a decision tree is summarised
in Figure 12.6. An simple attribute/value representation is used and so, like Aq, decision
trees are incapable of representing relational information. They are, however, very quick
and easy to build.
The algorithm operates over a set of training instances, .
If all instances in are in class , create a node and stop. Otherwise select a feature,
F and create a decision node.
Partition the traning instances in into subsets according to the values of F.
Apply the algorithm recursively to each if the subsets of .
Sec. 12.6]
+
+
+ ++
+
+
-
239
+
+
+
+
Fig. 12.7: The dashed line shows the real division of objects in the universe. The solid lines show a
decision tree approximation.
Decision tree learning algorithms can be seen as methods for partitioning the universe
into successively smaller rectangles with the goal that each rectangle only contains objects
of one class. This is illustrated in Figure 12.7.
12.6.1 Discussion
Michalski has always argued in favour of rule-based representations over tree structured
representations, on the grounds of readability. When the domain is complex, decision trees
can become very bushy and difcult to understand, whereas rules tend to be modular
and can be read in isolation of the rest of the knowledge-base constructed by induction.
On the other hand, decision trees induction programs are usually very fast. A compromise
is to use decision tree induction to build an initial tree and then derive rules from the tree
thus transforming an efcient but opaque representation into a transparent one (Quinlan,
1987b).
It is instructive to compare the shapes that are produced by various learning systems
when they partition the universe. Figure 12.7 demonstrates one weakness of decision tree
and other symbolic classication. Since they approximate partitions with rectangles (if
the universe is 2-dimensional) there is an inherent inaccuracy when dealing with domains
with continuous attributes. Function approximation methods and IBL may be able to attain
higher accuracy, but at the expense of transparency of the resulting theory. It is more
difcult to make general comments about genetic algorithms since the encoding method
240
Knowledge Representation
[Ch. 12
Class2
Class1
%&
k(PdUq%Ud
Sec. 12.7]
241
In the next section we will look at learning algorithms that deal with relational information. In this case, the emphasis on language is essential since geometric interpretations
no longer provide us with any real insight into the operation of these algorithms.
12.7 RELATIONS AND BACKGROUND KNOWLEDGE
Inductions systems, as we have seen so far, might be described as what you see is what you
get. That is, the output class descriptions use the same vocabulary as the input examples.
However, we will see in this section, that it is often useful to incorporate background
knowledge into learning.
We use a simple example from Banerji (1980) to the use of background knowledge.
There is a language for describing instances of a concept and another for describing
concepts. Suppose we wish to represent the binary number, 10, by a left-recursive binary
tree of digits 0 and 1:
&IF
f @
f @ D f @
d &td&IF
X F
hvE b7
6
F
E &7
6
head and tail are the names of attributes. Their values follow the colon. The concepts
of binary digit and binary number are dened as:
n n j n j
#hCGUUC#&l zA a(j
n n j n j
ChCGUUC#&l zA a(j
( q v
Ah
Thus, an object belongs to a particular class or concept if it satises the logical expression
in the body of the description. Predicates in the expression may test the membership of an
object in a previously learned concept.
Banerji always emphasised the importance of a description language that could grow.
That is, its descriptive power should increase as new concepts are learned. This can clearly
be seen in the example above. Having learned to describe binary digits, the concept of
digit becomes available for use in the description of more complex concepts such as binary
number.
Extensibility is a natural and easily implemented feature of horn-clause logic. In
addition, a description in horn-clause logic is a logic program and can be executed. For
example, to recognise an object, a horn clause can be interpreted in a forward chaining
manner. Suppose we have a set of clauses:
'E
y
(12.3)
(12.4)
(12.5)
'
and an instance:
Clause (12.3) recognises the rst two terms in expression (12.5) reducing it to
'
Clause (12.4) reduces this to . That is, clauses (12.3) and (12.4) recognise expression
(12.5) as the description of an instance of concept .
When clauses are executed in a backward chaining manner, they can either verify that
the input object belongs to a concept or produce instances of concepts. In other words,
242
Knowledge Representation
[Ch. 12
larger(hammer, feather).
denser(hammer, feather).
heavier(A, B) : denser(A, B), larger(A, B).
: heavier(hammer, feather).
: heavier(hammer, feather).
denser(hammer, feather).
larger(hammer, feather).
: larger(hammer, feather).
Fig. 12.9: A resolution proof tree from Muggleton & Feng (1990).
x j
IUS &UA
n
&EAb`
x
&U
n
&EAb`
x E
j
kUEU `A
j xn x j
kUE& UCy&SAU &S kUEoU
EAb`
x
&S&9&7
Sec. 12.7]
243
'
'
(12.6)
n x n j
PUQGqj
(12.7)
(12.8)
n x n j
yGlj
n kx n kj
vvqGlj
x
&yA
(12.9)
(12.10)
k
b
n kx n kj
vzlGlj
n kj n kj
vlSrqI
n kj n kj
vqIUrqI A
n x n j
yGlj
vq
n kj
n x n j
PUQGqj
k
vE
UPhy
vq
n kj
UPhy
(12.11)
vq
n kj
n kj
vqbE
vq
n kj
n kj
vqGU
According to subsumption, the least general generalisation of (12.4) and (12.5) is:
n kj
rqI
n kj
vq
UPhy
since unmatched literals are dropped from the clause. However, given the background
knowledge, we can see that this is an over-generalisation. A better one is:
(12.12)
n kj
vly
n kj
rqI
n kj
vq
UPhy
The moral being that a generalisation should only be done when the relevant background
knowledge suggests it. So, observing (12.9), use clause (12.11) as a rewrite rule to produce
a generalisation which is Clause (12.12). which also subsumes Clause (12.10).
Buntine drew on earlier work by Sammut (Sammut & Banerji, 1986) in constructing his
generalised subsumption. Muggleton & Buntine (1998) took this approach a step further
and realised that through the application of a few simple rules, they could invert resolution
as Plotkin and Reynolds had wished. Here are two of the rewrite rules in propositional
form:
Given a set of clauses, the body of one of which is completely contained in the bodies
of the others, such as:
9
244
Knowledge Representation
[Ch. 12
Intra-construction takes a group of rules all having the same head, such as:
5
g
These two operations can be interpreted in terms of the proof tree shown in Figure 12.9.
Resolution accepts two clauses and applies unication to nd the maximal common unier.
In the diagram, two clauses at the top of a V are resolved to produce the resolvent at
the apex of the V. Absorption accepts the resolvent and one of the other two clauses to
produce the third. Thus, it inverts the resolution step.
Intra-construction automatically creates a new term in its attempt to simplify descriptions. This is an essential feature of inverse resolution since there may be terms in a theory
that are not explicitly shown in an example and may have to be invented by the learning
program.
12.7.1 Discussion
These methods and others (Muggleton & Feng, 1990; Quinlan, 1990) have made relational
learning quite efcient. Because the language of Horn-clause logic is more expressive than
the other concept description languages we have seen, it is now possible to learn far more
complex concepts than was previously possible. A particularly important application of
this style of learning is knowledge discovery. There are now vast databases accumulating
information on the genetic structure of human beings, aircraft accidents, company inventories, pharmaceuticals and countless more. Powerful induction programs that use expressive
languages may be a vital aid in discovering useful patterns in all these data.
For example, the realities of drug design require descriptive powers that encompass
stereo-spatial and other long-range relations between different parts of a molecule, and can
generate, in effect, new theories. The pharmaceutical industry spends over $250 million
for each new drug released onto the market. The greater part of this expenditure reects
todays unavoidably scatter-gun synthesis of compounds which might possess biological
activity. Even a limited capability to construct predictive theories from data promises high
returns.
The relational program Golem was applied to the drug design problem of modelling
structure-activity relations (King et al., 1992). Training data for the program was 44
trimethoprim analogues and their observed inhibition of E. coli dihydrofolate reductase.
A further 11 compounds were used as unseen test data. Golem obtained rules that were
statistically more accurate on the training data and also better on the test data than a Hansch linear regression model. Importantly, relational learning yields understandable rules
Sec. 12.8]
Conclusion
245
Thus, when approaching a machine learning problem, the choice of knowledge representation formalism is just as important as the choice of learning algorithm.
13
Learning to Control Dynamic Systems
Tanja Urban i (1) and Ivan Bratko (1,2)
cc
(1) Jo ef Stefan Institute and (2) and University of Ljubljana
z
13.1 INTRODUCTION
The emphasis in controller design has shifted from the precision requirements towards the
o
following objectives(Leitch & Francis, 1986; Enterline, 1988; Verbruggen and Astr m,
om, 1991; Sammut & Michie, 1991; AIRTC92, 1992):
1989; Astr
control without complete prior knowledge (to extend the range of automatic control
applications),
reliability, robustness and adaptivity (to provide successful performance in the realworld environment),
transparency of solutions (to enable understanding and verication),
generality (to facilitate the transfer of solutions to similar problems),
realisation of specied characteristics of system response (to please customers).
These problems are tackled in different ways, for example by using expert systems (Dvorak, 1987), neural networks (Miller et al., 1990; Hunt et al., 1992), fuzzy control (Lee,
1990) and genetic algorithms (Renders & Nordvik, 1992). However, in the absence of a
complete review and comparative evaluations, the decision about how to solve a problem
at hand remains a difcult task and is often taken ad hoc. Leitch (1992) has introduced a
step towards a systematisation that could provide some guidelines. However, most of the
approaches provide only partial fullment of the objectives stated above. Taking into account also increasing complexity of modern systems together with real-time requirements,
one must agree with Schoppers (1991), that designing control means looking for a suitable
compromise. It should be tailored to the particular problem specications, since some
objectives are normally achieved at the cost of some others.
Another important research theme is concerned with the replication of human operators
subconscious skill. Experienced operators manage to control systems that are extremely
difcult to be modelled and controlled by classical methods. Therefore, a natural choice
would be to mimic such skilful operators. One way of doing this is by modelling the
Address for correspondence: Jo ef Stefan Institute, Univerza v Lubljani, 61111 Ljubljana, Slovenia
z
Sec. 13.1]
Introduction
247
operators strategy in the form of rules. The main problem is how to establish the appropriate
set of rules: While gaining skill, people often lose their awareness of what they are actually
doing. Their knowledge is implicit, meaning that it can be demonstrated and observed, but
hardly ever described explicitly in a way needed for the direct transfer into an automatic
controller. Although the problem is general, it is particularly tough in the case of control
of fast dynamic systems where subconscious actions are more or less the prevailing form
of performance.
Control
rule
Control
rule
Control
rule
Learning
system
Dynamic
system
(a)
Partial
knowledge
(b)
Learning
system
Dynamic
system
Learning
system
(c)
Operator
Dynamic
system
Fig. 13.1: Three modes of learning to control a dynamic system: (a) Learning from scratch,
(b) Exploiting partial knowledge, (c) Extracting human operators skill.
The aim of this chapter is to show how the methods of machine learning can help
in the construction of controllers and in bridging the gap between the subcognitive skill
and its machine implementation. First successful attempts in learning control treated the
controlled system as a black box (for example Michie & Chambers, 1968), and a program
learnt to control it by trials. Due to the black box assumption, initial control decisions
are practically random, resulting in very poor performance in the rst experiments. On
the basis of experimental evidence, control decisions are evaluated and possibly changed.
Learning takes place until a certain success criterion is met. Later on, this basic idea
was implemented in different ways, ranging from neural networks (for example Barto
248
[Ch. 13
et al., 1983; Anderson, 1987) to genetic algorithms (for example Odetayo & McGregor,
1989). Recently, the research concentrated on removing the deciencies inherent to these
methods, like the obscurity and unreliability of the learned control rules (Bain, 1990;
Sammut & Michie, 1991; Sammut & Cribb, 1990) and time-consuming experimentation
(Sammut, 1994) while still presuming no prior knowledge. Until recently, this kind of
learning control has remained predominant. However, some of the mentioned deciences
are closely related to the black box assumption, which is hardly ever necessary in such
a strict form. Therefore, the latest attempts take advantage of the existing knowledge,
being explicit and formulated at the symbolic level (for example Urban i & Bratko, 1992;
cc
Bratko, 1993; Var ek et al., 1993), or implicit and observable just as operators skill (Michie
s
et al., 1990; Sammut et al., 1992; Camacho & Michie, 1992; Michie & Camacho, 1994).
The structure of the chapter follows this introductory discussion. We consider three
modes of learning to control a system. The three modes, illustrated in Figure 13.1, are:
(a) The learning system learns to control a dynamic system by trial and error, without any
prior knowledge about the system to be controlled (learning from scratch).
(b) As in (a), but the learning system exploits some partial explicit knowledge about the
dynamic system.
(c) The learning system observes a human operator and learns to replicate the operators
skill.
Experiments in learning to control are popularly carried out using the task of controlling
the pole-and-cart system. In Section 13.2 we therefore describe this experimental domain.
Sections 13.3 and 13.4 describe two approaches to learning from scratch: BOXES and
genetic learning. In Section 13.5 the learning system exploits partial explicit knowledge.
In Section 13.6 the learning system exploits the operators skill.
13.2 EXPERIMENTAL DOMAIN
The main ideas presented in this chapter will be illustrated by using the pole balancing
problem (Anderson & Miller, 1990) as a case study. So let us start with a description of this
control task which has often been chosen to demonstrate both classical and nonconventional
control techniques. Besides being an attractive benchmark, it also bears similarities with
tasks of signicant practical importance such as two-legged walking, and satellite attitude
control (Sammut & Michie, 1991). The system consists of a rigid pole and a cart. The cart
can move left and right on a bounded track. The pole is hinged to the top of the cart so that
it can swing in the vertical plane. In the AI literature, the task is usually just to prevent the
pole from falling and to keep the cart position within the specied limits, while the control
regime is that of bang-bang. The control force has a xed magnitude and all the controller
can do is to change the force direction in regular time intervals.
Classical methods (for example Kwakernaak & Sivan, 1972) can be applied to controlling the system under several assumptions, including complete knowledge about the
system, that is a differential equations model up to numerical values of its parameters.
Alternative approaches tend to weaken these assumptions by constructing control rules in
two essentially different ways: by learning from experience, and by qualitative reasoning.
The rst one will be presented in more detail later in this chapter. The second one will
be described here only up to the level needed for comparison and understanding, giving a
general idea about two solutions of this kind:
Sec. 13.2]
Experimental domain
critical
LEFT
critical
critical
LEFT
LEFT
RIGHT
critical
critical
LEFT
249
RIGHT
critical
critical
RIGHT
critical
RIGHT
!
3 86
79 5 1
" 2$ 2$
)0
' "%
($ &$
#
#
ref
(13.2)
goal
"
"
#
# 4
(13.3)
(13.4)
goal
where ref and ref denote reference values to be reached, goal and goal denote goal
values required for successful control, and
denotes a monotonically increasing
ref
"
"
79 5
goal
F0
[Ch. 13
@ 4 4
ED@CA@
B
250
When the system is to be controlled under the bang-bang regime, control action
is determined by the sign of force : if
then
else
.
Assuming ref
and
without loss of generality, Equations (13.1)(13.4) can
be simplied and normalised, resulting in
US
VT# F
QIG
RPH# F
DgDBefeB )
ddd
B
` WY 3 G W! % G X ' G
W
1
@ 2$ @ #
# cbB &G
a
# 4
where
sign
(13.5)
Both Makarovi s and Bratkos rule successfully control the inverted pendulum, provided
c
the appropriate values of the numerical parameters are chosen. Moreover, there exists a
set of parameter values that makes Bratkos rule equivalent to the bang-bang variant of a
classical control rule using the sign of pole-placement controller output (D eroski, 1989).
z
13.3 LEARNING TO CONTROL FROM SCRATCH: BOXES
In learning approaches, trials are performed in order to gain experimental evidence about
different control decisions. A trial starts with the system positioned in an initial state chosen
from a specied region, and lasts until failure occurs or successful control is performed
for a prescribed maximal period of time. Failure occurs when the cart position or pole
inclination exceeds the given boundaries. The duration of a trial is called survival time.
Learning is carried out by performing trials repeatedly until a certain success criterion is
met. Typically, this criterion requires successful control within a trial to exceed a prescribed
period of time. Initial control decisions are usually random. On the basis of experimental
evidence, they are evaluated and possibly changed, thus improving control quality. This
basic idea has been implemented in many different ways, for example in BOXES (Michie
& Chambers, 1968), Adaptive Critic reinforcement method (Barto et al., 1983), CART
(Connell & Utgoff, 1987), multilayer connectionist approach (Anderson, 1987) and many
others. Geva and Sitte (1993a) provide an exhaustive review. Here, two methods will be
described in more detail: BOXES (Michie & Chambers, 1968) and genetic learning of
control (Var ek et al., 1993). The choice of methods presented here is subjective. It was
s
guided by our aim to describe recent efforts in changing or upgrading the original ideas.
We chose BOXES because it introduced a learning scheme that was inspirational to much
further work.
13.3.1 BOXES
The BOXES program (Michie & Chambers, 1968) learns a state-action table, i.e. a set
of rules that specify action to be applied to the system in a given state. Of course this
would not be possible for the original, innite state space. Therefore, the state space is
divided into boxes. A box is dened as the Cartesian product of the values of the system
variables, where all the values belong to an interval from a predened partition. A typical
partition of the four dimensional state space into boxes distinguish 3 values of , 3 of ,
6 of and 3 of , giving 162 boxes. All the points within a box are mapped to the same
control decision. During one trial, the state-action table is xed. When a failure is detected
a trial ends. Decisions are evaluated with respect to the accumulated numeric information:
Sec. 13.3]
251
how many times the system entered a particular state, how successful it was after particular
decisions, etc. The following information is accumulated for each box:
: left life, weighted sum of survival times after left decision was taken in this state
during previous trials,
: right life, the same for the right decision,
: left usage, weighted sum of the number of left decisions taken in this state
during previous trials,
: right usage, the same for right decisions,
: times (i.e. steps) at which the system enters this state during the current
trial.
After a trial the program updates these gures. For the states in which decision left
was taken, the new values are:
d
uPt DBdefdeB % t B ' t
rXp
r
sh
hqp
h
ih
r
2Xp
h
2iqp
r
W Eh
a t Pt ' a W Eivh
h
# y r
# Ayww Xp
h
q
# ACxh p
yw
r
# Axvh
ywh
)
W
Rr # Cw r
t W 2ih # w h
y
h
These values are used for a numeric evaluation of the success for both actions. The estimates
are computed after a trial for each qualitative state:
# g mlkja &
W Xp
r
i
r vh A W qp
h
h
# ffy e
W sh
r
g d
r ih A W vh
h
h
252
[Ch. 13
r
Xp
h
qp
r
sh
h
ih
n r hh h n r ph h
6s2iih 6XEvp
n r ph h n r hh h
6XEvp 6Eiih
is a user dened parameter that adjusts the relative importance of exploitation and
exploration. The lowest reasonable value for is 1. This corresponds to pure exploitation
without any desire to explore the untested. By increasing , the systems mentality changes
towards experimentalist. Then the system is willing to experiment with actions that from
past experience look inferior.
A suitable compromise for is needed for overall good performance. For the classical
pole-and-cart problem, it was experimentally found that
is optimal. The learning
rate is relatively stable for values of between 1.4 and 1.8, and it degrades rapidly when
decreases below 1.4 or increases above 1.8. The following improvement of the Law &
Sammut rule with respect to the Michie & Chambers rule was reported: on the average
over 20 experiments, the original BOXES needed 557 trials to learn to control the system,
whereas the Law & Sammut rule needed 75 trials (with
). In trying to test the
stability of the Law & Sammut rule, it was found that was slightly, but not signicantly,
sensitive to small changes in the learning problem, such as changing the number of boxes
from 162 to 225, or introducing asymmetry in the force (left push twice the right push).
Geva and Sitte (1993a) carried out exhaustive experiments concerning the same topic.
With the appropriate parameter setting the BOXES method performed as well as the
Adaptive Critic reinforcement learning (Barto et al., 1983). They got an average of 52
trials out of 1000 learning experiments (standard deviation was 32).
do ) #
pod ) #
Sec. 13.4]
253
are usually represented as binary coded strings of xed length. The initial population is
generated at random. What happens during cycles called generations is as follows. Each
member of the population is evaluated using a tness function. After that, the population
undergoes reproduction. Parents are chosen stochastically, but strings with a higher value
of tness function have higher probability of contributing an offspring. Genetic operators,
such as crossover and mutation, are applied to parents to produce offspring. A subset of
the population is replaced by the offspring, and the process continues on this new generation. Through recombination and selection, the evolution converges to highly t population
members representing near-optimal solutions to the considered problem.
When controllers are to be built without having an accurate mathematical model of the
system to be controlled, two problems arise: rst, how to establish the structure of the
controller, and second, how to choose numerical values for the controller parameters. In the
following, we present a three-stage framework proposed by Var ek et al. ( c 1993 IEEE).
s
First, control rules, represented as tables, are obtained without prior knowledge about
the system to be controlled. Next, if-then rules are synthesized by structuring information
encoded in the tables, yielding comprehensible control knowledge. This control knowledge
has adequate structure, but it may be non-operational because of inadequate settings of
its numerical parameters. Control knowledge is nally made operational by ne-tuning
numerical parameters that are part of this knowledge. The same ne-tuning mechanism
can also be applied when available partial domain knowledge sufces to determine the
structure of a control rule in advance.
In this approach, the control learning process is considered to be an instance of a
combinatorial optimisation problem. In contrast to the previously described learning
approach in BOXES, where the goal is to maximise survival time, here the goal is to
maximise survival time, and, simultaneously, to minimise the discrepancy between the
desired and actual system behaviour. This criterion is embodied in a cost function, called
the raw tness function, used to evaluate candidate control rules during the learning process.
is calculated as follows:
Raw tness
}}
R|
y ~
xy
}}
~ Rz |
' a
B efd ) #$ a ( R) # Rq|
dd
B
}}
W a
~
'
) # Rqz |
y
}}
R~} q| ~
}
~ ' u
y ) # xy
y u ~)
R~} q{ x y mw) (p@# vtr r
} z|
Bu s
max
max
max
254
[Ch. 13
I } U 0 VS
U
B B
QI
ReG
B B B
B B B
B B B
B B
B B B
B B B
B B B
@ ) B @ ) W s 4
Sec. 13.5]
255
B B B
B B B
q
Table 13.1: ( c 1993 IEEE) Control performance of GA-induced BOXES-like rule, compressed rule
, ne-tuned rule
, and the original Makarovi s
c
rule
.
B B B
B B B
@ ) W s 4
B B B
B @ ) W s
@ ) ) B @ ) ) W s 4
6B @ ) W s 4
4 @ B @ W s 4
q
Avg. survival
time [steps]
1 000 000
665 772
1 000 000
1 000 000
)
)
) B @ W s 4 @ B
W s
4
B @ ) W s 4
Failures
[%]
0
44
0
0
Rule
GA+1010
GA+105
Tuned+1010
Tuned+105
Fitness
0.9222
0.5572
0.9505
0.9637
To summarise, successful and comprehensible control rules were synthesized automatically in three phases. Here, a remark should be made about the number of performed trials.
In this research, it was very high due to the following reasons. First, the emphasis was put
on the reliability of learned rules and this, of course, demands much more experimentation
in order to ensure good performance on a wide range of initial states. In our recent experiments with a more narrow range of initial states the number of trials was considerably
reduced without affecting the reliability. Second, the performance of the rules after the rst
phase was practically the same as that of the rules after the third phase. Maybe the same
controller structure could be obtained in the second phase from less perfect rules. However,
it is difcult to know when the learned evidence sufces. To conclude, the exhaustiveness
of these experiments was conciously accepted by the authors in order to show that 100%
reliable rules can be learned from scratch.
13.5 EXPLOITING PARTIAL EXPLICIT KNOWLEDGE
13.5.1 BOXES with partial knowledge
To see how adding domain knowledge affects speed and results of learning, three series of
experiments were done by Urban i & Bratko (1992). The following variants of learning
cc
control rules with program BOXES were explored:
256
A.
B.
C.
[Ch. 13
d a g a i a
d gai
Version
A.
B.
C.
Length of learning
[av. num. of trials]
427
50
197
Av. reliability
[ratio]
3/20
10/20
4/20
Av. survival
[steps]
4894
7069
4679
3G
) B ) s
6 B @ ) W s 4 W 4
% GB ' G
@ ) B @ ) W vs 4
Sec. 13.6]
257
Table 13.4: ( c 1993 IEEE) Control performance of Bratkos control rule (a) with parameter
values found by a GA, and (b) with parameter values that make the rule equivalent to the
bang-bang variant of the classical control rule.
Parameters
%G
'G
0.45
0.30
0.25
0.60
0.45
0.40
3G
(a)
22.40
19.00
13.65
Failures
[%]
0
0
0
Avg. survival
time [steps]
1,000,000
1,000,000
1,000,000
Fitness
0.9980
0.9977
0.9968
Parameters
3
(G
%
PG
'
G
(b)
... the action was performed some time later in response to the stimulus. But how
do we know what the stimulus was? Unfortunately there is no way of knowing.
The planes state variables included elevation, elevation speed, azimuth, azimuth speed,
airspeed etc. The possible control actions affected four control variables: rollers, elevator,
thrust and aps. The problem was decomposed into four induction problems, one for each
of the four control variables. These four learning problems were assumed independent.
The control rules were induced by the C4.5 induction program (Quinlan, 1987a). The
total data set consisted of 90 000 events collected from three pilots and 30 ights by each
pilot. The data was segmented into seven stages of the complete ight plan and separate
rules were induced for each stage. Separate control rules were induced for each of the
three pilots. It was decided that it was best not to mix the data corresponding to different
individuals because different pilots carry out their manouevres in different styles.
There was a technical difculty in using C4.5 in that it requires discrete class values
whereas in the ight problem the control variables are mostly continuous. The continuous
ranges therefore had to be converted to discrete classes by segmentation into intervals. This
segmentation was done manually. A more natural learning tool for this induction task would
therefore be one that allows continuous class, such as the techniques of learning regression
trees implemented in the programs CART (Breiman et al., 1984) and Retis (Karali , 1992).
c
Sammut et al. (1992) state that control rules for a complete ight were successfully
synthesized resulting in an inductively constructed autopilot. This autopilot ies the Cessna
258
[Ch. 13
in a manner very similar to that of the human pilot whose data was used to construct the
rules. In some cases the autopilot ies more smoothly than the pilot.
We have observed a clean-up effect noted in Michie, Bain and Hayes-Michie
(1990). The ight log of any trainer will contain many spurious actions due to
human inconsistency and corrections required as a result of inattention. It appears
that effects of these examples are pruned away by C4.5, leaving a control rule
which ies very smoothly.
It is interesting to note the comments of Sammut et al. (1992) regarding the contents of
the induced rules:
One of the limitations we have encountered with existing learning algorithms is
that they can only use the primitive attributes supplied in the data. This results in
control rules that cannot be understood by a human expert. The rules constructed
by C4.5 are purely reactive. They make decisions on the basis of the values in
a single step of simulation. The induction program has no concept of time and
causality. In connection with this, some strange rules can turn up.
13.6.2 Learning to control container cranes
The world market requires container cranes with as high capacity as possible. One way to
meet this requirement is to build bigger and faster cranes; however, this approach is limited
by construction problems as well as by unpleasant feelings drivers have when moving with
high speeds and accelerations. The other solution is to make the best of the cranes of
reasonable size, meaning in the rst place the optimisation of the working cycle and
efcient swing damping.
It is known that experienced crane drivers can perform very quickly as long as everything
goes as expected, while each subsequent correction considerably affects the time needed for
accomplishing the task. Also, it is very difcult to drive for hours and hours with the same
attention, not to mention the years of training needed to gain required skill. Consequently,
interest for cooperation has been reported by chief designer of Metalna Machine Builders,
Steel Fabricators and Erectors, Maribor, which is known world-wide for its large-scale
container cranes. They are aware of insufciency of classical automatic controllers (for
example Sakawa & Shinido, 1982), which can be easily disturbed in the presence of wind or
other unpredictable factors. This explains their interest in what can be offered by alternative
methods.
Impressive results have been obtained by predictive fuzzy control (see Yasunobu &
Hasegawa, 1986). Their method involves steps such as describing human operator strategies, dening the meaning of linguistic performance indices, dening the models for
predicting operation results, and converting the linguistic human operator strategies into
predictive fuzzy control rules.
In general, these tasks can be very time consuming, so our focus of attention was on
the automated synthesis of control rules directly from the recorded performance of welltrained operators. In this idea, we are following the work of Michie et al. (1990), Sammut
et al. (1992) and Michie & Camacho (1994) who conrmed the ndings of Sammut et
al. (1992) using the ACM public-domain simulation of an F-16 combat plane. When
trying to solve the crane control problem in a manner similar to their autopilot construction,
Sec. 13.6]
259
d4
x4
Q d
RI k4
x4
d4
260
[Ch. 13
They were given just the instrument version; in fact, they didnt know which dynamic
system underlied the simulator. In spite of that, they succeeded to learn the task, although
the differences in time needed for this as well as the quality of control were remarkable.
To learn to control the crane reasonably well, it took a subject between about 25 and 200
trials. This amounts to about 1 to 10 hours of real time spent with the simulator.
Our aim was to build automatic controllers from human operators traces. We applied
RETIS - a program for regression tree construction (Karali , 1992) to the recorded data.
c
The rst problem to solve was how to choose an appropriate set of learning examples out
of this enormous set of recorded data. After some initial experiments we found, as in
Sammut et al. (1992), that it was benecial to use different trials performed by the same
student, since it was practically impossible to nd trials perfect in all aspects even among
the successful cases.
In the preparation of learning data, performance was sampled each 0.1 second. The
actions were related to the states with delay which was also 0.1 second. The performance
of the best autodriver induced in these initial experiments can be seen in Figure 13.3. It
resulted from 798 learning examples for
and 1017 examples for . The control strategy
it uses is rather conservative, minimising the swinging, but at the cost of time. In further
experiments, we will try to build an autodriver which will successfully cope with load
swinging, resulting in faster and more robust performance.
2.5diffL
20
40
Distance/m/deg
60
d
4
diffX
60
80
-5
Time/s
Fx
20
40
Time/s
60
10Fy
Control Force/kN
40
20
80
Fig. 13.3: The crane simulator response to the control actions of the autodriver.
These experiments indicate that further work is needed regarding the following questions: what is the actual delay between the systems state and the operators action; robustness of induced rules with respect to initial states; comprehensibility of induced control
rules; inducing higher level conceptual description of control strategies.
Conclusions
261
13.7 CONCLUSIONS
In this chapter we have treated the problem of controlling a dynamic system mainly as a
classication problem. We introduced three modes of learning to control, depending on
the information available to the learner. This information included in addition to the usual
examples of the behaviour of the controlled system, also explicit symbolic knowledge about
the controlled system, and example actions performed by a skilled human operator.
One point that the described experiments emphasise is the importance of (possibly
incomplete) partial knowledge about the controlled system. Methods described in this
chapter enable natural use of partial symbolic knowledge. Although incomplete, this
knowledge may drastically constrain the search for control rules, thereby eliminating in
advance large numbers of totally unreasonable rules.
Our choice of the approaches to learning to control in this chapter was subjective.
Among a large number of known approaches, we chose for more detailed presentation
those that: rst, we had personal experimental experience with, and second, that enable
the use of (possibly partial) symbolic prior knowledge. In all the approaches described,
there was an aspiration to generate comprehensible control rules, sometimes at the cost of
an additional learning stage.
An interesting theme, also described, is behavioural cloning where a humans behavioural skill is cloned by a learned rule. Behavioural cloning is interesting both from the
practical and the research points of view. Much further work is needed before behavioural
cloning may become routinely applicable in practice.
Behavioural cloning is essentially the regression of the operators decision function from
examples of his/her decisions. It is relevant in this respect to notice a similarity between
this and traditional top-down derivation of control from a detailed model of the system to
be controlled. This similarity is illustrated by the fact that such a top-down approach for
the pole-and-cart system gives the known linear control rule
which looks just like regression equation.
As stated in the introduction, there are several criteria for, and goals of, learning to
control, and several assumptions regarding the problem. As shown by the experience
with various learning approaches, it is important to clarify very precisely what these
goals and assumptions really are in the present problem. Correct statement of these may
considerably affect the efciency of the learning process. For example, it is important to
consider whether some (partial) symbolic knowledge exists about the domain, and not to
assume automatically that it is necessary, or best, to learn everything from scratch. In some
approaches reviewed, such incomplete prior knowledge could also result from a previous
stage of learning when another learning technique was employed.
Acknowledgements: This work was supported by the Ministry of Science and Technology
of Slovenia. The authors would like to thank Donald Michie for comments and suggestions.
1 W 3 W % W 'G
(G (G q PG q # 4
APPENDICES
A
Dataset availability
The public domain datasets are listed below with an anonymous ftp address. If you do
not have access to these, then you can obtain the datasets on diskette from Dr. P. B. Brazdil,
University of Porto, Laboratory of AI and Computer Science, R. Campo Alegre 823,
4100 Porto, Potugal. The main source of datasets is ics.uci.edu (128.195.1.1) - the UCI
Repository of Machine Learning Databases and Domain Theories which is managed by
D. W. Aha. The following datasets (amongst many others) are in pub/machine-learningdatabases
australian credit (credit-screening/crx.data statlog/australian)
diabetes (pima-indian-diabetes)
dna (molecular-biology/splice-junction-gene-sequences)
heart disease (heart-disease/ statlog/heart)
letter recognition
image segmentation (statlog/segment)
shuttle control (statlog/shuttle)
LANDSAT satellite image (statlog/satimage)
vehicle recognition (statlog/vehicle)
The datasets were often processed, and the processed form can be found in the statlog subdirectory where mentioned above. In addition, the processed datasets (as used
in this book) can also be obtained from ftp.strath.ac.uk (130.159.248.24) in directory
Stams/statlog. These datasets are australian, diabetes, dna, german, heart, letter, satimage,
segment, shuttle, shuttle, and there are associated .doc les as well as a split into train and
test set (as used in the StatLog project) for the larger datasets.
B Software sources and details
Many of the classical statistical algorithms are available in standard statistical packages.
Here we list some public domain versions and sources, and some commercial packages. If
a simple rule has been adopted for parameter selection, then we have also described this.
263
2Cv(Pe2(PCC2C
2 e2C E &&( C Cf( 2C e (22CC (PC
(C2
C 22C eRe EP & R 2CCf( &
C
CR e PE2CC 2R
P 2&
% qF
8 P E e 8C &e
s
P2 t e C e C e 28 &C2 &&e
&
10 runs were made, with 10which of the 10 runs was best. Random number seed for each run
was = run number (1..10). Having picked the best net by cross validation within the training
264
set, these nets were then used for supplying the performance gures on the whole training
set and on the test set. The gures averaged for cross validation performance measures
were also for the best nets found during local cross-validation within the individual training
sets.
Training proceeds in four stages, with different stages using different subsets of the
training data, larger each time. Training proceeds until no improvement in error is achieved
for a run of updates.
The RRNN simulator provided the radial basis function code. This is freely available at
the time of writing by anonymous ftp from uk.ac.aston.cs (134.151.52.106). This package
also contains MLP code using the conjugate gradient algorithm, as does AutoNet, and
several other algorithms. Reports on benchmark excercises are available for some of these
MLP programs in Rohwer (1991c).
The centres for the radial basis functions were selected randomly from the training
data, except that centres were allocated to each class in proportion to the number of
representatives of that class in the dataset, with at least one centre provided to each class in
any case. Each Gaussian radius was set to the distance to the nearest neighboring centre.
The linear system was solved by singular value decomposition.
For the small datasets the number of centres and thier locations were selected by training
with various numbers of centres, using 20 different random number seeds for each number,
and evaluating with a cross validation set withheld from the training data, precisely as was
done for the MLPs. For the large datasets, time constraints were met by compromising
rigour, in that the test set was used for the cross-validation set. Results for these sets
should therefore be viewed with some caution. This was the case for all data sets, until
those for which cross-validation was explicitly required (australian, diabetes, german, isoft,
segment) were repeated with cross-validation to select the number of centres carried out
within the training set only.
The rough guideline followed for deciding on numbers of centres to try is that the
number should be about 100 times the dimension of the input space, unless that would be
more than 10% of the size of the dataset.
LVQ is available from the Laboratory of Computer Science and Information Science,
Helsinki University of Technology, Rakentajanaukio 2 C, SF -02150 Espoo, Finland. It
can also be obtained by anonymous ftp from cochlea.hut. (130.233.168.48).
CART is a licensed product of California Statistical Software Inc., 961 Yorkshire Court,
Lafayette, CA 94549, USA.
C4.5 is availbale from J.R. Quinlan, Dept. of Computer Science, Madsen Building F09,
University of Sydney, New South Wales, New South Wales.
The parameters used were the defaults. The heuristic was information gain.
2R v CREC &2
2 &
R &
e P2ECC 2&(f 2C (
C e Re2Ce
PC sE& eR C
C
P R C C R CR & {E2R E P C 2C e
However, since the StatLog project was completed, there is a more recent version of C4.5,
so the results contained in this book may not be exactly reproducable.
C. CONTRIBUTORS
265
NewID and CN2 are available from Robin Boswell and Tim Niblett, respectively at The
Turing Institute, George House, 36 North Hanover Street, Glasgow G1 2AD, UK.
For NewID:
& R P2 e &(2CV EPE f2 RE 2
2
( C e P E &&
e &2Y s&PRe2 (&e&&EC2C
R
8
(2 &2 e P2
E &
s2Pf s& e C & 2 PE2
2&& R s& &2 2CE&
& s&2 2 R e P 2 ev 2
For CN2:
P22 &&
&
PE2 2&& 2P
although the two Belgian power datasets were run with the above parameters set to
(3,5000) and 3,2000).
Kohonen was written by J. Paul, Dhamstr. 20, W-5948 Schmallenberg, Germany for a PC
with an attached transputer board.
k-NN is still under development. For all datasets, except the satellite image dataset,
.
Distance was scaled in a class dependent manner, using the standard deviation. Further
details can be obtained from C. C. Taylor, Department of Statistics, University of Leeds,
Leeds LS2 9JT, UK.
) $
#
C Contributors
This volume is based on the StatLog project, which involved many workers at over 13
institutions. In this list we aim to include those who contributed to the Project and the
Institutions at which they were primarily based at that time.
G. Nakhaeizadeh, J. Graf, A. Merkel, H. Keller, Laue, H. Ulrich, G. Kressel, DaimlerBenz AG
R.J. Henery, D. Michie, J.M.O. Mitchell, A.I. Sutherland, R. King, S. Haque, C. Kay,
D. Young, W. Buntine, B. D. Ripley, University of Strathclyde
S.H. Muggleton, C. Feng, T. Niblett, R. Boswell, Turing Institute
H. Perdrix, T. Brunet, T. Marre, J-J Cannat, ISoft
J. Stender, P. Ristau, D. Picu, I. Chorbadjiev, C. Kennedy, G. Ruedke, F. Boehme, S.
Schulze-Kremer, Brainware GmbH
P.B. Brazdil, J. Gama, L. Torgo, University of Porto
R. Molina, N. P rez de la Blanca, S. Acid, L. M. de Campos, Gonzalez, University of
e
Granada
F. Wysotzki, W. Mueller, Der, Buhlau, Schmelowski, Funke, Villman, H. Herzheim, B.
Schulmeister, Fraunhofer Institute
266
References
Acid, S., Campos, L. M. d., Gonz lez, A., Molina, R., and P rez de la Blanca, N. (1991a).
a
e
Bayesian learning algorithms in castle. Report no. 91-4-2, University of Granada.
Acid, S., Campos, L. M. d., Gonz lez, A., Molina, R., and P rez de la Blanca, N. (1991b).
a
e
CASTLE: Causal structures from inductive learning. release 2.0. Report no. 91-4-3,
University of Granada.
Agresti, A. (1990). Categorical Data Analysis. Wiley, New York.
Aha, D. (1992). Generalising case studies: a case study. In 9th Int. Conf. on Machine
Learning, pages 110, San Mateo, Cal. Morgan Kaufmann.
Aha, D. W., Kibler, D., and Albert, M. K. (1991). Instance-based learning algorithms.
Machine Learning, 6(1):3766.
AIRTC92 (1992). Preprints of the 1992 IFAC/IFIP/IMACS International Symposium on
Articial Intelligence in Real-Time Control. Delft, The Netherlands, 750 pages.
Aitchison, J. and Aitken, C. G. G. (1976). Multivariate binary discrimination by the kernel
method. Biometrika, 63:413420.
Al-Attar, A. (1991). Structured Decision Tasks Methodology. Attar Software Ltd., Newlands House, Newlands Road, Leigh, Lancs.
Aleksander, I., Thomas, W. V., and Bowden, P. A. (1984). Wisard: A radical step forward
in image recognition. Sensor Review, 4:120124.
Anderson, C. W. (1987). Strategy learning with multilayer connectionist representations. In
Lengley, P., editor, Proceedings of the 4th International Workshop on Machine Learning,
pages 103114. Morgan Kaufmann.
Anderson, C. W. and Miller, W. T. (1990). Challenging control problems. In Miller, W. T.,
Sutton, R. S., and Werbos, P. J., editors, Neural Networks for Control, pages 475510.
The MIT Press.
Anderson, J. A. (1984). Regression and ordered categorical variables. J. R. Statist. Soc. B,
46:130.
Anderson, T. W. (1958). An introduction to multivariate statistical analysis. John Wiley,
New York.
268
REFERENCES
REFERENCES
269
Bratko, I. (1991). Qualitative modelling: Learning and control. In Proceedings of the 6th
Czechoslovak Conference on Articial Intelligence. Prague.
Bratko, I. (1993). Qualitative reasoning about control. In Proceedings of the ETFA 93
Conference. Cairns, Australia.
Bratko, I. Mozetic, I. and Lavrac, L. (1989). KARDIO: A Study in deep and Qualitative
Knowledge for Expert Systems. MIT Press, Cambridge, MA, and London.
Breiman, L. and Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation (with discussion). Journal of the American Statistical
Association (JASA), 80 No. 391:580619.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classication and
Regression Trees. Wadsworth and Brooks, Monterey, Ca.
Breiman, L., Meisel, W., and Purcell, E. (1977). Variable kernel estimates of multivariate
densities. Technometrics, 19:135144.
Bretzger, T. M. (1991). Die Anwendung statistischer Verfahren zur Risikofruherkennung
bei Dispositionskrediten. PhD thesis, Universitat Hohenheim.
Broomhead, D. S. and Lowe, D. (1988). Multi-variable functional interpolation and adaptive
networks. Complex Systems, 2:321355.
Bruner, J. S., Goodnow, J. J., and Austin, G. A. (1956). A Study of Thinking. Wiley, New
York.
Buntine, W. Ind package of machine learning algorithms ind 1.0. Technical Report MS
244-17, Research Institute for Advanced Computer Science, NASA Ames Research
Center, Moffett Field, CA 94035.
Buntine, W. (1988). Generalized subsumption and its applications to induction and redundancy. Articial Intelligence, 36:149176.
Buntine, W. (1992). Learning classication trees. Statistics and Computing, 2:6373.
Camacho, R. and Michie, D. (1992). An engineering model of subcognition. In Proceedings
of the ISSEK Workshop 1992. Bled, Slovenia.
Carpenter, B. E. and Doran, R. W., editors (1986). A. M. Turings ACE Report and Other
Papers. MIT Press, Cambridge, MA.
Carter, C. and Catlett, J. (1987). Assessing credit card applications using machine learning.
IEEE Expert: intelligent systems and their applications, 2:7179.
Cestnik, B. and Bratko, I. (1991). On estimating probabilities in tree pruning. In EWSL
91, Porto, Portugal, 1991, Berlin. Springer-Verlag.
Cestnik, B., Kononenko, I., and Bratko, I. (1987). Assistant 86: A knowledge-elicitation
tool for sophisticated users. In Progress in Machine Learning: Proceedings of EWSL-87,
pages 3145, Bled, Yugoslavia. Sigma Press.
Cherkaoui, O. and Cleroux, R. (1991). Comparative study of six discriminant analysis
procedures for mixtures of variables. In Proceedings of Interface Conference 1991.
Morgan Kaufmann.
Clark, L. A. and Pregibon, D. (1992). Tree-based models. In Chambers, J. and Hastie, T.,
editors, Statistical Models in S. Wadsworth & Brooks, Pacic Grove, California.
Clark, P. and Boswell, R. (1991). Rule induction with cn2: some recent improvements. In
EWSL 91, Porto, Portugal, 1991, pages 151163, Berlin. Springer-Verlag.
270
REFERENCES
Clark, P. and Niblett, T. (1988). The CN2 induction algorithm. Machine Learning, 3:261
283.
Clarke, W. R., Lachenbruch, P. A., and Broftt, B. (1979). How nonnormality affects the
quadratic discriminant function. Comm. Statist. Theory Meth., IT-16:4146.
Connell, M. E. and Utgoff, P. E. (1987). Learning to control a dynamic physical system. In
Proceedings of the 6th National Conference on Articial Intelligence, pages 456459.
Morgan Kaufmann.
Cooper, G. F. (1984). Nestro: A computer-based medical diagnostic that integrates causal
and probabilistic knowledge. Report no. 4,hpp-84-48, Stanford University, Stanford
California.
Cooper, G. F. and Herkovsits, E. (1991). A bayesian method for the induction of probabilistic networks from data. Technical report ksl-91-02, Stanford University.
Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities
with applications in pattern recognition. IEEE Transactions on Electronic Computers,
14:326334.
Cox, D. R. (1966). Some procedures associated with the logistic qualitative response curve.
In David, F. N., editor, Research papers on statistics: Festschrift for J. Neyman, pages
5577. John Wiley, New York.
Crawford, S. L. (1989). Extensions to the cart algorithm. Int. J. Man-Machine Studies,
31:197217.
Cutsem van, T., Wehenkel, L., Pavella, M., Heilbronn, B., and Goubin, M. (1991). Decision
trees for detecting emergency voltage conditions. In Second International Workshop on
Bulk Power System Voltage Phenomena - Voltage Stability and Security, pages 229
240, USA. McHenry.
Davies, E. R. (1988). Training sets and a priori probabilities with the nearest neighbour
method of pattern recognition. Pattern Recognition Letters, 8:1113.
Day, N. E. and Kerridge, D. F. (1967). A general maximum likelihood discriminant.
Biometrics, 23:313324.
Devijver, P. A. and Kittler, J. V. (1982). Pattern Recognition. A Statistical Approach.
Prentice Hall, Englewood Cliffs.
Djeroski, S., Cestnik, B., and Petrovski, I. (1983). Using the m-estimate in rule induction.
J. Computing and Inf. Technology, 1:37 46.
Dubes, R. and Jain, A. K. (1976). Clustering techniques: The users dilemma. Pattern
Recognition, 8:247260.
Duin, R. P. W. (1976). On the choice of smoothing parameters for Parzen estimators of
probability density functions. IEEE Transactions on Computers, C-25:11751179.
Dvorak, D. L. (1987). Expert systems for monitoring and control. Technical Report
Technical Report AI87-55, Articial Intelligence Laboratory, The University of Texas at
Austin.
D eroski, S. (1989). Control of inverted pendulum. B.Sc. Thesis, Faculty of Electrical
z
Engineering and Computer Science, University of Ljubljana (in Slovenian).
Efron, B. (1983). Estimating the error rate of a prediction rule: improvements on crossvalidation. J. Amer. Stat. Ass., 78:316331.
REFERENCES
271
Enas, G. G. and Choi, S. C. (1986). Choice of the smoothing parameter and efciency of
the
nearest neighbour classication. Comput. Math. Applic., 12A:235244.
Enterline, L. L. (1988). Strategic requirements for total facility automation. Control Engineering, 2:912.
Ersoy, O. K. and Hong, D. (1991). Parallel, self-organizing, hierarchical neural networks for
vision and systems control. In Kaynak, O., editor, Intelligent motion control: proceedings
of the IEEE international workshop, New York. IEEE.
Fahlman, S. E. (1988a). An empirical study of learning speed in back-propagation. Technical Report CMU-CS-88-162, Carnegie Mellon University, USA.
Fahlman, S. E. (1988b). Faster learning variation on back-propagation: An empirical study.
In Proccedings of the 1988 Connectionist Models Summer School. Morgan Kaufmann.
Fahlman, S. E. (1991a). The cascade-correlation learning algorithm on the monks problems. In Thrun, S., Bala, J., Bloedorn, E., and Bratko, I., editors, The MONKs problems
- a performance comparison of different learning algorithms, pages 107112. Carnegie
Mellon University, Computer Science Department.
Fahlman, S. E. (1991b). The recurrent cascade-correlation architecture. Technical Report
CMU-CS-91-100, Carnegie Mellon University.
Fahlman, S. E. and Lebi` re, C. (1990). The cascade correlation learning architecture. In
e
Tourzetsky, D. S., editor, Advances in Neural Information Processing Systems 2, pages
524532. Morgan Kaufmann.
Fahrmeir, L., Haussler, W., and Tutz, G. (1984). Diskriminanzanalyse. In Fahrmeir, L. and
Hamerle, A., editors, Multivariate statistische Verfahren. Verlag de Gruyter, Berlin.
Fienberg, S. (1980). The Analysis of Cross-Classied Categorical Data. MIT Press, Cambridge, Mass.
Fisher, D. H. and McKusick, K. B. (1989a). An empirical comparison of ID3 and backpropagation (vol 1). In IJCAI 89, pages 788793, San Mateo, CA. Morgan Kaufmann.
Fisher, D. H. and McKusick, K. B. et al.. (1989b). Processing issues in comparisons of
symbolic and connectionist learning systems. In SPATZ, B., editor, Proceedings of the
sixth international workshop on machine learning, Cornell University, Ithaca, New York,
pages 169173, San Mateo, CA. Morgan Kaufmann.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7:179188.
Fisher, R. A. (1938). The statistical utilisation of multiple measurements. Ann. Eugen.,
8:376386.
Fix, E. and Hodges, J. L. (1951). Discriminatory analysis, nonparametric estimation:
consistency properties. Report no. 4, project no. 21-49-004, UASF School of Aviation
Medicine, Randolph Field, Texas.
Frean, M. (1990a). Short Paths and Small Nets: Optimizing Neural Computation. PhD
thesis, University of Edinburgh, UK.
Frean, M. (1990b). The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Computation, 2:198209.
Frey, P. W. and Slate, D. J. (1991). Letter recognition using holland-style adaptive classiers.
Machine Learning, 6.
272
REFERENCES
REFERENCES
273
274
REFERENCES
Huang, H. H., Zhang, C., Lee, S., and Wang, H. P. (1991). Implementation and comparison
of neural network learning paradigms: back propagation, simulated annealing and tabu
search. In Dagli, C., Kumara, S., and Shin, Y. C., editors, Intelligent Engineering Systems
Through Articial Neural Networks: Proceedings of the Articial Neural Networks in
Engineering Conference, New York. American Society of Mechanical Engineers.
Huang, W. Y. and Lippmann, R. P. (1987). Comparisons between neural net and conventional classiers. In Proceedings of the IEEE rst international conference on neural
networks, pages 485494, Piscataway, NJ. IEEE.
Hunt, E. B. (1962). Concept Learning: An Information Processing Problem. Wiley.
Hunt, E. B., Martin, J., and Stone, P. I. (1966). Experiemnts in Induction. Academic Press,
New York.
Hunt, K. J., Sbarbaro, D., Zbikovski, R., and Gawthrop, P. J. (1992). Neural networks for
control systems a survey. Automatica, 28(6):10831112.
Jacobs, R. (1988). Increased rates of convergence through learning rate adaptation. Neural
Networks, 1:295307.
Jennet, B., Teasdale, G., Braakman, R., Minderhoud, J., Heiden, J., and Kurzi, T. (1979).
Prognosis of patients with severe head injury. Neurosurgery, 4:283 288.
Jones, D. S. (1979). Elementary Information Theory. Clarendon Press, Oxford.
Karali , A. (1992). Employing linear regression in regression tree leaves. In Proceedings of
c
the 10th European Conference on Articial Intelligence, pages 440441. Wiley & Sons.
Wien, Austria.
Karali , A. and Gams, M. (1989). Implementation of the gynesis pc inductive learning
c
system. In Proceedings of the 33rd ETAN Conference, pages XIII.8390. Novi Sad, (in
Slovenian).
Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical
data. Appl. Statist., 29:119 127.
Kendall, M. G., Stuart, A., and Ord, J. K. (1983). The advanced Theory of Statistics, Vol 3,
Design and Analysis and Time Series. Chapter 44. Grifn, London, fourth edition.
King, R. D., Lewis, R. A., Muggleton, S. H., and Sternberg, M. J. E. (1992). Drug design by
machine learning: the use of inductive logic programming to model the structure-activity
relationship of trimethoprim analogues binding to dihydrofolate reductase. Proceedings
of the National Academy Science, 89.
Kirkwood, C., Andrews, B., and Mowforth, P. (1989). Automatic detection of gait events:
a case study using inductive learning techniques. Journal of biomedical engineering,
11(23):511516.
Knoll, U. (1993). Kostenoptimiertes Prunen in Entscheidungsbaumen. Daimler-Benz,
Forschung und Technik, Ulm.
Kohonen, T. (1984). Self-Organization and Associative Memory. Springer Verlag, Berlin.
Kohonen, T. (1989). Self-Organization and Associative Memory. Springer-Verlag, Berlin,
3rd edition.
Kohonen, T., Barna, G., and Chrisley, R. (1988). Statistical pattern recognition with
neural networks: Benchmarking studies. In IEEE International Conference on Neural
Networks, volume 1, pages 6168, New York. (San Diego 1988), IEEE.
REFERENCES
275
276
REFERENCES
Makarovi , A. (1988). A qualitative way of solving the pole balancing problem. Technical
c
Report Memorandum Inf-88-44, University of Twente. Also in: Machine Intelligence
12, J.Hayes, D.Michie, E.Tyugu (eds.), Oxford University Press, pp. 241258.
Mardia, K. V. (1974). Applications of some measures of multivariate skewness and kurtosis
in testing normality and robustness studies. Sankhya B, 36:115128.
Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate Analysis. Academic Press,
London.
Marks, S. and Dunn, O. J. (1974). Discriminant functions when covariance matrices are
unequal. J. Amer. Statist. Assoc., 69:555559.
McCarthy, J. and Hayes, P. J. (1969). Some philosophical problems from the standpoint
of articial intelligence. In Meltzer, B. and Michie, D., editors, Machine Intelligence 4,
pages 463 502. EUP, Edinburgh.
McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman and Hall,
London, 2nd edition.
McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous
activity forms. Bulletin of Methematical Biophysics, 9:127147.
McLachlan, G. J. (1992). Discriminant Analysis and Statistical Pattern Recognition. John
Wiley, New York.
Meyer-Br tz, G. and Sch rmann, J. (1970). Methoden der automatischen Zeichenerkeno
u
nung. Akademie-Verlag, Berlin.
M zard, M. and Nadal, J. P. (1989). Learning in feed-forward layered networks: The tiling
e
algorithm. Journal of Physics A: Mathematics, General, 22:21912203.
M.G. Kendall, A. S. and Ord, J. (1983). The advanced Theory of Statistics, Vol 1, Distribution Theory. Grifn, London, fourth edition.
Michalski, R. S. (1969). On the quasi-minimal solution of the general covering problem.
In Proc. of the Fifth Internat. Symp. on Inform. Processing, pages 125 128, Bled,
Slovenia.
Michalski, R. S. (1973). Discovering classication rules using variable valued logic system
VL1. In Third International Joint Conference on Articial Intelligence, pages 162172.
Michalski, R. S. (1983). A theory and methodology of inductive learning. In R. S. Michalski,
J. G. C. and Mitchell, T. M., editors, Machine Learning: An Articial Intelligence
Approach. Tioga, Palo Alto.
Michalski, R. S. and Chilauski, R. L. (1980). Knowledge acquisition by encoding expert rules versus computer induction from examples: a case study involving soybean
pathology. Int. J. Man-Machine Studies, 12:63 87.
Michalski, R. S. and Larson, J. B. (1978). Selection of the most representative training
examples and incremental generation of vl1 hypothesis: the underlying methodology
and the description of programs esel and aq11. Technical Report 877, Dept. of Computer
Sciencence, U. of Illinois, Urbana.
Michie, D. (1989). Problems of computer-aided concept formation. In Quinlan, J. R.,
editor, Applications of Expert Systems, volume 2, pages 310 333. Addison-Wesley,
London.
Michie, D. (1990). Personal models of rationality. J. Statist. Planning and Inference, 25:381
399.
REFERENCES
277
Michie, D. (1991). Methodologies from machine learning in data analysis and software.
Computer Journal, 34:559 565.
Michie, D. and Al Attar, A. (1991). Use of sequential bayes with class probability trees. In
Hayes, J., Michie, D., and Tyugu, E., editors, Machine Intelligence 12, pages 187202.
Oxford University Press.
Michie, D. and Bain, M. (1992). Machine acquisition of concepts from sample data. In
Kopec, D. and Thompson, R. B., editors, Articial Intelligence and Intelligent Tutoring
Systems, pages 5 23. Ellis Horwood Ltd., Chichester.
Michie, D., Bain, M., and Hayes-Michie, J. (1990). Cognitive models from subcognitive
skills. In Grimble, M., McGhee, J., and Mowforth, P., editors, Knowledge-Based Systems
in Industrial Control, pages 7190, Stevenage. Peter Peregrinus.
Michie, D. and Camacho, R. (1994). Building symbolic representations of intuitive realtime skills from performance data. To appear in Machine Intelligence and Inductive
Learning, Vol. 1 (eds. Furukawa, K. and Muggleton, S. H., new series of Machine
Intelligence, ed. in chief D. Michie), Oxford: Oxford University Press.
Michie, D. and Chambers, R. A. (1968a). Boxes: An experiment in adaptive control. In
Dale, E. and Michie, D., editors, Machine Intelligence 2. Oliver and Boyd, Edinburgh.
Michie, D. and Chambers, R. A. (1968b). Boxes: an experiment in adaptive control. In
Dale, E. and Michie, D., editors, Machine Intelligence 2, pages 137152. Edinburgh
University Press.
Michie, D. and Sammut, C. (1993). Machine learning from real-time input-output behaviour. In Proceedings of the International Conference Design to Manufacture in
Modern Industry, pages 363369.
Miller, W. T., Sutton, R. S., and Werbos, P. J., editors (1990). Neural Networks for Control.
The MIT Press.
Minsky, M. C. and Papert, S. (1969). Perceptrons. MIT Press, Cambridge, MA, USA.
Mller, M. (1993). A scaled conjugate gradient algorithm for fast supervised learning.
Neural Networks, 4:525534.
Mooney, R., Shavlik, J., Towell, G., and Gove, A. (1989). An experimental comparison
of symbolic and connectionist learning algorithms (vol 1). In IJCAI 89: proceedings of
the eleventh international joint conference on articial intelligence, Detroit, MI, pages
775780, San Mateo, CA. Morgan Kaufmann for International Joint Conferences on
Articial Intelligence.
Muggleton, S. H. (1993). Logic and learning: Turings legacy. In Muggleton, S. H. and
Michie, D. Furukaw, K., editors, Machine Intelligence 13. Oxford University Press,
Oxford.
Muggleton, S. H., Bain, M., Hayes-Michie, J. E., and Michie, D. (1989). An experimental
comparison of learning formalisms. In Sixth Internat. Workshop on Mach. Learning,
pages 113 118, San Mateo, CA. Morgan Kaufmann.
Muggleton, S. H. and Buntine, W. (1988). Machine invention of rst-order predicates
by inverting resolution. In R. S. Michalski, T. M. M. and Carbonell, J. G., editors,
Proceedings of the Fifth International Machine. Learning Conference, pages 339352.
Morgan Kaufmann,, Ann Arbor, Michigan.
278
REFERENCES
Muggleton, S. H. and Feng, C. (1990). Efcient induction of logic programs. In First International Conference on Algorithmic Learning Theory, pages 369381, Tokyo, Japan.
Japanese Society for Articial Intellligence.
Neapolitan, E. (1990). Probabilistic reasoning in expert systems. John Wiley.
Nowlan, S. and Hinton, G. (1992). Simplifying neural networks by soft weight-sharing.
Neural Computation, 4:473493.
Odetayo, M. O. (1988). Balancing a pole-cart system using genetic algorithms. Masters
thesis, Department of Computer Science, University of Strathclyde.
Odetayo, M. O. and McGregor, D. R. (1989). Genetic algorithm for inducing control rules
for a dynamic system. In Proceedings of the 3rd International Conference on Genetic
Algorithms, pages 177182. Morgan Kaufmann.
Ozturk, A. and Romeu, J. L. (1992). A new method for assessing multivariate normality
with graphical applications. Communications in Statistics - Simulation, 21.
Pearce, D. (1989). The induction of fault diagnosis systems from qualitative models. In
Proc. Seventh Nat. Conf. on Art. Intell. (AAAI-88), pages 353 357, St. Paul, Minnesota.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann, San Mateo.
Piper, J. and Granum, E. (1989). On fully automatic feature measurement for banded
chromosome classication. Cytometry, 10:242255.
Plotkin, G. D. (1970). A note on inductive generalization. In Meltzer, B. and Michie, D.,
editors, Machine Intelligence 5, pages 153163. Edinburgh University Press.
Poggio, T. and Girosi, F. (1990). Networks for approximation and learning. Proceedings
of the IEEE, 78:14811497.
Pomerleau, D. A. (1989). Alvinn: An autonomous land vehicle in a neural network. In
Touretzky, D. S., editor, Advances in Neural Information Processing Systems. Morgan
Kaufmann Publishers, San Mateo, CA.
Prager, R. W. and Fallside, F. (1989). The modied Kanerva model for automatic speech
recognition. Computer Speech and Language, 3:6182.
Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vettering, W. T. (1988). Numerical
Recipes in C: The Art of Scientic Computing. Cambridge University Press, Cambridge.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1:81106.
Quinlan, J. R. (1987a). Generating production rules from decision trees. In International
Joint Conference on Articial Intelligence, pages 304307, Milan.
Quinlan, J. R. (1987b). Generating production rules from decision trees. In Proceedings
of the Tenth International Joint Conference on Articial Intelligence, pages 304307.
Morgan Kaufmann, San Mateo, CA.
Quinlan, J. R. (1987c). Simplifying decision trees. Int J Man-Machine Studies, 27:221234.
Quinlan, J. R. (1990). Learning logical denitions from relations. Machine Learning,
5:239266.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San
Mateo, CA.
Quinlan, J. R., Compton, P. J., Horn, K. A., and Lazarus, L. (1986). Inductive knowledge
acquisition: a case study. In Proceedings of the Second Australian Conference on
REFERENCES
279
applications of expert systems, pages 83204, Sydney. New South Wales Institute of
Technology.
Reaven, G. M. and Miller, R. G. (1979). An attempt to dene the nature of chemical
diabetes using a multidimensional analysis. Diabetologia, 16:1724.
Refenes, A. N. and Vithlani, S. (1991). Constructive learning by specialisation. In Proceedings of the International Conference on Articial Neural Networks, Helsinki, Finland.
Remme, J., Habbema, J. D. F., and Hermans, J. (1980). A simulative comparison of linear,
quadratic and kernel discrimination. J. Statist. Comput. Simul., 11:87106.
Renals, S. and Rohwer, R. (1989). Phoneme classication experiments using radial basis
functions. In Proceedings of the International Joint Conference on Neural Networks,
volume I, pages 461468, Washington DC.
Renders, J. M. and Nordvik, J. P. (1992). Genetic algorithms for process control: A
survey. In Preprints of the 1992 IFAC/IFIP/IMACS International Symposium on Articial
Intelligence in Real-Time Control, pages 579584. Delft, The Netherlands.
Reynolds, J. C. (1970). Transformational systems and the algebraic structure of atomic
formulas. In Meltzer, B. and Michie, D., editors, Machine Intelligence 5, pages 153
163. Edinburgh University Press.
Ripley, B. (1993). Statistical aspects of neural networks. In Barndorff-Nielsen, O., Cox, D.,
Jensen, J., and Kendall, W., editors, Chaos and Networks - Statistical and Probabilistic
Aspects. Chapman and Hall.
Robinson, J. A. (1965). A machine oriented logic based on the resolution principle. Journal
of the ACM, 12(1):2341.
Rohwer, R. (1991a). Description and training of neural network dynamics. In Pasemann,
F. and Doebner, H., editors, Neurodynamics, Proceedings of the 9th Summer Workshop,
Clausthal, Germany. World Scientic.
Rohwer, R. (1991b). Neural networks for time-varying data. In Murtagh, F., editor, Neural
Networks for Statistical and Economic Data, pages 5970. Statistical Ofce of the
European Communities, Luxembourg.
Rohwer, R. (1991c). Time trials on second-order and variable-learning-rate algorithms. In
Lippmann, R., Moody, J., and Touretzky, D., editors, Advances in Neural Information
Processing Systems, volume 3, pages 977983, San Mateo CA. Morgan Kaufmann.
Rohwer, R. (1992). A representation of representation applied to a discussion of variable binding. Technical report, Dept. of Computer Science and Applied Maths., Aston
University.
Rohwer, R. and Cressy, D. (1989). Phoneme classication by boolean networks. In Proceedings of the European Conference on Speech Communication and Technology, pages
557560, Paris.
Rohwer, R., Grant, B., and Limb, P. R. (1992). Towards a connectionist reasoning system.
British Telecom Technology Journal, 10:103109.
Rohwer, R. and Renals, S. (1988). Training recurrent networks. In Personnaz, L. and
Dreyfus, G., editors, Neural networks from models to applications, pages 207216. I. D.
S. E. T., Paris.
Rosenblatt, F. (1958). Psychological Review, 65:368408.
Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan Books, New York.
280
REFERENCES
Rumelhart, D. E., Hinton, G. E., and J., W. R. (1986). Learning internal representations by
error propagation. In Rumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed
Processing, volume 1, pages 318362. MIT Press, Cambridge MA.
Sakawa, Y. and Shinido, Y. (1982). Optimal control of container crane. Automatica,
18(3):257266.
Sammut, C. (1988). Experimental results from an evaluation of algorithms that learn to
control dynamic systems. In LAIRD, J., editor, Proceedings of the fth international
conference on machine learning. Ann Arbor, Michigan, pages 437443, San Mateo, CA.
Morgan Kaufmann.
Sammut, C. (1994). Recent progress with boxes. To appear in Machine Intelligence and
Inductive Learning, Vol. 1 (eds. Furukawa, K. and Muggleton, S. H., new series of
Machine Intelligence, ed. in chief D. Michie), Oxford: Oxford University Press.
Sammut, C. and Cribb, J. (1990). Is learning rate a good performance criterion of learning?
In Proceedings of the Seventh International Machine Learning Conference, pages 170
178, Austin, Texas. Morgan Kaufmann.
Sammut, C., Hurst, S., Kedzier, D., and Michie, D. (1992). Learning to y. In Sleeman, D.
and Edwards, P., editors, Proceedings of the Ninth International Workshop on Machine
Learning, pages 385393. Morgan Kaufmann.
Sammut, C. and Michie, D. (1991). Controlling a black box simulation of a space craft.
AI Magazine, 12(1):5663.
Sammut, C. A. and Banerji, R. B. (1986). Learning concepts by asking questions. In
R. S. Michalski, J. C. and Mitchell, T., editors, Machine Learning: An Articial Intelligence Approach, Vol 2, pages 167192. Morgan Kaufmann, Los Altos, California.
SAS (1985). Statistical Analysis System. SAS Institute Inc., Cary, NC, version 5 edition.
Scalero, R. and Tepedelenlioglu, N. (1992). A fast new algorithm for training feedforward
neural networks. IEEE Transactions on Signal Processing, 40:202210.
Schalkoff, R. J. (1992). Pattern Recognotion: Statistical, Structural and Neural Approaches. Wiley, Singapore.
Schoppers, M. (1991). Real-time knowledge-based control systems. Communications of
the ACM, 34(8):2730.
Schumann, M., Lehrbach, T., and Bahrs, P. (1992). Versuche zur Kreditwurdigkeitsprognose
mit kunstlichen Neuronalen Netzen. Universitat Gottingen.
Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization.
John Wiley, New York.
Sethi, I. K. and Otten, M. (1990). Comparison between entropy net and decision tree
classiers. In IJCNN-90: proceedings of the international joint conference on neural
networks, pages 6368, Ann Arbor, MI. IEEE Neural Networks Council.
Shadmehr, R. and DArgenio, Z. (1990). A comparison of a neural network based estimator and two statistical estimators in a sparse and noisy environment. In IJCNN-90:
proceedings of the international joint conference on neural networks, pages 289292,
Ann Arbor, MI. IEEE Neural Networks Council.
Shapiro, A. D. (1987). Structured Induction in Expert Systems. Addison Wesley, London.
REFERENCES
281
Shapiro, A. D. and Michie, D. (1986). A self-commenting facility for inductively synthesized end-game expertise. In Beal, D. F., editor, Advances in Computer Chess 5.
Pergamon, Oxford.
Shapiro, A. D. and Niblett, T. (1982). Automatic induction of classication rules for a
chess endgame. In Clarke, M. R. B., editor, Advances in Computer Chess 3. Pergamon,
Oxford.
Shastri, L. and Ajjangadde, V. From simple associations to systematic reasoning: A
connectionist representation of rules, variables, and dynamic bindings using temporal
synchrony. Behavioral and Brain Sciences. to appear.
Shavlik, J., Mooney, R., and Towell, G. (1991). Symbolic and neural learning algorithms:
an experimental comparison. Machine learning, 6:111143.
Siebert, J. P. (1987). Vehicle recognition using rule based methods. Tirm-87-018, Turing
Institute.
Silva, F. M. and Almeida, L. B. (1990). Acceleration techniques for the backpropagation
algorithm. In Almeida, L. B. and Wellekens, C. J., editors, Lecture Notes in Computer
Science 412, Neural Networks, pages 110119. Springer-Verlag, Berlin.
Silverman, B. W. (1986). Density estimation for Statistics and Data Analysis. Chapman
and Hall, London.
Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., and Johannes, R. S.
(1988). Using the adap learning algorithm to forecast the onset of diabetes mellitus.
In Proceedings of the Symposium on Computer Applications and Medical Care, pages
261265. IEEE Computer Society Press.
Smith, P. L. (1982). Curve tting and modeling with splines using statistical variable selection techniques. Technical Report NASA 166034, Langley Research Center, Hampton,
Va.
Snedecor, W. and Cochran, W. G. (1980). Statistical Methods (7th edition). Iowa State
University Press, Iowa, U.S.A.
Spiegelhalter, D. J., Dawid, A. P., Lauritzen, S. L., and Cowell, R. G. (1993). Bayesian
analysis in expert systems. Statistical Science, 8:219247.
Spikovska, L. and Reid, M. B. (1990). An empirical comparison of id3 and honns for
distortion invariant object recognition. In TAI-90: tools for articial intelligence: proceedings of the 2nd international IEEE conference, Los Alamitos, CA. IEEE Computer
Society Press.
Spirtes, P., Scheines, R., Glymour, C., and Meek, C. (1992). TETRAD II, Tools for discovery.
Srinivisan, V. and Kim, Y. H. (1987). Credit granting: A comparative analysis of classication procedures. The Journal of Finance, 42:665681.
StatSci (1991). S-plus users manual. Technical report, StatSci Europe, Oxford. U.K.
Stein von, J. H. and Ziegler, W. (1984). The prognosis and surveillance of risks from
commercial credit borrowers. Journal of Banking and Finance, 8:249268.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. J. Roy.
Statist. Soc., 36:111147 (including discussion).
Switzer, P. (1980). Extensions of linear discriminant analysis for statistical classication of
remotely sensed satellite imagery. J. Int. Assoc. for Mathematical Geology, 23:367376.
282
REFERENCES
Switzer, P. (1983). Some spatial statistics for the interpretation of satellite data. Bull. Int.
Stat. Inst., 50:962971.
Thrun, S. B., Mitchell, T., and Cheng, J. (1991). The monks comparison of learning
algorithms - introduction and survey. In Thrun, S., Bala, J., Bloedorn, E., and Bratko,
I., editors, The MONKs problems - a performance comparison of different learning
algorithms, pages 16. Carnegie Mellon University, Computer Science Department.
Titterington, D. M., Murray, G. D., Murray, L. S., Spiegelhalter, D. J., Skene, A. M.,
Habbema, J. D. F., and Gelpke, G. J. (1981). Comparison of discrimination techniques
applied to a complex data set of head injured patients (with discussion). J. Royal Statist.
Soc. A, 144:145175.
Todeschini, R. (1989).
nearest neighbour method: the inuence of data transformations
and metrics. Chemometrics Intell. Labor. Syst., 6:213220.
Toolenaere, T. (1990). Supersab: Fast adaptive back propagation with good scaling properties. Neural Networks, 3:561574.
Tsaptsinos, D., Mirzai, A., and Jervis, B. (1990). Comparison of machine learning
paradigms in a classication task. In Rzevski, G., editor, Applications of articial
intelligence in engineering V: proceedings of the fth international conference, Berlin.
Springer-Verlag.
Turing, A. M. (1986). Lecture to the london mathematical society on 20 february 1947.
In Carpenter, B. E. and Doran, R. W., editors, A. M. Turings ACE Report and Other
Papers. MIT Press, Cambridge, MA.
Unger, S. and Wysotzki, F. (1981). Lernf hige Klassizierungssysteme. Akademie-Verlag,
a
Berlin.
Urban i , T. and Bratko, I. (1992). Knowledge acquisition for dynamic system control. In
cc
Sou ek, B., editor, Dynamic, Genetic, and Chaotic Programming, pages 6583. Wiley
c
& Sons.
Urban i , T., Juri i , D., Filipi , B., and Bratko, I. (1992). Automated synthesis of control
cc
cc
c
for non-linear dynamic systems. In Preprints of the 1992 IFAC/IFIP/IMACS International
Symposium on Articial Intelligence in Real-Time Control, pages 605610. Delft, The
Netherlands.
Var ek, A., Urban i , T., and Filipi , B. (1993). Genetic algorithms in controller design
s
cc
c
and tuning. IEEE Transactions on Systems, Man and Cybernetics.
o
Verbruggen, H. B. and Astr m, K. J. (1989). Articial intelligence and feedback control.
In Proceedings of the Second IFAC Workshop on Articial Intelligence in Real-Time
Control, pages 115125. Shenyang, PRC.
Wald, A. (1947). Sequential Analysis. Chapman & Hall, London.
Wasserman, P. D. (1989). Neural Computing, Theory and Practice. Van Nostrand Reinhold.
Watkins, C. J. C. H. (1987). Combining cross-validation and search. In Bratko, I. and
Lavrac, N., editors, Progress in Machine Learning, pages 7987, Wimslow. Sigma
Books.
Wehenkel, L., Pavella, M., Euxibie, E., and Heilbronn, B. (1993). Decision tree based
transient stability assessment - a case study. volume Proceedings of IEEE/PES 1993
Winter Meeting, Columbus, OH, Jan/Feb. 5., pages Paper # 93 WM 2352 PWRS.
REFERENCES
283
Index
c! 5
B
cB!
!
% qF
Accuracy, 7, 8
ACE, 46
Algorithms
function approximation, 230
Algorithms
instance-based, 230
Algorithms
symbolic learning, 230
ALLOC80, 33, 214, 227, 263
Alternating Conditional Expectation, 46
Analysis of results, 176
AOCDL, 56
AQ, 56, 57, 74, 77
Aq, 237
AQ11, 50, 54
Architectures, 86
Assistant, 65
Attribute coding, 124
Attribute entropy, 174
Attribute noise, 174
Attribute reduction, 120
Attribute types, 214
Attribute vector, 17
Attributes, 1
INDEX
285
286
INDEX
Dataset
heart disease, 152
Dataset
image segmentation, 145
Dataset
Karhunen-Loeve Digits, 137
Dataset
letter recognition, 140
Dataset
machine faults, 165
Dataset
satellite image, 143
Dataset
shuttle control, 154
Dataset
Technical, 174
Dataset
technical, 161
Dataset
tsetse y distribution, 167
Dataset
vehicle recognition, 138
Dataset
Credit management, 124
Dataset
cut, 181
Dataset
Karhunen-Loeve Digits, 193
Dataset
shuttle control, 193
Dataset
Shuttle, 173
Dataset characterisation, 112
Dataset collection, 124
Decision class, 14
Decision problems, 1
Decision trees, 5, 9, 56, 73, 109, 121, 161,
217, 226
Default, 57, 80
Default rule, 13
Density estimates, 12
Density estimation, 30
Diabetes dataset, 157
Digits dataset, 135, 181, 223
DIPOL92, 12, 103, 223, 225, 263
INDEX
GLIM, 26
GOLEM, 81
Golem, 244
Gradient descent, 90
Gradient descent
MLP, 92
Gradient descent
second-order, 91
Gradient methods, 93
Head dataset, 149, 173
head injury dataset, 23
Heart dataset, 152, 173
Heuristically adequate, 80
Hidden nodes, 109
Hierarchical clustering, 189, 192
Hierarchical structure, 2
Hierarchy, 120, 123
Human brain, 3
Hypothesis language, 54, 229
ID3, 160, 218, 219
ILP, 65, 81, 82
Image datasets, 176, 179, 182
Image segmentation, 112, 181
Impure, 60
Impure node, 57
Impurity, 57, 58, 60
IND Package, 40
IND package, 73
IndCART, 12, 219, 263
Indicator variables, 26
inductive learning, 254
Inductive Logic Programming, 81, 82
Inductive logic programming, 2, 5
Inductive Logic Programming (ILP), 50
Information measures, 116, 169
Information score, 203
Information theory, 116
Instatnce-based learning (IBL), 230
Iris data, 9
Irrelevant attributes, 119, 226
ISoft dataset, 123
ITrule, 12, 56, 57, 77, 78, 220, 265
J-measure, 56, 78
Jackknife, 32
Joint entropy, 117
K nearest neighbour, 160
K-Means clustering, 102
K-means clustering, 101
K-Nearest Neighbour, 35
K-Nearest neighbour, 1012, 16, 126
k-Nearest neighbour, 29
k-NN, 160, 182, 216, 224, 227, 265
k-NN
Cross validation, 36
K-R-K problem, 8082
Kalman lter, 96
KARDIO, 52, 227
Kernel
classier, 33
Kernel
window width, 33
Kernel density (ALLOC80), 12
Kernel density estimation, 30, 214
Kernel function, 31
Kernels, 32
KL digits dataset, 27, 121, 137, 170
Kohonen, 160, 222, 265
Kohonen networks, 85, 102
Kohonen self-organising net, 12
Kullback-Leibler information, 112
Kurtosis, 22, 115, 170
Layer
hidden, 87
Layer
input, 86
Layer
output, 86
learning curves, 127
Learning graphical representations, 43
Learning vector quantization (LVQ), 12
Learning Vector Quantizer, 102
Learning vector quantizers, 102
Leave-one-out, 108
Letters dataset, 140, 208
Likelihood ratio, 27
Linear decision tree, 156
287
288
INDEX
INDEX
Noisy, 57
Noisy data, 61
Nonlinear regression, 89
Nonparametric density estimator, 35
Nonparametric methods, 16, 29
Nonparametric statistics, 5
Normal distribution, 20
NS.ratio, 119, 174
Object recognition datasets, 180
Observation language, 53, 229
Odds, 25
Optimisation, 94
Ordered categories, 25
Over-tting, 107
Overtting, 63, 64
Parametric methods, 16
Partitioning as classication, 8
Parzen window, 30
Pattern recognition, 16
Perceptron, 86, 109, 232
Performance measures, 4
Performance prediction, 210
Plug-in estimates, 21
Polak-Ribiere, 92
pole balancing, 248
Polytrees, 43
Polytrees (CASTLE), 12
Polytrees
as classiers, 43
Pooled covariance matrix, 19
Prediction as classication, 8
Preprocessing, 120, 123
Primary attribute, 123
Prior
uniform, 100
Prior probabilities, 13, 133
Probabilistic inference, 42
Products of attributes, 25
Projection pursuit, 37, 216
Projection pursuit (SMART), 12
Projection pursuit
classication, 38
Propositional learning systems, 237
289
Prototypes, 230
Pruning, 61, 63, 6769, 96, 107, 109, 194
Pruning
backward, 61, 64
Pruning
cost complexity, 69
Pruning
forward, 61
Purity, 61, 62
Purity
measure, 61
Purity measure, 59
Quadisc, 22, 121, 170, 173, 193, 263
Quadiscr, 225, 226
Quadratic discriminant, 12, 17, 22, 27, 214
Quadratic discriminants, 193
Quadratic functions of attributes, 22
Radial Basis Function, 85
Radial basis function, 93, 126, 223, 263
Radial Basis Function Network, 93
RAMnets, 103
RBF, 12, 85, 93, 223, 263
Recurrent networks, 88
Recursive partitioning, 9, 12, 16
Reduced nearest neighbour, 35
Reference class, 26
regression tree, 260
Regularisation, 23
Relational learning, 241
RETIS, 260
RG, 56
Risk assessment, 132
Rule-based methods, 10, 220
Rule-learning, 50
Satellite image dataset, 121, 143, 173
Scaling parameter, 32
Scatterplot smoother, 39
SDratio, 113, 170
Secic-to-general, 54
Secondary attribute, 123
Segmentation dataset, 145, 218
Selector, 56
290
INDEX
Shuttle, 107
Shuttle dataset, 154, 218
Simulated digits data, 45
Skew abs, 115
Skewness, 28, 115, 170
SMART, 39, 216, 224, 225, 263
Smoothing parameter, 32
Smoothing parameters, 214
SNR, 119
Specic-to-general, 54, 57, 58, 79
Speed, 7
Splitiing criteria, 61
Splitting criteria, 61
Splitting criterion, 62, 67, 70, 76
Splus, 26
Statistical approaches to classication, 2
Statistical measures, 112, 169
StatLog, 1, 4
StatLog
collection of data, 53
StatLog
objectives, 4
StatLog
preprocessing, 124
Stepwise selection, 11
Stochastic gradient, 93
Storage, 223
Structured induction, 83
Subset selection, 199
Sum of squares, 18
Supervised learning, 1, 6, 8, 85
Supervised networks, 86
Supervised vector, 102
Supervisor, 8
Symbolic learning, 52
Symbolic ML, 52
Taxonomic, 58
Taxonomy, 54, 57, 58, 79
Technical dataset, 120, 161, 218
Tertiary attribute, 123
Test environment, 214
Test set, 8, 17, 108
Three-Mile Island, 7
Tiling algorithm, 96