Franco - Neural Networks
Franco - Neural Networks
Franco - Neural Networks
)
Constructive Neural Networks
Studies in Computational Intelligence, Volume 258
Editor-in-Chief
Prof. Janusz Kacprzyk
Systems Research Institute
Polish Academy of Sciences
ul. Newelska 6
01-447 Warsaw
Poland
E-mail: kacprzyk@ibspan.waw.pl
Vol. 240. Dimitri Plemenos and Georgios Miaoulis (Eds.) Vol. 251. Zbigniew W. Ras and William Ribarsky (Eds.)
Intelligent Computer Graphics 2009, 2009 Advances in Information and Intelligent Systems, 2009
ISBN 978-3-642-03451-0 ISBN 978-3-642-04140-2
Vol. 241. Janos Fodor and Janusz Kacprzyk (Eds.)
Vol. 252. Ngoc Thanh Nguyen and Edward Szczerbicki (Eds.)
Aspects of Soft Computing, Intelligent Robotics and Control,
Intelligent Systems for Knowledge Management, 2009
2009
ISBN 978-3-642-04169-3
ISBN 978-3-642-03632-3
Vol. 242. Carlos Artemio Coello Coello, Vol. 253. Akitoshi Hanazawa, Tsutom Miki, and
Satchidananda Dehuri, and Susmita Ghosh (Eds.) Keiichi Horio (Eds.)
Swarm Intelligence for Multi-objective Problems in Data Brain-Inspired Information Technology, 2009
Mining, 2009 ISBN 978-3-642-04024-5
ISBN 978-3-642-03624-8
Vol. 254. Kyandoghere Kyamakya, Wolfgang A. Halang,
Vol. 243. Imre J. Rudas, Janos Fodor, and
Herwig Unger, Jean Chamberlain Chedjou,
Janusz Kacprzyk (Eds.)
Nikolai F. Rulkov, and Zhong Li (Eds.)
Towards Intelligent Engineering and Information Technology,
Recent Advances in Nonlinear Dynamics and
2009
Synchronization, 2009
ISBN 978-3-642-03736-8
ISBN 978-3-642-04226-3
Vol. 244. Ngoc Thanh Nguyen, Rados law Piotr Katarzyniak,
and Adam Janiak (Eds.) Vol. 255. Catarina Silva and Bernardete Ribeiro
New Challenges in Computational Collective Intelligence, Inductive Inference for Large Scale Text Classification, 2009
2009 ISBN 978-3-642-04532-5
ISBN 978-3-642-03957-7
Vol. 256. Patricia Melin, Janusz Kacprzyk, and
Vol. 245. Oleg Okun and Giorgio Valentini (Eds.) Witold Pedrycz (Eds.)
Applications of Supervised and Unsupervised Ensemble Bio-inspired Hybrid Intelligent Systems for Image Analysis
Methods, 2009 and Pattern Recognition, 2009
ISBN 978-3-642-03998-0 ISBN 978-3-642-04515-8
Vol. 246. Thanasis Daradoumis, Santi Caballe,
Vol. 257. Oscar Castillo, Witold Pedrycz, and
Joan Manuel Marques, and Fatos Xhafa (Eds.)
Janusz Kacprzyk (Eds.)
Intelligent Collaborative e-Learning Systems and
Evolutionary Design of Intelligent Systems in Modeling,
Applications, 2009
Simulation and Control, 2009
ISBN 978-3-642-04000-9
ISBN 978-3-642-04513-4
Vol. 247. Monica Bianchini, Marco Maggini, Franco Scarselli,
and Lakhmi C. Jain (Eds.) Vol. 258. Leonardo Franco, David A. Elizondo, and
Innovations in Neural Information Paradigms and Jose M. Jerez (Eds.)
Applications, 2009 Constructive Neural Networks, 2009
ISBN 978-3-642-04002-3 ISBN 978-3-642-04511-0
Leonardo Franco, David A. Elizondo,
and Jose M. Jerez (Eds.)
123
Leonardo Franco Jose M. Jerez
Dept. of Computer Science Dept. of Computer Science
University of Malaga University of Malaga
Campus de Teatinos S/N Campus de Teatinos S/N
Malaga 29071 Malaga 29071
Spain Spain
E-mail: lfranco@lcc.uma.es E-mail: jja@lcc.uma.es
David A. Elizondo
Centre for Computational Intelligence
School of Computing
De Montfort University
The Gateway
Leicester LE1 9BH
UK
E-mail: elizondo@dmu.ac.uk
DOI 10.1007/978-3-642-04512-7
c 2009 Springer-Verlag Berlin Heidelberg
This work is subject to copyright. All rights are reserved, whether the whole or part
of the material is concerned, specically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microlm or in any other
way, and storage in data banks. Duplication of this publication or parts thereof is
permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from
Springer. Violations are liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this
publication does not imply, even in the absence of a specic statement, that such
names are exempt from the relevant protective laws and regulations and therefore
free for general use.
Typeset & Cover Design: Scientic Publishing Services Pvt. Ltd., Chennai, India.
987654321
springer.com
Preface
David A. Elizondo
Leicester, UK
Jos M. Jerez
Mlaga, Spain
Contents
1 Introduction
Conventional neural network (NN) training algorithms (such as Backpropagation )
require the definition of the NN architecture before learning starts. The common
way for developing a neural network that suits a task consists of defining several
different architectures, training and evaluating each of them, and then choosing
the one most appropriate for the problem based on the error produced between the
target and actual output values. Constructive neural network (CoNN) algorithms,
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 123.
springerlink.com Springer-Verlag Berlin Heidelberg 2009
2 M. do Carmo Nicoletti et al.
however, define the architecture of the network along with the learning process.
Ideally CoNN algorithms should efficiently construct small NNs that have good
generalization performance. As commented in (Muselli, 1998), the possibility
of adapting the network architecture to the given problem is one of the advantages
of constructive techniques [This] has also important effects on the convergence
speed of the training process. In most constructive methods, the addition of a new
hidden unit implies the updating of a small portion of weights, generally only
those regarding the neuron to be added.
The automated design of appropriate neural network architectures can be
approached by two different groups of techniques : evolutionary and non-
evolutionary. In the evolutionary approach, a NN can be evolved by means of an
evolutionary technique, i.e. a population-based stochastic search strategy such as a
GA (see (Schaffer et al., 1992) (Yao, 1999)). In the non-evolutionary approach,
the NN is built not as a result of an evolutionary process, but rather as the result of
a specific algorithm designed to automatically construct it, as is the case with a
constructive algorithm.
CoNN algorithms, however, are not the only non-evolutionary approach to
the problem of defining a suitable architecture for a neural network. The
strategy implemented by the so called pruning methods can also be used and
consists in training a larger than necessary network (which presumably is an easy
task) and then, pruning it by removing some of its connections and/or nodes
(see (Reed, 1993)).
As approached by Lahnajrvi and co-workers in (Lahnajrvi et al., 2002),
pruning algorithms can be divided into two main groups. Algorithms in the first
group estimate the sensitivity of an error function to the removal of an element
(neuron or connection); those with the least effect can be removed. Algorithms in
the second group generally referred to as penalty-term as well as regularization
algorithms, add terms to the objective function that reward the network for
choosing efficient and small solutions. The same group of authors also detected
in the literature what they call combined algorithms that take advantage of the
properties of both, constructive and pruning algorithms, in order to determine the
network size in a flexible way (see (Fiesler, 1994), (Gosh & Tumer, 1994)).
There are many different kinds of neural network; new algorithms and
variations of already known algorithms are constantly being published in the
literature. Similarly to other machine learning techniques, neural network
algorithms can also be characterized as supervised, when the target values are
known and the algorithm uses the information, or as unsupervised, when such
information is not given and/or used by the algorithm. The two main classes of
NN architecture are feedforward, where the connections between neurons do not
form cycles, and feedback (or recurrent), where the connections may form cycles.
NNs may also differ in relation to the type of data they deal with; the two more
popular being categorical and quantitative. Both, supervised and unsupervised
learning with categorical targets are referred to as classification. Supervised
learning with quantitative target values is known as regression. Classification
problems can be considered a particular type of regression problems.
CoNN Algorithms for Feedforward Architectures Suitable for Classification Tasks 3
Among the most well known CoNN algorithms for two-class classification
problems are the Tower and the Pyramid (Gallant, 1986b), the Tiling (Mzard &
Nadal, 1989), the Upstart (Frean, 1990), the Perceptron Cascade (Burgess, 1994),
the PTI and the Shift (Amaldi & Guenin, 1997), the Irregular Partitioning
Algorithm (IPA) (Marchand et al., 1990; Marchand & Golea, 1993), the Target
Switch (Campbell & Vicente, 1995), the Constraint Based Decomposition (CBD)
algorithm (Drghici, 2001) and the BabCoNN (Bertini Jr. & Nicoletti, 2008a).
Smieja, (1993), Gallant, (1994), Bishop, (1996), Mayoraz and Aviolat (1996),
Campbell (1997) and Muselli (1998) discuss several CoNN algorithms in detail.
neurons per layer is reached and the training set still does not have a faithful
representation in this layer. The original Tiling algorithm as described in (Mzard
& Nadal, 1989) uses the same TLU training algorithm for each neuron (master or
ancillary) added to the NN; in the same article the authors state and prove a
theorem that ensures its convergence.
The unnamed algorithm proposed in (Nadal, 1989) and described by its author
as an algorithm similar in spirit to the Tiling algorithm was named by Smieja in
(Smieja, 1993) as Pointing. The Pointing algorithm constructs a feedforward
neural network with the input layer, one output neuron and an ordered sequence of
single hidden neurons, each one connected to the input layer and to the previous
hidden neuron in the sequence. Despite its author claiming that the algorithm is
very similar to the Tiling one, although more constrained (which is true), actually
the Pointing algorithm corresponds to the Tower algorithm as proposed by Gallant
(Gallant, 1986b).
The Partial Target Inversion (PTI) algorithm is a CoNN algorithm proposed in
(Amaldi & Guenin, 1997) that shares strong similarities with the Tiling algorithm.
The PTI grows a multi-layer network where each layer has one master neuron and
a few ancillary neurons. Following the Tiling strategy as well, the PTI adds
ancillary neurons to a layer in order to satisfy the faithfulness criteria; the neurons
in layer c are connected only to neurons in layer c 1. If the training of the master
neuron results in a weight vector that correctly classifies all training patterns or if
the master neuron of layer c does not classify a larger number of patterns than the
master neuron of layer c 1, the algorithm stops. If a training pattern, however,
was incorrectly classified by the master neuron, and the master neuron correctly
classifies a greater number of patterns than the master of the previous layer, the
algorithm starts adding ancillary neurons to the current layer aiming at its
faithfulness. When the current c layer becomes faithful the algorithm adds a new
layer, c + 1, initially only having the master neuron. The process continues until
stopping criteria are met, such as when the number of master (or ancillary)
neurons has reached a pre-defined threshold. The only noticeable difference
between the Tiling and the PTI is the way the training set, used for training the
ancillary neurons in the process of turning a layer faithful, is chosen.
For training the master neuron of layer c, the PTI uses all the outputs from the
previous layer. Considering that the layer c needs to be faithful, the first ancillary
neuron added to layer c will be trained with those patterns (used to train the master
neuron) that made layer c unfaithful.
In addition, the patterns that activate the last added neuron (master, in this case)
have their classes inverted. The authors justify this procedure by saying that
When trying to break several unfaithful classes simultaneously, it may be
worthwhile to flip the target of the prototypes on layer c corresponding to several
unfaithful classes. In fact, the specific targets are not important; the only
requirement is that outputs corresponding to a different c1 layer representation
trigger a different output for at least one unit in the layer c. For clarification if
after the addition of the first ancillary neuron the layer is not faithful yet, another
ancillary neuron needs to be added to layer c. The second ancillary neuron will be
trained with the c1 outputs that provoked an unfaithful representation of layer c,
8 M. do Carmo Nicoletti et al.
this time, however, taking also into consideration the master and first ancillary
neuron previously added. The patterns that activated the first ancillary neuron
have their class inverted when they are included in the training set for the second
ancillary neuron. The name PTI (Partial Target Inversion) refers to the fact that
when constructing training sets only a partial number of patterns have their class
(target) inverted.
Another algorithm that adopts the same error-correction strategy of the Upstart
is the Perceptron Cascade (PC) algorithm (Burgess, 1994); PC also borrows the
architecture of neural networks created by the Cascade Correlation algorithm
(Fahlman & Lebiere, 1990).
PC begins the construction of the network by training the output neuron. If this
neuron does not classify all training patterns correctly, the algorithm begins to add
hidden neurons to the network. Each new added hidden neuron is connected to all
previous hidden neurons as well as to the input neurons. The new hidden neuron is
then connected to the output neuron; each time a hidden neuron is added, the
output neuron needs to be retrained. The addition of a new hidden neuron
enlarges the space in one dimension. The algorithm has three stopping criteria: 1)
the network converges i.e. correctly classifies all training patterns; 2) a
pre-defined maximum number of hidden neurons has been achieved and 3) the
most common, the addition of a new hidden neuron degrades the networks
performance.
Similarly to the Upstart algorithm, hidden neurons are added to the network in
order to correct wrongly-on and wrongly-off errors. Following the same strategy
employed by Upstart, what distinguishes a neuron created for correcting wrongly-
on or for correcting wrongly-off errors caused by the output neuron is the training
set used for training the neuron. To correct wrongly-on errors the training set used
should have all negative patterns plus the patterns which produce wrongly-on
errors. For correcting wrongly-off errors, the training set should have all positive
patterns plus the negative patterns which produce wrongly-off patterns. Unlike
Upstart, however, the PC algorithm only adds and trains one neuron at a time in
order to correct the most frequent wrongly-on and wrongly-off errors produced by
the output neuron.
Like the Upstart and Perceptron Cascade algorithms, the Shift algorithm
(Amaldi & Guenin, 1997) also constructs the network beginning with the output
neuron. This algorithm, however, creates only one hidden layer, iteratively adding
neurons to it; each added neuron is connected to the input neurons and to the
output neuron. The error correcting procedure used by the Shift method is similar
to the one used by the Upstart method, in the sense that the algorithm also
identifies wrongly-on and wrongly-off errors. However, it adds and trains a hidden
neuron to correct the most frequent between these two types of errors. Also, the
training set used for training a wrongly-off (or wrongly-on) corrector differs
slightly from the training sets used by the Upstart algorithm.
to E+ that are correctly stored by the ()dichotomy are the patterns that will be
correctly separated by the neuron introduced in order to connect the above
mentioned pair.
When the architecture is a cascade type, the introduced neuron is a linear
neuron that implements a summation function on the pair outputs. If the result is
positive or negative the current pattern is correctly classified otherwise a
misclassification is produced, which will be dealt with by the next pair of neurons
to be added. For growing neural networks with a tree structure the introduced
neuron is a threshold neuron that implements a threshold function on the pair
outputs. Considering that the first iteration adds one threshold neuron (the output),
each following iteration will add two more threshold neurons to those already
added in the previous iteration.
To obtain the dichotomies the authors propose the use of any Perceptron-like
TLU training algorithm. Roughly speaking, the idea is to run the TLU training
algorithm and then shift the resulting hyperplane in order to correctly classify all
patterns of a given class.
The Constraint Based Decomposition (CBD) is another algorithm that follows
the sequential model. The algorithm builds an architecture with four layers which
are named input, hyperplane, AND and OR layers respectively. The whole
training set is used for training the first hidden neuron in the hyperplane layer. The
next hidden neuron to be added will be trained with those training patterns that
were misclassified by the first hidden neuron. The algorithm goes on adding
neurons to the first layer until no pattern is left in the training set. For training a
neuron ui, one pattern from each class is randomly chosen and removed from the
training set E. These patterns are put in the training set Eui. After ui has been
trained with Eui, the algorithm starts to add patterns to Eui, one at a time, in a
random manner. Each time a pattern is added to the set, ui is retrained with the
updated Eui. However, if the addition of a new pattern to Eui results in
misclassification, the last pattern added is removed from Eui and marked as used
by the neuron. Before adding a new hidden neuron, the algorithm considers all
patterns in E for the current neuron. A new neuron will be added when all training
patterns left have been tried for the current neuron. The neurons of the AND layer
are connected only to relevant neurons from the hyperplane layer and in the OR
layer the output neurons are connected only to neurons from the AND layer which
are turned on for the given class.
The recently introduced DASG algorithm belongs also to the class of sequential
learning algorithm. It works with binary inputs by decomposing the original
Boolean function (or partially defined Boolean function) into two new lower
complexity functions, which in turn are decomposed until all obtained functions
are threshold functions that can be implemented by a single neuron. The final
solution incorporates all functions in a single hidden layer architecture with an
output neuron that computes and OR or AND Boolean function.
The BabCoNN (Barycentric-based CoNN) (Bertini Jr. & Nicoletti, 2008a) is a
new two-class CoNN that borrows some of the ideas of the BCP (Barycentric
Correction Procedure, see (Poulard, 1995), (Poulard & Labreche 1995)) and can
be considered a representative of the sequential model. Like the Upstart,
12 M. do Carmo Nicoletti et al.
# HL
New Neuron Stopping
Algorithm Group Growth Special Feature
Connected to criteria
direction
Various Previously CON
Tower One HN per HL added HN Weight update AD
Forward and INs NHL
All
Various Dimension of CON
previously
Pyramid One HN per HL weight space AD
added HNs
Forward increases NHL
and INs
Neurons CON
Various
perform Previous Faithful layers AD
Tiling
different layer divide and conquer NHL
Forward
functions NHN
Neurons CON
Various Faithful layers
perform Previous AD
PTI inversion of
different layer NHL
Forward classes
functions NHN
Binary
Children correct CON
Wrongly-on/off tree
Upstart Parent neuron the fathers AD
correctors
mistakes NHL
Backward
Weighted
One CON
Wrongly-on/off connections are
Shift INs AD
correctors used to correct the
Backward NHL
output error
Output increases
Cascade-
Previously the dimension of CON
Perceptron Wrongly-on/off like
added HNs its weight space AD
cascade correctors
and INs every time a NHL
Backward
neuron is added
Cascade-
Previously CON
Cascade Wrongly-on/off like Suitable for
added HNs AD
correlation correctors regression tasks
and INs NHL
Backward
Neurons
Two CON
perform Previous
Offset Parity machine AD
different layer
Forward NHL
functions
One Sequentially
IPA Sequential INs classifies the TSC
Forward training set
Cascade Previously
Target (+) and ()
Sequential (tree-like) added HNs TSC
switch dichotomies
Backward and INs
Three
Previous
CBD Sequential AND/OR layers TSC
layer
Forward
One
BabCoNN Sequential Input HN fires 1, 0 or 1 TSC
Backward
One
AND/OR output
DASG Sequential Input TSC
function
Forward
CoNN Algorithms for Feedforward Architectures Suitable for Classification Tasks 13
Perceptron Cascade (PC) and Shift, the BabCoNN also constructs the network,
beginning with the output neuron. However, it creates only one hidden layer; each
hidden neuron is connected to the input layer as well as to the output neuron, like
with the Shift algorithm.
Although the Upstart, PC and Shift construct the network by adding new hidden
neurons specialized in correcting wrongly-on and wrongly-off errors, the
BabCoNN employs a different strategy. The BCP is used for constructing a
hyperplane in the hyperspace defined by the training set; the classified patterns are
removed from the set and the process is repeated again with the updated training
set. Due to the way the algorithm works a certain degree of redundancy is inserted
in the process, in the sense of a pattern being correctly classified by more than one
hidden neuron. This has been fixed by the BabCoNN classification process, where
hidden neurons have a particular way of firing their output.
Table 1 summarizes the main characteristics of fourteen two-class algorithms
previously discussed. For presenting the table the following abbreviations were
adopted: Forward (the NN is grown from input towards output layer); Backward
(the NN is grown from output towards input layer); INs: all neurons in the input
layer; HN: a hidden neuron; HL: a hidden layer; #HL: number of hidden layers.
The following abbreviations were adopted for stopping criteria: CON
(convergence); AD (accuracy decay); NHL (number of hidden layers exceeds a
given threshold); NHN (number of hidden neurons per hidden layer exceeds a
given threshold); TSC (all training patterns have been correctly classified).
Kwok & Yeung in (Kwok & Yeung, 1997b) conducted a very careful
investigation on the objective functions for training hidden neurons in CoNN for
multilayer feedforward networks for regression problems, aiming at deriving a
class of objective functions whose value and the corresponding weight updates
could be computed in (N) time, for a training set with N patterns.
In spite of the many CoNN algorithms surveyed in (Kwok & Yeung, 1997a),
the most popular for regression problems is no doubt the Cascade Correlation
algorithm (CasCor) and maybe the second most popular is the DNC. While the
DNC algorithm constructs neural networks with a single hidden layer, the CasCor
creates them with multiple hidden layers, where each hidden layer has one hidden
neuron. The popularity of CasCor can be attested by the various ways this
algorithm has inspired new variations and also has been used in the combined
approaches between learning methods.
A similar approach to CasCor called Constructive Backpropagation (CBP) was
proposed in (Lehtokangas, 1999). The RCC, a recurrent extension to CasCor is
described in (Fahlman, 1991) and its limitations are presented and discussed in
(Giles et al., 1995). In (Kremer, 1996) the conclusions of Giles et al. in relation to
RCC are extended. An investigation into problems and improvements in relation
to the basic CasCor can be found in (Prechelt, 1997), where CasCor and five of its
variations are empirically compared using 42 different datasets from the
benchmark PROBEN1 (Prechelt, 1994).
CasCor has also inspired the proposal of the Fixed Cascade Error (FCE),
described in (Lahnajrvi et al., 1999c), (Lahnajrvi et al., 2002), which is an
enhanced version of a previous algorithm proposed by the same authors known as
Cascade Error (CE) (see (Lahnajrvi et al., 1999a), (Lahnajrvi et al., 1999b)).
While the general structure of both algorithms is the same, they differ in the way
the hidden neurons are created.
The Rule-based Cascade-correlation (RBCC) proposed in (Thivierge et al.,
2004) is a collaborative symbolic-NN approach which is partially inspired by the
KBANN (Knowledge-Based Artificial Neural Networks) model proposed in
(Towel et al., 1990), (Towel, 1991) where the NN used is a CasCor network. In
the KBANN an initial set of rules is translated into a neural network which is then
refined using a training set of patterns; the refined neural network can undergo a
further step and be converted into a set of symbolic rules which could, again, be
used as the starting point for constructing a neural network and the whole cycle
would be repeated.
According to the authors the RBCC is a particular case of the Knowledge-
based Cascade-correlation algorithm (KBCC) (Shultz & Rivest, 2000) (Shultz &
Rivest, 2001). The KBCC extends the CasCor by allowing as hidden neurons
during the growth of a NN not only single neurons, but previously learned
networks as well. In (Thivierge et al., 2003) an algorithm that implements
simultaneous growing and pruning of CasCor networks is described; the pruning
is done by removing irrelevant connections using the Optimal Brain Damage
(OBD) procedure (Le Cun et al., 1990).
In (Islam & Murase, 2001) the authors propose the CNNDA (Cascade Neural
Network Design Algorithm) for inducing two-hidden-layer NNs. The method
16 M. do Carmo Nicoletti et al.
automatically determines the number of nodes in each hidden layer and can also
reduce a two-hidden-layer network to a single-layer network. It is based on the use
of a temporary weight freezing technique. The Fast Constructive-Covering
Algorithm (FCCA) for NN construction proposed in (Wang, 2008) is based on
geometrical expansion. It has the advantage of each training example having to be
learnt only once, which allows the algorithm to work faster than traditional
training algorithms.
6 Miscellaneous
A few constructive approaches have also been devised for RBF (Radial Basis
Function) networks, such the Orthogonal Least Squares (OLS) (Chen et al., 1989)
(Chen et al., 1991) and the Growing Radial Basis Function (GRBF) networks
(Karayiannis & Weiqun, 1997).
Although CoNN algorithms seem to have a lot of potential in relation to both
the size of the induced network and its accuracy, it is really surprising that their
use, particularly in the area of classification problems, is not as widespread as it
should be, considering their many advantages. In regression problems, however,
CoNNs have been very popular, particularly the Cascade-Correlation algorithm
and many of its variations. In what follows some of the most recent works using
CoNN are mentioned.
In (Lahnajrvi et al., 2004) four CasCor-based CoNN algorithms, have been
used for evaluating the movements of a robotic manipulator. In (Huemer et al.,
2008) the authors describe a method for controlling machines, such as mobile
robots, using a very specific CoNN. The NN is grown based on a reward value
given by a feedback function that analyses the on-line performance of a certain
task. In fact since conventional NNs are commonly used in controlling tasks
(Alnajjar & Murase, 2005), this is a potential application area for CoNN
algorithms as well.
In (Giordano et al., 2008), a committee of CasCor neural networks was
implemented as a software filter, for the online filtering of CO2 signals from a
bioreactor gas outflow. The knowledge-based CasCor proposal (KBCC) previously
mentioned has been used in a few knowledge domains, such as simulation of
cognitive development (see e.g. (Mareschal & Schultz, 1999) and (Sirois & Shultz,
1998)), vowel recognition (Rivest & Shultz, 2002) and for gene-splice-junction
determination (Thivierge & Shultz, 2002), a benchmark problem from the UCI
Machine Learning Repository (Asuncion & Newman, 2007). A more in depth
investigation into the use of the knowledge-based neural learning implemented by
the KBCC in developmental robotics can be seen in (Shultz et al., 2007).
A few other non-conventional approaches to CoNN can be found in recent
works, such as the one described in (Garca-Pedrajas & Ortiz-Boyer, 2007), based
on cooperative co-evolution, for the automatic induction of the structure of an NN
for classification purposes; the method partially tries to avoid the problems of
greedy approaches. In (Yang et al., 2008) the authors combined the ridgelet
function with feedforward neural networks in the ICRNN (Incremental
Constructive Ridgelet Neural Network) model. The ridgelet function was chosen
CoNN Algorithms for Feedforward Architectures Suitable for Classification Tasks 17
very compact architectures. The intrinsic dynamics of the algorithm has been
applied for detecting and filtering noisy instances, reducing overfitting and
improving the generalization ability.
7 Conclusions
This chapter presents an overview of several CoNN algorithms and highlighted
some of their applications and contributions. Although focusing on feedforward
architectures for classification tasks, the chapter also tries to present a broad view
of the area, discussing several of the most recent contributions.
An interesting aspect of CoNN research is its chronological aspect. It may be
noticeable that most of the CoNN algorithms for classification tasks were
proposed in the nineties and since then not many new proposals have been
published. Another point to consider also is the lack of real world applications
involving the use of CoNN algorithms; this can be quite surprising, considering
the many that are available and the fact that several have competitive
performances in comparison to other more traditional approaches. The tendency in
the area is for diversifying both the architecture and the constructive process itself,
by means of including collaborative techniques. What has been surveyed in this
chapter is just a part of the research work going on in the area of CoNN
algorithms. As mentioned in the Introduction, there is a very promising area
characterized as the group of evolutionary techniques that has been contributing a
lot to the development of CoNNs and was not the subject of this chapter.
References
Alnajjar, F., Murase, K.: Self-organization of spiking neural network generating
autonomous behavior in a real mobile robot. In: Proceedings of The International
Conference on Computational Intelligence for Modeling, Control and Automation,
vol. 1, pp. 11341139 (2005)
Amaldi, E., Guenin, B.: Two constructive methods for designing compact feedfoward
networks of threshold units. International Journal of Neural System 8(5,6), 629645
(1997)
Anlauf, J.K., Biehl, M.: The AdaTron: an adaptive perceptron algorithm. Europhysics
Letters 10, 687692 (1989)
Ash, T.: Dynamic node creation in backpropagation networks, Connection Science, vol.
Connection Science 1(4), 365375 (1989)
Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California,
School of Information and Computer Science, Irvine (2007),
http://www.ics.uci.edu/~mlearn/MLRepository.html
CoNN Algorithms for Feedforward Architectures Suitable for Classification Tasks 19
Barreto-Sanz, M., Prez-Uribe, A., Pea-Reyes, C.-A., Tomassini, M.: Fuzzy growing
hierarchical self-organizing networks. In: Krkov, V., Neruda, R., Koutnk, J. (eds.)
ICANN 2008,, Part II. LNCS, vol. 5164, pp. 713722. Springer, Heidelberg (2008)
Bertini Jr., J.R., Nicoletti, M.C.: A constructive neural network algorithm based on the
geometric concept of barycenter of convex hull, Computational Intelligence: Methods
and Applications. In: Rutkowski, R.L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J. (eds.)
Academic Publishing House Exit, pp. 112. IEEE Computational Intelligence Society,
Poland (2008a)
Bertini Jr., J.R., Nicoletti, M.C.: MBabCoNN A multiclass version of a constructive
neural network algorithm based on linear separability and convex hull. In: Krkov, V.,
Neruda, R., Koutnk, J. (eds.) ICANN 2008,, Part II. LNCS, vol. 5164, pp. 723733.
Springer, Heidelberg (2008)
Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press, USA
(1996)
Burgess, N.: A constructive algorithm that converges for real-valued input patterns. Int.
Journal of Neural Systems 5(1), 5966 (1994)
Campbell, C.: Constructive learning techniques for designing neural network systems. In:
Leondes, C. (ed.) Neural Network Systems Technologies and Applications, vol. 2.
Academic Press, San Diego (1997)
Campbell, C., Vicente, C.P.: The target switch algorithm: a constructive learning procedure
for feed-forward neural networks. Neural Computation 7(6), 12451264 (1995)
Chen, S., Billings, S., Luo, W.: Orthogonal least squares learning methods and their
application to non-linear system identification. International Journal of Control 50,
18731896 (1989)
Chen, S., Cowan, C., Grant, P.M.: Orthogonal least squares learning algorithm for radial
basis function networks. IEEE Transactions on Neural Networks 2, 302309 (1991)
Drghici, S.: The constraint based decomposition (CBD) training architecture. Neural
Networks 14, 527550 (2001)
Elizondo, D., Ortiz-de-Lazcano-Lobato, J.M., Birkenhead, R.: On the generalization of the
m-class RDP neural network. In: Krkov, V., Neruda, R., Koutnk, J. (eds.) ICANN
2008, Part II. LNCS, vol. 5164, pp. 734743. Springer, Heidelberg (2008)
Fahlman, S.E.: Faster-learning variations on backpropagation: an empirical study. In:
Touretzky, D.S., Hinton, G.E., Sejnowski, T.J. (eds.) Proceedings of the 1988
Connectionist Models Summer School, pp. 3851. Morgan Kaufmann, San Mateo
(1988)
Fahlman, S., Lebiere, C.: The cascade correlation architecture, Advances in Neural
Information Processing Systems, vol. 2, pp. 524532. Morgan Kaufman, San Mateo
(1990)
Fahlman, S.: The recurrent cascade-correlation architecture, Advances in Neural
Information Processing Systems, vol. 3, pp. 190196. Morgan Kaufman, San Mateo
(1991)
Farlow, S.J. (ed.): Self-organizing methods in Modeling: GMDH Type Algorithms. In:
Statistics: Textbooks and Monographs, vol. 54. Marcel Dekker, New York (1984)
Ferrari, E., Muselli, M.: A constructive technique based on linear programming for training
switching neural networks. In: Krkov, V., Neruda, R., Koutnk, J. (eds.) ICANN 2008,
Part II. LNCS, vol. 5164, pp. 744753. Springer, Heidelberg (2008)
Fiesler, E.: Comparative bibliography of ontogenic neural networks. In: Proceedings of The
International Conference on Artificial Neural Networks (ICANN), Sorrento, vol. 94, pp.
793796 (1994)
20 M. do Carmo Nicoletti et al.
Frean, M.: The upstart algorithm: a method for constructing and training feedforward
neural networks. Neural Computation 2, 198209 (1990)
Frean, M.: A thermal perceptron learning rule. Neural Computation 4, 946957 (1992)
Friedman, J.H., Stuetzle, W.: Projection pursuit regression. Journal of the American
Statistical Association 76(376), 817823 (1981)
Friedman, J.: Exploratory projection pursuit. Journal of the American Statistical
Association 82, 249266 (1987)
Gallant, S.I.: Optimal linear discriminants. In: Proceedings of The Eighth International
Conference on Pattern Recognition, pp. 849852 (1986a)
Gallant, S.I.: Three constructive algorithms for network learning. In: Proceedings of The
Eighth Annual Conference of the Cognitive Science Society, Amherst, Ma, pp. 652660
(1986b)
Gallant, S.I.: Perceptron based learning algorithms. Proceedings of the IEEE Transactions
on Neural Networks 1(2), 179191 (1990)
Gallant, S.I.: Neural Network Learning & Expert Systems. The MIT Press, England (1994)
Garca-Pedrajas, N., Ortiz-Boyer, D.: A cooperative constructive method for neural
networks for pattern recognition. Pattern Recognition 40, 8098 (2007)
Ghosh, J., Tumer, K.: Structural adaptation and generalization in supervised feed-forward
networks. Journal of Artificial Neural Networks 1(4), 431458 (1994)
Giles, C.L., Chen, D., Sun, G.-Z., Chen, H.-H., Lee, Y.-C., Goudreau, M.W.: Constructive
learning of recurrent neural network: limitations of recurrent cascade correlation and a
simple solution. IEEE Transactions on Neural Networks 6(4), 829836 (1995)
Giordano, R.C., Bertini Jr., J.R., Nicoletti, M.C., Giordano, R.L.C.: Online filtering of CO2
signals from a bioreactor gas outflow using a committee of constructive neural
networks. Bioprocess and Biosystems Engineering 31(2), 101109 (2008)
Grochowski, M., Duch, W.: Projection pursuit constructive neural networks based on
quality of projected cluster. In: Krkov, V., Neruda, R., Koutnk, J. (eds.) ICANN
2008,, Part II. LNCS, vol. 5164, pp. 754762. Springer, Heidelberg (2008)
Hrycej, T.: Modular Learning in Neural Networks. Addison Wiley, New York (1992)
Huemer, A., Elizondo, D., Gongora, M.: A reward-value based constructive method for the
autonomous creation of machine controllers. In: Krkov, V., Neruda, R., Koutnk, J.
(eds.) ICANN 2008,, Part II. LNCS, vol. 5164, pp. 773782. Springer, Heidelberg
(2008)
Islam, M.M., Murase, K.: A new algorithm to design compact two-hidden-layer artificial
networks. Neural Networks 14, 12651278 (2001)
Karayiannis, N.B., Weiqun, G.: Growing radial basis neural networks: merging supervised
and unsupervised learning with network growth techniques. IEEE Transactions on
Neural Networks 8(6), 14921506 (1997)
Krauth, W., Mzard, M.: Learning algorithms with optimal stability in neural networks.
Journal of Physics A 20, 745752 (1987)
Kremer, S.: Comments on constructive learning of recurrent neural networks: limitations of
recurrent cascade correlation and a simple solution. IEEE Transactions on Neural
Networks 7(4), 10471049 (1996)
Kubat, M.: Designing neural network architectures for pattern recognition. The Knowledge
Engineering Review 15(2), 151170 (2000)
Kwok, T.-Y., Yeung, D.-Y.: Constructive algorithms for structure learning in feedforward
neural networks for regression problems. IEEE Transactions on Neural Networks 8(3),
630645 (1997a)
CoNN Algorithms for Feedforward Architectures Suitable for Classification Tasks 21
Kwok, T.-Y., Yeung, D.-Y.: Objective functions for training new hidden units in
constructive neural networks. IEEE Transactions on Neural Networks 8(5), 11311148
(1997b)
Lahnajrvi, J.J.T., Lehtokangas, M.I., Saarinen, J.P.P.: Fast constructive methods for
regression problems. In: Proceedings of the 18th IASTED International Conference on
Modelling, Identification and Control (MIC 1999), Innsbruck, Austria, pp. 442445
(1999a)
Lahnajrvi, J.J.T., Lehtokangas, M.I., Saarinen, J.P.P.: Comparison of constructive neural
networks for structure learning. In: Proceedings of the 18th IASTED International
Conference on Modelling, Identification and Control (MIC 1999), Innsbruck, Austria,
pp. 446449 (1999b)
Lahnajrvi, J.J.T., Lehtokangas, M.I., Saarinen, J.P.P.: Fixed cascade error a novel
constructive neural network for structure learning. In: Proceedings of the Artificial
Neural Networks in Engineering Conference (ANNIE 1999), St. Louis, USA, pp. 2530
(1999c)
Lahnajrvi, J.J.T., Lehtokangas, M.I., Saarinen, J.P.P.: Evaluation of constructive neural
networks with cascaded architectures. Neurocomputing 48, 573607 (2002)
Lahnajrvi, J.J.T., Lehtokangas, M.I., Saarinen, J.P.P.: Estimating movements of a robotic
manipulator by hybrid constructive neural networks. Neurocomputing 56, 345363
(2004)
Le Cun, T., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Touretzky, D.S. (ed.)
Advances in Neural Information Processing Systems, vol. 2, pp. 598605. Morgan
Kauffman, San Mateo (1990)
Lehtokangas, M.: Modelling with constructive backpropagation. Neural Networks 12, 707
716 (1999)
Ma, L., Khorasani, K.: A new strategy for adaptively constructing multilayer feedforward
neural networks. Neurocomputing 51, 361385 (2003)
Ma, L., Khorasani, K.: New training strategies for constructive neural networks with
application to regression problems. Neural Networks 17, 589609 (2004)
Marchand, M., Golea, M., Rujn, P.: A convergence theorem for sequential learning in two-
layer perceptrons. Europhysics Letters 11(6), 487492 (1990)
Marchand, M., Golea, M.: On learning simple neural concepts: from halfspace intersections
to neural decision lists. Network: Computation in Neural Systems 4(1), 6785 (1993)
Mareschal, D., Schultz, T.R.: Development of childrens seriation: a connectionist
approach. Connection Science 11, 149186 (1999)
Mascioli, F.M.F., Martinelli, G.: A constructive algorithm for binary neural networks: the
oil-spot algorithm. IEEE Transaction on Neural Networks 6, 794797 (1995)
Mayoraz, E., Aviolat, F.: Constructive training methods for feedforward neural networks
with binary weights. International Journal of Neural Networks 7, 149166 (1996)
Mzard, M., Nadal, J.: Learning feedforward networks: the tiling algorithm. J. Phys. A:
Math. Gen. 22, 21912203 (1989)
Muselli, M.: Sequential constructive techniques. In: Leondes, C. (ed.) Neural Network
Systems Techniques and Applications, vol. 2, pp. 81144. Academic, San Diego (1998)
Muselli, M.: Switching neural networks: a new connectionist model for classification. In:
Apolloni, B., Marinaro, M., Nicosia, G., Tagliaferri, R. (eds.) WIRN 2005 and NAIS
2005. LNCS, vol. 3931, pp. 2330. Springer, Heidelberg (2006)
Nabhan, T.M., Zomaya, A.Y.: Toward generating neural network structures for function
approximation. Neural Networks 7(1), 8999 (1994)
22 M. do Carmo Nicoletti et al.
Nadal, J.-P.: Study of a growth algorithm for a feedforward network. International Journal
of Neural Systems 1(1), 5559 (1989)
Nguifo, E.M., Tsopz, N., Tindo, G.: M-CLANN: Multi-class concept lattice-based
artificial neural network for supervised classification. In: Krkov, V., Neruda, R.,
Koutnk, J. (eds.) ICANN 2008,, Part II. LNCS, vol. 5164, pp. 812821. Springer,
Heidelberg (2008)
Nicoletti, M.C., Bertini Jr., J.R.: An empirical evaluation of constructive neural network
algorithms in classification tasks. International Journal of Innovative Computing and
Applications 1(1), 213 (2007)
Parekh, R.G., Yang, J., Honavar, V.: Constructive neural network learning algorithm for
multi-category classification, TR ISU-CS-TR95-15a, Iowa State University, IA (1995)
Parekh, R.G., Yang, J., Honavar, V.: MUpstart a constructive neural network learning
algorithm for multi-category pattern classification. In: Proceedings of the IEEE/INNS
International Conference on Neural Networks (ICNN 1997), vol. 3, pp. 19241929
(1997a)
Parekh, R.G., Yang, J., Honavar, V.: Pruning strategies for the MTiling constructive
learning algorithm. In: Proceedings of the IEEE/INNS International Conference on
Neural Networks (ICNN 1997), 3rd edn., pp. 19601965 (1997b)
Parekh, R.G., Yang, J., Honavar, V.: Constructive neural-network learning algorithms for
pattern classification. IEEE Transactions on Neural Networks 11(2), 436451 (2000)
Platt, J.: A resource-allocating network for function interpolation. Neural Computation 3,
213225 (1991)
Poulard, H.: Barycentric correction procedure: a fast method for learning threshold unit. In:
Proceedings of WCNN 1995, vol. 1, pp. 710713 (1995)
Poulard, H., Estve, D.: A convergence theorem for barycentric correction procedure,
Technical Report 95180, LAAS-CNRS, Toulouse, France (1995)
Poulard, H., Labreche, S.: A new threshold unit learning algorithm, Technical Report
95504, LAAA-CNRS, Toulouse, France (1995)
Poulard, H., Hernandez, N.: Training a neuron in sequential learning. Neural Processing
Letters 5, 9195 (1997)
Prechelt, L.: PROBEN1 A set of neural-network benchmark problems and benchmarking
rules, Fakultt fr Informatik, Univ. Karlsruhe, Germany, Tech. Rep. 21/94 (1994)
Prechelt, L.: Investigation of the CasCor family of learning algorithms. Neural
Networks 10(5), 885896 (1997)
Rujn, P., Marchand, M.: Learning by minimizing resources in neural networks. Complex
Systems 3, 229241 (1989)
Reed, R.: Pruning algorithms a survey. IEEE Transaction on Neural Networks 4(5), 740
747 (1993)
Rivest, F., Shultz, T.R.: Application of knowledge-based cascade-correlation to vowel
recognition. In: Proceedings of The International Joint Conference on Neural Networks,
pp. 5358 (2002)
Schaffer, J.D., Whitely, D., Eshelman, L.J.: Combinations of genetic algorithms and neural
networks: a survey of the state of the art. In: Proceedings of the International Workshop
of Genetic Algorithms and Neural Networks, pp. 137 (1992)
Shultz, T.R., Rivest, F.: Knowledge-based cascade-correlation. In: Proceedings of The
IEEE-INNS-ENNS International Joint Conference on Neural Networks, vol. 5, pp. 641
646 (2000)
Shultz, T.R., Rivest, F.: Knowledge-based cascade-correlation: using knowledge to speed
learning. Connection Science 13, 130 (2001)
CoNN Algorithms for Feedforward Architectures Suitable for Classification Tasks 23
Shultz, T.R., Rivest, F., Egri, L., Thivierge, J.-P., Dandurand, F.: Could knowledge-based
neural learning be useful in developmental robotics? The case of KBCC. International
Journal of Humanoid Robotics 4(2), 245279 (2007)
Sirois, S., Shultz, T.R.: Neural network modeling of developmental effects in
discrimination shifts. Journal of Experimental Child Psychology 71, 235274 (1998)
Smieja, F.J.: Neural network constructive algorithms: trading generalization for learning
efficiency? Circuits, Systems and Signal Processing 12, 331374 (1993)
Subirats, J.L., Jerez, J.M., Franco, L.: A new decomposition algorithm for threshold
synthesis and generalization of Boolean Functions. IEEE Transactions on Circuits and
Systems I 55, 31883196 (2008)
Subirats, J.L., Franco, L., Molina, I.A., Jerez, J.M.: Active learning using a constructive
neural network algorithm. In: Krkov, V., Neruda, R., Koutnk, J. (eds.) ICANN 2008,,
Part II. LNCS, vol. 5164, pp. 803811. Springer, Heidelberg (2008)
Tajine, M., Elizondo, D.: Enhancing the perceptron neural network by using functional
composition. Technical Report 96-07, Computer Science Department, Universit Louis
Pasteur, Strasbourg, France (1996)
Tajine, M., Elizondo, D., Fiesler, E., Korczak, J.: Adapting the 2-class recursive
deterministic perceptron neural network to m-classes. In: Proceedings of The IEEE
International Conference on Neural Networks (ICNN 1997), Los Alamitos, vol. 3, pp.
15421546 (1997)
Thivierge, J.-P., Dandurand, F., Shultz, T.R.: Transferring domain rules in a constructive
network: introducing RBCC. In: Proceedings of The IEEE International Joint
Conference on Neural Networks, vol. 2, pp. 14031408 (2004)
Thivierge, J.-P., Shultz, T.R.: Finding relevant knowledge: KBCC applied to DNA splice-
junction determination. In: Proceedings of The IEEE International Joint Conference on
Neural Networks, pp. 14011405 (2002)
Thivierge, J.-P., Rivest, F., Shultz, T.R.: A dual-phase technique for pruning constructive
networks. In: Proceedings of The IEEE International Joint Conference on Neural
Networks, vol. 1, pp. 559564 (2003)
Towell, G.G., Shavlik, J.W., Noordewier, M.O.: Refinement of approximate domain
theories by knowledge-based neural networks. In: Proceedings of the Eight National
Conference on Artificial Intelligence, Boston, MA, pp. 861866 (1990)
Tsopz, N., Nguifo, E.M., Tindo, G.: Concept-lattice-based artificial neural network. In:
Diatta, J., Eklund, P., Liquire, M. (eds.) Proceedings of the Fifth International
Conference on Concept Lattices and Applications (CLA 2007), Monpellier, France, pp.
157168 (2007)
Towell, G.G.: Symbolic knowledge and neural networks: insertion, refinement and
extraction. Doctoral dissertation, Madison, WI. University of Wisconsin, USA (1991)
Wang, D.: Fast constructive-covering algorithm for neural networks and its implement in
classification. Applied Soft Computing 8, 166173 (2008)
Yang, J., Parekh, R.G., Honavar, V.: MTiling a constructive network learning algorithm
for multi-category pattern classification. In: Proceedings of the World Congress on
Neural Networks, pp. 182187 (1996)
Yang, S., Wang, M., Jiao, L.: Incremental constructive ridgelet neural network.
Neurocomputing 72, 367377 (2008)
Yao, X.: Evolving neural networks. Proceedings of the IEEE 87(9), 14231447 (1999)
Young, S., Downs, T.: CARVE a constructive algorithm for real-valued examples. IEEE
Transactions on Neural Network 9(6), 11801190 (1998)
Wendmuth, A.: Learning the unlearnable. Journal of Physics A 28, 54235436 (1995)
Efficient Constructive Techniques for Training
Switching Neural Networks
Abstract. In this paper a general constructive approach for training neural networks
in classification problems is presented. This approach is used to construct a partic-
ular connectionist model, named Switching Neural Network (SNN), based on the
conversion of the original problem in a Boolean lattice domain. The training of an
SNN can be performed through a constructive algorithm, called Switch Program-
ming (SP), based on the solution of a proper linear programming problem. Since the
execution of SP may require excessive computational time, an approximate version
of it, named Approximate Switch Programming (ASP) has been developed. Simula-
tion results obtained on the StatLog benchmark show the good quality of the SNNs
trained with SP and ASP.
1 Introduction
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 2548.
springerlink.com c Springer-Verlag Berlin Heidelberg 2009
26 E. Ferrari and M. Muselli
minimizing at the same time the error on the available data and a measure of the
complexity of g. As a matter of fact, according to the Occam razor principle and
the main results in statistical learning theory [18] on equal training errors, a simpler
function g has a higher probability of scoring a good level of accuracy in examples
not belonging to S.
To pursue this double target any technique for the solution of classification prob-
lems must perform two different actions: choosing a class of functions (model
definition) and retrieving the best classifier g (training phase). These two tasks
imply a trade-off between a correct description of the data and the generalization
ability of the resulting classifier. In fact, if the set is too large, it is likely to incur
the problem of overfitting: the optimal classifier g has a good behavior in the
examples of the training set, but scores a high number of misclassifications in the
other points of the input domain. On the other hand, the choice of a small set
prevents retrieval of a function with a sufficient level of accuracy on the training set.
Backpropagation algorithms [17] have been widely used to train multilayer per-
ceptrons: when these learning techniques are applied, the choice of is performed
by defining some topological properties of the net, such as the number of hidden
layer and neurons. In most cases, this must be done without having any prior infor-
mation about the problem at hand and several validation trials are needed to find a
satisfying network architecture.
In order to avoid this problem, two different approaches have been introduced:
pruning methods [16] and constructive techniques [10]. The former consider an ini-
tial trained neural network with a large number of neurons and adopt smart tech-
niques to find and eliminate those connections and units which have a negligible
influence on the accuracy of the classifier. However, training a large neural network
may increase the computational time required to obtain a satisfactory classifier.
On the other hand, constructive methods initially consider a neural network in-
cluding only the input and the output layers. Then, hidden neurons are added itera-
tively until a satisfactory description of the examples in the training set is reached.
In most cases the connections between hidden and output neurons are decided be-
fore training, so that only a small part of the weight matrix has to be updated at each
iteration. It has been shown [10] that constructive methods usually present a rapid
convergence to a well-generalizing solution and allow also the treatment of complex
training sets. Nevertheless, since the inclusion of a new hidden unit involves only a
limited number of weights, it is possible that some correlations between the data in
the training set may be missed.
Here, we will present a new connectionist model, called Switching Neural Net-
work (SNN) [12], which can be trained in a constructive way while achieving gener-
alization errors comparable to those of best machine learning techniques. An SNN
includes a first layer containing a particular kind of A/D converter, called latticiz-
ers, which suitably transform input vectors into binary strings. Then, the subsequent
two layers compute a positive Boolean function that solves in a lattice domain the
original classification problem.
Since it has been shown [11] that positive Boolean functions can approximate
any measurable function within any desired precision, the SNN model is sufficiently
Efficient Constructive Techniques for Training Switching Neural Networks 27
2 Problem Setting
Consider a general binary classification problem, where d-dimensional patterns x
are to be assigned to one of two possible classes, labeled with the values of a
Boolean output variable y {0, 1}. According to possible situations in real world
problems, the type of the components xi , i = 1, . . . , d, may be one of the following:
continuous ordered: when xi can assume values inside an uncountable subset Ui
of the real domain R; typically, Ui is an interval [ai , bi ] (possibly open at one end
or at both the extremes) or the whole R.
discrete ordered: when xi can assume values inside a countable set Ci , where a
total ordering is defined; typically, Ci is a finite subset of Z.
nominal: when xi can assume values inside a finite set Hi , where no ordering
is defined; for example, Hi can be a set of colors or a collection of geometric
shapes.
If xi is a binary component, it can be viewed as a particular case of discrete ordered
variable or as a nominal variable; to remove a possible source of ambiguity, a binary
component will always be considered henceforth as a nominal variable.
Denote with Im the set {1, 2, . . . , m} of the first m positive integers; when the
domain Ci of a discrete ordered variable is finite, it is isomorphic to Im , with m = |Ci |,
being |A| the cardinality of the set A. On the other hand, if Ci is infinite, it can be
shown to be isomorphic to the set Z of the integer numbers (if Ci is neither lower
nor upper bounded), to the set N of the positive integers (if Ci is lower bounded), or
to the set Z \ N (if Ci is upper bounded).
28 E. Ferrari and M. Muselli
For each of the above three types of components xi a proper metric can be defined.
In fact, many different distances have been proposed in the literature to characterize
the subsets of R. Also when the standard topology is assumed, different equivalent
metrics can be adopted; throughout this paper, the absolute metric da (v, v ) = |v v |
will be employed to measure the dissimilarity between two values v, v R assumed
by a continuous ordered component xi .
In the same way, when xi is a discrete ordered component, the above cited iso-
morphism between its domain Ci and a suitable subset K of Z makes it possible to
adopt the distance dc (v, v ) induced on Ci by the absolute metric da (v, v ) = |v v |
on K. Henceforth we will use the term counter metric to denote the distance dc .
Finally, if xi is a nominal component no ordering relation exists between any pair
of elements of its domain Hi ; we can only assert that a value v assumed by xi is equal
or not to another value v . Consequently, we can adopt in Hi the flat metric d f (v, v )
defined as
0 if v = v
d f (v, v ) =
1 if v = v
for every v, v Hi . Note that, by considering a counting function which assigns a
different positive integer in Im to each element of the set Hi , being m = |Hi |, we can
substitute the domain Hi of xi with the set Im without affecting the given classifica-
tion problem. It is sufficient to employ the flat metric d f (v, v ) also in Im , which is
no longer seen as an ordered set.
According to this framework, to simplify the exposition we suppose henceforth
that the patterns x of our classification problem belong to a set X = di=1 Xi , where
each monodimensional domain Xi can be a subset of the real field R if xi is a contin-
uous ordered variable, a subset of integers in Z if xi is a discrete ordered component,
or the finite set Im (for some positive integer m) without ordering on it if xi is a nom-
inal variable.
A proper metric dX (x, x ) on X can be simply defined by summing up the contri-
butions di (xi , xi ) given by the different components
d
dX (x, x ) = di (xi , xi ) , for any x, x X
i=1
where di is the absolute metric da if xi and xi are (continuous or discrete) ordered
variables or the flat metric d f if xi and xi are nominal variables.
The target of a binary classification problem is to choose within a predetermined
set of decision functions the classifier g : X {0, 1} that minimizes the number
of misclassifications on the whole set X. If is equal to the collection M of all the
measurable decision functions, this amounts to selecting the Bayes classifier gopt (x)
[2]. On the other hand, if is a proper subset of M , the optimal decision function
corresponds to the classifier g that best approximates gopt according to a proper
distance in M .
Unfortunately, in real world problems we have only access to a training set S,
i.e. a collection of s observations (xk , yk ), k = 1, . . . , s, for the problem at hand.
Efficient Constructive Techniques for Training Switching Neural Networks 29
In [10] the decision list is used hierarchically: a pattern x is assigned to the class yh ,
where h is the lower index such that x Lh . It is possible to consider more general
criteria in the output assignment: for example a weight wh > 0 can be associated
with each domain Lh , measuring the reliability of assigning the output value yh to
every point in Lh .
It is thus possible to associate with every pattern x a weight vector u, whose h-th
component is defined by
wh if x Lh
uh =
0 otherwise
for h = 1, . . . ,t. The weight uh can be used to choose the output for the pattern x.
Without loss of generality suppose that the decision list is ordered so that yh = 0
for h = 1, . . . ,t0 , whereas yh = 1 for h = t0 + 1, . . . ,t0 + t1 , where t0 + t1 = t. The
value of yt+1 is the default decision, i.e. the output assigned to x if x Lh for each
h = 1, . . . ,t.
In order to fix a criterion in the output assignment for an input vector x let us
present the following
30 E. Ferrari and M. Muselli
This classifier can then be implemented in a two layer neural network: the first
layer retrieves the weights uh for h = 1, . . . ,t, whereas the second one realizes the
output decider . The behavior of is usually chosen a priori so that the training
phase consists of finding a proper decision list and the relative weight vector w. For
example, can be made equivalent to a comparison between the sum of the weights
of the two classes:
0 if th=1
0
uh > th=t0 +1 uh
(u) = 1 t0
if h=1 uh < th=t0 +1 uh
yt+1 otherwise
The presence of noisy data can also be taken into account by allowing a small
number of errors in the training set. To this aim Def. 3 can be generalized by the
following
For y {0, 1} do
1. Set T = Sy and h = 1.
2. Find a partial classifier gh for T .
3. Let W = {{xk , yk } T | gh (xk ) = 1}.
4. Set T = T \ W and h = h + 1.
5. If T is nonempty go to step 2.
6. Prune redundant neurons and set ty = h.
the current output value not covered by the neurons already included in the network.
Notice that removing elements from T allows a considerable reduction of the train-
ing time for each neuron since a lower number of examples has to be processed at
each iteration.
A pruning phase is performed at the end of the training process in order to elimi-
nate redundant overlaps among the sets Lh , h = 1, . . . ,t.
Without entering into details about the general theoretical properties of con-
structive techniques, which can be found in [10], in the following sections we will
present the architecture of Switching Neural Networks and an appropriate training
algorithm.
f : {0, 1}n {0, 1} is called positive (resp. negative) if v v implies f (v) f (v )
(resp. f (v) f (v )) for every v, v {0, 1}n. Positive and negative Boolean func-
tions form the class of monotone Boolean functions.
Since the Boolean lattice {0, 1}n does not involve the complement operator NOT,
Boolean expressions developed in this lattice (sometimes called lattice expressions)
can only include the logical operations AND and OR. As a consequence, not every
Boolean function can be written as a lattice expression. It can be shown that only
positive Boolean functions are allowed to be put in the form of lattice expressions.
A recent theoretical result [11] asserts that positive Boolean functions are uni-
versal approximators, i.e. they can approximate every measurable function g : X
{0, 1}, being X the domain of a general binary classification problem, as defined in
Sec. 2. Denote with Qln the subset of {0, 1}n containing the strings of n bits hav-
ing exactly l values 1 inside them. A possible procedure for finding the positive
Boolean function f that approximates a given g within a desired precision is based
on the following three steps:
1. (Discretization) For every ordered input xi , determine a finite partition Bi of the
domain Xi such that a function g can be found, which approximates g on X within
the desired precision and assumes a constant value on every set B B, where
B = {di=1 Bi : Bi Bi , i = 1, . . . , d}.
2. (Latticization) By employing a proper function , map the points of the domain
X into the strings of Qln , so that (x) = (x ) if x and x belong to the same set
B B, whereas (x) = (x ) if x B and x B , being B and B two different
sets in B.
3. (Positive Boolean function synthesis) Select a positive Boolean function f .
If g is completely known these three steps can be easily performed; the higher the
required precision is, the finer the partitions Bi for the domains Xi must be. This
affects the length n of the binary strings in Qln , which has to be big enough to allow
the definition of the 1-1 mapping .
If a {0, 1}n, let P(a) be the subset of In = {1, . . . , n} including the indexes i for
which ai = 1. It can be shown [14] that a positive Boolean function can always be
written as
f (z) = zj (1)
aA jP(a)
where A is an antichain of the Boolean lattice {0, 1}n, i.e. a set of Boolean strings
such that neither a < a nor a < a holds for each a, a A. It can be proved that a
positive Boolean function is univocally specified by the antichain A, so that the task
of retrieving f can be transformed into searching for a collection A of strings such
that a < a for each
a, a
A.
The symbol (resp. ) in (1) denotes a logical sum (resp. product) among the
terms identified by the subscript. The logical product jP(a) z j is an implicant for
the function f ; however, when no confusion arises, the term implicant will also be
used to denote the corresponding binary string a A.
Preliminary tests have shown that a more robust method for classification prob-
lems consists of defining a positive Boolean function fy (i.e. an antichain Ay to be
Efficient Constructive Techniques for Training Switching Neural Networks 33
g ( x )
s O u tp u t D e c id e r
u 1
. . . u r0
D e fa u lt
d e c is io n u r0 + 1 . . . u r
A N D A N D A N D A N D
F ... ... ... ... ... ...
S w itc h S w itc h
z 1
... . . . z m z 1
... . . . z m
j L a ttic iz e r L a ttic iz e r
x 1 x 2 . . . x d x 1 x 2 . . . x d
inserted on (1)) for each output value y and in properly combining the functions
relative to the different output classes. To this aim, each generated implicant can be
characterized by a weight wh > 0, which measures its significance level for the ex-
amples in the training set. Thus, to each Boolean string z can be assigned a weight
vector u whose h-th component is
wh if jP(ah ) z j = 1
uh = Fh (z) =
0 otherwise
4.1 Discretization
Since the exact behavior of the function g is not known, the approximating function
g and the partition B have to be inferred from the samples (xk , yk ) S. It follows
that at the end of the discretization task every set Bi Bi must be large enough
to include the component xki of some point xk in the training set. Nevertheless, the
resulting partition B must be fine enough to capture the actual complexity of the
function g.
Several different discretization methods for binary classification problems have
been proposed in the literature [1, 5, 6, 7]. Usually, for each ordered input xi a set
of mi 1 consecutive values ri1 < ri2 < < ri,mi 1 is generated and the parti-
tion Bi is formed by the mi sets Xi Ri j , where Ri1 = (, ri1 ), Ri2 = (ri1 , ri2 ),
. . . , Ri,mi 1 = (ri,mi 2 , ri,mi 1 ), Rimi = (ri,mi 1 , +). Excellent results have been ob-
tained with the algorithms ChiMerge and Chi2 [5, 7], which employ the 2 statistic
to decide the position of the points ri j , k = 1, . . . , mi 1, and with the technique Ent-
MDL [6], which adopts entropy estimates to achieve the same goal. An alternative
and promising approach is offered by the method used in the LAD system [1]: in
this case an integer programming problem is solved to obtain optimal values for the
cutoffs ri j .
By applying a procedure of this kind, the discretization task defines for each
ordered input xi a mapping i : Xi Imi , where i (z) = j if and only if z Ri j . If we
assume that i is the identity function with mi = |Xi | when xi is a nominal variable,
the approximating function g is uniquely determined by a discrete function h : I
{0, 1}, defined by h( (x)) = g(x), where I = di=1 Imi and (x) is the mapping
from X to I, whose ith component is given by i (xi ).
By definition, the usual ordering relation is induced by i on Imi when xi is an
ordered variable. On the other hand, since in general i is not 1-1, different choices
for the metric on Imi are possible. For example, if the actual distances on Xi must be
taken into account, the metric di ( j, k) = |ri j rik | can be adopted for any j, k Imi ,
Efficient Constructive Techniques for Training Switching Neural Networks 35
having set ri,mi = 2ri,mi ri,mi 1 . Alternative definitions employ the mean points of
the intervals Rik or their lower boundaries.
According to statistical non parametric inference methods a valid choice can also
be to use the absolute metric da on Imi , without caring about the actual value of
the distances on Xi . This choice assumes that the discretization method has selected
correctly the cutoffs ri j , sampling with greater density the regions of Xi where the
unknown function g changes more rapidly. In this way the metric d on I = di=1 Imi
is given by
d
d(v, v ) = di (ui , vi )
i=1
where di is the absolute metric da (resp. the flat metric d f ) if xi is an ordered (resp.
nominal) input.
4.2 Latticization
It can be easily observed that the function provides a mapping from the domain X
onto the set I = di=1 Imi , such that (x) = (x ) if x and x belong to the same set
B B, whereas (x) = (x ) if x B and x B , being B and B two different sets
in B. Consequently, the 1-1 function from X to Qln , required in the latticization
step, can be simply determined by defining a proper 1-1 function that maps the
elements of I into the binary strings of Qln . In this way, (x) = ( (x)) for every
x X.
A possible way of constructing the function is to define properly d mappings
i : Imi Qlnii ; then, the binary string (v) for an integer vector v I is obtained
by concatenating the strings i (vi ) for i = 1, . . . , d. With this approach, (v) always
produces a binary string with length n = di=1 ni having l = di=1 li values 1 inside it.
The mappings i can be built in a variety of different ways; however, it is im-
portant that they fulfill the following two basic constraints in order to simplify the
generation of an approximating function g that generalizes well:
1. i must be an isometry, i.e. Di ( i (vi ), i (vi )) = di (vi , vi ), where Di (, ) is the
metric adopted on Qlnii and di (, ) is the distance on Imi (the absolute or the flat
metric depending on the type of the variable xi ),
2. if xi is an ordered input, i must be full order-preserving, i.e. i (vi ) i (vi ) if
and only if vi vi , where is a (partial or total) ordering on Qlnii .
A valid choice for the definition of consists of adopting the lexicographic ordering
on Qlnii , which amounts to asserting that z z if and only if zk < zk for some k =
1, . . . , ni and zi = zi for every i = 1, . . . , k 1. In this way is a total ordering
on Qni
li
ni
and it can be easily seen that Qlnii becomes isomorphic to Im with m = . As a
li
consequence the counter metric dc can be induced on Qlnii ; this will be the definition
for the distance Di when xi is an ordered input.
36 E. Ferrari and M. Muselli
With these definitions, a mapping i that satisfies the two above properties (isometry
and full order-preserving) is the inverse only-one code, which maps an integer vi
mi 1
Imi into the binary string zi Qm i having length mi and jth component zi j given by
0 if vi = j
zi j = , for every j = 1, . . . , mi
1 otherwise
if xi is a nominal input. Note that xi Ri j if and only if xi exceeds the cutoff ri, j1
(if j > 1) and is lower than the subsequent cutoff ri j (if j < mi ).
Consequently, the mapping i can be implemented by a simple device that re-
ceives in input the value xi and compares it with a sequence of integers or real
numbers, according to definitions (2) or (3), depending on whether xi is an ordered
Efficient Constructive Techniques for Training Switching Neural Networks 37
or a nominal input. This device will be called latticizer; it produces mi binary out-
puts, but only one of them can assume the value 0. The whole mapping is realized
by a parallel of d latticizer, each of which is associated with a different input xi .
5 Training Algorithm
Many algorithms are present in literature to reconstruct a Boolean function start-
ing from a portion of its truth table. However two drawbacks prevent the use of
such techniques for the current purpose: these methods usually deal with general
Boolean functions and not with positive ones and they lack generalization ability.
In fact, the aim of most of these algorithms is to find a minimal set of implicants
which satisfies all the known input-output relations in the truth table. However, for
classification purposes, it is important to take into account the behavior of the gen-
erated function on examples not belonging to the training set. For this reason some
techniques [1, 4, 13] have been developed in order to maximize the generaliza-
tion ability of the resulting standard Boolean function. On the other hand, only one
method, named Shadow Clustering [14], is expressly devoted to the reconstruction
of positive Boolean functions.
In this section a novel constructive algorithm for building a single fy (denoted
only by f for simplicity) will be described. The procedure must be repeated for each
value of the output y in order to find an optimal classifier for the problem at hand.
In particular, if the function fy is built, the Boolean output 1 will be assigned to the
examples belonging to the class y, whereas the Boolean output 0 will be assigned to
all the remaining examples.
The architecture of the SNN has to be constructed starting from the converted
training set S , containing s1 positive examples and s0 negative examples. Let us
suppose, without loss of generality, that the set S is ordered so that the first s1
examples are positive. Since the training algorithm sets up, for each output value,
the switches in the second layer of the SNN, the constructive procedure of adding
neurons step by step will be called Switch Programming (SP).
In particular (a) must take into account the number of examples in S covered
by a and the degree of complexity of a, usually defined as the number of elements
in P(a) or, equivalently, as the sum m i=1 ai . These parameters will be balanced in
the objective function through the definition of two weights and .
In order to define the constraints to the problem, define, for each example zk , the
number k of indexes i for which ai = 1 and zki = 0. It is easy to show that a covers
zk if and only if k = 0. Then, the quantity
s1
(k )
k=1
s1 m
min
,a
s1 k=1
k + a i
m i=1
m
subj to ai(ai zki ) = k for k = 1, . . . , s1
i=1
m
ai(ai zki ) 1 for k = s1 + 1, . . . , s (4)
i=1
k 0 for k = 1, . . . , s1
ai {0, 1} for i = 1, . . . , d
where the Heaviside function has been substituted by its argument in order to avoid
nonlinearity in the cost function. Notice that the terms in the objective function are
normalized in order to be independent of the complexity of the problem at hand.
Since the determination of a sufficiently great collection of implicants, from
which the antichain A is selected, requires the repeated solution of problem (4), the
generation of an already found implicant must be avoided at any extraction. This
can be obtained by adding the following constraint
m
a ji (1 ai) 1 for j = 1, . . . , q 1 (5)
i=1
where a is the implicant to be constructed and a1 ,. . . , aq1 are the already found
q 1 implicants.
Additional requirements can be added to problem (4) in order to improve the
quality of the implicant a and the convergence speed. For example, in order to better
differentiate implicants and to cover all the patterns in fewer steps, the set S1 of
Efficient Constructive Techniques for Training Switching Neural Networks 39
positive patterns not yet covered can be considered separately and weighted by a
different factor = .
Moreover, in the presence of noise it would be useful to avoid excessive adher-
ence of a with the training set by accepting a small fraction of errors.
In this case a further term is added to the objective function, measuring the level
of misclassification, and constraints in (4) have to be modified in order to allow at
most s0 patterns to be misclassified by the implicant a. In particular, slack vari-
ables k , k = s1 + 1, . . . , s are introduced such that k = 1 corresponds to a violated
constraints (i.e. to a negative pattern covered by the implicant). For this reason the
sum sk=s1 +1 , which is just the number of misclassified patterns, must be less than
0 s0 . If the training set is noisy, the optimal implicant can be found by solving the
following LP problem, where it is supposed that the first s1 positive patterns are not
yet covered:
s
1 s1
m s
min k + s0 k=s
+ a i + k
s1 s1 k=s +1
k
,a s1 k=1 m i=1
1 1 +1
m
subj to ai (1 zki) = k for k = 1, . . . , s1
i=1
m
ai (1 zki) 1 k for k = s1 + 1, . . ., s (6)
i=1
s
k 0 s0 , ai {0, 1} for i = 1, . . . , m
k=s1 +1
k 0 for k = 1, . . . , s1 , k {0, 1} for k = s1 + 1, . . . , s
Notice that further requirements have to be imposed when dealing with real-
world problems. In fact, due to the coding (2) or (3) adopted in the latticization
phase, only some implicants correspond to a condition consistent with the original
inputs. In particular at least one zero must be present in the substring relative to each
input variable.
Fig. 3 A greedy method for converting a continuous solution of problem (4) into a binary
vector.
The algorithm for the conversion of the continuous solution to a binary one is
shown in Fig. 3. The method starts by setting ai = 0 for each i; then the ai corre-
sponding to the highest value of (ac )i is set to 1. The procedure is repeated con-
trolling at each iteration if the constraints (4) are satisfied. When no constraint is
violated the procedure is stopped; smart lattice descent techniques [14] may be
adopted to further reduce the number of active bits in the implicant.
These methods are based on the definition of proper criteria in the choice of the
bit to be set to zero. Of course, when the implicant does not satisfy all the constraints,
the bit is set to one again, the algorithm is stopped and the resulting implicant is
added to the antichain A. The same approach may be employed in the the pruning
phase, too.
The approximate version of the SP algorithm, obtained by employing the greedy
procedure for transforming the continuous solution of each LP problem involved in
the method into a binary one, is named Approximate Switch Programming (ASP).
Thus, it follows that every implicant a gives rise to an if-then rule, having in its if
part a conjunction of the conditions obtained from the substrings ai associated with
the d inputs xi . If all these conditions are verified, to the output y = g(x) will be
assigned the value 1.
Due to this property, SP and ASP (with the addition of discretization and latticiza-
tion) become rule generation methods, being capable of retrieving from the training
set some kind of intelligible information about the physical system underlying the
binary classification problem at hand.
Table 1 The original and transformed dataset for the problem of controlling the quality of a
layer produced by a rolling mill. x1 and x2 are the original values for Pressure and Speed, v1
and v2 are the discretized values, whereas z is the binary string obtained through the latticizer.
The quality of the resulting layer is specified by the value of the Boolean variable y.
x1 x2 v1 v2 z y
0.62 0.65 1 1 0110111 1
1.00 1.95 1 1 0110111 1
1.31 2.47 1 1 0110111 1
1.75 1.82 2 1 1010111 1
2.06 3.90 2 2 1011011 1
2.50 4.94 2 3 1011101 1
2.62 2.34 3 1 1100111 1
2.75 1.04 3 1 1100111 1
3.12 3.90 3 2 1101011 1
3.50 4.94 3 3 1011110 1
0.25 5.20 1 3 0111101 0
0.87 6.01 1 3 0111101 0
0.94 4.87 1 2 0111011 0
1.87 4.06 1 2 0111011 0
1.25 8.12 1 4 0111110 0
1.56 6.82 1 4 0111110 0
1.87 8.75 2 4 1011110 0
2.25 8.12 2 4 1011110 0
2.50 7.15 2 4 1011110 0
2.81 9.42 3 4 1101110 0
scoring a very low value of the objective function. Nevertheless, the first three ex-
amples are still to be covered, so the SP algorithm must be iterated.
The constraint (5) has to be added
(1 a1) + (1 a7) 1
and the first three examples constituting the set S must be considered separately in
the cost function (6).
A second execution of the SP algorithm generates the implicant 0000111, which
covers 6 examples among which are the ones not yet covered. Therefore the an-
tichain A = {1000001, 0000111}, corresponding to the PDNF f (z) = z1 z7 + z5 z6 z7 ,
correctly describes all the positive examples. It is also minimal since the pruning
phase cannot eliminate any implicant.
In a similar way an antichain is generated for the output class labelled by 0, thus
producing the second layer of the SNN.
A possible choice for the weight uh to be associated with the h-th neuron is given
by its covering, i.e. the fraction of examples covered by it. For example, the weights
associated with the neurons for the class 1 may be u1 = 0.7 for 1000001 and u2 = 0.6
for 0000111.
Efficient Constructive Techniques for Training Switching Neural Networks 45
8 Simulation Results
To obtain a preliminary evaluation of the performances achieved by SNNs trained
with SP or ASP, the classification problems included in the well-known StatLog
benchmark [9] have been considered. In this way the generalization ability and the
complexity of resulting SNNs can be compared with those of other machine learning
methods, among which are the backpropagation algorithm (BP) and rule generation
techniques based on decision trees, such as C4.5 [15].
All the experiments have been carried out on a personal computer with an Intel
Core Quad Q6600 (CPU 2.40 GHz, RAM 3 GB) running under the Windows XP
operative system.
The tests contained in the Statlog benchmark presents different characteristics
which allow the evaluation of different peculiarities of the proposed methods. In par-
ticular, four problems (Heart, Australian, Diabetes, German) have a binary output;
two of them (Heart and German) are clinical datasets presenting a specific weight
matrix which aims to reduce the number of misclassifications on ill patients. The re-
maining datasets present 3 (Dna), 4 (Vehicle) or 7 (Segment, Satimage, Shuttle) out-
put classes. In some experiments, the results are obtained through a cross-validation
test; however, in the presence of large amount of data, a single trial is performed
since the time for many executions may be excessive.
The generalization ability of each technique is evaluated through the level of
misclassification on a set of examples not belonging to the training set; on the other
hand, the complexity of an SNN is measured using the number of AND ports in
the second layer (corresponding to the number of intelligible rules) and the average
number of conditions in the if part of a rule. Tab. 2 presents the results obtained on
the datasets, reported in increasing order of complexity. Accuracy and complexity
of resulting SNNs are compared to those of rulesets produced by C4.5. In the same
table is also shown the best generalization error included in the StatLog report [9]
for each problem, together with the rank scored by SNN when its generalization
error is inserted into the list of available results.
The performances of the different techniques for training an SNN depend on the
characteristics of the different problems. In particular the SP algorithm scores a bet-
ter level of accuracy with respect to ASP in the datasets Heart and Australian. In
fact, these problems are characterized by a small amount of data so that the execu-
tion of the optimal minimization algorithm may obtain a good set of rules within a
reasonable execution time.
On the other hand, the misclassification of ASP is lower than that of SP in all
the other problems (except for Shuttle), which are composed of a greater amount of
46 E. Ferrari and M. Muselli
Table 2 Generalization error of SNN, compared with C4.5, BP and other methods, on the
StatLog benchmark.
data. The decrease of the performances of SP in the presence of huge datasets is due
to the fact that simplifications in the LP problem are necessary in order to make it
solvable within a reasonable period of time. For example, some problems may be
solved by setting = = 0, since taking into account the possible presence of noise
gives rise to an excessive number of constraints in (6).
Notice that in one case (Shuttle), SP and ASP achieve the best results among
the methods in the StatLog archive, whereas in four other problems ASP achieves
one of the first five positions. However, ASP is in the first ten positions in all the
problems except for Vehicle.
Moreover, a comparison of the other methods reported in Tab. 2 with the best
version of SNN for each problem illustrates that:
Only in one case (Vehicle) the classification accuracy achieved by C4.5 is higher
than that of SNN; in two problems (Satimage and Segment) the performances are
similar, whereas in all the other datasets SNN scores significantly better results.
In two cases (Vehicle and Satimage), BP achieves better results with respect to
SNN; in all the other problems the performances of SNN are significantly better
than those of BP.
These considerations highlight the good quality of the solutions offered by the
SNNs, trained by the SP or ASP algorithm.
Nevertheless, the performances obtained by SP are conditioned by the number
of examples s in the training set and by the number of input variables d. Since the
number of constrains in (4) or (6) depends linearly on s, SP becomes slower and
less efficient when dealing with complex training sets. In particular, the number of
implicants generated by SP in many cases is higher than that of the rules obtained
by C4.5, causing an increase in the training time.
However a smart combination of the standard optimization techniques with the
greedy algorithm in Sec. 5.3 may allow complex datasets to be handled very ef-
ficiently. In fact the execution of ASP requires at most three minutes for each
Efficient Constructive Techniques for Training Switching Neural Networks 47
execution of the first six problems, about twenty minutes for the Dna dataset and
about two hours for Satimage and Shuttle.
Notice that the minimization of (4) or (6) is obtained using the package Gnu
Linear Programming Kit (GLPK) [8], a free library for the solution of linear pro-
gramming problems. It is thus possible to improve the above results by adopting
more efficient tools to solve the LP problem for the generation of implicants.
Concluding Remarks
In this paper a general schema for constructive methods has been presented and em-
ployed to train a Switching Neural Network (SNN), a novel connectionist model for
the solution of classification problems. According to the SNN approach, the input-
output pairs included in the training set are mapped to Boolean strings according to
a proper transformation which preserves ordering and distance. These new binary
examples can be viewed as a portion of the truth table of a positive Boolean function
f , which can be reconstructed using a suitable algorithm for logic synthesis.
To this aim a specific method, named Switch Programming (SP), for reconstruct-
ing positive Boolean functions from examples has been presented. SP is based on
the definition of a proper integer linear programming problem, which can be solved
with standard optimization techniques. However, since the treatment of complex
training sets with SP may require an excessive computational cost, a greedy ver-
sion, named Approximate Swith Programming (ASP), has been proposed to reduce
the execution time needed for training SNN.
The algorithms SP and ASP have been tested by analyzing the quality of the
SNNs produced when solving the classification problems included in the Statlog
archive. The results obtained show the good accuracy of classifiers trained with SP
and ASP. In particular, ASP turns out to be very convenient from a computational
point of view.
Acknowledgement. This work was partially supported by the Italian MIUR project Labo-
ratory of Interdisciplinary Technologies in Bioinformatics (LITBIO).
References
1. Boros, E., Hammer, P.L., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implemen-
tation of Logical Analysis of Data. IEEE Trans. Knowledge and Data Eng. 9, 292306
(2004)
2. Devroye, L., Gyorfi, L., Lugosi, G.: A probabilistic theory of Pattern Recognition.
Springer, Heidelberg
3. Ferrari, E., Muselli, M.: A constructive technique based on linear programming for train-
ing switching neural networks. In: Kurkova, V., Neruda, R., Koutnk, J. (eds.) ICANN
2008, Part II. LNCS, vol. 5164, pp. 744753. Springer, Heidelberg (2008)
4. Hong, S.J.: R-MINI: An iterative approach for generating minimal rules from examples.
IEEE Trans. Knowledge and Data Eng. 9, 709717 (1997)
48 E. Ferrari and M. Muselli
5. Kerber, R.: ChiMerge: Discretization of numeric attributes. In: Proc. 9th Intl. Conf. on
Art. Intell., pp. 123128 (1992)
6. Kohavi, R., Sahami, M.: Error-based and Entropy based discretization of continuous
features. In: Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining, pp. 114119
(1996)
7. Liu, H., Setiono, R.: Feature selection via discretization. IEEE Trans. on Knowledge and
Data Eng. 9, 642645 (1997)
8. Makhorin, A.: GNU Linear Programming Kit - Reference Manual (2008),
http://www.gnu.org/software/glpk/
9. Michie, D., Spiegelhalter, D., Taylor, C.: Machine Learning, Neural and Statistical Clas-
sification. Ellis-Horwood, London (1994)
10. Muselli, M.: Sequential Constructive Techniques. In: Leondes, C. (ed.) Optimization
Techniques, Neural Netw. Systems Techniques and Applications, pp. 81144. Academic
Press, San Diego (1998)
11. Muselli, M.: Approximation properties of positive boolean functions. In: Apolloni, B.,
Marinaro, M., Nicosia, G., Tagliaferri, R. (eds.) WIRN 2005 and NAIS 2005. LNCS,
vol. 3931, pp. 1822. Springer, Heidelberg (2006)
12. Muselli, M.: Switching neural networks: A new connectionist model for classification.
In: Apolloni, B., Marinaro, M., Nicosia, G., Tagliaferri, R. (eds.) WIRN/NAIS 2005.
LNCS, vol. 3931, pp. 2330. Springer, Heidelberg (2006)
13. Muselli, M., Liberati, D.: Binary rule generation via Hamming Clustering. IEEE Trans.
Knowledge and Data Eng. 14, 12581268 (2002)
14. Muselli, M., Quarati, A.: Reconstructing positive Boolean functions with Shadow Clus-
tering. In: Proc. 17th European Conf. Circuit Theory and Design, pp. 377380 (2005)
15. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco
(1994)
16. Reed, R.: Pruning AlgorithmA Survey. IEEE Trans. on Neural Netw. 4, 740747 (1993)
17. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropa-
gating errors. Nature 323, 533536 (1988)
18. Vapnik, V.N.: Statistical learning theory. John Wiley & Sons, New York (1998)
Constructive Neural Network Algorithms That
Solve Highly Non-separable Problems
Abstract. Learning from data with complex non-local relations and multimodal
class distribution is still very hard for standard classification algorithms. Even if
an accurate solution is found the resulting model may be too complex for a given
data and will not generalize well. New types of learning algorithms are needed to ex-
tend capabilities of machine learning systems to handle such data. Projection pursuit
methods can avoid curse of dimensionality by discovering interesting structures in
low-dimensional subspace. This paper introduces constructive neural architectures
based on projection pursuit techniques that are able to discover simplest models
of data with inherent highly complex logical structures. The key principle is to look
for transformations that discover interesting structures, going beyond error functions
and separability.
1 Introduction
Popular statistical and machine learning methods that rely solely on the assump-
tion of local similarity between instances (equivalent to a smoothness prior) suffer
from the curse of dimensionality [2]. When high-dimensional functions are not suf-
ficiently smooth learning becomes very hard, unless extremly large number of train-
ing samples is provided. That leads to a dramatic increase in cost of computations
and creates complex models which are hard to interpret. Many data mining prob-
lems in bioinformatics, text analysis and other areas, have inherent complex logic.
Searching for the simplest possible model capable of representing that kind of data
is still a great challenge that has not been fully addressed.
Marek Grochowski and Wodzisaw Duch
Department of Informatics, Nicolaus Copernicus University, Torun, Poland
e-mail: grochu@is.umk.pl,Google:WDuch
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 4970.
springerlink.com c Springer-Verlag Berlin Heidelberg 2009
50 M. Grochowski and W. Duch
One of the simplest examples of such hard problems is the n-bit parity problem.
It has a very simple solution (one neuron with all weights wi = 1 implementing a
periodic transfer function with a single parameter [3]), but popular kernel methods
and algorithms that depend only on similarity relations, or only on discrimination,
have strong difficulties in learning this function. Linear methods fail completely,
because this problem is highly non-separable. Gaussian-based kernels in SVMs use
all training vectors as support vectors, because in case of parity function all points
have closest neighbors from the opposite class. The nearest neighbor algorithms
(with the number of neighbors smaller than 2n) and the RBF networks have the same
problem. For multilayer perceptrons convergence is almost impossible to achieve
and requires many initiations to find accurate solution. Special feedforward neural
network architectures have been proposed to handle parity problems [16, 30, 31,
28, 21] but they are designed only for this function and cannot be used for other
Boolean functions, even very similar to parity.
Learning systems are frequently tested on benchmark datasets that are almost
linearly separable and relatively simple to handle, but without a strong prior knowl-
edge it is very hard to find satisfactory solution for really complex problems. One
can estimate how complex a given data is using the k-separability index introduced
in [3]. Consider a dataset X = {x1 , . . . , xn } R d , where each vector xi belongs to
one of the two classes.
For example, datasets with two classes that can be separated by a single hyperplane
have k = 2 and are thus 2-separable. XOR problem belongs to the 3-separable cat-
egory, as projections have at least three clusters that contain even, odd and even in-
stances (or odd, even and odd instances). n-bit parity problems are n + 1-separable,
because linear projection of binary strings exists that forms at least n + 1 separated
alternating clusters of vectors for odd and even cases. Please note that this is equiv-
alent to a linear model with n parallel hyperplanes, or a nearest-prototype model
in one dimension (along the line) with n + 1 prototypes. This may be implemented
as a Learning Vector Quantization (LVQ) model [19] with strong regularization. In
both cases n linear parameters define direction w, and n parameters define thresh-
olds placed on the y line (in case of prototypes there are placed between thresholds,
except for those on extreme left and extreme right, placed on the other side of the
threshold in the same distance as the last prototype), so the whole model has 2n
parameters.
It is obvious that complexity of data classification is proportional to the k-
separability index, although for some datasets additional non-linear transformations
are needed to avoid overlaps of projected clusters. For high values of k learning
becomes very difficult and most classifiers, based on the Multi-Layer Perceptron
(MLPs), Radial Basis Function (RBF) network, or Support Vector Machine (SVM)
data models, as well as almost all other systems, are not able to discover sim-
ple data models. Linear projections are the simplest transformations with an easy
Constructive Neural Network Algorithms 51
interpretation. For many complicated situations proper linear mapping can discover
interesting structures. Local and non-local distributions of data points can be clus-
tered and discriminated by hyperplanes. Often a simple linear mapping exists that
leaves only trivial non-linearities that may be separated using neurons that imple-
ment a window-like transfer function:
1 if wx [a, b]
M(x; w, a, b) = (1)
0 if wx
/ [a, b]
This function is suitable for learning all 3-separable data (including XOR). The
number of such Boolean functions for 3 or more bits is much greater than of the
linearly separable functions [3]. For data which is more than k = 3 separable this
will not give an optimal solution, but it will still be simpler than the solution con-
structed using hyperplanes. There are many advantages of using window-type func-
tions in neural networks, especially in difficult, highly non-separable classification
problems [7]. One of the most interesting learning algorithms in the field of learn-
ing Boolean functions is the constructive neural network with Sequential Window
Learning (SWL), an algorithm described by Muselli [26]. This network also uses
window-like transfer function, and in comparison with other constructive methods
[26] outperforms similar methods with threshold neurons, leading to models with
lower complexity, higher speed and better generalization [12]. SWL works only for
binary data and therefore some pre-processing is needed to use it for different kind
of problems.
The k-separability idea is a good guiding principle that facilitates searching for
transformations that can create non-separable data distributions that will be easy
to handle. In the next section constructive network is presented that uses window-
like transfer functions to distinguish clusters created by linear projections. A lot of
methods that search for optimal and most informative linear transformations have
been developed. Projection pursuit is a branch of statistical methods that search for
interesting data transformations by maximizing some index of interest [18, 10].
First a c3sep network that has nodes designed to discover 3-separable structures is
presented and tested on learning Boolean functions and some benchmark classifi-
cation problems. Second, a new Quality of Projected Clusters index designed to
discover k-separable structures is introduced, and applied to visualization of data in
low-dimensional spaces. Constructive networks described in section 3 use this in-
dex with the projection pursuit methodology for construction of an accurate neural
QPCNN architecture for solving complex problems. The paper ends with a discus-
sion and conclusion.
In the brain neurons are organized in cortical column microcircuits [22] that res-
onate with certain frequencies whenever they receive specific signals. Threshold
neurons split input space in two disjoint regions, with hyperplane defined by the w
direction. For highly non-separable data searching for linear separation is useless,
while finding interesting clusters for projected data, corresponding to an active mi-
crocircuit, is more likely. Therefore network nodes should implement a window-like
transfer functions (Eq. 1) that solve 3-separable problems by combination of projec-
tion and clustering, separating some (preferably large) number of instances from a
single class in the [a, b] interval. This simple transformation may handle not only lo-
cal neighborhoods, as Gaussian functions in SVM kernels or RBF networks do, but
also non-local distributions of data that typically appear in the Boolean problems.
For example, in the n-bit parity problem projection on the [1, 1..1] direction creates
several large clusters with vectors that contain fixed number of 1 bits.
Optimization methods based on gradient descent used in the error backpropaga-
tion algorithm require continuous and smooth functions, therefore soft windowed-
type functions should be used in the training phase. Good candidate functions
include a combination of two sigmoidal functions:
For a > b an equivalent product form, called bicentral function [5], is:
Parameter controls slope of the sigmoid functions and can be adjusted during
training together with weights w and biases a and b. Bicentral function (3) has values
in the range [0, 1], while function (2) for b < a may become negative, giving values
Constructive Neural Network Algorithms 53
in the [1, +1] range. This property may be useful in constructive networks for
unlearning instances, misclassified by previous hidden nodes. Another interesting
window-type function is:
1
M(x; w, a, b, ) = 1 tanh( (wx a)) tanh( (wx b)) . (4)
2
This function has one interesting feature: for points wx = a or wx = b and for
any value of slope it is equal to 1/2. By setting large value of hard-window type
function (1) is obtained
M(x; w, a, b, ) M(x; w, a , b ) , (5)
where for function (4) boundaries of the [a, b] interval do not change (a = and a
b = b ), while for the bicentral function (3) value of has influence on the interval
boundaries, so for they are different than [a, b]. Another way to achieve
sharp decision boundaries is by introduction of an additional threshold function and
parameter:
Many other types of transfer functions can be used for practical realization of
3-separable models. For detailed taxonomy of neural transfer functions see [7, 8].
where i controls the tradeoff between the covering and the quality of solution af-
ter the new M(x; i ) hidden node is added. For nodes implementing wx projections
54 M. Grochowski and W. Duch
(or any other functions with outputs restricted to [ai , bi ] interval) largest pure cluster
will give the lowest contribution to the error, lowering the first term while keeping
the second one equal to zero. If such cluster is rather small it may be worthwhile to
create a slightly bigger one, but not quite pure, to lower the first term at the expense
of the second. Usually a single i parameter is taken for all nodes, although each pa-
rameter could be individually optimized to reduce the number of misclassification.
The current version of the c3sep constructive network assumes binary 0/1 class
labels, and uses the standard mean square error (MSE) measure with two additional
terms:
1
2
E(x; , 1 , 2 ) = (y(x; ) c(x))2 +
x
+1 (1 c(x))y(x; ) 2 c(x)y(x; ) . (9)
x x
The term scaled by 1 represents additional the penalty for unclean clusters, in-
creasing the total error for vectors from class 0 that falls into at least one interval
created by hidden nodes. The term scaled by 2 represents reward for large clusters,
decreasing the value of total error for every vector that belongs to class 1 and was
correctly placed inside created clusters.
In most cases this should provide a good starting point for optimization with gradient
based methods.
Network construction proceeds in a greedy manner. First node is trained to sep-
arate as large group of class 1 vectors as possible. After convergence is reached the
slope of transfer function is set to a large value to obtain hard-windowed function,
and the weights of this neuron are kept frozen during further learning. Samples from
class 1 correctly handled by the network do not contribute to the error, and can be
removed from the training data to further speed up learning (however, leaving them
may stabilize learning, giving a chance to form more large clusters). After that, the
next node is added and the training is repeated on the remaining data, until all vec-
tors are correctly handled. To avoid overfitting, one may use pruning techniques, as
it is done in the decision trees. The network construction should be stopped when
the number of cases correctly classified by a new node becomes too small, or when
the crossvalidation tests show that adding such node will decrease generality.
Experiments
10 0.004
1
2 3 4 5 6 7 8 9 10
Number of features
10000
c3sep
Carve
Oil Spot
1000 Irregular Partitioning 0.346
Average CPU time [sec.]
Sequential Window
Target Switch
100 0.214
0.065
10 0.049 0.037
0.034
0.025
1 0.004
0.003
0.1
2 3 4 5 6 7 8 9 10
Number of features
0.007
10
0.221
0.155 0.330
0.107
0.069
0.019
1
2 3 4 5 6 7 8 9 10
Number of features
1000 0.272
c3sep
Carve
Oil Spot
Irregular Partitioning
0.181
Average CPU time [sec.]
0.107
0.007
1 0.069
0.019
0.1
2 3 4 5 6 7 8 9 10
Number of features
will overfit the data. Therefore the ability to find simple, but approximate, solutions
is very useful. One should expect that such approximate models should be more
robust than perfect models if the training is carried on slightly different subset of
examples for a given Boolean function.
Irregular partitioning produces small networks, but the training time is very high,
while on the other hand the fastest methods (Oil Spot) needs many neurons. Sequen-
tial window learning gave solutions with a small number of neurons and rather low
computational cost. The c3sep network was able to create smallest architectures, but
the average times of computations are somehow longer than needed by most other
algorithms. This network provides near optimal solution, as not all patterns were
correctly classified.
(and thus more adaptive parameters are used by standard MLP networks and other
algorithms), results of most methods on binary data are significantly worse, partic-
ularly in the case of Iris and Glass where all features in the original data are real
valued.
In all these tests c3sep network gave very good accuracy with low variance, better
on statistically equivalent to the best solutions, with a very small number of neurons
created in the hidden layer. The ability to solve complex problems in an approxi-
mate way is evidently helpful also for relatively simple data used here, showing the
universality of constructive c3sep networks.
Projection pursuit (PP) is a generic name given to all algorithms that search for
the most interesting linear projections of multidimensional data, maximizing (or
minimizing) some objective functions or indices [11, 10]. Many projection pursuit
indices may be defined to characterize different aspects or structures that the data
may contain. Modern statistical dimensionality reduction approaches, such as the
principal component analysis (PCA), Fishers discriminant analysis (FDA) or inde-
pendent component analysis (ICA) may be seen as special cases of projection pur-
suit approach. Additional directions may be generated in the space orthogonalized
to the already found directions.
PP indices may be introduced both for unsupervised and for supervised learning.
By working in a low-dimensional space based on linear projections projection pur-
suit methods are able to avoid the curse of dimensionality caused by the fact that
high-dimensional space is mostly empty [15]. In this way noisy and non-informative
variables may be ignored. In contrast to most similarity-based methods that opti-
mize metric functions to capture local clusters, projection pursuit may discover also
non-local structures. Not only global, but also local extrema of the PP index are of
interest and may help to discover interesting data structures.
A large class of PP constructive networks may be defined, where each hidden
node is trained by optimization of some projection pursuit index. In essence the
hidden layer defines a transformation of data to low dimensional space based on se-
quence of projections. This transformation is then followed by linear discrimination
in the output layer. PCA, FDA or ICA networks are equivalent to linear discrimi-
nation on pre-processed suitable components. In the next section more interesting
index, in the spirit of k-separability, is defined.
where G(x) is a function with localized support and maximum in x = 0, for exam-
ple a Gaussian function. The first term in Q(x; w) function is large if all vectors
from class Cx are placed close to x after the projection on direction defined by w,
indicating how compact and how large is this cluster of vectors. The second term
depends on distance beetwen x and all patterns that do not belong to class Cx , there-
fore it represents penalty for placing vector x too close to the vectors from opposite
classes. The Quality of Projected Clusters (QPC) index is defined as an average of
the Q(x; w) for all vectors:
1
QPC(w) = Q(x; w) ,
n xX
(12)
improvements in speed may be achieved if the sum in Eq. (12) is restricted only to a
few centers of projected clusters ti . This may be done after projection w stabilizes,
as at the beginning of the training the number of clusters in the projection is not
known without some prior knowledge about the problem. For k-separable datasets
k centers are sufficient and the cost of computing QPC index drops to O(kn). Gra-
dient descent methods may be replaced by more sophisticated approaches [14, 20]),
although in practice multistart gradient methods have been quite effective in search-
ing for interesting projections. It is worth to notice that although global extrema of
QPC index give most valuable projections, suboptimal solutions may also provide
useful insight into the structure of data.
Wine Monks 1
w = 0.10 0.04 0.06 0.19 0.04 0.22 0.81 0.10 0.09 0.22 0.04 0.24 0.34 w = 0.72 0.70 0.01 0.03 0.01 0.03
QPC = 0.6947 QPC = 0.2410
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1.5 1 0.5 0 0.5 1 1.5
Class 3 Is a Monk
Class 2
Class 1 Not a Monk
Class 2
Class 4
Class 3
Class 1 Class 2
Class 1
Fig. 4 Examples of four projections found by maximization of the QPC index using gradient
descent for the Wine data (top-left), the Monk1 problem (top-right), the 10-bit Parity (bottom-
left) and the noisy Concentric Rings (bottom-right).
4 classes, each with 200 samples defined by 4 continuous features. Only the first
and the second feature is relevant, vectors belonging to the same class are located
inside one of the 4 concentric rings. The last two noisy features are uniformly dis-
tributed random numbers. For this dataset the best projection that maximizes the
QPC index reduces influence of noisy features, with weights for dimensions 3 and
4 close to zero. This shows that the QPC index may be used for feature selection,
but also that linear projections have limited power: a complicated solution requiring
many projections at different angles to delineate the rings is needed. Of course a
much simpler network using localized functions will solve this problem more ac-
curately. The need for networks with different types of transfer functions [7, 9] has
been stressed some time ago, but still there are no programs capable of finding the
simplest data models in all cases.
The value of f (w, w1 ) should be large if the current direction w is close to the
previous direction w1 . For example, some power of the scalar product between these
directions may be used: f (w, w1 ) = (w1 T w)2 . Parameter scales the importance
of enforcing this condition during the optimization process.
Scatterplots of data vectors projected on two directions may be used for visual-
ization. Figure 5 presents such scatterplots for the four datasets used in the previous
section. The second direction w, found by gradient descent optimization of function
(14) with = 0.5, is used for the horizontal axis. The final weights of the second
Wine Monks 1
Class 1 Not a Monk
0.6 Class 2 Is a Monk
Class 3 1
0.4
0.2
0.5
0
1
0
w
0.2
0.4
0.5
0.6
0.8 1
QPC = 0.5254 QPC = 0.1573
w w = 0.0942 w w = 0.0002
1 2 1 2
1
0.4
1
0.2
1
w1
0 0
w
0.2
1
0.4
0.6
2
Fig. 5 Scatterplots created by projection on two QCP directions for the Wine and Monk1
data (top-left/right), 10-bit parity and the noisy Concentric Rings data (bottom-left/right).
64 M. Grochowski and W. Duch
direction, value of the projection index QPC(w) and the inner product of w1 and
w are shown in the corresponding figures. For the Wine problem first projection
was able to separate almost perfectly all three classes. Second projection (Fig. 5)
gives additional insight into the structure of this data, leading to a better separation
of vectors placed near decision boundary. Two-dimensional projection of Monk1
data shows separate and compact clusters. The 5th feature (which forms the sec-
ond rule describing this dataset) has significant value, and all unimportant features
have weights equal almost zero, allowing for simple extraction of correct logical
rules. In case of the 10-bit parity problem each diagonal direction of a hypercube
representing Boolean function gives a good solution with large cluster in the cen-
ter. Two such orthogonal directions have been found, projecting each data vector
into large pure cluster, either in the first or in the second dimension. In particular
small, one or two-vector clusters at the extreme ends of the first projection belong
to the largest clusters in the second direction, ensuring good generalization in this
two-dimensional space using naive Bayes estimation of classification probabilities.
Results for the noisy Concentric Rings dataset show that maximization of the QPC
index has caused vanishing of noisy and uninformative features, and has been able to
discover two-dimensional relations hidden in this data. Although linear projections
in two directions cannot separate this data, such dimensionality reduction is suffi-
cient for any similarity-based method, for example the nearest neighbor method, to
perfectly solve this problem.
A single projection allows for estimation and drawing class-conditional and pos-
terior probabilities, but may be not sufficient for optimal classification. Projections
on 2 or 3 dimensions allow for visualization of scatterograms, showing data struc-
tures hidden in the high-dimensional distributions, suggesting how to handle the
problem in the simplest way: adding linear output layer (Wine), employing localized
functions, decision trees or covering algorithms, using intervals (parity) or naive
Bayes, or using the nearest neighbor rule (Concentric Rings). If this is not sufficient
more projections should be used as a pre-processing for final classification, trying
different approaches in a meta-learning scheme [6].
Coefficients of the projection vectors may be used directly for feature rank-
ing/selection, because maximization of the QPC index gives negligible weights to
noisy or insignificant features, while important attributes have distinctly larger val-
ues. This method might be used to improve learning for many machine learning
models sensitive to feature weighting, such as all similarity-based methods. Inter-
esting projections may also be used to initialize weights in various neural network
architectures.
QPC index general sequential constructive method may be used [26]. For the two-
class problems this method is described as follows:
1. start learning with an empty hidden layer;
2. if there are some misclassified vectors do:
3. add a new node;
4. train the node to obtain a partial classiffier;
5. remove all vectors for which the current node outputs +1;
6. enddo.
A partial classiffier is a node with output +1 for at least one vector from one of
the classes, and 1 for all vectors from the opposite classes. After a finite number
of iterations this procedure leads to a construction of neural network that classifies
all training vectors (unless there are conflicting cases, i.e. identical vectors with
different labels, that should be removed). Weights in the output layer do not take part
in the learning phase and their values can be determined from a simple algebraic
equation, assigning the largest weight to the node created first, and progressively
smaller weights to subsequently created nodes, for example:
h
u0 = u j + dh+1 , u j = d j 2h j for j = 1, . . . , h (15)
j=1
where h is the number of hidden neurons, di = {1, +1} denotes label for which
i-th hidden node gives output +1 and dh+1 = dh .
The sequential constructive method critically depends on construction of good
partial classifier. A method to create it is described below. Consider a node M im-
plementing the following function:
1 if |G(w(x t)) | 0
M(x) = (16)
1 otherwise
where the weights w are obtained by maximization of the QPC index, is a thresh-
old parameter which determines the window width, and t is the center of a cluster
of vectors projected on w, estimated by:
to optimize (both the weights and the center are adjusted), but computational cost
of calculations here is linear in the number of vectors O(n), and since only a few
iterations are needed this part of learning is quite fast.
If all vectors for which the trained node gives output +1 have the same label then
this node is a good partial classifier and sequential constructive method described
above can be used directly for network construction. However, for some datasets lin-
ear projections cannot create pure clusters, as for example in the Concentric Rings
case. Creation of a partial classifier may then be done by searching for additional di-
rections by optimization of Q(t; w) function (11) in respect to weights w and center t
restricted to the subset of vectors that fall into the impure cluster. Resulting direction
and center define the next network node according to Eq. 16. If this node is not pure,
that is it provides +1 output for vectors from more than one class, then more nodes
are required. This leads to the creation of a sequence of neurons {Mi }Ki=1 , where the
last neuron MK separates some subset of training vectors without mistakes. Then the
following function:
+1 if K1 Ki=1 Mi (x) 12 > 0
M(x) = (19)
1 otherwise
is a partial classifier. In neural network function Eq. 19 is realized by group of
neurons Mi placed in the first hidden layer and connected to a threshold node M in
the second hidden layer with weight equal to K1 and bias 12 . This approach has been
implemented and the test results are reported below.
QPCNN Tests
Table 3 presents comparison of results of the nearest neighbor (1-NN), naive Bayes
classifier, support vector machine (SVM) with Gaussian kernel, the c3sep network
described in this article, and the constructive network based on the QPC index
(QPCNN). 9 datasets from the UCI repository [1] have been used in 10-fold cross-
validation to test generalization capabilities of these systems. For the SVM classifier
parameters and C have always been optimized using an inner 10-fold crossvalida-
tion procedure, and those that produced the lowest error have been used to learn the
model on the whole training data.
Most of these datasets are relatively simple and require networks with only a few
neurons in the hidden layer. Both the c3sep and the QPCNN networks achieve good
accuracy, in most cases comparable with 1-NN, Naive Bayes and SVM algorithms.
General constructive sequence learning in original formulation applied to QPCNN
may lead to overfitting. This effect have occurred for Glass and Pima-diabetes where
average size of created networks is higher than in the case of c3sep network, while
the average accuracy is lower. To overcome this problem proper stop criterion for
growing the network should be considered, e.g. by tracking test error changes esti-
mated on a validation subset.
Constructive Neural Network Algorithms 67
Table 3 Average classification accuracy for 10 fold crossvalidation test. Results are averaged
over 10 trials. For SVM average number of support vectors (#SV) and for neural networks
average number of neurons (#N) are reported.
complex logical problems. This network is able to discover simple models for diffi-
cult Boolean functions and works also well for real benchmark problems.
Many other variants of the constructive networks based on the guiding princi-
ples that may be implemented using projection pursuit indices are possible. From
Fig. 5 it is evident that an optimal model should use transformations that discover
important features, followed in the reduced space by a specific approach, depending
on the character of a given data. The class of PP networks is quite broad. One can
implement various transformations in the hidden layer, explicitly creating hidden
representations that are used as new inputs for further network layers, or used for
initialization of standard networks. Brains are capable of deep learning, with many
specific transformations that lead from simple contour detection to final invariant
object recognition. Studying linear and non-linear projection pursuit networks will
be most fruitful in combination with the meta-learning techniques, searching for the
simplest data models in the low-dimensional spaces after initial PP transformation.
This approach should bring us a bit closer to the powerful methods required for deep
learning and for discovering hidden knowledge in complex data.
References
1. Asuncion, A., Newman, D.: UCI repository of machine learning databases (2007),
http://www.ics.uci.edu/mlearn/MLRepository.html
2. Bengio, Y., Delalleau, O., Roux, N.L.: The curse of dimensionality for local kernel ma-
chines. Technical Report 1258, Dpartement dinformatique et recherche operationnelle,
Universite de Montreal (2005)
3. Duch, W.: K-separability. In: Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E. (eds.)
ICANN 2006. LNCS, vol. 4131, pp. 188197. Springer, Heidelberg (2006)
4. Duch, W.: Towards comprehensive foundations of computational intelligence. In: Duch,
W., Mandziuk, J. (eds.) Challenges for Computational Intelligence, vol. 63, pp. 261316.
Springer, Heidelberg (2007)
5. Duch, W., Adamczak, R., Grabczewski, K.: A new methodology of extraction, optimiza-
tion and application of crisp and fuzzy logical rules. IEEE Transactions on Neural Net-
works 12, 277306 (2001)
6. Duch, W., Grudzinski, K.: Meta-learning via search combined with parameter optimiza-
tion. In: Rutkowski, L., Kacprzyk, J. (eds.) Advances in Soft Computing, pp. 1322.
Physica Verlag, Springer, New York (2002)
7. Duch, W., Jankowski, N.: Survey of neural transfer functions. Neural Computing Sur-
veys 2, 163213 (1999)
8. Duch, W., Jankowski, N.: Taxonomy of neural transfer functions. In: International Joint
Conference on Neural Networks, Como, Italy, vol. III, pp. 477484. IEEE Press, Los
Alamitos (2000)
9. Duch, W., Jankowski, N.: Transfer functions: hidden possibilities for better neural net-
works. In: 9th European Symposium on Artificial Neural Networks, Brusells, Belgium,
pp. 8194. De-facto publications (2001)
Constructive Neural Network Algorithms 69
10. Friedman, J.H.: Exploratory projection pursuit. Journal of the American Statistical As-
sociation 82, 249266 (1987)
11. Friedman, J.H., Tukey, J.W.: A projection pursuit algorithm for exploratory data analysis.
IEEE Trans. Comput. 23(9), 881890 (1974)
12. Grochowski, M., Duch, W.: A Comparison of Methods for Learning of Highly Non-
separable Problems. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M.
(eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 566577. Springer, Heidelberg
(2008)
13. Grochowski, M., Jankowski, N.: Comparison of instance selection algorithms II. Results
and comments. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.)
ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 580585. Springer, Heidelberg (2004)
14. Haykin, S.: Neural Networks - A Comprehensive Foundation. Maxwell MacMillian Int.,
New York (1994)
15. Huber, P.J.: Projection pursuit. Annals of Statistics 13, 435475 (1985),
http://www.stat.rutgers.edu/rebecka/Stat687/huber.pdf
16. Iyoda, E.M., Nobuhara, H., Hirota, K.: A solution for the n-bit parity problem using a
single translated multiplicative neuron. Neural Processing Letters 18(3), 233238 (2003)
17. Jankowski, N., Grochowski, M.: Comparison of instance selection algorithms. i. algo-
rithms survey. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.)
ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 598603. Springer, Heidelberg (2004)
18. Jones, C., Sibson, R.: What is projection pursuit. Journal of the Royal Statistical Society
A 150, 136 (1987)
19. Kohonen, T.: Self-organizing maps. Springer, Heidelberg (1995)
20. Kordos, M., Duch, W.: Variable Step Search MLP Training Method. International Journal
of Information Technology and Intelligent Computing 1, 4556 (2006)
21. Liu, D., Hohil, M., Smith, S.: N-bit parity neural networks: new solutions based on linear
programming. Neurocomputing 48, 477488 (2002)
22. Maass, W., Markram, H.: Theory of the computational function of microcircuit dynam-
ics. In: Grillner, S., Graybiel, A.M. (eds.) Microcircuits,The Interface between Neurons
and Global Brain Function, pp. 371392. MIT Press, Cambridge (2006)
23. Marchand, M., Golea, M.: On learning simple neural concepts: from halfspace intersec-
tions to neural decision lists. Network: Computation in Neural Systems 4, 6785 (1993)
24. Mascioli, F.M.F., Martinelli, G.: A constructive algorithm for binary neural networks:
The oil-spot algorithm. IEEE Transactions on Neural Networks 6(3), 794797 (1995)
25. Muselli, M.: On sequential construction of binary neural networks. IEEE Transactions
on Neural Networks 6(3), 678690 (1995)
26. Muselli, M.: Sequential constructive techniques. In: Leondes, C. (ed.) Optimization
Techniques,of Neural Network Systems, Techniques and Applications, vol.2, pp. 81
144. Academic Press, San Diego (1998)
27. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes: The
Art of Scientific Computing. Cambridge University Press, Cambridge (2007)
28. Sahami, M.: Learning non-linearly separable boolean functions with linear threshold unit
trees and madaline-style networks. In: National Conference on Artificial Intelligence, pp.
335341 (1993)
29. Shakhnarovish, G., Darrell, T., Indyk, P. (eds.): Nearest-Neighbor Methods in Learning
and Vision. MIT Press, Cambridge (2005)
30. Stork, D.G., Allen, J.: How to solve the n-bit parity problem with two hidden units.
Neural Networks 5, 923926 (1992)
70 M. Grochowski and W. Duch
31. Wilamowski, B., Hunter, D.: Solving parity-n problems with feedforward neural net-
work. In: Proc. of the Int. Joint Conf. on Neural Networks (IJCNN 2003), vol. I, pp.
25462551. IEEE Computer Society Press, Los Alamitos (2003)
32. Young, S., Downs, T.: Improvements and extensions to the constructive algorithm carve.
In: Vorbruggen, J.C., von Seelen, W., Sendhoff, B. (eds.) ICANN 1996. LNCS, vol. 1112,
pp. 513518. Springer, Heidelberg (1996)
33. Zollner, R., Schmitz, H.J., Wunsch, F., Krey, U.: Fast generating algorithm for a general
three-layer perceptron. Neural Networks 5(5), 771777 (1992)
On Constructing Threshold Networks for
Pattern Classification
Martin Anthony
1 Introduction
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 7182.
springerlink.com c Springer-Verlag Berlin Heidelberg 2009
72 M. Anthony
if the underlying directed graph is acyclic (that is, it has no directed cycles). This
feed-forward condition means that the units (both the input units and the threshold
units) can be labeled with integers in such a way that if there is a connection from
the unit labeled i to the unit labeled j then i < j. In any feed-forward network, the
units may be grouped into layers, labeled 0, 1, 2, . . ., , in such a way that the input
units form layer 0, these feed into the threshold units, and if there is a connection
from a threshold unit in layer r to a threshold unit in layer s, then we must have
s > r. Note, in particular, that there are no connections between any two units in a
given layer. The top layer consists of output units. The layers that are not inputs or
outputs are called hidden layers.
We will be primarily interested in linear threshold networks having just one hid-
den layer, and it is useful to give an explicit description in this case of the function-
ality of the network. Such a network will consist of n inputs and some number, k,
of threshold units in a single hidden layer, together with one output threshold unit.
Each threshold unit computes a threshold function of the n inputs. The (binary-
valued) outputs of these hidden nodes are then used as the inputs to the output node,
which calculates a threshold function of these. Thus, the threshold network com-
putes a threshold function of the outputs of the k threshold functions computed by
the hidden nodes. If the threshold function computed by the output node is described
by weight vector Rk and threshold , and the threshold function computed by
hidden node i is fi [w(i) , (i) ], then the threshold network as a whole computes
the function f : {0, 1}n {0, 1} given by
k
f (y) = 1 i fi (y) ;
i=1
that is,
k n
i sgn
(i)
f (y1 y2 . . . yn ) = sgn w j y j (i) ,
i=1 j=1
where sgn(x) = 1 if x 0 and sgn(x) = 0 if x < 0. The state of the network is the
(concatenated) vector
A fixed network architecture of this type (that is, fixing n and k), computes a param-
eterised set of functions { f : Rnk+2k+1 }. In state , the network computes the
function f : {0, 1}n {0, 1}.
Any Boolean function can be expressed by a disjunctive normal formula (or DNF),
using literals u1 , u2 , . . . , un , u1 , . . . , un , where the ui are known as negated literals. A
disjunctive normal formula is one of the form
74 M. Anthony
T1 T2 Tk ,
u1 u4 u1 u2 u3 u1 u3 u4
Suppose that (T, F) is a partially-defined Boolean function and that the Boolean
function f is some extension of (T, F), meaning that f (x) = 1 for x T and f (x) =
0 for x F. Let be a DNF formula
for f.
Suppose
= T1 T2 Tk , where
each Ti is a term of the form Ti = jPi u j jNi j , for some disjoint subsets
u
Pi , Ni of {1, 2, . . . , n}. We form a network with k hidden units, one corresponding to
each term of the DNF. Labelling these threshold units 1, 2, . . . , k, we set the weight
vector w(i) from the inputs to hidden threshold unit i to correspond directly to Ti , in
(i) (i) (i)
the sense that w j = 1 if j Pi , w j = 1 if j Ni , and w j = 0 otherwise. We take
the threshold (i) on hidden unit i to be |Pi |. We set the weight on the connection
between each hidden threshold unit and the output unit to be 1, and the threshold
on the output unit to be 1/2. That is, we set to be the all-1 vector of dimension
k, and set the threshold to be 1/2. It is clear that hidden threshold unit i outputs 1
on input x precisely when x satisfies the term Ti , and that the output unit computes
the or of all the outputs of the hidden units. Thus, the output of the network is the
disjunction of the terms Ti , and hence equals f .
On Constructing Threshold Networks for Pattern Classification 75
Note that this does not describe a unique threshold network representing the pdBf
(T, F), for there may be many choices of extension function f and, given f , there
may be many possible choices of DNF for f . In the case in which T F = {0, 1}n,
so that the function is fully defined, we could, for the sake of definiteness, use the
particular DNF formula described above, the disjunction of all prime implicants.
In general, a simple counting argument establishes that, whatever method is be-
ing used to represent Boolean functions by threshold networks, for most Boolean
functions a high number of units will be required in the resulting network. Explic-
itly, suppose we have an n-input threshold network with one output and one hidden
layer comprising k threshold units. Then, since the number of threshold functions is
2 2
at most 2n (see [1, 3], for instance), the network computes no more than (2n )k+1
different Boolean functions, this being an upper bound on the number of possible
mappings from the input set {0, 1}n to the vector of outputs of all the k + 1 threshold
2 n
units. This bound, 2n (k+1) is, for any fixed k, a tiny proportion of all the 22 Boolean
functions and, to be comparable, we need k = (2n /n2 ). (This is a very quick and
easy observation. For more detailed bounds on the sizes of threshold networks re-
quired to compute general and specific Boolean functions, see [10], for instance.)
It is easy to give an explicit example of a function for which this standard method
produces an exponentially large threshold network. The parity function f on {0, 1}n
is given by f (x) = 1 if and only if x has an odd number of ones. It is well known that
any DNF formula for f must have 2n1 terms. To see this, note first that each term
of must have degree n. For, suppose some term Ti contained fewer than n literals,
and that neither u j nor u j were present in Ti . Then there are x, y {0, 1}n which are
true points of Ti , but which differ only in position j. Then, since Ti is a term in the
DNF representation of the parity function f , we would have f (x) = f (y) = 1. But
this cannot be: one of x, y will have an odd number of entries equal to 1, and one
will have an even number of such entries. It follows that each term must contain n
literals, in which case each term has only one true point, and so we must have 2n1
distinct terms, one for each true point. It follows that the resulting network has 2n1
threshold units in the hidden layer.
evaluate f1 (y). If f1 (y) = 1, we assign the value c1 to f (y); if not, we evaluate f2 (y),
and if f2 (y) = 1 we set f (y) = c2 , otherwise we evaluate f3 (y), and so on. If y fails
to satisfy any fi then f (y) is given the default value 0. The evaluation of a decision
list f can therefore be thought of as a sequence of if then else commands,
as follows:
if f1 (y) = 1 then set f (y) = c1
else if f2 (y) = 1 then set f (y) = c2
...
...
else if fr (y) = 1 then set f (y) = cr
else set f (y) = 0.
We define DL(G), the class of decision lists based on G, to be the set of finite
sequences
f = ( f1 , c1 ), ( f2 , c2 ), . . . , ( fr , cr )
such that fi G, ci {0, 1} for 1 i r. The values of f are defined by f (y) = c j
where j = min{i : fi (y) = 1}, or 0 if there are no j such that f j (y) = 1. We call each
f j a test (or, following Krause [6], a query) and the pair ( f j , c j ) is called a term of
the decision list.
We now consider the class of decision lists in which the tests are threshold functions.
We shall call such decision lists threshold decision lists, but they have also been
called neural decision lists [7] and linear decision lists [13]. Formally, a threshold
decision list
f = ( f1 , c1 ), ( f2 , c2 ), . . . , ( fr , cr )
has each fi : Rn {0, 1} of the form fi (x) = sgn(wi , x i ) for some wi Rn and
i R. The value of f on y Rn is f (y) = c j if j = min{i : fi (y) = 1} exists, or 0
otherwise (that is, if there are no j such that f j (y) = 1).
work.) The points that have been chopped off can then be removed from consider-
ation and the procedure iterated until no points remain. In general, we would hope
to be able to separate off more than one point at each stage, but the argument given
above establishes that, at each stage, at least one point can indeed be chopped off,
so since the set of points is finite, the procedure does indeed terminate.
We may regard the chopping procedure as a means of constructing a threshold
decision list consistent with the data set. If, at stage i of the procedure, the hyper-
plane with equation ni=1 i yi = chops off points all having label j, with these
points in the half-space with equation ni=1 i yi , then we take as the ith term
of the threshold decision list the pair ( fi , j), where fi [ , ]. Therefore, given
any partially-defined Boolean function (T, F), there will always be some threshold
decision list representing the pdBf.
one entry equal to 1. All of these are positive points and the hyperplane y1 + y2 +
+ yn = 3/2 will separate them from the other points. These points are then deleted
from consideration. We can continue in this way. The procedure iterates n times,
and at stage i in the procedure we chop off all data points having precisely (i 1)
ones, by using the hyperplane y1 + y2 + + yn = i 1/2, for example. (These
hyperplanes are in fact all parallel, but this is not in general possible.) So we can
represent the parity function by a threshold decision list with n terms. By contrast,
Jeroslows method requires 2n1 iterations, since at each stage it can only chop
off one positive point: that is, it produces a disjunction of threshold functions (or a
special type of threshold decision list) with an exponential number of terms.
3.5 Algorithmics
We now show how we can make use of the chopping procedure to find a threshold
network representing a given Boolean function by giving an explicit way in which
a threshold decision list can be represented by a threshold network with one hidden
layer.
f = ( f1 , c1 ), ( f2 , c2 ), . . . , ( fk , ck )
where
On Constructing Threshold Networks for Pattern Classification 79
that is, i = 2ki (2ci 1). Then f , the function computed by the network in state
, equals f .
Proof: We prove the result by induction on k, the length of the decision list (and
number of hidden threshold units in the network).
The base case is k = 1. Since the default output of any decision list is 0, we may
assume that f takes the form f = ( f1 , 1) where f1 [w, ] for some w Rn and
R. Then, is the single number 211 (2c1 1) = 1 and = 1. So
n n
(i) (i)
f (y1 y2 . . . yn ) = sgn sgn w j y j (i) 1 = sgn w j y j (i) = f1 (y1 y2 . . . yn ),
j=1 j=1
so f = f1 = f .
Now suppose that the result is true for threshold decision lists of length k, where
k 1. Consider a threshold decision list
f = ( f1 , c1 ), ( f2 , c2 ), . . . , ( fk , ck ), ( fk+1 , ck+1 ).
g = ( f2 , c2 ), . . . , ( fk , ck ), ( fk+1 , ck+1 ).
f (y) = sgn(F(y)),
where
k+1
F(y) = 2k+1i(2ci 1) fi (y) 1.
i=1
Now,
k+1
F(y) = 2k (2c1 1) f1 (y) + 2k+1i (2ci 1) fi (y) 1
i=2
k
= 2k (2c1 1) f1 (y) + 2ki (2ci+1 1) fi+1 (y) 1
i=1
= 2 (2c1 1) f1 (y) + G(y).
k
80 M. Anthony
Now, suppose f1 (y) = 0. In this case, by the way in which decision lists are
defined to operate, we should have f (y) = g(y). This is indeed the case, since
Suppose now that f1 (y) = 1. In this case we have f (y) = c1 and so we need to verify
that sgn(F(y)) = c1 . We have
= 0.
Applying Theorem 1 now to this threshold decision list would give a threshold net-
work representing f . That network would have exactly the same structure as the one
obtained by using the standard DNF-based method, using DNF formula . (How-
ever, the weights from the hidden layer to the output would be different, with expo-
nentially decreasing, rather than constant, values.) What this demonstrates is that, in
particular, there is always a threshold decision list representation whose length is no
more than that of any given DNF representation of the function. There may, as in the
case of parity, be a significantly shorter threshold decision list. So the decision list
approach (and application of Theorem 1) will, for any function (or partially-defined
function), in the best case, give a network that is no larger than that obtained by the
standard method.
6 Conclusions
from previous constructions which have also been based on iterative linear sepa-
ration, in that the networks constructed have only one hidden layer. Furthermore, it
can always produce a network that is no larger than that which follows from the stan-
dard translation from a Boolean functions disjunctive normal form into a threshold
network.
References
1. Anthony, M.: Discrete Mathematics of Neural Networks: Selected Topics. Society for
Industrial and Applied Mathematics, Philadeplhia (2001)
2. Anthony, M.: On data classification by iterative linear partitioning. Discrete Applied
Mathematics 144(1-2), 216 (2004)
3. Cover, T.M.: Geometrical and Statistical Properties of Systems of Linear Inequalities
with Applications in Pattern Recognition. IEEE Trans. on Electronic Computers EC-14,
326334 (1965)
4. Hammer, P.L., Ibaraki, T., Peled., U.N.: Threshold numbers and threshold completions.
Annals of Discrete Mathematics 11, 125145 (1981)
5. Jeroslow, R.G.: On defining sets of vertices of the hypercube by linear inequalities. Dis-
crete Mathematics 11, 119124 (1975)
6. Krause, M.: On the computational power of boolean decision lists. In: Alt, H., Ferreira,
A. (eds.) STACS 2002. LNCS, vol. 2285, pp. 372383. Springer, Heidelberg (2002)
7. Marchand, M., Golea, M.: On Learning Simple Neural Concepts: from Halfspace Inter-
sections to Neural Decision Lists. Network: Computation in Neural Systems 4, 6785
(1993)
8. Marchand, M., Golea, M., Rujan, P.: A convergence theorem for sequential learning in
two-layer perceptrons. Europhys. Lett. 11, 487 (1990)
9. Rivest, R.R.: Learning Decision Lists. Machine Learning 2(3), 229246 (1987)
10. Siu, K.Y., Rowchowdhury, V., Kalaith, T.: Discrete Neural Computation: A Theoretical
Foundation. Prentice Hall, Englewood Cliffs (1995)
11. Tajine, M., Elizondo, D.: Growing methods for constructing Recursive Deterministic
Perceptron neural networks and knowledge extraction. Artificial Intelligence 102, 295
322 (1998)
12. Tajine, M., Elizondo, D.: The recursive deterministic perceptron neural network. Neural
Networks 11, 15711588 (1998)
13. Turan, G., Vatan, F.: Linear decision lists and partitioning algorithms for the construction
of neural networks. In: Foundations of Computational Mathematics: selected papers of a
conference, Rio de Janeiro, pp. 414423. Springer, Heidelberg (1997)
14. Zuev, A., Lipkin, L.I.: Estimating the efficiency of threshold representations of Boolean
functions. Cybernetics 24, 713723 (1988); Translated from Kibernetika (Kiev) 6, 2937
(1988)
Self-Optimizing Neural Network 3
Adrian Horzyk
1 Introduction
Nowadays, various types of constructive neural networks and other incremental
learning algorithms play an increasingly important role in neural computations.
These algorithms usually provide an incremental method of building neural net-
works with reduced topologies for classification problems. Furthermore, this method
Adrian Horzyk
AGH University of Science and Technology, Department of Automatics
e-mail: horzyk@agh.edu.pl
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 83101.
springerlink.com c Springer-Verlag Berlin Heidelberg 2009
84 A. Horzyk
produces a multilayer network architecture, which together with the weights, is de-
termined automatically by the introduced constructive algorithm. Another advantage
of these algorithms is that convergence is guaranteed by the method [1], [3],[4], [6],
[8], [13]. A growing amount of the current research in neural networks is oriented
towards this important topic. Providing constructive methods for building neural net-
works can potentially create more compact models which can easily be implemented
in hardware and used in various embedded systems.
There are many classification methods in the world today. Many of them need
to set up some training parameters, initialize network parameters or build an initial
network architecture before a training process can be initiated. Some of them suffer
from limitations in the input value ranges or from the curse of the dimensionality
problem [1], [8]. Some methods favour certain data values or treat data in a dif-
ferent way depending on the number of cases that represent classes [1], [3]. Not
many methods can automatically manage, reduce or simplify an input data space
and their training processes are sometimes unsuccessful when exposed to the influ-
ence of minor values of weakly differentiating input features. Moreover, the model
size (e.g. the architecture size) and the time necessary for training and evaluating
are also significant. Many neural methods compete for a better generalization us-
ing various training strategies and parameters, neuron functions, various quantities
of layers, neurons and interconnections [1], [3], [8], [13]. This paper confirms that
generalization potential is also hidden in suitable preprocessing of training data and
an appropriate covering of an input data space.
This paper describes constructive Self-Optimizing Neural Network 3 [4] that is
devoided of many of the limitations described above and thanks to the proposed for-
mulas for estimating of discrimination of input features it can build a suitable neural
network architecture and precisely compute weight values. This neural solution can
automatically reduce and simplify an input data space, specifically converting real
input vectors into bipolar binary vectors {1; +1}, which can be used to aggregate
equal binary values into a compact neural model of training data. A human brain
works in a very similar way. It gathers data using sensors that are able to convert
various physical data from the surrounding world into some frequencies of binary
signals. Binary information is then used to perform the relevant computations until
actuators are activated and the information is transformed into physical reactions
[6], [10], [12]. It is very interesting that nature has also chosen binary signals to
carry out complex computations in our brains instead of real signals. The SONN-3
converts all real vectors into binary ones using the specialist algorithm (ADLBCA),
which cleverly transforms values of real input features into bipolar binary values
so that discrimination properties of real input data are not lost. It also automati-
cally rejects all useless values of real input features that have minor significance
for classification. All complex computations for discrimination of classes are car-
ried out on the bipolar binary values. Network outputs take the real values from
a continuous range [1; +1] to emphasize degrees of similarity of input vector to
defined classes.
Self-Optimizing Neural Network 3 85
There are a few algorithms that can also reduce or simplify input data space, e.g.
PCA, ICA, roughs sets, SSV trees [7], [9], [11], but they do not transform real inputs
into binary inputs, so it is not easy to compare their performance.
The SONN-3 performs many analyses on bipolar binary vectors to optimize a
neural network model. It aggregates the most discriminating and most frequent sim-
ilar values for various training cases, transforms them into an appropriate network
architecture and reinforces them in accordance with their discrimination proper-
ties, rejecting all useless features for classification and builds a classification model.
Thanks this ability, this method is devoided of the curse of the ddimensionality
problem.
The SONN-3 analytically computes a network architecture and weight values, so
it is also devoided of convergence problems. It always finds a solution that differ-
entiates all defined classes provided that training cases are not contradictory, e.g.
two or more cases from different classes can be differentiated neither by a single
value nor by their combinations. If training data are not contradictory, SONN-3 al-
ways produces a neural solution that always correctly classifies all training data. It
also generalizes well. This method does not memorize training data using a huge
number of network parameters but builds a very compact neural model (figs. 8, 9)
that represents only major ranges of values of real input features for all classes and
considers the discrimination of all training cases of different classes.
Chapter 2 describes construction elements and development of a network archi-
tecture of the SONN-3. Global Discrimination Coefficients used for optimization
processes of SONN-3 are introduced in chapter 3. Chapter 4 describes the construc-
tion of a Lossy Binarizing Subnetwork, which is an input subnetwork of a SONN-3
network. Chapter 5 combines the Lossy Binarizing Subnetwork with an Aggregation
Reinforcement Subnetwork that aggregates and performes lossless compression of
same values and appropriately reinforces inputs. Chapter 6 describes a Maximum
Selection Subnetwork that selects maximum outputs for each class and produces
an ultimate classification. Comparisons of various soft-computing methods can be
found in chapter 7.
Fig. 1 Three types of SONN-3 neurons: (a) Lossy Binarizing Neuron (LBN) (b) Aggregation
Reinforcement Neuron (ARN), (c) Maximum Selection Neuron (MSN)
neurons are data dependent. If data are not very correlated - there is a small num-
ber of layers and vice versa. This subnetwork is responsible for extraction, counting
and aggregation of equal binary values of various cases and various classes and for
their appropriate reinforcement depending on their discrimination properties com-
puted after the Global Discrimination Coefficients (GDCs) (1) described in the third
chapter of this paper. The aggregation of equal binary values enables the SONN-3
to lossless compress equal binary values of training data and transform these val-
ues into single connections and a special architecture of Aggregation Reinforcement
Neurons (ARN) (fig. 1b). Each ARN represents a subset of training cases. Each one
is determined during the special data division process described in the fifth chapter
of this paper. The subset of training cases can consist of training cases of one or
of many classes. ARNs that represent training cases of more than a single class are
intermediate (hidden) neurons that do not produce their outputs for the next subnet-
work (MSS) but only for some other ARNs.
The third subnetwork (Maximum Selection Subnetwork - MSS) (fig. 7) consists
of a single layer of Maximum Selection Neurons (MSNs) (fig. 1c). Each defined
class is represented by a single MSN. Each MSN is responsible for the selection
of a maximum output value for the class it represents. The maximum output value
is taken from all outputs of all parentless ARNs that represent a subset of training
cases of a single class which is the same as a class for which MSN is created.
The Global Discrimination Coefficients (GDCs) play the most significant role in all
optimization processes during network construction and the computation of weights.
The GDCs are computed for all LBS bipolar binary outputs (i.e. all ARS bipolar
binary inputs) computed for all training cases separately. They precisely determine
the discrimination ability of all LBS bipolar binary outputs (i.e. all ARS bipolar
binary inputs) and are insensitive to the differences in a number of training cases
that represent various defined classes thanks to the normalization factor Qm (1).
They can determine the representativeness of a given bipolar binary value for a
given class thanks to the quotients Pkm /Qm or Nkm /Qm (1). In other words, the more
Self-Optimizing Neural Network 3 87
representative for a given class a given bipolar binary value +1 or 1 is the bigger
value has the quotients Pkm /Qm or Nkm /Qm respectively (1). The GDCs also include
the coefficient defining the differentiating ability of a given bipolar binary value for
a given class from other classes represented as the sum normalized by M 1 in
equation 1. In other words, the more frequent a given bipolar binary value +1 or
1 is in other classes the less discriminating is this value for a class of a considered
training case.
The GDCs (1) are computed for each k-th bipolar binary feature value vk for each
n-th raw training case vn of the m-th class Cm :
un Cm n{1,...Q}
m{1,...,M} k{1,...K} :
Pkm P h
n M 1 Qkh i f fkLB (vnr , R) > 0 vn Cm
dk+ = (M1)Qm h=1h=m
0 if fkLB (vnr , R) < 0 vn Cm (1)
Nkm N h
n M 1 Qkh i f fkLB (vnr , R) < 0 vn Cm
dk = (M1)Qm h=1h=m
0 if fkLB (vnr , R) > 0 vn Cm
un,R LB n
k = f k (vr , R) (2)
m{1,..,M} Qm = un,R U Cm : n {1, ...Q} (3)
In order to compute GDCs, all training cases have to be available at the begin-
ning of a construction and training process. It is impossible to compute GDCs if
some parts of training cases are not available or when some parts of training cases
change during a construction or training process. If training data change or are sup-
plemented then GDCs have to be computed once again and a construction process of
88 A. Horzyk
SONN-3 has to be repeated from the very beginning. This drawback is not very sig-
nificant because training data rarely change during a construction or training process
and even if this occurs a construction process of SONNs-3 is so quick that it can be
quickly repeated in order to build an improved solution based on a new architecture
and new values of weights.
GDCs allow SONN-3 to globally estimate the significance of each bipolar bi-
nary value computed for each real input feature value. The GDCs are the basis for
constructing an ARS architecture and computing the values of ARN weights.
Fig. 2 The comparison of the simple smooth and ADLBCA transformations of real values
into binary ones for the Iris data from the ML Repository.
Self-Optimizing Neural Network 3 89
Fig. 3 The exemplar tables of the discriminated classes for the Iris cases.
4. Next, all fully discriminated training cases for all input features are looked
through in order to remove their indexes from the sorted index tables.
5. Steps 2, 3 and 4 are repeated until all TD cases are discriminated from all other
classes (fig. 3) or there is no more range to consider.
6. If not all training cases are discriminated and no more ranges can be used to carry
out their discrimination, all the atomic ranges are chosen for all input features that
contain indiscriminated training cases and are added to the previously selected
ranges in steps 2, 3 and 4. The atomic ranges always contain a sequence of cases
of one class or they may represent a few classes but the range is narrowed to a
single value (fig. 2).
If step 6 occurs it means that some training cases are contradictory or they can be
differentiated later by a soft-computing algorithm (e.g. the ARS) that can combine
these ranges.
leaf-length v1 f4,R
LB
3 u3
f
LB
4,R4 u4
leaf-width v2 f2,R
LB
5 u5
f4,R
LB
6 u6
f
LB
2,R7 u7
petal-length v3
f3,R
LB
8 u8
f
LB
4,R9 u9
petal-width v4 u10
LB
f 2,R10
f3,R
LB
11 u11
f
LB
3,R12 u12
Fig. 4 The LBS constructed for the Iris data with the specified lossy binarization ranges.
loses nor rounds off any important representative input values but aggregates them
and appropriately reinforces them. This algorithm lossless compresses input data
and makes it possible to join computations for the same values of input features.
It reinforces input features suitable to the values of Global Discrimination Coeffi-
cients (GDC) (1) computed for given data and described in the third chapter of this
paper. The ARS architecture is always constructed individually for each data set
and weights are precisely computed to reflect the discrimination and representative
property of any given data. The described ARS automatic configuration can be pro-
ceeded only for bipolar binary inputs {1; +1}. The ARS construction algorithm
can automatically and very quickly find appropriate combinations of binary inputs
and automatically simplify or even reduce a bipolar binary input data space. The
ARS is placed in the middle part of the SONN-3 architecture (figs. 7,8,9).
Construction of an Aggregation Reinforcement Subnetwork (ARS) begins with
computation of GDCs (1) for all bipolar binarized inputs uk1 , ..., ukt achieved from
binary outputs (6) of the previous LBS described in the previous chapter. The ARS
consists of Aggregation Reinforcement Neurons (ARNs) (fig. 1b). These neurons
need bipolar binary inputs during their construction and adaptation process. They
aggregate inputs of various cases together (fig. 6) if they have the same values (with-
out losing or rounding off any values) and reinforce the values which best discrimi-
nate and best represent training cases in-between various classes. All reinforcement
factors and weights (9)-(10) are dependent on the appropriate values of Global Dis-
crimination Coefficients (1). The ARNs always produce their outputs in the range of
[1; +1] and are interconnected to the other ARNs or Maximum Selection Neurons
92 A. Horzyk
(MSNs) described in the next chapter. ARNs propagate the sum of discrimination
coefficient values of all previous connections to ARNs of the next layer (8) dur-
ing the ARS construction (8). In this way, proper reinforcement is appropriately
promoted and propagated through a network without a loss of information about
discrimination. The ARNs compute their outputs as an appropriately weighted sum
of their inputs (7). The ARNs are connected to a compact subset of bipolar binary
ARS inputs (LBS outputs) {uk1 , ..., ukt } and to a single ARN of a previous layer (if
it exists) in order to supplement its discrimination ability (fig. 7). The propagation
of information between ARNs never spoils discrimination properties of neurons of
previous layers because interneuron weights (9) are computed to keep the influence
of GDCs of all previous layers on a computation in the next layers (8).
x
xi = fiAR (uks , ..., ukt , x p ) = w0p x p + wxji u j (7)
j{ks ,...,kt }
where
dj
AR p
d0 = (8)
jJ
AR p
AR p d0
w0 = (9)
j{ks ,...,kt } d j
n,R
uk dk+
j{ks ,...,kt } d j
i f un,R
k 0
wAR r
= (10)
j
un,R
k dk
i f un,R
j{ks ,...,kt } d j k <0
After the GDCs (1) have been computed for all bipolar binary input features
of all training cases, it is determined which GDCs will be used as obligatory (fig.
5) to achieve the correct discrimination of all training cases. The goal is to find
a minimal subset of GDCs of the largest values that can do this. The larger GDC
value is the better discrimination property it represents. Besides obligatory GDCs
there are usually many other GDCs with large values that can have values equal
to values of the appropriate obligatory GDCs. The GDC with an equal value to the
obligatory one is called optional because it can sometimes be represented in the ARS
architecture without additional construction elements. All other not null values that
are established for the same input features as obligatory ones should be taken into
account when discriminating in order to achieve unambiguous discrimination and
classification. The determination of the obligatory and optional GDCs proceeds in
the following way:
For each training case find the GDCs which have the largest values that discrim-
inate it from other training cases of all other classes in the following way:
1. For each not obligatory GDC of this case compute a number of potentially dis-
criminable training cases of other classes from the indiscriminated cases list
and multiply it by the GDC value. This product determines how many training
cases can be discriminated using this GDC taking into account the discrimination
Self-Optimizing Neural Network 3 93
property of this bipolar binary input. Only those cases can be discriminated that
have an opposite value of that binary input feature to the value of the appropriate
binary input of the ARS (e.g. 1 is opposite to +1, +1 is opposite to 1).
2. Choose a maximal value from the products computed for all not obligatory GDC
values.
3. Use this maximal value to discriminate a subset of training cases from the indis-
criminated cases list and remove all discriminated training cases from the indis-
criminated cases list. If the GDC value for the checked case is null (or its suitable
bipolar binary input feature value is the opposite of the bipolar binary input fea-
ture value of the given discriminated case) then its training case can be removed
from this list.
4. If not all training cases have been discriminated against the given training case
then return to step 2.
This algorithm is executed once for each training case and when it finishes all train-
ing cases are discriminated against all training cases from all other classes. This
guarantees 100% discrimination of all not contradictory training cases and is an
important part of the ARS optimization algorithms. Figure 5 presents the obliga-
tory and optional GDCs computed for the Iris data from MLRepository. The black
rectangles map the obligatory GDCs that have to be used to totally discriminate all
Iris training cases against all cases of all other classes. The dark grey rectangles
map the optional GDCs that can be without additional costs included in the ARS.
The light grey rectangles map the not null GDC values that have to be taken into
Fig. 5 The GDC characters computed for the lossy binarized Iris cases.
94 A. Horzyk
j = S j (C j 1) (11)
After all GDCs of all training cases are classified as obligatory, optional, con-
sidered or irrelevant (fig. 5), the obligatory GDCs of various training cases can be
grouped together only if they have equal values for some bipolar binary input. The
obligatory and optional GDCs are grouped in such a way as to minimize a number
of network construction elements (a number of connections and a number of neu-
rons) that will represent them in the network simultaneously, without limiting exact
representation of the obligatory GDCs in this network. The obligatory and optional
GDCs that have equal values for the same input features and for different training
samples can be grouped together and represented using a single connection. Such
grouping and transformation efficiently lossless compress GDC information in the
neural network. So called Aggregation Effectiveness Coefficients (AECs) (11) are
used to count how many obligatory GDCs can be grouped and aggregated for each
obligatory GDC value and the cost of representation of the compressed obligatory
GDCs in the network is subtracted. The AECs are used to recursively divide bipo-
lary lossy binarized training cases U into subsets (e.g. {U1 ,U2 ,U3 ,U4 ,U5 } for the
Iris data (fig. 6)) and to create ARNs for these subsets. The maximum value of the
AEC in each division step is transformed into an ARN. Each AEC can group many
different GDC values (fig. 6) if they are equal for all samples of each input feature
and occur for the same samples of various input features simultaneously. The vari-
ous GDC values are transformed into connections for the ARN created for a given
AEC (figs. 6, 7). The AECs also count numbers of saved connections that can be
omitted thanks to the aggregations found of GDCs. The AEC (11) is computed as a
product of a number of training cases C j that have equal GDC values for each input
k minus one (because the equal GDC values are always represented by a single con-
nection - this is the cost of representation of the aggregated GDCs in the network)
and a number S j of different GDC values that are equal for the same subset found
of training cases for various inputs. After AECs are computed for all different GDC
values for all inputs, the maximal AEC value is chosen in order to use it to divide
the training cases U into two subsets: The first subset consists of the training cases
that the obligatory and optional GDC values are equal to and all obligatory GDC
values are determined in the maximal AEC. The second subset consists of other
training cases that do not belong to the first subset. The division of training cases is
demonstrated in fig. 6 for the Iris data. In each division step AECs are computed for
one of the divided subsets of training cases until all training cases in subsets belong
to single classes and all obligatory GDCs are represented in the ARS connections.
During the process of division of the transformed training cases (fig. 6), the con-
nections are established (fig. 7) and weights are computed (9), (10). The interneuron
weights (9) according to the sum (8) of all previously connected GDCs of bipolar
binary inputs appropriately reinforce an interneuron input value (fig. 7).
Self-Optimizing Neural Network 3 95
Fig. 6 The division and construction process of the ARS for the Iris data.
96 A. Horzyk
Fig. 7 The architecture of the SONN-3 constructed for the Iris data. The architecture includes
all the subnetworks: the LBS, the ARS and the MSS and all computed weights.
Fig. 8 The SONN-3 architecture (11 LBNs, 14 ARNs, 2 MSNs, 61 connections, 5 layers)
automatically constructed for the Iris data.
ym = fmMS (uks , ..., ukt , xig , ..., xih ) = max{ uks , ..., ukt , xig , ..., xih } (12)
The MSNs compute the maximum of outputs of all ARNs that represent an undi-
vidable subset of training cases of a single class (12). The output value of the MSN
can be interpreted as the similarity of an input vector to a considered class. The in-
puts of MSNs are not weighted (fig. 7). If some MSN has a single input then this
MSNs can be reduced. Such situation can occur for some very correlated and easy
to discriminate classes of training cases (compare figs. 7 and 8). If some training
cases can be discriminated using a single bipolar binary input feature then the MSN
can even be connected to this input and a suitable ARN can be reduced (compare
figs. 7 and 8).
classifier for various classification tasks, even to highly non-separable data [1], [8].
The SONN-3 can precisely adjust its architecture and weights to given training data
taking into consideration complexity and correlations of them. This adjustment abil-
ity of this network is similar to the plasticity processes [5] that take place in natural
nervous systems [10].
This chapter compares the classification performance of the SONN-3 with that
of other top classification soft-computing methods (tabs. 1-2). The Iris and Wine
data from the ML Repository are used to carry out comparisons. Figures 4-8 illus-
trate the topologies of the SONNs-3 constructed for the above-mentioned training
data. The Iris and Wine data from the ML Repository have been used to construct
various soft-computing models, i.e. SVM, IncNet, k-NN, FSM, SSV Tree, MLP,
RBF, PNN and SONN-3. The 4 dimensional Iris data consists of 150 training cases
and 3 classes. The 13 dimensional Wine data consists of 178 training cases and 3
classes. The GhostMiner 3.0 solvers and the Statistica NN automatic designer with
10-fold cross-validation have been used to find the best solutions for these soft-
computing methods. Moreover various configurations and parameters have been
Self-Optimizing Neural Network 3 99
Fig. 9 The SONN-3 architecture (6 LBNs, 7 ARNs, 3 MSNs, 32 connections, 4 layers) au-
tomatically constructed for the Wine data.
tested. My own implementation of the SONN-3 has been used to verify assumptions
of this method and to construct the solutions (figs. 8-9). Tables 1-2 contain the best
results achieved for all tested soft-computing methods mentioned above.
The presented comparisons (tabs. 1-2) confirm that the SONN-3:
creates very compact architectures,
is constructed very quickly,
always classifies all training cases correctly and unambiguously,
automatically sets up all training parameters,
achieves results which are competitive with other soft-computing methods,
generalizes very well [4].
8 Conclusions
The paper deals with the fully automatic construction of the universal ontogenic
neural network classifier SONN-3, which is able to automatically adapt itself using
binary, integer or real input data without any limitations. It analyses and processes
training data very quickly and finds a compact SONN-3 architecture and weights for
any given training data set. It can also automatically simplify and reduce an input
data space which is very desirable in many practical situations. It is also devoided
of the curse of the dimensionality problem.
Moreover, not many soft-computing methods can automatically and effectively
select the most discriminative input, so classification results are often influenced by
minor or irrelevant parameters that can also spoil them and generalization properties
of an achieved soft-computing solution. The SONN-3 reduces original real inputs
100 A. Horzyk
to a set of the best discriminating inputs and transforms them to bipolar binary ones
(figs. 4-8) used in further computations.
The comparison results (tabs. 1-2) show that the SONN-3 is not only very quick
and easy to use but it also achieves very good classification and generalization re-
sults in comparison with the other popular soft-computing classification methods
[4],[6]. Not many other algorithms for training neural networks can effectively com-
pute an architecture and all weights for all training cases using a global analysis of
them. Furthermore, the SONN-3 has many interesting features that can be com-
pared with biological neural networks and various neural processes in biological
brains [5].
The second main strength of the SONN-3 is that it uses two kinds of information
that is very useful for the discrimination: the existence and non-existence of some
input values for some classes. This kind of information is rarely used by other soft-
computing models. The majority of soft-computing models and methods are limited
to using only the information about the existence of some input values for some
classes. The SONN methodology expands these abilities and offers better possibili-
ties for a generalization.
The third strength of the SONN-3 is that it is able to construct a compact model
without either rounding off or losing any important values of input data. This makes
all computations very exact and accurate. Moreover, the SONN-3 can automati-
cally and very accurately estimate discrimination and representative properties of
the data. These estimations are used to group and aggregate the most important
values of input features in order to produce a compact classification model of any
given data using Global Discrimination Coefficients (1). Figures 8 and 9 show how
compact and consistent the architectures constructed by the SONN-3 algorithms
described in this paper can be.
The fourth strength of the SONN-3 is that it always builds a solution using the
most important, well-differentiating and well-discriminating features of all training
cases after the global analysis of training data. The SONN-3 also automatically
excludes data artifacts because it focuses on the most discriminative features, which
are not artifacts.
Finally, the SONN-3 is very quickly, cost-effective and fully automatic. On the
other hand, a computer implementation of this method is not easy because of the
huge number of optimization algorithms that gradually analyze training data, trans-
form them, develop a final neural network architecture and compute its weights. The
interactive website with the implemented SONN-3 algorithms will be published at
http://home.agh.edu.pl/horzyk soon.
References
1. Duch, W., Korbicz, J., Rutkowski, L., Tadeusiewicz, R. (eds.): Biocybernetics and
Biomedical Engineering. EXIT, Warszawa (2000)
2. Dudek-Dyduch, E., Horzyk, A.: Analytical Synthesis of Neural Networks for Selected
Classes of Problems. In: Bubnicki, Z., Grzech, A. (eds.) Knowledge Engineering and
Experts Systems, OWPN, Wroclaw, pp. 194206 (2003)
Self-Optimizing Neural Network 3 101
3. Fiesler, E., Beale, R. (eds.): Handbook of Neural Computation. IOP Publishing Ltd.,
Oxford University Press, Bristol, New York (1997)
4. Horzyk, A.: Introduction to Constructive and Optimization Aspects of SONN-3. In:
Kurkova, V., Neruda, R., Koutnk, J. (eds.) ICANN 2008,, Part II. LNCS, vol. 5164,
pp. 763772. Springer, Heidelberg (2008)
5. Horzyk, A.: A New Extension of Self-Optimizing Neural Networks for architecture Op-
timization. In: Duch, W., Kacprzyk, J., Oja, E., Zadrozny, S. (eds.) ICANN 2005. LNCS,
vol. 3696, pp. 415420. Springer, Heidelberg (2005)
6. Horzyk, A., Tadeusiewicz, R.: Comparison of Plasticity of Self-Optimizing Neural Net-
works and Natural Neural Networks. In: Mira, J., Alvarez, J.R. (eds.) Proc. of ICANN
2005, pp. 156165. Springer, Heidelberg (2005)
7. Hyvarinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley and
Sons, Chichester (2001)
8. Jankowski, N.: Ontogenic neural networks. EXIT, Warszawa (2003)
9. Jolliffe, I.T.: Principal Component Analysis. Springer, Heidelberg (2002)
10. Kalat, J.: Biological Psychology, Thomson Learning Inc. Thomson Learning Inc.,
Wadsworth (2004)
11. Pawlak, Z.: Rough sets. In: Theoretical Aspects of Reasoning about Data. Kluwer Aca-
demic Publishers, Dordrecht (1991)
12. Starzyk, J.A.: Motivation in Embodied Intelligence, Robotics, Automation and Control.
I-Tech Education and Publishing (2008)
13. Subirats, J.L., Franco, L., Molina Conde, I., Jerez, J.M.: Active Learning Using a Con-
structive Neural Network Algorithm. In: Kurkova, V., Neruda, R., Koutnk, J. (eds.)
ICANN 2008,, Part II. LNCS, vol. 5164, pp. 803811. Springer, Heidelberg (2008)
M-CLANN: Multiclass Concept
Lattice-Based Articial Neural Network
1 Introduction
A growing number of real world applications have been tackled with articial
neural networks (ANNs). ANN is an adaptive system that changes its struc-
ture based on external or internal information that ows through the network
during the learning phase. ANNs oer a powerful and distributed computing
architecture, with signicant learning abilities and they are able to represent
highly nonlinear and multivariable relationships. ANNs have been success-
fully applied to solve a variety of specic tasks (pattern recognition, function
approximation, clustering, feature extraction, optimization, pattern matching
Engelbert Mephu Nguifo and Norbert Tsopze
CRIL CNRS, Artois University, Lens, France
e-mail: tsopze@cril.univ-artois.fr
Norbert Tsopze and Gilbert Tindo
Computer Science Department - University of Yaounde I,
PO Box 812 Yaounde - Cameroon
e-mail: tsopze@cril.fr;gtindo@uycdc.uninet.cm
Engelbert Mephu Nguifo
LIMOS CNRS, Universite Blaise Pascal, Clermont Ferrand, France
e-mail: mephu@isima.fr
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 103121.
springerlink.com
c Springer-Verlag Berlin Heidelberg 2009
104 E.M. Nguifo, N. Tsopze, and G. Tindo
2 Related Works
Research works about neural network architecture design could be divided
into two groups as mentioned above. The rst group uses prior knowledge
to propose a MLP topology, while the second group searches for an optimal
topology minimizing the number of hidden neurons and layers.
KBANN (Knowledge Based Articial Neural networks) [32] use a set of rules
represented as a set of Horn clauses. From these rules an items hierarchy is
dened and the architecture of the neural network is derived from this hierar-
chy. The hierarchy between items is dened using the following equivalences:
1. Final conclusions output units;
2. Intermediate conclusions internal units;
3. Hypothesis input units;
4. Dependency between items connection links.
The dierent steps of KBANN are as follows:
1. Rewriting. This step consists of writing the rules such that disjuncts are
expressed as a set of rules (each rule has only one antecedent).
2. Mapping. The hierarchy between items is dened and directly mapped to
the network.
3. Labeling. Each unit is numbered by its level.
4. Adding new hidden units. In order to make the network able to learn
derived features not specied in the initial rule set, it is advised to add
new units in the hidden layer.
5. Adding input units. Some relevant features which are referred to by the
initial rule set are added.
6. Adding links. Links are added to connect each unit numbered n 1 to
each unit numbered n. The connection weights of these links are set to 0.
7. Perturbing. A small random number is added to each weight.
8. Initialization of connection weights and ANN training by error backprop-
agation.
The connection weights and the bias neurons are initialized as follows:
w for the positive antecedents
w for the negated antecedents.
The bias on the unit corresponding to the rules consequent to (p 1/2)w
where p is the number of positive antecedents of the unit.
w is a positive number having 4 (empirically dened) as the default value.
Example 1. Figure 1 presents a simplied example of dening the neural
topology by KBANN approach. In the rst column, are the initial rules which
are rewritten and presented in the second column. And the nal network is
presented in the third column.
positive connections
A negative connections
added connections
A:-B A2
A:-A1 A1
A:-C,E A:-A2
B:-D,F,not H A1:-B
C:-F,not G A2:-C,E
E:-not F,G,H B:-D,F,not H B E
C
C:-F,not G
E:-not F,G,H
D F G H
from the data. These methods start with a small network and dynamically
grow the network by adding and training neurons as needed until better
classication is achieved. These methods can be divided into two subgroups:
those with many hidden layers [28] and those with only one hidden layer [38].
C1 C2 C3
Output layer
Hidden layer
Hidden layer
a b c d e f Input layer
Ancillary neuron
Master neuron
Full connection
The Distal method [38] belongs to this category. It builds a 3 layer neural
network. Each neuron of the input layer is linked to an attribute. Each neuron
of the output layer is associated to a predened class. The process essentialy
consists of dening the hidden layer. Distal clusters training data in disjoint
subsets and represents each subset in the hidden layer by one neuron.
Example 3. Figure 3 presents a neural network dened by Distal. (a) is the
initial state of the network and (b) is the generated network.
There are also in the literature many works which help the user to optimize
[37] or prune networks by pruning some connections [23] or by selecting some
variables [5] among the entire set of initial variables, or by detecting and
ltering noisy examples [34]. These works do not propose an ecient method
to build neural network topology, but they can be classied in the second
M-CLANN: Multiclass Concept Lattice-Based Articial Neural Network 109
C1 C2 C3 C1 C2 C3
O1,O2,...,O10
Full connection
group, since by reducing the number of input neurons, the number of neurons
in the hidden layer could also vary.
3.1 Classification
The classication task consists of labelling unknown patterns into a prede-
ned class. The classication process builds a model and trains it for making
this model able to aect the unseen patterns to one of the output classes.
Here, each known pattern is presented as a pair (x, y) where x is the vector
containing dierent values taken by the pattern on dierent attributes and y
is its class value represented by a particular attribute. The training data is
divided into two sets: the training set and the test set. The system operates
in two phases: the training phase consists in designing the model while the
second step evaluates the trained model.
For instance, in the data table 1, objects or patterns 1 to 6 can be asso-
ciated to the positive class (+), while patterns 7 to 10 can be associated to
the negative class (-).
110 E.M. Nguifo, N. Tsopze, and G. Tindo
The model evaluation (or test) consists of calculating its accuracy rate as
the ratio between the number of well classied patterns and the total number
of patterns. There are many techniques to determine accuracy rate among
which:
1. K-fold cross validation. The training data is divided into k disjoint subsets
and the model is trained and tested k times. At each iteration i, the
ith subset is used to test the model built and trained using the other
k 1 subsets (all other subsets except ith subset). The accuracy rate is
calculated as the average of the dierent accuracy rates obtained at each
iteration. Empirically it is advised to take k = 10.
2. Leave-one-out is a variant of k-fold cross validation where k is to the
number of patterns on the training set.
3. Holdout. The training set is randomly separated into two disjoint subsets.
One of these subsets is used to build and train the model while the other
is used for test.
Classication as well as supervised learning are more detailed in [7, 20].
O/A ab cd ef
1 11 1 11
2 1 1 11
3 1 11 1
4 1 11
5 11 1 1
6 1 1 1
7 1 1 1
8 1 11
9 1 11
10 1 11
Example 5. From table 1, ({1, 2, 5, 6, 10}, {a, e}) is a formal concept where
{1, 2, 5, 6, 10} is the extent and {a, e} is the intent. While ({1, 2, 5}, {a, e}) is
not a formal concept since {1, 2, 5} is not the largest set for which each object
veries all attributes of the set {a, e}.
Denition 3. Let L be the entire set of concepts extracted from the formal
context C and a relation dened as (O1 , A1 ) (O2 , A2 ) (O1 O2 ) (or
A1 A2 ). The relation denes the order relation on L [16].
If (O1 , A1 ) (O2 , A2 ) is veried (without intermediate concept) then the
concept (O1 , A1 ) is called the successor of the concept (O2 , A2 ) and (O2 , A2 )
the predecessor of (O1 , A1 ).
3.3 Constraints
In order to reduce the size of concept lattice and consequently the time com-
plexity, we introduce some constraints regularly used to select concepts during
the learning process.
112 E.M. Nguifo, N. Tsopze, and G. Tindo
Learning
Translation
Training
data and setting
semi - lattice
Neural network topology
Training
Heuristics data
Neural classifier
the join semi-lattice into a topology of the neural network, and set the initial
connections weights; (3) train the neural network.
Variables used in the algorithms dened in this section are : C is a formal
context (dataset); L is the semi-lattice built from the training dataset K; c
and c are formal concepts; n is the number of attributes in each training
pattern; m is the number of output classes in the training dataset; c a formal
concept, element of L; N N is the comprehensive neural network build to
classify the data.
Output
{},{123456}
(b)
(a)
Threshold is zero for all units and the connection weights are initialized as
follows:
Connection weights between neurons derived directly from the lattice is
initialized to 1. This implies that when the neuron is active, all its prede-
cessors are active too.
Connection weights between the input layer and hidden layer are initial-
ized as follows: 1 if the attribute represented by the input appears in the
intention Y of the concept associated to the ANN node and -1 otherwise.
This implies that the hidden unit connected to the input unit will be active
only if the majority of its input (attributes including in its intent) is 1.
5.1 Data
To examine the practical aspect of the approach presented above, we run
the experiments on the data available on the UCI repository [26]. The
116 E.M. Nguifo, N. Tsopze, and G. Tindo
characteristics of this data is shown in the table 2 which contains the name
of the dataset, the number of training patterns (#Train), the number of test
patterns (#Test), the number of output classes (#Class), the initial number
of (nominal) attributes in each pattern (#Nom), the number of binary at-
tributes obtained after binarization (#Bin). Attributes were binarized by the
Weka [36] binarization procedure Filters.NominalToBinary. The diversity
of this data (from 24 to 3196 training patterns; from 2 to 19 output classes)
helps in revealing the behavior of each model in many situations. There are
no missing values in these datasets.
Two constraints presented above (frequency and height) have been applied
in selecting concepts during experimentation. We rst separately use each of
them and then we combine them.
5.2 Results
Experimental results are obtained from the model trained by error back-
propagation [30] and validated by 10-fold cross-validation or holdout [20].
The learning parameters are the following: as activation function, we use the
1
sigmoid (f (x) = 1+exp x ), 500 iterations in the weight modication process
and 1 as learning rate.
Table 3 presents the accuracy rate (percentage) obtained with data in
table 2. In table 3, the symbol - indicates that no formal concept satises
the constraints and the process was stopped. The symbol x indicates that
the classier CLANN was not applied for those multiclass problem.
In this table, MCL1 is M-CLANN built from a semi-lattice with one level
while MCL30 and MCL20 are M-CLANN built using respectively 30 and 20
M-CLANN: Multiclass Concept Lattice-Based Articial Neural Network 117
Table 3 Accuracy rates of MCLANN classier with some varied input parameters.
6 Discussion
As presented in the previous section, there exists many algorithms which
could be used to dene the neural network architecture. Each of those algo-
rithms present advantages but they also have issues:
1. Input data. Many algorithms could not process other data than numeric.
Apart from Distal where the authors have dened the distance between
symbolic data, all others only treat numeric data. In addition of the train-
ing data, using KBANN method requires a domain theory which is not
always available. The choice of the method could hardly be inuenced by
the input data.
2. Interpretability of ANN. It is well known that the ANN is one of the
most commonly used methods in classication. As it is seen as Black
box, it is not used in the domain where result explanations are impor-
tant. Among the previous methods, only M-CLANN and KBANN present
interpretable architectures. So, it could not be advised to use other ap-
proaches than M-CLANN and KBANN, while in M-CLANN, each node
is associate to one formal concept and each formal concept is formed by
M-CLANN: Multiclass Concept Lattice-Based Articial Neural Network 119
7 Conclusion
References
1. Andrews, R., Diederich, J., Tickle, A.: Surevy and critique of techniques for
extracting rules from trained articial neural networks. Knowledge-Based Sys-
tems 8(6), 373389 (1995)
2. Bastide, Y., Pasquier, N., Taouil, R., Stumme, G., Lakhal, L.: Mining minimal
non-redundant association rules using frequent closed itemsets. In: Palamidessi,
C., Moniz Pereira, L., Lloyd, J.W., Dahl, V., Furbach, U., Kerber, M., Lau,
K.-K., Sagiv, Y., Stuckey, P.J. (eds.) CL 2000. LNCS (LNAI), vol. 1861, pp.
972986. Springer, Heidelberg (2000)
120 E.M. Nguifo, N. Tsopze, and G. Tindo
23. Le Cun, Y., Denker, J.S., Solla, S.A.: Optimal Brain Damage. In: Advances in
Neural Information Processing Systems, vol. 2, pp. 598605. Morgan Kaufmann
Publishers, San Francisco (1990)
24. Mephu Nguifo, E.: Une nouvelle approche base sur le treillis de Galois
pour lapprentissage de concepts. Mathematiques, Informatique, Sciences Hu-
maines 134, 1938 (1994)
25. Mephu Nguifo, E., Tsopze, N., Tindo, G.: M-CLANN: Multi-class concept
lattice-based articial neural network for supervised classication. In: Kurkova,
V., Neruda, R., Koutnk, J. (eds.) ICANN 2008,, Part II. LNCS, vol. 5164, pp.
812821. Springer, Heidelberg (2008)
26. Newmann, D.J., Hettich, S., Blake, C.L., Merz, C.J.: (UCI)Repository of ma-
chine learning databases, Dept. Inform. Comput. Sci. Univ. California, Irvine,
CA (1998), http://www.ics.uci.edu/AI/ML/MLDBRepository.html
27. Parekh, R., Yang, J., Honavar, V.: Constructive Neural Networks Learning
Algorithms for Multi-Category Classication. Department of Computer Science
Lowa State University Tech. Report ISU CS TR 95-15 (1995)
28. Parekh, R., Yang, J., Honavar, V.: Constructive Neural-Network Learning Algo-
rithms for Pattern Classication. IEEE Transactions on neural networks 11(2),
436451 (2000)
29. Piccinini, G.: Some neural networks compute, others dont. Neural Network 21
(special issue), 311321 (2008)
30. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by
backpropagating errors. Nature (323), 318362 (1986)
31. Rudolph, S.: Using FCA for Encoding Closure Operators into Neural Networks.
In: Priss, U., Polovina, S., Hill, R. (eds.) ICCS 2007. LNCS (LNAI), vol. 4604,
pp. 321332. Springer, Heidelberg (2007)
32. Shavlik, W.J., Towell, G.G.: Kbann: Knowledge based articial neural networks.
Articial Intelligence (70), 119165 (1994)
33. Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., Lakhal, L.: Computing Ice-
berg concept lattices with TITANIC. Journal on Knowledge and Data Engi-
neering (KDE) 2(42), 189222 (2002)
34. Subirats, J.L., Franco, L., Molina Conde, I., Jerez, J.M.: Active learning using
a constructive neural network algorithm. In: Kurkova, V., Neruda, R., Koutnk,
J. (eds.) ICANN 2008,, Part II. LNCS, vol. 5164, pp. 803811. Springer, Hei-
delberg (2008)
35. Tsopze, N., Mephu Nguifo, E., Tindo, G.: CLANN: Concept-Lattices-based
Articial Neural Networks. In: Diatta, J., Eklund, P., Liquire, M. (eds.) Pro-
ceedings of fth Intl. Conf. on Concept Lattices and Applications (CLA 2007),
Montpellier, France, October 24-26, 2007, pp. 157168 (2007)
36. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and
techniques. Morgan Kaufmann, San Francisco (2005)
37. Yacoub, M., Bennani, Y.: Architecture Optimisation in Feedforward Connec-
tionist Models. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.)
ICANN 1997. LNCS, vol. 1327. Springer, Heidelberg (1997)
38. Yang, J., Parekh, R., Honavar, V.: Distal: An Inter-pattern Distance-based
Constructive Learning Algorithm: Intell. Data Anal. 3, 5573 (1999)
39. Werbos, P.J.: Why neural networks? In: Fiesler, E., Beale, R. (eds.) Handbook
of Neural Computation, pp. A2.1:1A2.3:6. IOP Pub., Oxford University Press,
Oxford (1997)
40. Wille, R.: Restructuring Lattice Theory: An Approach Based on Hierarchies of
Concepts. In: Rival, I. (ed.) Ordered Sets, pp. 445470 (1982)
Constructive Morphological Neural Networks:
Some Theoretical Aspects and Experimental
Results in Classification
1 Introduction
Mathematical Morphology (MM) is a theory that uses concepts from set theory, ge-
ometry and topology to analyze geometrical structures in an image [21, 28, 45, 44].
MM has found wide-spread applications over the entire imaging spectrum [7, 19,
20, 26, 32, 47, 48]. Morphological operators were originally developed for binary
and grayscale image processing. The subsequent generalization to complete lattices
Peter Sussner
Department of Applied Mathematics, IMECC, University of Campinas,
Campinas, SP 13084 970
e-mail: sussner@ime.unicamp.br
Estevao Laureano Esmi
Department of Applied Mathematics, IMECC, University of Campinas,
Campinas, SP 13084 970
e-mail: ra050652@ime.unicamp.br
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 123144.
springerlink.com c Springer-Verlag Berlin Heidelberg 2009
124 P. Sussner and E.L. Esmi
is called anti-erosion ( Y) = (y); (5)
yY
is called anti-dilation ( Y) = (y). (6)
yY
+ () = () + = (8)
+ () = () + = (9)
There are two types of matrix products with entries in G. Given a matrix A
Gmp and a matrix B G pn, the matrix C = A B, called the max-product of A
B, called the min-product of A and B, are defined by
and B, and the matrix D = A
the following equations:
p
p
ci j = (aik + bk j ), di j = (aik + bk j ) . (10)
k=1 k=1
Constructive Morphological Neural Networks 127
A (x) = At
x, (11)
A (x) = A
t
x. (12)
Note that the operators A and A represent erosions and dilations from the com-
plete lattice Gn to the complete lattice Gm , respectively. The theory of minimax
algebra includes a theory of conjugation. For more information, we refer the reader
to the treatises of Cuninghame-Green [13, 11]. The bounded lattice ordered group
(G, , , +, + ) is self-conjugate. The conjugate of an element x G is denoted
using the symbol x and is defined as follows:
x, if x G \ {, +}
x = +, if x = (13)
, if x =
The operator of conjugation gives rise to a negation on Gn which maps the
i-th component of x to its conjugate. Formally, we have
( (x))i = (xi ) i = 1, . . . , n . (14)
n
y = f (w (x)), where w (x) = wt
x= (xi + wi ); (15)
i=1
n
y = f (w (x)), where w (x) = wt
x= (xi + wi ); (16)
i=1
n
y = f (w (x)), where w (x) = w (x) (x) = (xi + wi ); (17)
i=1
n
y = f (w (x)), where w (x) = w (x) (x) = (xi + wi ). (18)
i=1
Note that w represents an erosion from the complete lattice Rn to the complete
lattice R in the special form of Equation 11. Similarly, w represents a dilation
Rn R in the special form of Equation 12. By Theorem 1, the composition of the
negation followed by the erosion w yields an anti-erosion that we denoted by
w and the composition of the negation followed by the dilation w yields an
anti-dilation that we denoted by w .
The values of the morphological perceptrons weights must be determined be-
fore it can act as a classifier. More precisely, the weights are determined using a
supervised learning algorithm [54] that constructs n-dimensional boxes around sets
of points which share the same class value. Convergence occurs in a finite number
of steps.
In this paper, we present an improved version of the original training algorithm
for MPs that was proposed to solve two-class classification problems [54]. To this
end, let us introduce some relevant notations.
The vectors x1 , x2 , . . . , xk Rn denote the given training patterns. The set of train-
ing patterns belonging to class 0 is denoted using the symbol C0 and the set of train-
ing patterns belonging to class 1 is denoted using the symbol C1 . We define the
following index sets:
1 While D = 0/
p
i = sup{xi < pi : x C0 }. (25)
p
i = inf{xi > pi : x C0 }. (26)
(ii) Expand the hyperbox P that was obtained in item (a) by applying
Equations 25 and 26.
1.4 Determine the smallest hyperbox B = box(b , b ) which contains all the
misclassified patterns of class 1 in the interior of P. Formally, we have
b
i = xij i = 1, . . . , n (27)
x j P
b
i = xij i = 1, . . . , n (28)
x j P
b
i + pi
c
i = , (29)
2
b
i + pi
c
i = . (30)
2
Step 2 (Update the Architecture of the Morphological Perceptron) At the end of this
step, the patterns in the hyperbox C are assigned to class 1.
2.1 Update the function g as follows:
n
n
g(x) = g(x) (xi c
i ) (c
i xi ) . (31)
i=1 i=1
2.2 Compute f (g(x)) for all x D, update the set D, and return to Step 1.
Note that the function g determines a union of hyperboxes. An arbitrary input pattern
x
is assigned to class
n
1 if and only if it lies in this union of hyperboxes. The term
i=1 (xi ci ) i=1 (ci xi ) of Equation 31 is non-negative
n
if and only if x is
between the upper and lower vertices of C. The term ni=1 (xi c i ) corresponds to
m
y = f( (v j (x) w j (x))) (32)
j=1
Constructive Morphological Neural Networks 131
n
n
v (x) w (x) = (xi + vi ) (xi + wi ) (33)
i=1 i=1
only requires one single morphological neuron or processing element with inputs
of the form (x1 , . . . , xn , x1 , . . . , xn ) and weights in R2n . Therefore, we associate only
one morphological neuron to the computation of Equation 33. In other words, the
number of hidden morphological neurons corresonds to the number of hyperboxes
that are generated during the learning phase.
Suppose we have an S-class classification problem. If S > 2 then the MP approach
has to be adapted so as to be able to deal with multiple classes. Let Cs {x1 , . . . , xk }
denote the set of training patterns belonging to thesth class where s = 1, . . . , S. For
each s = 1, . . . S, we simply set C1 = Cs and C0 = t=s Ct and apply the MP training
algorithm. This procedure generates weights vsj and wsj for every s = 1, . . . S. Thus,
we obtain S MPs with threshold activation functions at their respective output nodes
132 P. Sussner and E.L. Esmi
of the form pictured in Figure 1(a). Removing the thresholds at the outputs yields S
MPs that are given by the following equations for s = 1, . . . S:
ms
ys = (vsj (x) wsj (x)) (34)
j=1
n
k (x) = pk (1)l+1 (xi + wlki ) , (35)
i=1 lL
where L {0, 1} corresponds to the set of terminal fibers on Ni that synapse on the
k-th dendrite
of N. After passing the value k (x) to the cell body, the state of N is
given by Kk=1 k (x), where K denotes the total number of dendrites of N. Finally,
an application of the hard limiting function f defined in Equation 22 yields the next
state of N, in other words the output y of the MPD depicted in Figure 1(b).
K
K
n
y = f( k (x)) = f ( pk (1)l+1 (xi + wlki )) . (36)
k=1 k=1 i=1 lL
Leaving the biological motivation aside, the MPD training algorithm that was
proposed for binary classification problems [38] resembles the one for MPs [54].
As is the case for MPs, the MPD training algorithm is guaranteed to converge in a
finite number of steps and, after convergence, all training patterns will be classified
correctly. Learning is based on the construction of n-dimensional hyperboxes. Given
Constructive Morphological Neural Networks 133
plete because IL does not exist in IL . There are however two closely related
complete lattices. The first one, called the complete lattice of the generalized in-
tervals, is denoted using the symbol PL and arises by leaving away the restriction
a b. Formally, we have PL = {[a, b] : a, b L}. If 0L and 1L denote the least el-
ement of L and the greatest element of L, respectively, then the least element of PL
is given by [1L , 0L ] and the greatest element of PL is given by [0L , 1L ]. The second
complete lattice of interest is denoted by VL and is given by adjoining [1L , 0L ] to
IL . We obtain VL = IL {[1L , 0L ]}.
The FLNN model employs a fuzzy partial order relation p : (VL )N (VL )N
[0, 1]. In this context, the fuzzy partial order is of a special form. Specifically, the
fuzzy partial order relation employed in the FLNN is based on a function-h or -
as we prefer to call it - a generating function [31]. Similar fuzzy lattice models use
fuzzy partial order relations based on positive valuation functions [5, 24, 23].
In applications of the FLNN to classification tasks such as the ones discussed in
Section 4, it suffices to consider - after an appropriate normalization - the complete
lattice U = [0, 1], i.e., the unit interval. Thus, the input and weight vectors are N-
dimensional hyperboxes in (VU )N . In this case, we can show that p(., w) : (VU )N
[0, 1] represents an elementary operation on mathematical morphology, namely both
an anti-erosion and an anti-dilation, if the underlying generating function h : U
R is continuous. The proof of this result is beyond the scope of this paper since
it involves further details on fuzzy lattice neuro-computing models, in particular
the construction of a fuzzy partial order from a generating function. Therefore, we
postpone the proof to a future paper where we will show a more general result
that uses additional concepts of lattice theory. Anyway, the fact that every node in
the category layer of an FLNN with a continuous generating function performs an
elementary operation of MM has led us to classify FLNNs as belonging to the class
of morphological neural networks.
FLNNs can be trained in supervised or unsupervised fashion [31, 24]. Both ver-
sions generate hyperboxes that determine the output of the FLNN. In this paper, we
focus on the supervised learning algorithm that is used in classification tasks. Here
we have a set of n training patterns x1 , . . . , xn (VL )N together with their class la-
bels c1 , . . . , cn {1, . . . , M}. Therefore, the basic arquitecture of the FLNN has to be
adapted so as to accomodate the class information. This can be achieved by allowing
for the storage of a class index in each node of the category layer, by augmenting
the input layer by one node that carries the class information ci corresponding to the
pattern xi during the training phase, and by fully interconnecting the two layers.
Constructive Morphological Neural Networks 135
During the training phase, the FLNN successively constructs nodes in the cate-
gory layer each of which is associated with an N-dimensional hyperbox wi - i.e., an
element of (VL )N - together with its respective class label ci . Training in the original
FLNN model is performed using an exhaustive search and takes O(N 3 ) operations
[31]. (A modification of the FLNN called fuzzy lattice reasoning (FLR) classifier
requires O(N 2 ) for training if one renounces on optimizing the outcome of training
[23].) After training, we obtain L N-dimensional hyperboxes wi (VL )N that cor-
respond to the L nodes that appear in the category layer of the FLNN (cf. Figure
2(b)). A class label ci {1, . . . , M} is associated with each hyperbox wi (VL )N .
In the testing phase, an input pattern x (VL )N is presented to the FLNN and the
values p(x, wi ) are computed for i = 1, . . . , L. A competition takes place among the
L nodes in the category layer and the input pattern x is assigned to the class ci that
is associated with the hyperbox wi exhibiting the highest value p(x, w). Informally
speaking, the degree of inclusion of x in wi is higher than the degree of inclusion
of x in w j for all j = i. In particular, if x is contained in wi but not contained in w j
for all j = i, i.e., p(x, wi ) = 1 and p(x, w j ) < 1 for all j = i, the x is classified as
belonging to class ci .
Occasionally, the training algorithms for FLNNs and its modifications produce
overlapping hyperboxes with disparate class memberships although - according to
Kaburlasos et al. - this event occurs rarely [23]. In the experiments we conducted in
Section 4, this situation did in fact occur as evidenced by the decision surface that
is visualized in Figure 5. For more information, we refer the reader to Section 4.
4 Experimental Results
Table 1 Percentage of misclassified patterns for training (Etr ) and testing (Ete ) in the ex-
periments, CPU time in seconds for learning (Tcpu ) and number of hidden artificial neurons
(hyperboxes for MNNs) (Ha ).
Table 2 Percentage of misclassified patterns for training (Etr ) and testing (Ete ) in the ex-
periments, CPU time in seconds for learning (Tcpu ) and number of hidden artificial neurons
(hyperboxes for MNNs) (Ha ).
All the models and algorithms were implemented using MATLAB which favors
linear operations over morphological operations. We believe that the CPU times for
learning in the constructive MP, MPD, MPD/C, and MPD/C models would be even
lower if more efficient implementations of the max-product and min-product were
used [33].
For a fair comparison of the number of artificial neurons or processing elements
in the constructive MNNs, we have implicitly expressed each individual model as a
feedforward model with one hidden layer and competitive output nodes and we have
counted the number of hidden nodes or hyperboxes that were constructed during the
learning phase. The same sequence of training patterns appearing on the respective
internet sites were employed for training the constructive MNNs [36, 6].
Ripleys synthetic dataset [36] consists of data samples from two classes [35, 36].
Each sample has two features. The data are divided into a training set and a test set
consisting of 250 and 1000 samples, respectively, with the same number of sam-
ples belonging to each of the two classes. Thus, we obtain a binary classification
problem in R2 . Figures 3, 4, and 5 provide for more insight into the constructive
Constructive Morphological Neural Networks 137
0.8
0.6
Feature 2
0.4
0.2
MNNs by visualizing the decision surfaces that are generated by these models after
training (Figure 5 also includes the decision surface corresponding to an MLP with
ten hidden nodes). Here, we have used the same order in which the training patterns
appear on Ripleys internet site. We would like to clarify that the decision surfaces
vary slightly depending on the order in which the training patterns are presented to
the constructive morphological models.
Recall that the decision surfaces of the constructive morphological models are
determined by N-dimensional hyperboxes, i.e., rectangles for N = 2. This fact is
clearly visible in the decision surfaces of the MP and the MPD with hardlimiting
output units, that are pictured by means of the continuous lines in Figures 3 and 4.
In addition, Figures 3, 4, and 5 reveal that the decision surfaces generated by
the MP/C, MPD/C, and FLNN models deviate from rectangular appearance of the
ones generated by the basic MP and MPD models. In this context, recall that the
MP/C and MPD/C models construct separate families of hyperboxes for the training
patterns of each class. Each family of hyperboxes is associated to a different class
and corresponds to a certain output node. Upon presentation of an input pattern x
to the MP/C or MPD/C model a competition among the output nodes occurs that
determines the result of classification.
In the FLNN model a similar competition occurs in the category layer. More
precisely, the FLNN uses information on the degrees of inclusion p(x, wi ) of an
input pattern x in the hyperboxes wi for classification by associating x to the class
of the hyperbox wi in which x exhibits the highest degree of inclusion. This property
of the FLNN is evidenced by the diagonal lines in its decision surface (cf. Figure 5).
The training algorithms of all types of constructive MNNs are guaranteed to
produce decision surfaces that perfectly separate the training patterns with differ-
ent class labels. However, the training algorithms of the MP/C, the MPD/C, and
the FLNN may result in overlapping hyperboxes with distinct class memberships
138 P. Sussner and E.L. Esmi
Feature 2
the difference in shading.
0.4
0.2
although this event did not happen in our simulations with Ripleys synthetic dataset
when using the MP/C and MPD/C models. Concerning the FLNN, Figure 5 depicts
intersections of rectangles with distinc class labels using a darker shade of gray.
These sets of intersection correspond to regions of indecision since a pattern x that
is contained in both wi and w j with ci = c j satisfies p(x, wi ) = 1 = p(x, w j ).
Table 1 exhibits the results that we obtained concerning the classification per-
formance and the computational effort required by the individual models. The MP
training algorithm described in Section 3.1.1 automatically generated 19 hyperboxes
corresponding to 19 (augmented) hidden neurons capable of evaluating Equation 33.
In a similar manner, training an MPD using the constructive algorithm of Ritter and
Urcid [38] yielded 19 dendrites that correspond to 19 hidden computational units.
In contrast to the basic MP and MPD models, the MP/C and MPD/C generate one
family of hyperboxes for each class of training patterns. Since Ripleys synthetic
problem represents a binary classification problem, the number of hidden compu-
tational units in the MP/C and MPD/C models is approximately twice as high as
in the MP and MPD. The FLNN grew 46 neurons in the category layer during the
training phase. Moreover, we compared the morphological models with an MLP
with ten hidden nodes that was trained using gradient descent with momentum and
adaptive step backpropagation rule (learning rate = 104, increase and decrease
parameters 1.05 and 0.5 respectively, momentum factor = 0.9). In addition, we
used 25-fold cross-validation in conjunction with the MLP and chose the weights
that led to the least validation error.
Table 1 reveals that the MP and MPD models including their variants with com-
petitive nodes converge rapidly to a set of weights that yield perfect separation of
the training data. The FLNN model also produces no training error but the conver-
gence of the training algorithm is slower yet not quite as slow as MLP training. All
the models we tested exhibited satisfactory results in classification. The MPD yields
the highest classification error for testing which is due to the fact that, in contrast
to the MP training algorithm, no expansion of the hyperboxes corresponding to C1
patterns takes place in MPD training (this is why the MPD learns faster than the
MP). This lack of expansion does not cause any problems if competing hyperboxes
are constructed for patterns of both classes as is the case for the MPD/C.
Constructive Morphological Neural Networks 139
Fig. 5 Decision surfaces of an FLNN represented by the difference in shading (dark regions
refer to areas of uncertainty where hyperboxes with different class labels overlap) and of an
MLP represented by the continuous line.
As the reader may recall by taking a brief glance at Figure 5, the FLNN produces
areas of indecision, i.e., overlapping hyperboxes with distinct class memberships.
If these areas of indecision are assigned to either one of the two classes, the per-
centages of misclassification for testing are 11.40% and 12.00%, respectively. Oth-
erwise, if no decision is taken then we obtain a classification error of 14.2%. In any
case, the MP/C model exhibits the best classification performance.
with respect to the region of indecision was taken. The classification error can be
lowered to 9.6% by associating a pattern x contained in the region of indecision with
the first class label having value 1. The MLD/C and MP/C also exhibit a better clas-
sification performance than the MLP. The high number of hidden neurons grown by
the MP/C is probably due to the provisions that were taken to circumvent problems
of convergence of the training algorithm. We suspect that these problems are caused
by integer-valued attributes of training patterns with different class labels.
5 Conclusions
This paper provides an overview and a comparison of morphological neural net-
works (MNNs) for pattern recognition with an emphasis on constructive MNNs,
which automatically grow hidden neurons during the training phase. We have de-
fined MNNs as models of artificial neural networks that perform an elementary op-
eration of mathematical morphology at every node followed by the application of an
activation function. The elementary morphological operations of erosion, dilation,
anti-erosion, and anti-dilation can be defined in an arbitrary complete lattice.
In many cases, the underlying complete lattice of choice is Rn which has allowed
researchers to formulate morphological neurons (implicitly) in terms of the additive
maximum and additive minimum operations in the bounded lattice ordered group
(R, , , +, + ) - often without being aware of this connection to minimax algebra
[13, 51]. In this setting, the elementary morphological operations can be expressed
in terms of maximums (or minimums) of sums, which lead to fast neural computa-
tional and easy hardware implementation [33, 38].
As to the resulting models of morphological neurons, recent research results have
revealed that the maximum operation lying at the core of morphological neurons is
neurobiologically plausible [58]. We have to admit though that there is no neuro-
physiological justification for summing the inputs and the synaptic weights. This
lack of neurobiological plausibility can be overcome by means of the isomorphism
between the algebraic structure (R, , , +, + ) and (R0
, , , , ) that transforms
additive maximum/minimum operations into multiplicative maximum/minimum
operations [52]. In this context, we intend to investigate the connections between
MNNs and min-max or adaptive logic networks that combine linear operations with
minimums and maximums [2].
In this paper, we have related another lattice based neuro-computing model,
namely the fuzzy lattice neural network (FLNN), to MNNs. Specifically, we ex-
plained that the FLNN can be considered to be one of the constructive MNN models.
Morphological perceptrons (MPs) and morphological perceptrons with dendrites
(MPDs) also belong to the class of constructive MNN models. In this paper, we
introduced a modified MP training algorithm that only requires a finite number of
epochs to converge, resulting in a decision surface that perfectly separates the train-
ing data. Furthermore, incorporating competitive neurons into the MP and MPD
Constructive Morphological Neural Networks 141
models led to the MP/C and MPD/C models that can be trained using extensions
of the MP and MPD training algorithms. Further research has to be conducted to
devise more efficient training algorithms for these new morphological models.
Finally, this article has empirically demonstrated the effectiveness of constructive
morphological models in simulations with two well-know datasets for classification
[36, 6] by analyzing and comparing the error rates and the computational effort for
learning. In general, the constructive morphological models exhibited very satisfac-
tory classification results and - except for the FLNN - extremely fast convergence of
the training algorithms. On one hand, the constructive morphological models often
require more artificial neurons or computational units than conventional models. On
the other hand, morphological neural computations based on max-products or min-
products are much less complicated than the usual semi-linear neural computations.
Acknowledgements. This work was supported by CNPq under grant no. 306040/2006 9
and by FAPESP under grant no. 2006/05868 5.
References
1. Araujo, R.A., Madeiro, R., Sousa, R.P., Pessoa, L.F.C., Ferreira, T.A.E.: An evolutionary
morphological approach for financial time series forecasting. In: Proceedings of the IEEE
Congress on Evolutionary Computation, Vancouver, Canada, pp. 24672474 (2006)
2. Armstrong, W.W., Thomas, M.M.: Adaptive Logic Networks. In: Fiesler, E., Beale, R.
(eds.) Handbook of Neural Computation, vol. C1.8, pp. 114. IOP Publishing, Oxford
University Press, Oxford United Kingdom (1997)
3. Araujo, R.A., Madeiro, F., Pessoa, L.F.C.: Modular morphological neural network train-
ing via adaptive genetic algorithm for design translation invariant operators. In: Proceed-
ings of the IEEE International Conference on Acoustics, Speech and Signal Processing,
Toulouse, France (May 2006)
4. Banon, G., Barrera, J.: Decomposition of Mappings between Complete Lattices by Math-
ematical Morphology, Part 1, General Lattices. Signal Processing 30(3), 299327 (1993)
5. Birkhoff, G.: Lattice Theory, 3rd edn. American Mathematical Society, Providence
(1993)
6. Asuncion, A., Newman, D.J.: UCI Repository of machine learning databases, Univer-
sity of California, Irvine, CA, School of Information and Computer Sciences (2007),
http://www.ics.uci.edu/mlearn/MLRepository.html
7. Braga-Neto, U., Goutsias, J.: Supremal multiscale signal analysis. SIAM Journal of
Mathematical Analysis 36(1), 94120 (2004)
8. Carpenter, G.A., Grossberg, S.: ART 3: Hierarchical search using chemical transmitters
in self-organizing pattern recognition architectures. In: Neural Networks, vol. 3, pp. 129
152 (1990)
9. Carre, B.: An algebra for network routing problems J. Inst. Math. Appl. 7, 273294
(1971)
10. Coskun, N., Yildirim, T.: The effects of training algorithms in MLP network on image
classification. Neural Networks 2, 12231226 (2003)
142 P. Sussner and E.L. Esmi
11. Cuninghame-Green, R.: Minimax Algebra and Applications. In: Hawkes, P. (ed.) Ad-
vances in Imaging and Electron Physics, vol. 90, pp. 1121. Academic Press, New York
(1995)
12. Cuninghame-Green, R., Meijer, P.F.J.: An Algebra for Piecewise-Linear Minimax Prob-
lems. Discrete Applied Mathematics 2(4), 267286 (1980)
13. Cuninghame-Green, R.: Minimax Algebra: Lecture Notes in Economics and Mathemat-
ical Systems, vol. 166. Springer, New York (1979)
14. Davey, B.A., Priestley, H.A.: Introduction to lattices and order. Cambridge University
Press, Cambridge (2002)
15. Gader, P.D., Khabou, M., Koldobsky, A.: Morphological Regularization Neural Net-
works. Pattern Recognition, Special Issue on Mathematical Morphology and Its Appli-
cations 33(6), 935945 (2000)
16. Giffler, B.: Mathematical solution of production planning and scheduling problems. IBM
ASDD, Tech. Rep (1960)
17. Grana, M., Gallego, J., Torrealdea, F.J., DAnjou, A.: On the application of associa-
tive morphological memories to hyperspectral image analysis. In: Mira, J., Alvarez, J.R.
(eds.) IWANN 2003. LNCS, vol. 2687, pp. 567574. Springer, Heidelberg (2003)
18. Gratzer, G.A.: Lattice theory: first concepts and distributive lattices. W. H. Freeman, San
Francisco (1971)
19. Haralick, R.M., Shapiro, L.G.: Computer and Robot Vision, vol. 1. Addison-Wesley,
New York (1992)
20. Haralick, R.M., Sternberg, S.R., Zhuang, X.: Image Analysis Using Mathematical Mor-
phology: Part I. IEEE Transactions on Pattern Analysis and Machine Intelligence 9(4),
532550 (1987)
21. Heijmans, H.: Morphological Image Operators. Academic Press, New York (1994)
22. Hocaoglu, A.K., Gader, P.D.: Domain learning using Choquet integral-based morpho-
logical shared weight neural networks. Image and Vision Computing 21(7), 663673
(2003)
23. Kaburlasos, V.G., Athanasiadis, I.N., Mitkas, P.A.: Fuzzy lattice reasoning (FLR) classi-
fier and its application for ambient ozone estimation. International Journal of aproximate
reasoning 45(1), 152188 (2003)
24. Kaburlasos, V.G., Petridis, V.: Fuzzy lattice neurocomputing (FLN) models. Neural Net-
works 13, 11451170 (2000)
25. Khabou, M.A., Gader, P.D.: Automatic target detection using entropy optimized shared-
weight neural networks. IEEE Transactions on Neural Networks 11(1), 186193 (2000)
26. Kim, C.: Segmenting a low-depth-of-field image using morphological filters and region
merging. IEEE Transactions on Image Processing 14(10), 15031511 (2005)
27. Kwok, J.T.: Moderating the Outputs of Support Vector Machine Classifiers. IEEE Trans-
actions on Neural Networks 10(5), 10181031 (1999)
28. Matheron, G., Random Sets and Integral Geometry. Wiley, New York, (1975).
29. Monteiro, A.S., Sussner, P.: A Brief Review and Comparison of Feedforward Morpho-
logical Neural Networks with Applications to Classification. In: Kurkova, V., Neruda,
R., Koutnk, J. (eds.) ICANN 2008,, Part II. LNCS, vol. 5164, pp. 783792. Springer,
Heidelberg (2008)
30. Pessoa, L.F.C., Maragos, P.: Neural networks with hybrid morphological/rank/linear
nodes: a unifying framework with applications to handwritten character recognition. Pat-
tern Recognition 33, 945960 (2000)
31. Petridis, V., Kaburlasos, V.G.: Fuzzy lattice neural network (FLNN): a hybrid model for
learning. IEEE Transactions on Neural Networks 9(5), 877890 (1998)
Constructive Morphological Neural Networks 143
32. Pitas, I., Venetsanopoulos, A.N.: Morphological Shape Decomposition. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence 12(1), 3845 (1990)
33. Porter, R., Harvey, N., Perkins, S., Theiler, J., Brumby, S., Bloch, J., Gokhale, M., Szy-
manski, J.: Optimizing digital hardware perceptrons for multispectral image classifica-
tion. Journal of Mathematical Imaging and Vision 19(2), 133150 (2003)
34. Raducanu, B., Grana, M., Albizuri, X.F.: Morphological Scale Spaces and Associative
Morphological Memories: Results on Robustness and Practical Applications. Journal of
Mathematical Imaging and Vision 19(2), 113131 (2003)
35. Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press,
Cambridge (1996)
36. Ripley, B.D.: Datasets for Pattern Recognition and Neural Networks, Cambridge, United
Kingdom (1996), http://www.stats.ox.ac.uk/pub/PRNN/
37. Ritter, G.X., Iancu, L., Urcid, G.: Morphological perceptrons with dendritic structure. In:
The 12th IEEE International Conference, Fuzzy Systems, May 2003, vol. 2, pp. 1296
1301 (2003)
38. Ritter, G.X., Urcid, G.: Lattice Algebra Approach to Single-Neuron Computation. IEEE
Transactions on Neural Networks 14(2), 282295 (2003)
39. Ritter, G.X., Wilson, J.N.: Handbook of Computer Vision Algorithms in Image Algebra,
2nd edn. CRC Press, Boca Raton (2001)
40. Ritter, G.X., Sussner, P.: Morphological Perceptrons Intelligent Systems and Semiotics,
Gaithersburg, Maryland (1997)
41. Ritter, G.X., Wilson, J.N., Davidson, J.L.: Image Algebra: An Overview. Computer Vi-
sion, Graphics, and Image Processing 49(3), 297331 (1990)
42. Ronse, C.: Why Mathematical Morphology Needs Complete Lattices. Signal Process-
ing 21(2), 129154 (1990)
43. Segev, I.: Dendritic processing. In: Arbib, M.A. (ed.) The handbook of brain theory and
neural networks, MIT Press, Cambridge (1998)
44. Serra, J.: Image Analysis and Mathematical Morphology. In: Theoretical Advances,
vol. 2, Academic Press, New York (1998)
45. Serra, J.: Image Analysis and Mathematical Morphology. Academic Press, London
(1982)
46. Khabou, M.A., Gader, P.D., Keller, J.M.: LADAR target detection using morphological
shared-weight neural networks. Machine Vision and Applications 11(6), 300305 (2000)
47. Sobania, A., Evans, J.P.O.: Morphological corner detector using paired triangular struc-
turing elements. Pattern Recognition 38(7), 10871098 (2005)
48. Soille, P.: Morphological Image Analysis. Springer, Berlin (1999)
49. Sousa, R.P., Pessoa, L.F.C., Carvalho, J.M.: Designing translation invariant operations
via neural network training. In: Proceedings of the IEE International Conference on Im-
age Processing, Vancouver, Canada, pp. 908911 (2000)
50. Sussner, P., Valle, M.E.: Morphological and Certain Fuzzy Morphological Associative
Memories with Applications in Classification and Prediction. In: Kaburlasos, V.G., Rit-
ter, G.X. (eds.) Computational Intelligence Based on Lattice Theory. Studies in Compu-
tational Intelligence, vol. 67, pp. 149173. Springer, Heidelberg (2007)
51. Sussner, P., Valle, M.E.: Gray Scale Morphological Associative Memories. IEEE Trans-
actions on Neural Networks 17(3), 559570 (2006)
52. Sussner, P., Valle, M.E.: Implicative Fuzzy Associative Memories. IEEE Transactions on
Fuzzy Systems 14(6), 793807 (2006)
144 P. Sussner and E.L. Esmi
53. Sussner, P., Grana, M.: Guest Editorial: Special Issue on Morphological Neural Net-
works. Journal of Mathematical Imaging and Vision 19(2), 7980 (2003)
54. Sussner, P.: Morphological Perceptron Learning. In: Proceedings of IEEE
ISIC/CIRA/ISAS Joint Conference, Gaithersburg, MD, September 1998, pp. 477
482 (1998)
55. Valle, M.E., Sussner, P.: A General Framework for Fuzzy Morphological Associative
Memories. Fuzzy Sets and Systems 159(7), 747768 (2008)
56. Zadeh, L.A.: Fuzzy Sets. Information and Control 8(3), 338353 (1965)
57. Zimmermann, U.: Linear amd Combinatorial Optimization of Ordered Algebraic Struc-
tures. North-Holland, Amsterdam (1981)
58. Yu, A.J., Giese, M.A., Poggio, T.: Biophysiologically Plausible Implementations of the
Maximum Operation. Neural Computation 14(12), 28572881 (2002)
A Feedforward Constructive Neural Network
Algorithm for Multiclass Tasks Based on Linear
Separability
1 Introduction
There are many different methods that allow the automatic learning of concepts,
as can be seen in [1] and [2]. One particular class of relevant machine learning
methods is based on the concept of linear separability (LS).
The concept of linear separability permeates many areas of knowledge and
based on the definition given in [3] it can be stated as: Let E be a finite set of N
distinct patterns {E1, E2, , EN}, each pattern Ei (1 i N) described as Ei =
x1,,xk, where k is the number of attributes that defines a pattern. Let the pat-
terns of E be classified in such a way that each pattern in E belongs to only one of
the M classes Cj (1 j M). This classification divides the set of patterns E into
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 145169.
springerlink.com Springer-Verlag Berlin Heidelberg 2009
146 J.R. Bertini Jr. and M. do Carmo Nicoletti
the subsets EC1, EC2, , ECM, such that each pattern in ECi belongs to class Ci,
for i = 1, , M. If a linear machine can classify the patterns in E into the proper
class, the classification of E is a linear classification and the subsets EC1, EC2, ,
ECM are linearly separable. Stated another way, a classification of E is linear and
the subsets EC1, EC2, , ECM, are linearly separable if and only if linear dis-
criminant functions g1, g2, , gM exist such that
Since the decision regions of a linear machine are convex, if the subsets EC1,
EC2, , ECM are linearly separable, then each pair of subsets ECi, ECj, i, j = 1, ,
M, i j, is also linearly separable. That is, if EC1, EC2, , ECM, are linearly sepa-
rable, then EC1, EC2, , ECM, are also pairwise linearly separable.
According to Elizondo [4], linearly separable based learning methods can
be divided into four groups. Depending on their main focus they may be based
on linear programming, computational geometry, neural networks or quadratic
programming.
This chapter describes a new neural network algorithm named MBabCoNN
(Multiclass Barycentric-based Constructive Neural Network) suitable for multi-
class classification problems. The algorithm incrementally constructs a neural
network by adding hidden nodes that linearly separate sub-regions of the feature
space. It can be considered a multiclass version of the two-class CoNN named
BabCoNN (Barycentric-based Constructive Neural Network) proposed in [5].
The chapter is an extended version of an earlier paper [6] and is organized as fol-
lows. Section 2 stresses the importance of CoNN algorithms and discusses the role
played by the algorithm used for training individual Threshold Logic Units (TLU),
particularly focusing on the BCP algorithm [7]. Section 3 highlights the main char-
acteristics of the five well-known CoNN multiclass algorithms used in the empirical
experiments described in Section 5. Section 4 initially outlines the basic features of
the two-class BabCoNN algorithm briefly presenting the main concepts and strate-
gies used by BabCoNN when learning and classifying and,presents a detailed de-
scription of the multiclass MBabCoNN algorithm divided into two parts: learning
the neural network and using the network learnt for classifying previously unseen
patterns. Section 5 presents and discusses the results of 16 algorithms; four of them
are versions of PRM and BCP for multiclass tasks and the other 12 are variants of
the basic multiclass algorithms used, namely: MTower, MPyramid, MUpstart,
MTiling, MPerceptron-Cascade and MBabCoNN in eight knowledge domains from
the UCI Repository [8]. The Conclusion section ends the chapter by presenting a
summary of the main results highlighting a few possible research lines to investi-
gate, aiming at improving the MBabCoNN algorithm.
can begin, constructive neural network (CoNN) algorithms allow the network ar-
chitecture to be constructed simultaneously with the learning process; both sub-
processes, learning and constructing the network, are interdependent.
Constructive neural network (CoNN) algorithms do not assume fixed network
architecture before training begins. The main characteristic of a CoNN algorithm
is the dynamic construction of the networks hidden layer(s), which occurs simul-
taneously with training. A description of a few well-known CoNN algorithms can
be found in [9] and [10]; among the most well-known are: Tower and Pyramid
[11], Tiling [12], Upstart [13], Perceptron-Cascade [14], Pti and Shift [15].
k1 k2
i E2
i i
i E1
i =1 i =1
b1 = and b 2 = (1)
k1 k2
i i
i =1 i =1
M-class versions) lies on the connections. In a Pyramid network each newly added
hidden neuron has connections with all the previously added hidden ones as well
as with the input neurons.
The two-class Upstart algorithm [13] constructs the neural network as a binary
tree of TLUs and it is governed by the addition of new hidden neurons, specialized
in correcting wrongly-on or wrongly-off errors made by the previously added neu-
rons. A natural extension of this algorithm for multiclass tasks would be an algo-
rithm that constructs M binary trees, each one responsible for the learning of one
of the M classes found in the training set. This approach, however, would not take
into account a possible relationship that might exist between the M different
classes. The MUpstart proposal, described in [23], tries to circumvent the problem
by grouping the created hidden neurons in a single hidden layer. Each hidden neu-
ron is created aiming at correcting the most frequent error (wrongly-on or
wrongly-off) committed by a single neuron among the M output neurons. The hid-
den neurons are trained with patterns labeled with two classes only and they can
fire 0 or 1. Each hidden neuron is directly connected to every neuron in the output
layer. The input layer is connected to the hidden neurons as well as to the output
neurons.
The Tiling algorithm [12] constructs a neural network where hidden nodes are
added to a layer in a similar fashion to laying tiles. Each hidden layer in a Tiling
network has a master neuron and a few ancillary neurons. The output layer has
only one master neuron. Tiling constructs a neural network in successive layers
such that each new layer has a smaller number of neurons than the previous layer.
Similarly to this approach, the MTiling method, as proposed in [24], constructs a
multi-layer neural network where the first hidden layer has connections to the in-
put layer and each subsequent hidden layer has connections only to the previous
hidden layer. Each layer has master and ancillary neurons with the same functions
they perform in a Tiling network i.e., the master neurons are responsible for classi-
fying the training patterns and the ancillary ones are responsible for making the
layer faithful. The role played by the ancillary neurons in a hidden layer is to
guarantee that the layer does not produce the same output for any two training pat-
terns belonging to different classes. In the MTiling version the process of adding a
new layer is very similar to the one implemented by Tiling. However, while the
Tiling algorithm adds only one master neuron per layer, the MTiling adds M mas-
ter neurons (where M is the number of different classes in the training set).
The Perceptron Cascade algorithm [14] is a neural constructive algorithm that
constructs a neural network with an architecture resembling the one constructed
by the Cascade Correlation algorithm [25] and it uses the same approach for
correcting the errors adopted by the Upstart algorithm [13]. Unlike the Cascade
Correlation however, the Perceptron Cascade uses the Perceptron (or any of its
variants) for training individual TLUs). Like the Upstart algorithm, the Perceptron
Cascade starts the construction of the network by training the output neuron and
hidden neurons are added to the network similarly to the process adopted by the
Cascade Correlation: each new neuron is connected to both the output and input
neurons and has connections with all hidden neurons previously added to the net-
work. The MPerceptron-Cascade version, proposed in [22], is very similar to the
A Feedforward Constructive Neural Network Algorithm 151
MUpstart described earlier in this section, the main difference between them being
that the neural network architecture induced by both. The MPerceptron-Cascade
adds the new hidden neurons in new layers while the MUpstart adds them in one
single layer.
representing the hidden neurons. The function bcp( ) stands for the BCP algo-
rithm, used for training individual neurons. The function removeClassifiedPat-
terns( ) removes from the training set the patterns that were correctly classified
by the last added neuron and bothClasses( ) is a Boolean function that returns
true if the current training set still has patterns belonging to both classes and
false otherwise.
Due to the way the learning phase is conducted by BabCoNN, each hidden neu-
ron of the network is trained using patterns belonging to a region of the training
space (i.e., the one defined by the patterns that were misclassified by the previous
hidden neuron added to the network). This particular aspect of the algorithm has
the effect of introducing an undesirable redundancy, in the sense that a pattern
may be correctly classified by more than one hidden neuron. This has been sorted
out by implementing a classification process where the neurons of the hidden layer
have a particular way of firing their output.
procedure BabCoNN_learner(E)
begin
output bcp(E)
nE removeClassifiedPatterns(E)
h0
while bothClasses(nE) do
begin
hh+1
hiddenLayer[h] bcp(nE)
nE removeClassifiedPatterns(nE)
end
end procedure.
both, W and the bias. For each class, the region is defined as the hypersphere
whose radius is given by the largest distance between all correctly classified pat-
terns and the corresponding barycenter of the region.
To exemplify how a hidden neuron behaves during the classification phase, let
each Yi = yi1, yi2, i = {1, 2 ,3 ,4} be a given pattern to be classified. As can be
seen in Figure 2, four situations may occur:
(1) The new pattern (Y1) is in the positive classification region of the hidden
neuron. The pattern Y1 is classified as positive by the neuron, which
fires +1;
(2) The new pattern (Y2) is in the positive region, but now lying on the other
side of the hyperplane; this would make the neuron classify Y2 as nega-
tive. However, the neuron will fire the value 0 since there is no guarantee
that the pattern is negative;
(3) The new pattern (Y3) is not part of any region; in this case the neuron fires
the value 0 independently of the classification given by the hyperplane it
represents;
(4) The new pattern (Y4) is in the negative classification region of the hidden
neuron. The pattern Y4 is classified as negative and the neuron fires the
value 1. Note that the regions may overlap with each other and, eventu-
ally, a pattern may lie in both regions. When that happens, the hidden neu-
ron (as implemented by the version used in the experiments described in
Section 5) assigns the pattern the class is given by the hyperplane.
H
Y1 +
+ +
+ + Y2
+ Y4
b1 +
+ +
+ b2
W
+
Y3
+ +
procedure BabCoNN_classifier(X)
{X is the pattern to be classified}
begin
for i 1 to h do
begin
C classification(hiddenLayer[i], X)
Bp belongsToPositive(hiddenLayer[i], X)
Bn belongsToNegative(hiddenLayer[i], X)
if (C = 1 and Bp) then Hlc[i] 1
else if (C = 1 and Bn)
then Hlc[i] 1
else Hlc[i] 0
end
sum 0
for j 1 to h do
sum Hlc[j] + sum
if sum 0 then sum / |sum|
else classification(output,X)
end procedure.
procedure MBabCoNNLearner(E)
begin
currentAccuracy 0,
previousAccuracy 0
output MTluTraining(E) {output layer with M neurons for a M-class problem}
currentAccuracy evaluateNetwork(E)
h 0 {hidden neuron index}
while (currentAccuracy > previousAccuracy) and (currentAccuracy < 1) do
begin
highest_wrongly-on_error(E,WrongNeuron,Wrongly-onClass)
twoClassesE createTrainingSet(WrongNeuron,Wrongly-onClass,E)
hh+1
hiddenLayer[h] bcp(twoClassesE) {hidden BabCoNN neuron}
previousAccuracy currentAccuracy
currentAccuracy evaluateNetwork(E)
end
if currentAccuracy 1 then begin
remove(hiddenLayer,h)
hh1
end
end procedure.
procedure highest_wrongly-on_error(E,WrongNeuron,Wrongly-onClass)
begin
{initializing error matrix}
for i 1 to M do
for j 1 to M do
outputErr[i,j] 0
{collecting errors made by output neurons in training set E={E1,E2,,EN} }
for i 1 to N do
begin
predClass MBabCoNN(Ei)
if predClass class(Ei)
then outputErr[predClass,class(Ei)] outputErr[predClass,class(Ei)] + 1
end
{identifying which neuron makes the highest number of wrongly-on errors within a class}
highWrong 0
highErr 0
highWrongly-onClass 0
for i 1 to M do
for j 1 to M do
if outputErr[i,j] > highErr
then begin
highErr outputErr[i,j]
highWrong i
highWrongly-onClass j
end
WrongNeuron highWrong
Wrongly-onClass highWrongly-onClass
end procedure.
Fig. 5 Pseudocode for determining the neuron responsible for the highest number of
wrongly-on misclassifications as well as for the corresponding misclassified class.
Output neurons
#P u1 u2 u3 Class
Output u1 u2 u3
neurons #1 1 1 1 1 C
#2 1 1 1 2 W
#3 1 1 1 2 W
Input x1 x2 #4 1 1 1 1 W
neurons
#5 1 1 1 3 W
#6 1 1 1 3 C
Training set: {#1, #2, #3, #4, #5, #6} #P: pattern id; C: correctly classified;
W: wrongly classified
Output neurons
Output
u1 u2 u3 #P u1 u2 u3 Class
neurons
#1 1 1 1 1 C
1 1 #2 1 1 1 2 C
h1
#3 1 1 1 2 C
#4 1 1 1 1 W
Input x1 x2 #5 1 1 1 3 W
neurons
#6 1 1 1 3 C
Output neurons
#P u1 u2 u3 Class
Output u1 u2 u3
neurons #1 1
1 1 1 C
1 1 1
1 #2 1 1 2 C
h1 1 h2 #3 1 1 1 2 C
#4 1 1 1 1 C
Input x1 x2 #5 1 1 1 3 W
neurons
#6 1 1 1 3 C
After the addition of h2, pattern #4 is
Training set: {#1, #2, #3, #4, #5, #6} correctly classified.
#6 1 1 1 3 C
Network makes no mistakes. Training is fin-
Training set: {#1, #2, #3, #4, #5, #6} ished.
procedure MBabCoNN_classifier(X)
{X: new pattern}
{MBabCoNN network with M output neurons}
begin
result 0, counter 0, neuronIndex 0
j1
while (j M) and (counter < 2) do
begin
OutputClassification[j] classification(output[j],X) {retrieves 1 or -1}
if OutputClassification[j] = 1 then
begin
counter counter + 1
neuronIndex j
end
j j+1
end
if ( (counter = 1) and not hasDependencies(output[neuronIndex]) )
then result class(neuronIndex)
else
begin
for j 1 to M do
begin
sum 0
for k 1 to h do
begin {h is the number of hidden neurons}
if isConected(k,j) then {verifies connection between hidden neuron k and output j}
sum sum + classification(hiddenLayer[k], X) {BabCoNN-like neuron}
hiddenClassification[j] sum
end
end
result class(greatest(hiddenClassification))
{returns the class associated with the index of the greatest value in hiddenClassification}
end
end procedure.
end procedure.
Output neurons u1 u2 u3
1 1
#1
#4
#5
1
#6
1
h1 h2
Input neurons x1 x2
In Fig. 7, two of the output neurons, u2 and u3 have dependencies (have con-
nections with hidden neurons) and neuron u1 does not have dependencies. If a new
pattern is to be classified, the classification procedure checks if the output given
by the node(s) that has (have) no dependencies (in this example, the u1) is +1; if
that is the case the new pattern is assigned the class represented by u1 otherwise,
the classification procedure takes into consideration only the sum of the outputs by
the hidden neurons.
In cases where hidden neurons fire value 0, the classification procedure ig-
nores the hidden neurons and takes into account the information given by the
output neurons only. If, however, the output neuron that classifies the pattern has
dependencies, the output result will be the sum of the outputs of all hidden neu-
rons. If the sum is 0 the output neuron will be in charge of classifying the pattern.
The three output neurons, u1, u2 and u3, in the MBabCoNN network of Fig. 8
have dependencies. Each output node has two connections with the added hidden
neurons. A pattern to be classified will result in three outputs, one from each of
the three hidden nodes (+1, 1 or 0), which will be multiplied by the connection
u3
Output neurons u1 u2
1 1 1
1
#1
#4
#5
#6
1
h1 1 h2 h3
Input neurons x1 x2
Fig. 8 MBabCoNN network where the three output neurons have dependencies.
A Feedforward Constructive Neural Network Algorithm 161
weight, producing values +1, 1 or 0. Each output neuron will sum the input re-
ceived from the hidden neurons and the pattern will be assigned the class repre-
sented by the output neuron with the highest score.
Runs with the various learning procedures were carried out on the same train-
ing sets and evaluated on the same test sets. The cross-validation folds were the
same for all the experiments in each domain. For each domain, each learning pro-
cedure was run considering one, ten, a hundred and a thousand iterations; only the
best test accuracy among these iterations for each algorithm is presented. All the
results obtained with MBabCoNN and the other algorithms (and their variants) are
presented in tables 2 to 9, organized by knowledge domain.
The following abbreviations were adopted for presenting the tables: #I: number
of iterations, TR training set, TE testing set. The accuracy (Acc) is given as a per-
centage followed by the standard deviation value. The Absolute Best (AB) col-
umn gives the best performance of the learning procedure (in TE) over the ten
runs and the Absolute Worst (AW) column gives the worst performance of the
learning procedure (in TE) over the ten runs; #HN represents the number of hid-
den nodes; AB(HN) gives the smallest number of hidden nodes created and
AW(HN) gives the highest number of hidden nodes created.
Obviously the PRMWTA, PRMI, BCPWTA and BCPI do not have values for
#HN, AB(HN) and AW(HN) because the networks they create only have input
and output layers.
In relation to the results obtained in the experiments shown in tables 2 to 9, it
can be said that as far as accuracy in test sets is concerned, MBabCoNNP has
shown the best performance in four out of eight domains, namely the Iris, E. Coli,
Wine and Zoo. In the Balance domain, although its result is very close to the best
Table 2 Iris
Table 3 E. Coli
Table 4 Glass
Table 5 Balance
Table 6 Wine
Table 7 Zoo
Table 8 Car
result (obtained with MTilingP), it is worth noting that MTilingP created 28.1 hid-
den neurons on average while MBabCoNN created only 3.5. A similar situation
occurred in the Car domain.
All the algorithms with a performance higher than 80% induced networks big-
ger than the network induced by MBabCoNNP, especially the MTilingP, the
MPyramid and the MTowerP, although the accuracy values of the three were very
close to those of MBabCoNN. In the Glass domain, MBabCoNNP is ranked third
considering only accuracy; however, when taking into account the standard
deviation as well, it can be said that the MBabCoNNP and MPCascadeP (second
position in the rank) are even. In the last domain, Image Segmentation, MBab-
CoNNP accuracy was average while the best performance was obtained with the
MPCascadeP.
In relation to the versions that used BCPWTA for training the output neuron,
MBabCoNNB outperformed all the other algorithms in the eight domains. The re-
sults however were inferior to those obtained using the PRMWTA for training the
output nodes. This fact is due to the particular characteristics of the two training
approaches; in general the BCPWTA is not a good match for the PRMWTA. Fu-
ture work concerning BCPWTA needs to be done in order for this algorithm to be
considered an option for network construction.
Now, considering the PRMWTA versions of all algorithms, it is easy to see
that the test accuracies are more standardized. In order to have a clearer view
of the algorithm performances, Table 10 presents the values for the average
ranks considering the test accuracy for all PRMWTA based CoNN algorithms. In
A Feedforward Constructive Neural Network Algorithm 167
the table the results are ranked in ascending order of accuracy. Ties in accuracy
were sorted out by averaging the corresponding ranks.
According to Demar [26], average ranks provide a fair comparison of the al-
gorithms. Taking into account the values obtained with the average ranks it can be
said that, as far as the eight datasets are concerned, MBabCoNN is the best choice
among the six algorithms. As can be seen in Table 10, MBabCoNNP obtained the
smallest value, followed by MPCascadeP and MUpstartP. MBabCoNNP was
ranked last in only one domain (Car); in the Car domain, however, the test accura-
cies among the six algorithms were very close, i.e. the maximum difference was
about 3.0%.
It is worth noticing that the three algorithms ranked first in the average rank-
ings, construct the network by first adding the output neurons and then starting to
correct their misclassifications by adding two-class hidden neurons. The good per-
formance may be used to corroborate the efficiency of this technique. Based on
the empirical results obtained, it can be said that the MBabCoNN algorithm is a
good choice among the multiclass CoNN algorithms available.
Table 10 Average rank over PRMWTA based algorithms concerning test accuracy
5 Conclusions
This chapter proposes the multiclass version, MBabCoNN, of a recently proposed
constructive neural network algorithm named BabCoNN, which is based on the
geometric concept of convex hull and uses the BCP algorithm for training individ-
ual TLUs added to the network during learning. The chapter presents the accuracy
results of learning experiments conducted in eight multiclass knowledge domains,
using the MBabCoNN implemented in two different versions: MBabCoNNP and
MBabCoNNB, versus five well-known multiclass algorithms (each implemented
in two versions as well). Both versions of the MBabCoNN use the BCP for train-
ing the hidden neurons and differ from each other in relation to the algorithm used
for training their output neurons (PRMWTA and BCPWTA respectively).
As far as results in eight knowledge domains are concerned, it can (easily) be
observed that all algorithms performed better when using PRMWTA for training
the output neurons. This may occur because BCPWTA is not a good strategy for
168 J.R. Bertini Jr. and M. do Carmo Nicoletti
training M (>2) classes. Now considering the PRMWTA versions, it can be said
the MBabCoNNP version has shown superior average performance in relation to
both accuracy in test sets and the size of the induced neural network. This work
had established MBabCoNN as a good option among other CoNNs for multiclass
domains.
Acknowledgments. To CAPES and FAPESP for funding the work of the first author and
for the project financial help granted to the second author, and to Leonie C. Pearson for
proofreading the first draft of this chapter.
References
[1] Mitchell, T.M.: Machine learning. McGraw-Hill, USA (1997)
[2] Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification. John Wiley & Sons, USA
(2001)
[3] Nilsson, N.J.: Learning machines. McGrall-Hill Systems Science Series, USA (1965)
[4] Elizondo, D.: The linear separability problem: some testing methods. IEEE Transac-
tions on Neural Networks 17(2), 330344 (2006)
[5] Bertini Jr., J.R., Nicoletti, M.C.: A constructive neural network algorithm based on
the geometric concept of barycenter of convex hull. In: Rutkowski, L., Tadeusiewiza,
R., Zadeh, L.A., Zurada, J. (eds.) Computational Intelligence: Methods and Applica-
tions, 1st edn., vol. 1, pp. 112. Academic Publishing House EXIT, Warsaw (2008)
[6] Bertini Jr., J.R., Nicoletti, M.C.: MBabCoNN - a multiclass version of a constructive
neural network algorithm based on linear separability and convex hull. In: Krkov,
V., Neruda, R., Koutnk, J. (eds.) ICANN 2008, Part II. LNCS, vol. 5164, pp. 723
733. Springer, Heidelberg (2008)
[7] Poulard, H.: Barycentric correction procedure: A fast method for learning threshold
unit. In: WCNN 1995, Washington, DC, vol. 1, pp. 710713 (1995)
[8] Asuncion, A., Newman, D.J.: UCI machine learning repository. University of Cali-
fornia, School of Information and Computer Science, Irvine (2007),
http://www.ics.uci.edu/~mlearn/MLRepository.html
[9] Parekh, R.G.: Constructive learning: inducing grammars and neural networks. Ph.D.
Dissertation, Iowa State University, Ames, Iowa (1998)
[10] Nicoletti, M.C., Bertini Jr., J.R.: An empirical evaluation of constructive neural net-
work algorithms in classification tasks. Int. Journal of Innovative Computing and Ap-
plications (IJICA) 1, 213 (2007)
[11] Gallant, S.I.: Neural network learning & expert systems. The MIT Press, Cambridge
(1994)
[12] Mzard, M., Nadal, J.: Learning feedforward networks: the tiling algorithm. J. Phys.
A: Math. Gen. 22, 21912203 (1989)
[13] Frean, M.: The upstart algorithm: a method for constructing and training feedforward
neural networks. Neural Computation 2, 198209 (1990)
[14] Burgess, N.: A constructive algorithm that converges for real-valued input patterns.
International Journal of Neural Systems 5(1), 5966 (1994)
[15] Amaldi, E., Guenin, B.: Two constructive methods for designing compact feedfor-
ward networks of threshold units. International Journal of Neural System 8(5), 629
645 (1997)
[16] Poulard, H., Labreche, S.: A new threshold unit learning algorithm, Technical Report
95504, LAAS (December 1995)
[17] Poulard, H., Estves, D.: A convergence theorem for barycentric correction proce-
dure, Technical Report 95180, LAAS-CNRS, Toulouse (1995)
A Feedforward Constructive Neural Network Algorithm 169
[18] Bertini Jr., J.R., Nicoletti, M.C., Hruschka Jr., E.R.: A comparative evaluation of con-
structive neural networks methods using PRM and BCP as TLU training algorithms.
In: Proceedings of the IEEE International Conference on Systems, Man and Cyber-
netics, pp. 34973502. IEEE Press, Los Alamitos (2006)
[19] Bertini Jr., J.R., Nicoletti, M.C., Hruschka Jr., E.R., Ramer, A.: Two variants of the
constructive neural network Tiling algorithm. In: Proceedings of The Sixth Interna-
tional Conference on Hybrid Intelligent Systems (HIS 2006), pp. 4954. IEEE Com-
puter Society, Washington (2006)
[20] de Berg, M., van Kreveld, M., Overmars, M., Schwarzkopf, O.: Computational ge-
ometry: algorithms and applications, 2nd edn. Springer, Berlin (2000)
[21] Parekh, R.G., Yang, J., Honavar, V.: Constructive neural network learning algorithms
for multi-category real-valued pattern classification. TR ISU-CS-TR97-06, Iowa State
University, IA (1997)
[22] Parekh, R.G., Yang, J., Honavar, V.: Constructive neural network learning algorithm
for multi-category classification. TR ISU-CS-TR95-15a, Iowa State University, IA
(1995)
[23] Parekh, R.G., Yang, J., Honavar, V.: MUpstart a constructive neural network learn-
ing algorithm for multi-category pattern classification. In: ICNN 1997, vol. 3, pp.
19201924 (1997)
[24] Yang, J., Parekh, R.G., Honavar, V.: MTiling a constructive network learning algo-
rithm for multi-category pattern classification. In: Proc. of the World Congress on
Neural Networks, pp. 182187 (1996)
[25] Fahlman, S., Lebiere, C.: The cascade correlation architecture. In: Advances in Neu-
ral Information Processing Systems, vol. 2, pp. 524532. Morgan Kaufmann, San
Mateo (1990)
[26] Demar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of
Machine Learning Research 7, 130 (2006)
Analysis and Testing of the m-Class RDP
Neural Network
1 Introduction
The RDP for 2-class classification problems was introduced in [12]. This topol-
ogy is a generalisation of the single layer perceptron topology (SLPT) developed
David A. Elizondo and Ralph Birkenhead
School of Computing, De Montfort University, The Gateway, Leicester,
LE1 9BH, United Kingdom
e-mail: {elizondo,rab}@dmu.ac.uk
Juan M. Ortiz-de-Lazcano-Lobato
School of Computing, University of Malaga, Bulevar Louis Pasteur, 35, Malaga, Spain
e-mail: jmortiz@lcc.uma.es
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 171192.
springerlink.com c Springer-Verlag Berlin Heidelberg 2009
172 D.A. Elizondo, J.M. Ortiz-de-Lazcano-Lobato, and R. Birkenhead
1.1 Preliminaries
We use the following standard notions:
Sm stands for the set of permutations of {1, ..., m}.
If u = (u1 , ..., ud ), v = (v1 , ..., vd ) IRd , then uT v stands for u1 v1 + ... + udvd ;
and u(j) = uj (i.e. u(j) is the j-th component of u).
{i1 ,...,ik } u = (ui1 , ..., uik ) and by extension,
if S IRd then {i1 ,...,ik } S = {{i1 ,...,ik } x | x S}.
Let r IR, Adj(u, r) = (u1 , ..., ud , r) and by extension,
if S IRd , Adj(S, r) = {Adj(x, r) | x S}.
Im(E, F ) = {(x1 , ..., xd , xd+1 ) F | (x1 , ..., xd ) E} is defined for E IRd
and F IRd+1 .
P(w, t) stands for the hyperplane {x IRd | wT x + t = 0} of IRd .
Definition 2. Two subsets X and Y of IRd are said to be linearly separable if there
exists a hyperplane P(w, t) of IRd , such that (x X, wT x + t > 0 and y
Y, wT y + t < 0). In the following we will denote the fact that X and Y are LS
by X || Y or X || Y (P(w, t)) if we want to specify the hyperplane which linearly
separates X and Y .
This paper is divided into four sections. The m-class generalisation of the RDP neu-
ral network, based on a notion of linear separability for m classes, is presented in
section two. In this section also, some of the notions used throughout this paper are
introduced. In section three, the procedure used to evaluate the generalisation of the
m-class RDP model is presented. Six machine learning benchmarks (Iris, Soybean,
and Wisconsin Breast Cancer) were used [3] and datasets were generated using
Analysis and Testing of the m-Class RDP Neural Network 173
cross validation. The method is compared with Backpropagation and Cascade Cor-
relation in terms of their level of generalisation. A summary and some conclusions
are presented in section four.
Definition 3. Let X1 , ..., Xm IRd and a0 < a1 < ... < am , X1 , ..., Xm are
said to be linearly separable relatively to the ascending sequence of real numbers
a0 , ..., am if
Sm , w IRd , t IR such that i, x X(i) , ai1 < wT x + t < ai .
Remarks
Let X1 , ..., Xm IRd and a0 < a1 < a2 < ... < am ,
X1 , ..., Xm are linearly separable relatively to a0 , ..., am iff CH(X1 ), ..., CH(Xm )
are linearly separable relatively to a0 , ..., am .
Let Sm .
Put: X = Adj(X(1) , a0 ) Adj(X(2) , a1 )... Adj(X(m) , am1 ),
Y = Adj(X(1) , a1 ) Adj(X(2) , a2 )... Adj(X(m) , am ), then, X1 , ...,
Xm are linearly separable relatively to a0 , ..., am by using iff X || Y . In other
words, we reduce the problem of linear separability for m classes to the problem
of linear separability for 2 classes. We do this by augmenting the dimension of the
input vectors with the ascending sequence a0 , ..., am .
If X1 || X2 (P(w, t)) and = Max({|wT x + t| ; x (X1 X2 )}, then X1 , X2
are linearly separable relatively to , 0, .
Definition 5. A m-SLPT with the weight w IRd , the threshold t IR, the values
v1 , v2 , ..., vm IR and the characteristic (c, h) IR IR+ (c represents the value
corresponding to the starting hyperplane, and h a chosen distance between a hyper-
plane which we will call the step size), has the same topology as the 2-class SLPT.
174 D.A. Elizondo, J.M. Ortiz-de-Lazcano-Lobato, and R. Birkenhead
i
Xm )).
So, there exists x X1 ... Xm i
i
such that {x} || (Si \ {x})
Proof. Let w = (0, ..., 0, 1, ..., 1) IRd+i with d times 0 and i times 1, and let t =
(kX1 (i)b2 +...+kXm1 (i)bm +kXm (i)bm1 ). Thus, j < m, x Xji , wT x+
t = bj bj+1 = b+(j1)h, and x Xm i
, wT x+t = bm bm1 = b+(m1)h.
Let a0 = b 2 , ai = a0 + ih for 1 i m, thus j m, x Xji , aj1
h
w T x + t aj .
So, X1i , ..., Xmi
are linearly separable by the hyperplane P(w, t)
3 Comparison Procedure
The six machine learning benchmark data sets used in the comparison study identi-
fied in section 1 are described briefly.
The Glass benchmark relates to the classification of types of glass on criminolog-
ical investigation. The glass found at the scene of a crime can be used as evidence.
This benchmark consists of nine inputs and six classes (2). The dataset contains a
total of 214 samples.
The Wine dataset contains results of a chemical analysis of wines grown in the
same region in Italy derived from three different crops. The analysis determined the
quantities of 13 constituents found in each of the three types of wines (3).
Analysis and Testing of the m-Class RDP Neural Network 177
Table 7 Inputs and outputs used in the Wisconsin Breast Cancer classification problem.
The Zoo benchmark data set contains 17 Boolean-valued attributes with a type of
animal as output (5). A total of 101 samples are included (mammals, birds, reptiles,
fish, frogs, insects and sea shells).
The Iris dataset classifies a plant as being an Iris Setosa, Iris Versicolour or Iris
Virginica. The dataset describes every iris plant using four input parameters (Table
6). The dataset contains a total of 150 samples with 50 samples for each of the three
classes. All the samples of the Iris Setosa class are linearly separable from the rest
of the samples (Iris Versicolour and Iris Virginica). Some of the publications that
used this benchmark include: [7] [8] [2] and [4].
The Soybean classification problem contains data for the disease diagnosis of
the Soybean crop. The dataset describes the different diseases using symptoms. The
original dataset contains 19 diseases and 35 attributes. The attribute list was limited
to those attributes that had non trivial values in them (Table 4). Thus there were only
20 out of the 35 attributes included in the tests. Only 15 of the 19 have no missing
values. Therefore, only these 15 classes were used for the comparisons.
The Wisconsin Breast Cancer dataset [9, 1, 15] consists of a binary classification
problem to distinguish between benign and malignant breast cancer. The data set
contains 699 instances and 9 attributes (Table 7). The class distribution is: Benign
458 instances (65.5 %), and Malignant 241 instances (34.5 %).
The technique of cross validation was applied to split the benchmarks into train-
ing and testing data sets. The datasets were randomly divided into n equal sized
testing sets that were mutually exclusive [14]. The remaining samples were used
to train the networks. In this study, the classification benchmark data sets were di-
vided into ten equally sized data sets. On one hand sixty percent of the samples
were used for training the networks and the remaining forty percent were used for
testing purposes. On the other hand the training dataset consisted of eighty per-
cent of the samples and the remaining twenty percent were used for the testing
dataset.
The simplex algorithm was used on this study for testing for linear separabil-
ity. This algorithm was remarkably faster than the Perceptron one when searching
for LS subsets. Other algorithms for testing linear separability include the Class of
Linear Separability [5] and the Fisher method (see [6] for a survey on methods for
testing linear separability).
These results provide a good basis for further developing this study and com-
paring the effects of using single or multiple output neurons for multiple class
classification problems using the m-class RDP method and Backpropagation and
Cascade Correlation. After describing the experimental setup, some conclusions are
presented in the next section.
Table 8 Results obtained with the m-class, and backpropagation, using the Glass data set
benchmark in terms of the level of generalisation with 60% of the data used for training and
40% for testing.
Table 9 Results obtained with the m-class, and backpropagation, using the Glass data set
benchmark in terms of the level of generalisation with 80% of the data used for training and
20% for testing.
unseen data and the number of neurons needed for each method to solve the classi-
fication problems (i.e. the size of the topology).
As specified before, the m-class RDP uses a single output neuron for multiple
classes. Backpropagation and Cascade Correlation are tested using two different
topologies. The first one uses a unique output neuron and is named BP 1out and
CC 1out in the tables. The second type of topology uses as many neurons in the
output layer as the number of classes in the data set (Backprop and CC Mout in the
tables). Only the first type of topology is used when the dataset defines a binary
classification problem such as the Wisconsin Breast Cancer dataset.
Analysis and Testing of the m-Class RDP Neural Network 181
Table 10 Results obtained with the m-class, and backpropagation, using the Wine data set
benchmark in terms of the level of generalisation with 60% of the data used for training and
40% for testing.
Table 11 Results obtained with the m-class, and backpropagation, using the Wine data set
benchmark in terms of the level of generalisation with 80% of the data used for training and
20% for testing.
Overall, considering all the results obtained from tables 8 to 19 in terms of gen-
eralisation obtained using the m-class RDP, it appears that the method is broadly
comparable with CC and BP, but has slightly poorer results. It appears to be more
variable in its performance. While it does generally perform better than the other
methods when they are used with a single output neuron, it is arguable that the
nature of the data makes this an inappropriate choice of topology for a BP or CC
network.
Considering the size of the network produced (tables 20 to 31), the number of
neurons in an m-class RDP is usually significantly lower than in the corresponding
182 D.A. Elizondo, J.M. Ortiz-de-Lazcano-Lobato, and R. Birkenhead
Table 12 Results obtained with the m-class, and backpropagation, using the Zoo data set
benchmark in terms of the level of generalisation with 60% of the data used for training and
40% for testing.
Table 13 Results obtained with the m-class, and backpropagation, using the Zoo data set
benchmark in terms of the level of generalisation with 80% of the data used for training and
20% for testing.
BP and CC networks with multiple output neurons. The single output neuron BP
and CC networks sometimes have fewer neurons but, as discussed above, this is
probably an inappropriate architecture for the data. This will lead to future research
and exploring a multiple output architecture for the m-class RDP model.
Analysis and Testing of the m-Class RDP Neural Network 183
Table 14 Results obtained with the m-class, and backpropagation, using the Iris data set
benchmark in terms of the level of generalisation with 60% of the data used for training and
40% for testing.
Table 15 Results obtained with the m-class, and backpropagation, using the Iris data set
benchmark in terms of the level of generalisation with 80% of the data used for training and
20% for testing.
Table 16 Results obtained with the m-class, and backpropagation, using the Soybean data
set benchmark in terms of the level of generalisation with 60% of the data used for training
and 40% for testing.
Table 17 Results obtained with the m-class, and backpropagation, using the Soybean data
set benchmark in terms of the level of generalisation with 80% of the data used for training
and 20% for testing.
Table 18 Results obtained with the m-class, and backpropagation, using the Wisconsin
Breast Cancer data set benchmark in terms of the level of generalisation with 60% of the
data used for training and 40% for testing.
Table 19 Results obtained with the m-class, and backpropagation, using the Wisconsin
Breast Cancer data set benchmark in terms of the level of generalisation with 80% of the
data used for training and 20% for testing.
Table 20 Results obtained with the m-class, and backpropagation, using the Glass data set
benchmark in terms of the topology size (number of hidden/intermediate neurons) with 60%
of the data used for training and 40% for testing.
Table 21 Results obtained with the m-class, and backpropagation, using the Glass data set
benchmark in terms of the topology size (number of hidden/intermediate neurons) with 80%
of the data used for training and 20% for testing.
Table 22 Results obtained with the m-class, and backpropagation, using the Wine data set
benchmark in terms of the topology size (number of hidden/intermediate neurons) with 60%
of the data used for training and 40% for testing.
Table 23 Results obtained with the m-class, and backpropagation, using the Wine data set
benchmark in terms of the topology size (number of hidden/intermediate neurons) with 80%
of the data used for training and 20% for testing.
Table 24 Results obtained with the m-class, and backpropagation, using the Zoo data set
benchmark in terms of the topology size (number of hidden/intermediate neurons) with 60%
of the data used for training and 40% for testing.
Table 25 Results obtained with the m-class, and backpropagation, using the Zoo data set
benchmark in terms of the topology size (number of hidden/intermediate neurons) with 80%
of the data used for training and 20% for testing.
Table 26 Results obtained with the m-class, and backpropagation, using the Iris data set
benchmark in terms of the topology size (number of hidden/intermediate neurons) with 60%
of the data used for training and 40% for testing.
Table 27 Results obtained with the m-class, and backpropagation, using the Iris data set
benchmark in terms of the topology size (number of hidden/intermediate neurons) with 80%
of the data used for training and 20% for testing.
Table 28 Results obtained with the m-class, and backpropagation, using the Soybean data
set benchmark in terms of the topology size (number of hidden/intermediate neurons) with
60% of the data used for training and 40% for testing.
Table 29 Results obtained with the m-class, and backpropagation, using the Soybean data
set benchmark in terms of the topology size (number of hidden/intermediate neurons) with
80% of the data used for training and 20% for testing.
Table 30 Results obtained with the m-class, and backpropagation, using the Wiscon-
sin Breast Cancer data set benchmark in terms of the topology size (number of hid-
den/intermediate neurons) with 60% of the data used for training and 40% for testing.
Table 31 Results obtained with the m-class, and backpropagation, using the Wiscon-
sin Breast Cancer data set benchmark in terms of the topology size (number of hid-
den/intermediate neurons) with 80% of the data used for training and 20% for testing.
References
1. Bennett, K.P., Mangasarian, O.L.: Robust linear programming discrimination of two lin-
early inseparable sets. Optimization Methods and Software 1, 2334 (1992)
2. Dasarathy, B.W.: Nosing around the neighborhood: A new system structure and clas-
sification rule for recognition in partially exposed environments. IEEE Transactions on
Pattern Analysis and Machine Intelligence 2(1), 6771 (1980)
3. Newman, D.J., Hettich, S., Merz, C.B.: C.: UCI repository of machine learning databases
(1998), http://www.ics.uci.edu/mlearn/MLRepository.html
4. Elizondo, D.: The recursive determinist perceptron (rdp) and topology reduction strate-
gies for neural networks. Ph.D. thesis, Universite Louis Pasteur, Strasbourg, France
(1997)
5. Elizondo, D.: Searching for linearly separable subsets using the class of linear separabil-
ity method. In: Proceedings of the IEEE-IJCNN, pp. 955960 (2004)
6. Elizondo, D.: The linear separability problem: Some testing methods. Accepted for
Publication: IEEE TNN 17, 330344 (2006)
7. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Annual Eugen-
ics 7(II), 179188 (1936)
8. Gates, G.W.: The reduced nearest neighbor rule. IEEE Transactions on Information The-
ory, 431433 (1972)
9. Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear programming. SIAM
News 23(5), 118 (1990)
10. Preparata, F.P., Shamos, M.I.: Computational Geometry. An Introduction. Springer, New
York (1985)
11. Rosenblatt, F.: Principles of Neurodynamics. Spartan, Washington D.C (1962)
12. Tajine, M., Elizondo, D.: Enhancing the perceptron neural network by using functional
composition. Tech. Rep. 96-07, Computer Science Department, Universite Louis Pasteur,
Strasbourg, France (1996)
13. Tajine, M., Elizondo, D., Fiesler, E., Korczak, J.: Adapting the 2class recursive deter-
ministic perceptron neural network to mclasses. In: The International Conference on
Neural Networks (ICNN), IEEE, Los Alamitos (1997)
14. Weiss, S.M., Kulikowski, C.A.: Computer Systems That Learn. Morgan Kaufmann Pub-
lishers, San Mateo (1991)
15. Wolberg, W.H., Mangasarian, O.: Multisurface method of pattern separation for medi-
cal diagnosis applied to breast cytology. Proceedings of the National Academy of Sci-
ences 87, 91939196 (1990)
Active Learning Using a Constructive Neural
Network Algorithm
1 Introduction
A main issue at the time of implementing feed-forward neural networks in classifi-
cation or prediction problems is the selection of an adequate architecture [1, 2, 3].
Jose L. Subirats
Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga,
Campus de Teatinos S/N, 29071 Malaga, Spain
e-mail: jlsubirats@lcc.uma.es
Leonardo Franco
Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga,
Campus de Teatinos S/N, 29071 Malaga, Spain
e-mail: lfranco@lcc.uma.es
Ignacio Molina
Departamento de Tecnologa Electronica,, Universidad de Malaga,
Campus de Teatinos S/N, 29071 Malaga, Spain
e-mail: aimc@dte.uma.es
Jose M. Jerez
Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga,
Campus de Teatinos S/N, 29071 Malaga, Spain
e-mail: jja@lcc.uma.es
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 193206.
springerlink.com c Springer-Verlag Berlin Heidelberg 2009
194 J.L. Subirats et al.
refer to [19] for previous work on the field. Instance selection can be also used for
reducing the size of the dataset in order to speed up the training process and can be
also lead to prototype selection when the selected data are very much reduced. The
approach taken in this chapter is to use the proposed instance selection method as a
pre-processing step , as way of improving the generalization ability of predictive al-
gorithms. The method is developed inside a recently introduced constructive neural
network algorithm named C-Mantec [20] (Competitive MAjority Network Trained
by Error Correction) and leads to an improvement in the generalization ability of the
algorithm, permitting also to obtain more compact neural network architecures. The
reduced, filtered, datasets are also tested with other standard classification methods
like standard multilayer perceptrons, decision trees and support vector machines,
analyzing the generalization ability obtained. This chapter is organized as follows:
Next we give details about the C-Mantec constructive neural network algorithm and
in Section 3 the method for eliminating noisy instances is introduced, to follow with
some experiments, results and conclusions.
= ( i wi ) b. (1)
i
Note that the definition of the synaptic potential includes the value of the thresh-
old or bias, b, as this will be useful because for wrongly classified inputs the
196 J.L. Subirats et al.
As said above, the synaptic modification rule that is used by the C-Mantec al-
gorithm is the thermal perceptron rule for which the change of the weights, wi , is
given by the following equation:
T | |
wi = (t S) i exp{ } , (3)
T0 T
where t is the target value of the example being considered, S represents the ac-
tual output of the neuron and is the value of the input unit i. T is a parameter
introduced in the thermal perceptron definition, named temperature, T0 the starting
temperature value and , the synaptic potential defined in Eq. 1. For rightly classi-
fied examples, the factor (t S) is equals to 0 and then no synaptic weight changes
take place. The thermal perceptron rule can be seen as a modification to the standard
perceptron rule where the change of weights is modified by the factor, m, equals to:
T | |
m= exp{ } . (4)
T0 T
At the single neuron level the C-Mantec algorithm uses the thermal perceptron
rule, but at a global network level the C-Mantec algorithm incorporates competition
between the neurons, making the learning procedure more efficient and permitting
to obtain more compact architectures [20]. The main novelty introduced in the new
C-Mantec algorithm is the fact that once a new neuron is added to the network,
the existing synaptic weights are not frozen, as it is the standard procedure in con-
structive algorithms. Instead, after an input instance is presented to the network all
existing neurons can learn the incoming information by modifying its weights in
a competitive way, in which only one neuron will learn the incoming information.
The norm in standard constructive algorithms is to freeze weights not connected to
the last added neurons in order to preserve the stored information, in the C-Mantec
algorithm this is not necessary as the thermal perceptron is a quite conservative
learning algorithm and also because the C-Mantec algorithm incorporates a param-
eter g f ac that further controls the size of the allowed changes in synaptic weights,
in particular when the Temperature is large when this large changes are allowed at
the single neuron level by the thermal perceptron.
The C-Mantec algorithm generates architectures with a single hidden layer of
neurons. The output neuron of the network computes the majority function of the
activation values of the hidden units and thus the set of weights connecting the
hidden neurons with the output are fix from the beginning and not modified during
Active Learning Using a Constructive Neural Network Algorithm 197
the learning procedure. As the output neuron computes the majority of the hidden
layer activations, a correct functioning of the network is a state in which for every
instance in the training set the output of more than half of the hidden units coincides
with the respective target value of the instances.
As mentioned before, the algorithm also incorporates a parameter named grow-
ing factor, gfac, as it adjustment affects the size of the resulting architecture. Once
an instance is presented and the output of the network does not coincide with the
target value, a neuron in the hidden layer will be selected to learn it if some con-
ditions are met. The selected neuron will be the one with the lowest value of
among those neurons whose output is different from the target one, but only if the
value of m (see Eq. 4) is larger than the gfac value, set at the beginning of the
learning process. Thus, the gfac parameter will prevent the learning of misclassified
examples that will involve large weight modifications, as for high values of T the
thermal perceptron rule would not avoid these large changes, that can cause insta-
bility to the algorithm. After a neuron modifies it weights, its internal temperature
value is lowered. In the case in which for a wrongly classified instance there are
no neurons available for learning, a new neuron is added to the network and this
unit will learn the current input, ensuring the convergence of the algorithm. After a
new unit is added to the network the temperature, T , of all neurons is reset to the
initial value T0 and the learning process continues until all training examples are
correctly classified. In Fig. 1 a pseudocode of the algorithm is shown, summarizing
the most important steps of the C-Mantec algorithm and in Fig. 2 a flow diagram of
the algorithm is shown.
Regarding the setting of the two parameters of the algorithm, T0 and gfac, sev-
eral experiments have shown that the C-Mantec algorithm is quite robust against
changes of these two parameters and the finding of some optimal values is not diffi-
cult. The parameter T0 (initial temperature) ensures that a certain number of learn-
ing iterations will take place, permitting an initial phase of global exploration for
the weights values, as for high temperature values larger changes are easier to be
198 J.L. Subirats et al.
START
NO
Training.Set.Count > 0
FINISH
YES
YES
Output = Target ?
NO
NO
Is there neuron
that want learn?
Add a new neuron and reset
temperatures
YES
accepted. The value of the parameter g f ac affects the size of the final architecture,
and it has been observed that different values are needed in order to optimize the
algorithm towards obtaining more compact architectures or a network with a better
generalization ability.
The convergence of the algorithm is ensured because the learning rule is very
conservative in their changes, preserving the acquired knowledge of the neurons
and given by the fact that new introduced units learn at least one input example.
Tests performed with noise-free Boolean functions using the C-Mantec algorithm
show that it generates very compact architectures with less number of neurons than
Active Learning Using a Constructive Neural Network Algorithm 199
existing constructive algorithms [20]. However, when the algorithm was tested on
real datasets, it was observed that a larger number of neurons was needed because
the algorithm overfit noisy examples. To avoid this overfitting problem the method
introduced in the next section is developed in this work .
C1
P1
P2
C2
Fig. 3 Schematic drawing of the Resonance effect that occurs when noisy examples are
present in the training set. A thermal perceptron will learn the good examples, represented
at the left of the figure, but will classify rightly only one of the noisy samples. Further learning
iterations in which the neuron tries to learn the wrongly classified example will produce
an oscillation of the separating hyperplane. The number of times the synaptic weights are
adjusted upon presentation of an example can be used to detect noisy inputs.
200 J.L. Subirats et al.
1
Filtered data
Noisy data
0.95
0.9
Generalization ability
0.85
0.8
0.75
0.7
0.65
0.6
100
Filtered data
90
Noisy data
80
Number of neurons
70
60
50
40
30
20
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Noise level
Fig. 4 The effect of adding attribute noise. Top: Generalization ability as a function of the
level of attribute noise for the modified Pima indians diabetes dataset for the C-Mantec
algorithm applied with and without the filtering stage. Bottom: The number of neurons of the
generated architectures as a function of the level of noise. The maximum number of neurons
was set to 101.
1
Filtered data
0.9
0.85
0.8
0.75
0.7
0.65
80
60
40
20
0
0 0.1 0.2 0.3 0.4 0.5
Noise level
Fig. 5 The effect of adding class noise to the dataset. Top: Generalization ability as a function
of the level of class noise for the modified Pima indians diabetes dataset for the cases of
implementing the filtering stage and for the case of using the whole raw dataset. Bottom:
The number of neurons of the generated architectures for the two mentioned cases of the
implementation of the C-Mantec algorithm.
and if the number of presentations for an example is larger by two standard devia-
tions from the mean, it is removed from the training set. The removal of examples is
made on-line as the architecture is constructed and a final phase is carried out where
no removal of examples is allowed.
To test the new method for removal of noisy examples a noise-free dataset is
created from a real dataset, and then controlled noise was added to the attributes (in-
put variables) and to the class (output), in separate experiments to analyze whether
there is any evident difference between the two cases [23]. Knowing the origin of
202 J.L. Subirats et al.
the noise is an interesting issue with practical applications, as it can help to de-
tect the sources of noise and consequently help to eliminate it. The dataset chosen
for this analysis is the Pima Indians Diabetes dataset, selected because it has been
widely studied and also because it is considered a difficult set with an average gen-
eralization ability around 75%. To generate the noise-free dataset, the C-Mantec
algorithm was run with a single neuron that classified correctly approximately 70%
of the dataset, and then the noise-free dataset was constructed by presenting the
whole set of inputs through this network to obtain the noise-free output. Two
different experiments were carried out: in the first one, noise was added to the at-
tributes of the dataset and the performance of the C-Mantec algorithm was analyzed
with and without the procedure for noisy examples removal. In Fig. 4 (top) the gen-
eralization ability for both cases is shown for a level of noise between 0 and 0.8 and
the results are the average over 100 independent runs. For a certain value of added
noise, x, the input values were modified by a random uniform value between x
and x. The bottom graph shows the number of neurons in the generated architec-
tures when the filtering process was and was not applied as a function of the added
attribute noise. It can be clearly seen that the removal of the noisy examples helps
to obtain much more compact architectures while a better generalization ability is
observed. The second experiment consisted in adding noise to the output values and
the results are shown on Fig. 5. In this case the noise level indicate the probability
of modifying the class value to a binary value, chosen randomly between 0 or 1.
From the experiments carried out with the two types of noise introduced to the
Diabetes dataset we can observe that the resonance effect helps to detect and elimi-
nate the noisy instances in both cases, helping to increase the generalization ability,
even if the change is not enough to recover the generalization ability obtained in
the noise-free case. It can also be observed that the size of the neural architectures
obtained after the removal of the noisy instances is much lower than the size of the
architectures needed for the noisy cases. Also, it has to be said that the experiments
did not lead to a way of differentiating the sources of noise, as the results obtained
for the two noise-contaminated datasets considered were not particularly different.
We tested the noise filtering abilities of the method introduced in this work using
the C-Mantec constructive algorithm on a set of 11 well known benchmark func-
tions [24]. The set of analyzed functions contains 6 two-classes functions and 5
multi-class problems with a number of classes up to 19. The C-Mantec algorithm
was run with a maximum number of iterations of 50.000 and an initial temperature
value (T0 ) equals to the number of inputs of the analyzed functions. It is worth not-
ing that different tests showed that the algorithm is quite robust to changes on these
parameter values. The results are shown in Table 1, where it is shown the number
of neurons of the obtained architectures and the generalization ability obtained, in-
cluding the standard deviation values, computed over 100 independent runs. The
last column of Table 1 shows, as a comparison, the generalization ability values
Active Learning Using a Constructive Neural Network Algorithm 203
Table 1 Results for the number of neurons and the generalization ability obtained with the
C-Mantec algorithm using the data filtering method introduced in this work. The last column
shows the results from [25] (See text for more details).
Table 2 Results for generalization ability obtained using standard multilayer perceptrons
(MLP), decision trees (C4.5) and support vector machines (SVM) algorithms using both the
filtered and original datasets (See the text for more details).
datasets, the generalization ability observed decreases with the filtered instances in
average by approximately 0.61%, but noting that in 2 out of the 6 cases considered
the prediction improved.
Regarding the generalization ability obtained by the different methods, we first
note that the average generalization ability for the 6 functions shown in table 2
is of 87.29 3.72 for the C-Mantec algorithm with the active learning procedure
incorporated. Thus, the best method with these limited set of 6 functions turns out
to be the SVM approach, close followed by the constructive C-Mantec algorithm
and by the MLP; while the C4.5 came last with a lower generalization ability.
The number of instances in the filtered datasets was on average 2.73% smaller
than the original sets, being the smaller ones found in those for which the gener-
alization ability was lower, as for Diabetes dataset. The standard deviation of the
results shown in tables 1 and 2 is computed over 5 randomly selected datasets, us-
ing 75% of the examples for training the models and the remaining 25% for testing
the generalization ability.
5 Discussion
We introduced in this chapter a new method for filtering noisy examples using a
recently developed constructive neural network algorithm. The new C-Mantec al-
gorithm generalizes very well on free-noise dataset but have shown to overfit with
noisy datasets and thus, a filtering scheme for noisy instances have been imple-
mented. The filtering method devised is based on the observation that noisy ex-
amples needs more number of weights updates than regular ones. This resonant
effect observed, permits to distinguish these instances and eliminate them in an
on-line procedure. Simulations performed show that the generalization ability and
size of the resulting networks are very much improved after the removal of the
noisy examples. A comparison of results was done against previous reported val-
ues obtained using standard feed-forward neural networks [25] and showed that the
generalization ability was on average a 2.1% larger, indicating the effectiveness of
the C-Mantec algorithm implemented with the new filtering stage. The introduced
Active Learning Using a Constructive Neural Network Algorithm 205
method of data selection can also be used as a pre-processing stage for other pre-
diction algorithms, and for this reason a second comparison was carried out using
three well known predictive algorithms: MLP, C4.5 decision trees and SVM. The
results obtained and shown in table 2 indicate that the instance selection procedure
appears to work quite well with MLP and less with the other two algorithms. It
might be possible, given the neural nature of the C-Mantec algorithm, that the fil-
tering stage developed works better with neural-based algorithms but further studies
might be needed to extract a final conclusion. Overall we have observed that the ac-
tive learning procedure implemented using the new C-Mantec algorithm is working
very efficiently in the task of avoiding overfitting problems and that comparable re-
sults to those obtained using MLPs and SVMs can be obtained with a constructive
neural network algorithm.
Acknowledgements. The authors acknowledge support from CICYT (Spain) through grants
TIN2005-02984 and TIN2008-04985 (including FEDER funds) and from Junta de Andaluca
through grants P06-TIC-01615 and P08-TIC-04026. Leonardo Franco acknowledges support
from the Spanish Ministry of Science and Innovation (MICIIN) through a Ramon y Cajal
fellowship.
References
1. Haykin, S.: Neural Networks: A Comprehensive Foundation. Macmillan/IEEE Press
(1994)
2. Lawrence, S., Giles, C.L., Tsoi, A.C.: What Size Neural Network Gives Optimal Gener-
alization? Convergence Properties of Backpropagation. In: Technical Report UMIACS-
TR-96-22 and CS-TR-3617, Institute for Advanced Computer Studies, Univ. of Mary-
land (1996)
3. Gomez, I., Franco, L., Subirats, J.L., Jerez, J.M.: Neural Networks Architecture Selec-
tion: Size Depends on Function Complexity. In: Kollias, S.D., Stafylopatis, A., Duch,
W., Oja, E. (eds.) ICANN 2006. LNCS, vol. 4131, pp. 122129. Springer, Heidelberg
(2006)
4. Mezard, M., Nadal, J.P.: Learning in feedforward layered networks: The tiling algorithm,
J. Physics A 22, 21912204 (1989)
5. Frean, M.: The upstart algorithm: A method for constructing and training feedforward
neural networks. Neural Computation 2, 198209 (1990)
6. Parekh, R., Yang, J., Honavar, V.: Constructive Neural-Network Learning Algorithms for
Pattern Classification. IEEE Transactions on Neural Networks 11, 436451 (2000)
7. Subirats, J.L., Jerez, J.M., Franco, L.: A New Decomposition Algorithm for Threshold
Synthesis and Generalization of Boolean Functions. IEEE Transactions on Circuits and
Systems I 55, 31883196 (2008)
8. Nicoletti, M.C., Bertini, J.R.: An empirical evaluation of constructive neural network al-
gorithms in classification tasks. International Journal of Innovative Computing and Ap-
plications 1, 213 (2007)
9. Reed, R.: Pruning algorithms - a survey. IEEE Transactions on Neural Networks 4, 740
747 (1993)
206 J.L. Subirats et al.
10. Smieja, F.J.: Neural network constructive algorithms: trading generalization for learning
efficiency? Circuits, systems, and signal processing 12, 331374 (1993)
11. Bramer, M.A.: Pre-pruning classification trees to reduce overfitting in noisy domains.
In: Yin, H., Allinson, N.M., Freeman, R., Keane, J.A., Hubbard, S. (eds.) IDEAL 2002.
LNCS, vol. 2412, pp. 712. Springer, Heidelberg (2002)
12. Hawkins, D.M.: The problem of Overfitting. Journal of Chemical Information and Com-
puter Sciences 44, 112 (2004)
13. Angelova, A., Abu-Mostafa, Y., Perona, P.: Pruning training sets for learning of ob-
ject categories. In: IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, CVPR 2005, vol. 1, pp. 494501 (2005)
14. Cohn, D., Atlas, L., Ladner, R.: Improving Generalization with Active Learning. Mach.
Learn. 15, 201221 (1994)
15. Cachin, C.: Pedagogical pattern selection strategies. Neural Networks 7, 175181 (1994)
16. Kinzel, W., Rujan, P.: Improving a network generalization ability by selecting examples.
Europhys. Lett. 13, 473477 (1990)
17. Franco, L., Cannas, S.A.: Generalization and Selection of Examples in Feedforward
Neural Networks. Neural Computation 12(10), 24052426 (2000)
18. Sanchez, J.S., Barandela, R., Marques, A.I., Alejo, R., Badenas, J.: Analysis of new
techniques to obtain quality training sets. Pattern Recognition Letters 24, 10151022
(2003)
19. Jankowski, N., Grochowski, M.: Comparison of Instances Seletion Algorithms I. Algo-
rithms Survey. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.)
ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 598603. Springer, Heidelberg (2004)
20. Subirats, J.L., Franco, L., Jerez, J.M.: Competition and Stable Learning for Growing
Compact Neural Architectures with Good Generalization Abilities: The C-Mantec Al-
gorithm (2009) (in preparation)
21. Frean, M.: Thermal Perceptron Learning Rule. Neural Computation 4, 946957 (1992)
22. Rosenhlatt, F.: The perceptron: A probabilistic model for information storage and orga-
nization in the brain. Psychological Review 65, 386408 (1959)
23. Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study of their impacts.
Artif. Intell. Rev. 22, 177210 (2004)
24. Merz, C.J., Murphy, P.M.: UCI Repository of Machine Learning Databases. Department
of Information and Computer Science. University of California, Irvine (1998)
25. Prechelt, L.: Proben 1 A Set of Benchmarks and Benchmarking Rules for Neural Net-
work Training Algorithms. Technical Report (1994)
26. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kauffman, CA (1992)
27. Shawe-Taylor, J., Cristianini, N.: Support Vector Machines and other kernel-based learn-
ing methods. Cambridge University Press, Cambridge (2000)
28. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Tech-
niques with Java Implementations. Morgan Kaufmann Publishers, San Francisco (2000),
http://www.cs.waikato.ac.nz/ml/weka
Incorporating Expert Advice into
Reinforcement Learning Using Constructive
Neural Networks
Abstract. This paper presents and investigates a novel approach to using expert
advice to speed up the learning performance of an agent operating within a rein-
forcement learning framework. This is accomplished through the use of a
constructive neural network based on radial basis functions. It is demonstrated that
incorporating advice from a human teacher can substantially improve the perform-
ance of a reinforcement learning agent, and that the constructive algorithm pro-
posed is particularly effective at aiding the early performance of the agent, whilst
reducing the amount of feedback required from the teacher. The use of construc-
tive networks within a reinforcement learning context is a relatively new area of
research in itself, and so this paper also provides a review of the previous work in
this area, as a guide for future researchers.
1 Introduction
Reinforcement learning is a learning paradigm in which an autonomous agent
learns to execute actions within an environment in such a way as to maximise the
reinforcement which it receives from the environment. Scaling reinforcement
learning to large, complex problems is an ongoing area of research, and this paper
deals with the combination of two approaches which have been applied to this
scaling issue the use of constructive neural networks, and the incorporation of
human guidance into the learning process. Specifically we propose and empiri-
cally test a novel algorithm for utilising human advice within a reinforcement
learning system, which exploits the properties of a particular constructive neural
network known as the Resource Allocating Network.
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 207224.
springerlink.com Springer-Verlag Berlin Heidelberg 2009
208 R. Ollington, P. Vamplew, and J. Swanson
As many readers of this volume may not previously be familiar with reinforce-
ment learning, Section 2 will present a brief introduction to this learning paradigm
for a more thorough review the reader is directed to the excellent textbook by
Sutton and Barto (1998). This section will also discuss the relationship between
reinforcement learning and function approximation methods such as neural net-
works, whilst Section 3 will review the previous research on utilising constructive
neural networks within a reinforcement learning context. The following sections
examine how a reinforcement learning agent can be aided in learning a difficult
task. Section 4 reviews existing approaches to guiding an agent during learning,
whilst Section 5 proposes a new algorithm for incorporating human advice into the
learning process, based on the use of a constructive radial basis function network.
Section 6 documents experiments carried out to assess the effectiveness of this
new algorithm relative to alternative approaches. Section 7 offers conclusions and
suggestions for future work.
1
In fact, whilst the majority of reinforcement learning signals use a scalar reward signal,
tasks which require the agent to balance multiple conflicting objectives may be dealt with
by using a vector reward, with an element for each distinct objective see for example
(Vamplew et al. 2008).
Incorporating Expert Advice into Reinforcement Learning 209
each hidden neuron, and if this is found to be high the neuron is decomposed, and
replaced by two neurons with narrower widths. This algorithm is particularly ef-
fective in problems where the value function is smooth in some areas but with
large local variations in other regions of the state space.
All of the systems described above use neurons with radial basis functions, where
each neuron responds only to inputs within localised regions of the state-space.
The application in reinforcement learning of constructive networks based on neu-
rons with non-local response functions has been minimal, despite this style of sys-
tem being widely and successfully applied within the supervised learning field.
Many constructive algorithms have been proposed in the supervised learning lit-
erature, but amongst the most widely adopted has been Cascade-Correlation
(Cascor) (Fahlman and Lebiere, 1990). This constructive algorithm, based on non-
localised neurons, has been shown to equal or outperform fixed-architecture net-
works on a wide range of supervised learning tasks (Fahlman and Lebiere, 1990;
Waugh 1995) and its possible utility for reinforcement learning was first identified
by Tesauro (1993). Despite this, the first work using a cascade constructive net-
work for reinforcement learning appears to be by (Rivest and Precup (2003);
Bellemare, Precup and Rivest, (2004)). Possibly this is explained by the added dif-
ficulty in incorporating this style of network into a reinforcement learning envi-
ronment. Whereas the RAN was initially designed for on-line learning where the
network is updated after each input, Cascade-Correlation is usually used in con-
junction with batch training algorithms such as Quickprop. The approach used by
this group was to modify the reinforcement learning process so as to allow direct
application of the Cascor algorithm. They propose a learning algorithm with two
alternating stages. In the first stage the agent selects and executes actions, and
stores the input state and the target value generated via TD in a cache. Once the
cache is full, a network is trained on the cached examples, using the standard
Cascor algorithm. Once the network has been trained, the cache is cleared and the
algorithm returns to the cache-filling phase.
In contrast Vamplew and Ollington (2005) adapted the training algorithm for the
cascade network to use simple gradient descent backpropagation, thereby allowing
it to be trained in a fully on-line fashion within the temporal difference algorithm
with no need for caching. They propose the use of parallel candidate training
whereby the weights for both the output and candidate neurons in the cascade net-
work are updated after each interaction with the environment. This eliminates the
need to maintain a cache, and ensures that the policy is updated immediately after
the results of each action are known. However a possible disadvantage of parallel
candidate training is the issue of moving targets. The temporal difference errors
used as targets for the candidate nodes are themselves changing during training as
the weights of the output neurons are adapted. Therefore the task facing the candi-
date neurons is more complex than it would be if these values were static.
Nissen (2007) provides the most extensive examination so far of the use of
cascade networks for reinforcement learning, and introduces algorithms which
attempt to find a middle-ground between the two previous approaches. Nissens
Incorporating Expert Advice into Reinforcement Learning 213
that they would perform in that situation. Such a system has been used success-
fully by Clouse and Utgoff (1992). However, incorporating this form of advice
directly into the agent is not as straightforward. Clouse and Utgoff (1992) incor-
porated the advice by forcing the agent to perform the suggested action, and pro-
viding an additional bonus reward to encourage similar behaviour in the future.
This method has two main disadvantages. Firstly, there is no guarantee that the
bonus reward provided will be enough to make this action more favourable than
alternative actions. Secondly, it is likely that such a method will result in incor-
rect action values being learnt by the agent and unlearnt when the expert stops
providing advice.
Therefore we have investigated an alternative approach to providing advice
based on directly adding neurons to a constructive network when advice is pro-
vided to the agent. Section 5 describes this new algorithm, whilst Section 6 pro-
vides experimental results comparing it against an agent learning without advice,
and an agent learning via the bonus reward approach to advice-giving.
r
initialise network , initialise current state s
repeat
r
compute Q( s , a ) : [ r r
j : hj = exp ( s c j ) 2 / j
2
]
r
a : Q( s , a) = wja h j
j
Fig. 2 The test environment, showing the track boundaries, and the waypoint configuration.
Fig. 3 The input and output structure of the Resource Allocating Network used in all trials.
conclusions on the effectiveness between an RL agent that receives advice and one
that does not.
The simulation problem domain is a car navigation task loosely based on the
Race Track game (Gardner 1973). Adaptions of this problem domain have also
218 R. Ollington, P. Vamplew, and J. Swanson
been used in other advice systems (Clouse and Utgoff 1992, Clouse 1995). The
track can be described as being of any length and shape with a left boundary, a
right boundary, a start line, a finish line and way-points in between. Figure 2 de-
picts an outline of the track employed in the simulation environment. In our adap-
tion the objective is to drive the car from the start line to the finish line, while
avoiding colliding with the track walls and minimising lap time. The reward signal
combines a positive reward based on progress around the track and a negative
term for collisions.
Figure 3 shows the input and output configuration of an agent RAN for this
task. There are 12 inputs for state information and 9 outputs for action Q-values.
The first nine inputs correspond to a set of sensors that record the relative place-
ment of the car between track boundaries. The sensor configuration can be seen in
Figure 4. The tenth input value corresponds to the cars speed, where a positive
value means that the car is travelling forward and a negative value means the car
is travelling in reverse. The eleventh input is the direction to the nearest way-
point. The final input corresponds to the direction to the second nearest way-point.
These directional values were provided to encourage smoother movement between
way-points.
5 trials were run for each of three different learning agents one using the new
method of providing advice (referred to as the difference method), an agent receiv-
ing bonus rewards for advised actions, and an agent receiving no advice. For each
of five trials, the agent received advice for the first 250000 steps. Performance
was then monitored for a further 250000 steps during which standard Q-learning
Incorporating Expert Advice into Reinforcement Learning 219
55000 Difference
Bonus
45000 No Advice
35000
Reward
25000
15000
5000
-5000
0 100000 200000 300000 400000 500000
Steps
Fig. 5 Cumulative reward received by networks trained using the difference and bonus ad-
vice methods, and by a network trained with no advice. Advice was provided for the first
250000 steps only. (Error bars show 95% confidence intervals)
was performed with no advice being provided. The same system parameters were
used for all trials.
Figure 5 compares the performance of these three agents based on cumulative
reward. The results show that the new method produces more rapid early learning
than either an agent receiving no advice or the bonus reward agent. However, the
bonus reward agent does learn a better final solution to the problem, as indicated
by the upward trend of its curve in Figure 5. This may be due to the difference
agent exploring less and in particular exploring fewer difficult situations and
therefore, when the teacher is no longer available, not knowing how to deal with
these difficult situations when they arise. Another possible reason for this is over-
fitting. As will be discussed below, the difference method tends to produce a
much larger network, which may be resulting in overfitting.
Figure 6 illustrates the number of advice actions provided by the human teacher
during the first 250,000 training steps. Clearly the difference method places a
much lower load on the teacher than the bonus method. This is due to the teacher
being more satisfied with the performance of the agent and therefore not feeling
the need to provide advice as frequently.
As noted above, the difference method results in a much larger network than
the bonus method or an agent trained without advice, as shown in 7. This is not
surprising as this method does not require the error novelty criterion of the RAN
to be satisfied when advice is given to the network.
Finally, Figure 8 shows the number of collisions experienced over time. This is
important since, while the teacher did not have an accurate idea of the reward re-
ceived at each time step, the concept of avoiding collisions is intuitive to a human
teacher in this context. As with the cumulative reward indicator, the difference
220 R. Ollington, P. Vamplew, and J. Swanson
4500
Difference
4000 Bonus
3500
3000
Advice
2500
2000
1500
1000
500
0
0 50000 100000 150000 200000 250000
Steps
Fig. 6 Cumulative average of advice actions given to networks trained using the difference
and bonus methods. (Error bars show 95% confidence intervals).
Difference
600
Bonus
No Advice
500
400
Size
300
200
100
0
0 100000 200000 300000 400000 500000
Steps
Fig. 7 Network growth for the difference and bonus advice methods, and for a network
trained with no advice. Advice was provided for the first 250000 steps only. (Error bars
show 95% confidence intervals).
agent performs considerably better than the bonus reward agent in terms of num-
ber of collisions while advice is being provided. In contract to cumulative reward
however, the difference agent appears to perform equally as well as the bonus
agent, and better than the control agent, even after advice from the teacher has
Incorporating Expert Advice into Reinforcement Learning 221
Difference
2500 Bonus
No Advice
2000
Collisions
1500
1000
500
0
0 100000 200000 300000 400000 500000
Steps
Fig. 8 Cumulative average of track collisions for networks trained using the difference and
bonus advice methods, and by a network trained with no advice. Advice was provided for
the first 250000 steps only. (Error bars show 95% confidence intervals).
ceased. The strong performance of the no advice agent on this metric is ac-
counted for by its tendency to drive slowly, which reduces the size of the penalty
imposed when a collision occurs.
It should be noted that, for one of the five difference method trials, almost dou-
ble the number of hidden nodes were added during the advice period (compared to
the next highest for the other trials), as a result of the teacher providing more than
double the amount of advice. This agent also performed the most poorly in terms
of cumulative reward, and number of collisions. If this trial was not included in
the results, we would expect to see even better results for the difference agent.
7 Conclusion
The results presented here have showed significant benefits for using human ad-
vice to guide the learning of a reinforcement learning agent for this particular
learning problem. Both agents which were provided with advice during training
performed significantly better than the pure reinforcement-learning agent. The use
of the constructive Resource Allocation Network for the function approximation
component of the Q-learning algorithm enabled the implementation of a new
method for directly incorporating expert advice into the training procedure. This
new difference method of advice-giving resulted in better performance early in
training than for competing methods, while at the same time placing a lesser load
on the teacher.
Once the teacher stopped providing advice however, the results were mixed.
While the agent was able to maintain good performance in terms of catastrophic
222 R. Ollington, P. Vamplew, and J. Swanson
References
Anderson, C.W.: Q-learning with hidden unit restarting. Advances in Neural Information
Processing Systems 5, 8188 (1993)
Baird, L.: Residual algorithms: Reinforcement learning with function approximation. In:
Proceedings of the Twelfth International Conference on Machine Learning, pp. 3037
(1995)
Bellemare, M., Precup, D., Rivest, F.: Reinforcement Learning Using Cascade-Correlation
Neural Networks, Technical Report RL-3.04, McGill University, Canada (2004)
Clouse, J.: Learning from an automated training agent. In: Working Notes of the ICML
1995 Workshop on Agents that Learn from Other Agents (1995)
Clouse, J., Utgoff, P.: Two kinds of training information for evaluation function learning.
In: Proceedings of the Ninth Annual Conference on Artificial Intelligence, pp. 596600
(1991)
Clouse, J., Utgoff, P.: A teaching method for reinforcement learning. In: Proceedings of the
Ninth International Workshop on Machine Learning, pp. 92101 (1992)
Coulom, R.: Feedforward Neural Networks in Reinforcement Learning Applied to High-
dimensional Motor Control. In: International Conference on Algorithmic Learning The-
ory, pp. 402413. Springer, Heidelberg (2002)
Crites, R.H., Barto, A.G.: Improving Elevator Performance Using Reinforcement Learning.
Advances in Neural Information Processing Systems 8, 10171023 (1996)
Fahlman, S.E., Lebiere, C.: The Cascade-Correlation Learning Architecture. In: Touretzky,
D.S. (ed.) Advances in Neural Information Processing II, pp. 524532. Morgan Kauff-
man, San Francisco (1990)
Gardner, M.: Mathematical games: Fantastic patterns traced by programmed worms. Scien-
tific American 229(5), 116123 (1973)
Incorporating Expert Advice into Reinforcement Learning 223
Girgen, S., Preux, P.: Incremental Basis Function Expansion in Reinforcement Learning us-
ing Cascade-Correlation Networks, Research Report No. 6505, Institut National de Re-
cherche en Informatique et en Automatique, Lille, France (2008)
Gromann, A., Poli, R.: Continual Robot Learning with Constructive Neural Networks. In:
Birk, A., Demiris, J. (eds.) EWLR 1997. LNCS (LNAI), vol. 1545, pp. 95108.
Springer, Heidelberg (1998)
Jun, L., Duckett, T.: Q-Learning with a Growing RBF Network for Behavior Learning in
Mobile Robotics. In: Proc. IASTED International Conference on Robotics and Applica-
tions, Cambridge, USA (2005)
Kretchmar, R.M., Anderson, C.W.: Comparison of CMACs and RBFs for local function
approximators in reinforcement learning. In: IEEE International Conference on Neural
Networks, pp. 834837 (1997)
Maclin, R., Shavlik, J.: Creating advice-taking reinforcement learners. Machine Learn-
ing 22(1), 251281 (1996)
Maclin, R., Shavlik, J., Torrey, L., Walker, T., Wild, E.: Giving advice about preferred ac-
tions to reinforcement learners via knowledge-based kernel regression. In: Proceedings
of the 20th National Conference on Artificial Intelligence, pp. 819824 (2005)
Nechyba, M.C., Bagnell, J.A.: Stabilizing Human Control Strategies Through Reinforce-
ment Learning. In: Proceedings of the IEEE Hong Kong Symp. on Robotics and Control
(1999)
Nissen, S.: Large Scale Reinforcement Learning using Q-SARSA() and Cascading Neural
Networks, M.Sc. Thesis, Department of Computer Science, University of Copenhagen,
Denmark (2007)
Papudesi, V., Huber, M.: Learning from reinforcement and advice using composite reward
functions. In: Proceedings of the 16th International FLAIRS Conference, pp. 361365
(2003)
Perkins, T.J., Precup, D.: Using Options for Knowledge Transfer in Reinforcement Learn-
ing, Technical Report 99-34, Department of Computer Science, University of Massa-
chusetts (1999)
Platt, J.: A Resource-Allocating Network for Function Interpolation. Neural Computation 3,
213225 (1991)
Randlov, J., Alstrom, P.: Learning to Drive a Bicycle Using Reinforcement Learning and
Shaping. In: International Conference on Machine Learning, pp. 463471 (1998)
Rivest, F., Precup, D.: Combining TD-learning with Cascade-correlation Networks. In:
Twentieth International Conference on Machine Learning, Washington DC, pp. 632
639 (2003)
Rummery, G., Niranjan, M.: On-line Q-Learning Using Connectionist Systems, Technical
report, Cambridge University Engineering Department (1994)
Santos, J.M., Touzet, C.: Exploration tuned reinforcement function. Neurocomputing 28(1-
3), 93105 (1999)
ter, B., Dobnikar, A.: Adaptive Radial Basis Decomposition by Learning Vector Quantisa-
tion. Neural Processing Letters 18, 1727 (2003)
Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine Learn-
ing 3, 944 (1988)
Sutton, R.S.: Generalisation in reinforcement learning: Successful examples using sparse
coarse coding. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Proceedings of
the, Conference:Advances in Neural Information Processing Systems, pp. 10381044.
The MIT Press, Cambridge (1996)
224 R. Ollington, P. Vamplew, and J. Swanson
Sutton, R., Barto, S.: Reinforcement Learning. MIT Press, Cambridge (1998)
Tesauro, G.J.: Temporal difference learning and TD-Gammon. Communications of the
ACM 38(3), 5868 (1995)
Thrun, S., Schwartz, A.: Issues in Using Function Approximation for Reinforcement Learn-
ing. In: Proceedings of the Fourth Connectionist Models Summer School, Hillsdale, NJ
(December 1993)
Vamplew, P., Ollington, R.: Global Versus Local Constructive Function Approximation for
On-Line Reinforcement Learning. In: Zhang, S., Jarvis, R.A. (eds.) AI 2005. LNCS
(LNAI), vol. 3809, pp. 113122. Springer, Heidelberg (2005)
Vamplew, P., Yearwood, J., Dazeley, R., Berry, A.: On the Limitations of Scalarisation for
Multi-objective Reinforcement Learning of Pareto Fronts. In: Wobcke, W., Zhang, M.
(eds.) AI 2008. LNCS (LNAI), vol. 5360, pp. 372378. Springer, Heidelberg (2008)
Watkins, C., Dayan, P.: Q-learning. Machine Learning 8(3), 279292 (1992)
Waugh, S.G.: Extending and benchmarking Cascade-Correlation, PhD thesis, Department
of Computer Science, University of Tasmania, Australia (1995)
Yingwei, L., Sundararajan, N., Saratchandran, P.: Performance evaluation of a sequential
minimal radial basis function (rbf) neural network learning algorithm. IEEE Transac-
tions on Neural Networks 9(2), 308318 (1998)
A Constructive Neural Network for Evolving a
Machine Controller in Real-Time
1 Introduction
Typically constructive neural networks are used to solve classification problems. It
has been shown that using this type of network results in less computation require-
ment, smaller topologies, faster learning and better classification [5, 9]. Addition-
ally, [4] shows that certain constructive neural networks can always be evolved to a
stage in which it can classify 100% of the training data correctly.
For machine controllers, classification is only a secondary issue, but still an im-
portant one, as will be discussed later. The main machine task is to select a suitable
action in a certain situation. It need not be the best possible action but must certainly
Andreas Huemer
Institute Of Creative Technologies, De Montfort University, Leicester, UK
e-mail: ahuemer@dmu.ac.uk
David Elizondo
Centre for Computational Intelligence, De Montfort University, Leicester, UK
e-mail: elizondo@dmu.ac.uk
Mario Gongora
Centre for Computational Intelligence, De Montfort University, Leicester, UK
e-mail: mgongora@dmu.ac.uk
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 225242.
springerlink.com c Springer-Verlag Berlin Heidelberg 2009
226 A. Huemer, D. Elizondo, and M. Gongora
be suitable. Much more important is that the action is selected in the required time:
machines, especially mobile robots, must act in real-time. In [10] real-time systems
are discussed and a definition is provided.
One approach to enable a machine acting in real-time is to create the controller in
advance and then use it on the machine without further intervention. This is the tra-
ditional way, but it works only for simple applications or it needs much development
time for more complex applications.
As machines are required to fulfil increasingly complex tasks, people have looked
for various possible solutions to this problem. On the one hand, methods were
sought to speed up the development of the controller, which is then used on the
machine. The other idea was to improve the controller when already in use.
Both approaches are useful and can also be combined, and have resulted in meth-
ods inspired by nature. Evolutionary algorithms are inspired by the evolution of life
an can be used to improve the controller. For example [23] presents a very effective
method for evolving neural networks that can be used to control robots.
However, the power of this approach is limited, because it does not work with a
single controller but with a whole population of controllers. More controllers require
more computational power, which either may not be available or if available, may
slow down the embedded computer thereby making this option too expensive. Of
course there could also be a population of machines and not only a population of
controllers, but the increasing expense in material and energy is obvious in this case.
Alternatively, machines or their controllers could be simulated on a computer,
but this only transfers several problems from the robot to the computer and involves
the additional problem of it being hard to simulate the complex environments that
robots should often act in.
On-line learning, which means adapting the machine controller while it is run-
ning, can help with overcoming the problems of evolutionary methods, but it does
not substitute them. The remainder of this chapter is dedicated to on-line learning.
In Sect. 2 we summarise the history of on-line learning up to the current state-of-
the-art methods and we identify remaining problems tackled in this chapter.
Section 3 shows the basic characteristics of a novel constructive machine con-
troller. The crucial issue of how feedback can be defined, how it is fed into the
controller and how it is used for basic learning strategies is discussed in Sect. 4.
Section 5 shows how the feedback is used for the more sophisticated methods of
growing new neurons and connections.
A simulation of a mobile robot was used to test the novel methodology. The
robot learns to wander around in a simulated environment avoiding obstacles, such
as walls. The results are presented in Sect. 6, and Sect. 7 draws some conclusions
and outlines future lines of work.
in different situations. The controller decides which parts of the machine to adjust
(output) in which situation (input). The complexity of the controller increases with
the number of different situations that can be sensed and the number of different
actions that can be chosen.
We do not cover the history of control technologies (e.g. mechanical or electronic
control) here but we discuss control concepts with respect to control methods. Also
remote control is not given explicit consideration, because we interpret the remote
signals as an input to the actual controller.
Early controllers were developed so that they behaved in a strictly predefined way as
long as no error occurred. These controllers are still needed and it is hard to imagine
why they should not be needed in the future.
However, each decision of the machine is planned and programmed manually
and the number of situations that have to be considered grows exponentially:
NI = Nv (1)
where the number of all input combinations NI is the product of all states that have
to be considered for the input variables Nv . Because of this, the development time
or the number of errors increases dramatically with the number of states that have
to be differentiated for an input variable and especially with the number of input
variables. This is without even considering the output side.
Consequently, there have been different approaches to tackle this problem. One
idea was to adapt the environment in which the machine is situated. For example
by situating a mobile robot in a laboratory where the ground is level and there is
constant standardized lighting the number of situations the controller has to consider
is minimised.
The problem is that mobile robots and other machines are usually not needed in a
quasi-perfect environment. Many robots are however required to fulfil increasingly
complex tasks in our real and complex world. The good news is that many of those
robots need not act in a perfect way but just in a way that is good enough. There is
some room for error and it is possible to concentrate on minimising the time required
to develop a controller.
Controllers that have a fixed behaviour for regular operations have one impor-
tant advantage and one important drawback by definition. The advantage is that
the machines do not change their behaviour in known situations: their behaviour
is predictable. The drawback is that the machines do not change their behaviour
in unknown situations either, which may result in intolerable errors. An increas-
ing complexity of the environment increases the probability for intolerable errors,
if the time for developing a controller that should work in it is not increased
significantly.
228 A. Huemer, D. Elizondo, and M. Gongora
autonomously. In addition to the main task, the action selection, we also discuss
classification. In a machine controller, classification can be used to reduce the size
of the network.
A novel method is presented which not only includes the classification task and
the action selection task in a single neural network, but it is also capable of defining
the classes autonomously and it can grow new neurons and connections to learn to
select the correct actions from the controllers experience based on reward values.
Positive and negative reward values build the feedback, which can be generated
internally, without human intervention. Alternatively it is possible for a human to
feed the controller with feedback during runtime acting as a teacher.
The initial neural network consists of a minimum number of neurons and no
connections. The designer of the controller need not concern him/herself with the
topology of the network. Only the interface from the controller to the machine has
to be defined, which reduces the design effort considerably.
The interface includes:
Input neurons, which convert sensory input into values that can be sent to other
neurons. For our experiments we used spiking neurons, so the sensory values are
converted into spikes by the input neurons.
Output neurons, which convert the output of the neural network into values that
can be interpreted by the actuators.
A reward function, which feeds feedback into the machine controller.
(a) (b)
Fig. 1 (a) Neural network topology after a short simulation period. For clarity not all con-
nections are shown. Layer A consists of the input neurons which are connected to sensors A
to F from the robot shown in (b). Layer D contains the motor neurons, connected to G and H
in (b).
If there are no specialised neurons that represent a certain class in the subsequent
layers in the network, all neurons of this class have to connect to the next layer
separately and not with a single connection from the specialised class neuron.
There is no problem if one neuron connects only to one other neuron. In this case no
additional connections are required. However, when a neuron connects to a second
neuron, only one additional connection is made instead of more connections from
all the neurons of a class. Also the total number of neurons is reduced if the neurons
represent the possible combinations of neurons in the previous layer, because in a
subsequent layer only the class neurons have to be considered.
To achieve an efficient topology along with action selection and classification,
in a single neural network, we separate the connections into two parts: artificial
dendrites and axons. An axon of the presynaptic neuron is connected to a dendrite
of the postsynaptic neuron. Excitatory connections have to be used for operations
that are similar to the logical AND and OR operations. For inhibitory connections
this separation is not necessary, because they represent an operation similar to the
logical NOT operation.
A dendrite has a low threshold potential and is activated when only one presy-
naptic neuron (or a few neurons) have fired a spike via their axons. All presynaptic
neurons are handled equally at this point (logical OR operation) and represent neu-
rons which are combined into one class. An axon weight defines to what degree one
presynaptic neuron belongs to the class.
The neuron finally fires only if a certain combination of dendrites has fired (logi-
cal AND operation). For this operation, the threshold of the neuron potential, which
is modified by the dendrites, is very high. This causes a neuron to select an action for
a certain combination of input signals. The dendrite weight defines the importance
232 A. Huemer, D. Elizondo, and M. Gongora
of a class in the action selection process. In our model a single neuron can fulfil both
tasks, classification and action selection, and it learns both tasks automatically.
In the following we show the computations performed when signals travel from
one neuron to another.
Input of a dendrite:
Id = Oa+ wa+ (2)
where Id is the dendrite input. Oa+ is the output of an excitatory axon, which is 1
if the presynaptic neuron has fired and 0 otherwise. wa+ is the weight of the same
excitatory axon, which must be a value between 0 and 1.
Output of a dendrite:
1
Od = (3)
1 + eb(Id d )
where Od is the dendrite output and Id its input. d is a threshold value for the
dendrite. b is an activation constant and defines the abruptness of activation.
Input of a neuron:
I j = Od wd Oa wa (4)
where I j is the input of the postsynaptic neuron j, Od is the output of a dendrite, wd
is the weight of this dendrite, Oa is the output of an inhibitory axon and wa is the
weight of this inhibitory axon. Dendrite weights and axon weights are in the range
[0, 1] and all dendrite weights add up to 1.
Change of neuron potential:
Pj (t + 1) = Pj (t) + I j (5)
where the new neuron potential Pj (t + 1) is calculated from the potential of the
last time step t, Pj (t), and the current contribution by the neuron input I j . is a
constant between 0 and 1 for recovering the resting potential with time (which is 0 in
this case).
The postsynaptic neuron is activated when its potential reaches the threshold j
and becomes a presynaptic neuron for neurons which its own axons are connected
to. After firing, the neuron resets its potential to the resting state. In contrast to
similar neuron models that are for example summarised by [15], a refractory period
is not implemented here.
connection weights), creating new connections and creating new neurons (growing
mechanisms).
In our experiments we have used a single global reward value, which represents
a positive rewarding or negative aversive feedback depending on the machines
performance in the task that it has been assigned. The objective is to create a network
which maximises positive feedback.
Depending on current measurements like fast movement, crashes, recovering
from crashes, or the energy level, the reward value is set from 1 (very bad), to
1 (very good). The challenge is to provide a useful value in all situations. For exam-
ple, as experiments have shown (see Sect. 6), a reward function that is not capable of
providing enough positive feedback may result in a machine malfunction, because
despite all of its effort to find a good action, it is not evaluated properly. Also uni-
form positive feedback may result in a similar situation because of a lack of contrast.
The reward value (t), which is the result of the reward function at time t, has to
be back-propagated to all neurons, where it is analysed and used for the adaptation
mechanisms. To do so, the neurons of each layer, starting with the output layer and
going back to the input layer, calculate their own reward value. The value of the
output neurons is equivalent to the global reward value. All other neurons calculate
their reward as follows:
j+ (t) j (t)
i (t) = (6)
N+ + N
where i (t) is the reward value of the presynaptic neuron i at time step t. j+ (t) is
the reward value of a postsynaptic neuron that has an excitatory connection from
neuron i, while j refers to a postsynaptic neuron that has an inhibitory connection
from neuron i. N+ is the number of postsynaptic neurons of the first kind and N is
the number of the other postsynaptic neurons.
where wa+ (t) and wa+ (t + 1) are the axon weights before and after the adaptation.
a is the learning factor for axons and j (t) is the current reward value of the post-
synaptic neuron. i and j represent the recent activity of the presynaptic and the
postsynaptic neuron. In our experiments i was kept between -1 and 0 for very little
activity and from 0 to 1 for more activity, j is kept between 0 and 1. For positive
reward much activity in the presynaptic neuron strengthens the axon weight if also
the postsynaptic neuron was active but little presynaptic activity weakens the axon
weight. A negative reward value reverses the direction of change.
Adaptation of a dendrite weight (always excitatory):
where wd (t) and wd (t + 1) are the dendrite weights before and after the adaptation.
d is the learning factor for dendrites. d represents the recent dendrite activity
(which joins the activity of the connected axons) and is kept between 0 and 1. j (t)
and j are discussed with Equ. 7. When all dendrite weights of a neuron are adapted
they are normalised to add up to 1 again because of the dependencies between the
weights (see Sect. 3).
Adaptation of an inhibitory axon weight (axon connected to neuron):
where wa (t) and wa (t + 1) are the axon weights before and after the adaptation.
a , j (t), i and j are discussed in Equ. 7. Axons that are part of local inhibition,
were not changed in our experiments. Also if i is negative, the weight was kept
equal. An inhibitory axon is strengthened, if it was not able to prevent bad feedback,
and it is weakened, if it tried to prevent good feedback.
Fig. 2 When a spiking neural network controls a machine, there are two short periods where
the assignment of the feedback to the corresponding action is problematic. Whenever an
action is active for a longer period the feedback can be assigned correctly to the active neurons
that are responsible for the current action.
proved themselves. For a dendrite with more than one axon it may be worth remem-
bering an activation combination. A single axon may still be interesting if there was
bad feedback, because the axon should have been inhibitory in this case. A list of
axons, that had an influence on the activation of the neuron, is kept. If this com-
bination is not yet stored in an existing preceding neuron, a new neuron is created
and each axon of the list is copied. However, each of the new axons is connected to
its own dendrite to record the combination. The new neuron will only be activated
when the same combination is activated again. When the reward potential is posi-
tive, the new neuron is connected to the existing one by a new axon to the currently
evaluated dendrite (neurons in layer B in Fig. 1.a and neurons in Fig. 3). A nega-
tive reward potential results in the addition of a new inhibitory axon to the existing
neuron (neuron in layer C in Fig. 1.a).
Generally, the new neuron is then inserted into the layer previous to the existing
postsynaptic neuron as shown in Fig. 3.a. The relative position of the new neuron
will be similar to the relative position of the existing one.
However, if one of the axons to the new neuron has its source in the layer the
new neuron should be inserted into, a new layer is created in front of the layer of
the existing postsynaptic neuron as shown in Fig. 3.b. This way the feed-forward
structure which our methods are based on can be preserved.
Once the new neuron is inserted, local inhibition will be generated. Experiments
have shown that local inhibition makes learning much faster and much more reli-
able (see Sect. 6). Hence, new inhibitory axons are created to and from each new
neuron. This inhibitory web has been automatically created within one layer in our
experiments. In more sophisticated future developments this web should perhaps be
limited to certain areas within a layer. Nested layers with neuron containers, which
(a) (b)
Fig. 3 (a) The new neuron n is created and active axons (a) are used to create new connections
(b) to it. Then the new neuron is connected to the existing one (c). In (b) the new layer H2 is
created before neuron n is inserted, because the output neuron already has a connection to a
neuron in layer H1.
A Constructive Neural Network for Evolving a Machine Controller in Real-Time 237
are basically supported by our implementation but are not used for the growing
mechanisms yet, could help with this task.
6 Experiments
6.1 Setup
Our novel method for the autonomous creation of machine controllers was tested in
a simulation of a mobile robot which moves using differential steering, as shown in
figure Fig. 1.b. The initial neural network consists of 12 input neurons (2 for each
sensor) and 4 output neurons (2 for each motor, see Fig. 1.a).
The input neurons are fed by values from 6 sonar sensors as shown in Fig. 1.b,
each sensor feeds the input of 2 neurons. The sonar sensors are arranged so that 4
scan the front of the robot and 2 scan the rear as shown in the figure. The distance
value is processed so that one input neuron fires more frequently as the measured
distance increases and the other neuron connected to the same sensor fires more
frequently as the distance decreases.
For the actuator control, the output connections are configured so that the more
frequently one of the output neurons connected to each motor fires, the faster this
motor will try to drive forward. The more frequently the other output neuron con-
nected to the same motor fires, the faster that motor will try to turn backwards. The
final speed at which each motor will drive is calculated by the difference between
both neurons.
Fig. 5 The three curves show the reward values the robot received in simulation runs of the
same duration and starting from the same position. In Run 1 only forward movement is
rewarded. The robot learns to hold its position to minimise negative feedback. In Run 2
activity reward was introduced. In Run 3 the same reward algorithm is used as in Run
2, but in Run 3 no or only small forward movement is punished. The robot learns to receive
positive feedback with time which makes it more stable.
With this experimental setup the robot should learn to wander around in the sim-
ulated environment shown in Fig. 4 while avoiding obstacles.
The original robots bumpers are included in the simulation and are used to de-
tect collisions with obstacles, and are used to penalise significantly the reward val-
ues when such a collision occurs. The reward is increased continuously as the robot
travels farther during its wandering behaviour. Backward movement is only accept-
able when recovering from a collision, therefore it will only be used to increase
the robots reward value in that case, while it is used to decrease this value for all
other cases. As time increases, linear forward movement will receive higher positive
reward and this will discourage circular movement.
6.2 Results
This section discusses the challenges when finding an appropriate reward function.
Additionally, we show the importance of local inhibition for the reliability of the
learning methods. Finally, we present some results considering the performance and
the topology of the constructive neural network.
The reward function that we have used in our experiments delivers 1 in the
case where the robot crashes into an obstacle. Backward movement is punished
A Constructive Neural Network for Evolving a Machine Controller in Real-Time 239
Fig. 6 Run 1 shows the development of the reward value without local inhibition. This
contrasting method increases the produced reward values significantly in Run 2.
(negative value). There are two features that are not implemented in all of the three
simulation runs of Fig. 5: First, no or only small forward movement is punished;
second, backward movement is rewarded (positive value), if the robot received bad
feedback for a while, to keep the robot active.
Figure 6 shows the importance of local inhibition. Without local inhibition the
simulation run did not produce a single phase in which significant positive feedback
was received. Only short periods of positive reward can be identified where the robot
acted appropriately by chance. Local inhibition increases the contrast of spiking
patterns, which makes single neurons and hence single motor actions more powerful
and the assignment of reward to a certain spiking pattern more reliable.
Table 1 shows the results of a test of 50 simulation runs. In many cases the robot
was able to learn the wandering task with the ability to avoid obstacles. Each run of
the test sample was stopped after 20000 control cycles (processing the whole neural
network in each cycle).
The neural network contained no hidden neurons and no connections at the begin-
ning. Connections for local inhibition were created when the controller was started.
The speed values of the table are given in internal units of the simulation.
References
1. Alnajjar, F., Murase, K.: Self-organization of Spiking Neural Network Generating Au-
tonomous Behavior in a Real Mobile Robot. In: Proceedings of the International Confer-
ence on Computational Intelligence for Modelling, Control and Automation, vol. 1, pp.
11341139 (2005)
2. Alnajjar, F., Bin Mohd Zin, I., Murase, K.: A Spiking Neural Network with Dynamic
Memory for a Real Autonomous Mobile Robot in Dynamic Environment. In: 2008 In-
ternational Joint Conference on Neural Networks (2008)
3. Dauc, E., Henry, F.: Hebbian Learning in Large Recurrent Neural Networks. Movement
and Perception Lab, Marseille (2006)
4. Elizondo, D., Birkenhead, R., Taillard, E.: Generalisation and the Recursive Determin-
istic Perceptron. In: International Joint Conference on Neural Networks, pp. 17761783
(2006)
5. Elizondo, D., Fiesler, E., Korczak, J.: Non-ontogenetic Sparse Neural Networks. In: In-
ternational Conference on Neural Networks, vol. 26, pp. 290295. IEEE, Los Alamitos
(1995)
6. Florian, R.V.: Reinforcement Learning Through Modulation of Spike-timing-dependent
Synaptic Plasticity. Neural Computation 19(6), 14681502 (2007)
7. Fritzke, B.: Fast Learning with Incremental RBF Networks. Neural Processing Let-
ters 1(1), 25 (1994)
8. Fritzke, B.: A Growing Neural Gas Network Learns Topologies. In: Advances in Neural
Information Processing Systems, vol. 7, pp. 625632 (1995)
9. Gmez, G., Lungarella, M., Hotz, P.E., Matsushita, K., Pfeifer, R.: Simulating Develop-
ment in a Real Robot: On the Concurrent Increase of Sensory, Motor, and Neural Com-
plexity. In: Proceedings of the Fourth International Workshop on Epigenetic Robotics,
pp. 119122 (2004)
10. Greenwood, G.W.: Attaining Fault Tolerance through Self-adaption: The Strengths and
Weaknesses of Evolvable Hardware Approaches. In: Zurada, J.M., Yen, G.G., Wang, J.
(eds.) WCCI 2008. LNCS, vol. 5050, pp. 368387. Springer, Heidelberg (2008)
11. Hebb, D.O.: The Organization of Behaviour: A Neuropsychological Approach. John Wi-
ley & Sons, New York (1949)
12. Izhikevich, E.M.: Which Model to Use for Cortical Spiking Neurons? IEEE Transactions
on Neural Networks 15(5), 10631070 (2004)
13. Izhikevich, E.M.: Solving the Distal Reward Problem through Linkage of STDP and
Dopamine Signaling. Cerebral Cortex 10, 10931102 (2007)
14. Jaeger, H.: The Echo State Approach to Analysing and Training Recurrent Neural
Networks. GMD Report 148, German National Research Institute for Computer Science
(2001)
15. Katic, D.: Leaky-integrate-and-fire und Spike Response Modell. Institut fr Technische
Informatik, Universitt Karlsruhe (2006)
16. Kohonen, T.: Self-organization and Associative Memory. 3rd printing. Springer, Heidel-
berg (1989)
17. Liu, J., Buller, A.: Self-development of Motor Abilities Resulting from the Growth of a
Neural Network Reinforced by Pleasure and Tension. In: Proceedings of the 4th Interna-
tional Conference on Development and Learning, pp. 121125 (2005)
18. Liu, J., Buller, A., Joachimczak, M.: Self-motivated Learning Agent: Skill-development
in a Growing Network Mediated by Pleasure and Tensions. Transactions of the Institute
of Systems, Control and Information Engineers 19(5), 169176 (2006)
242 A. Huemer, D. Elizondo, and M. Gongora
19. Maass, W., Natschlaeger, T., Markram, H.: Real-time Computing without Stable States:
A New Framework for Neural Computation Based on Perturbations. Neural Computa-
tion 14(11), 25312560 (2002)
20. Maes, P., Brooks, R.A.: Learning to Coordinate Behaviors. In: AAAI, pp. 796802
(1990)
21. Martinetz, T.M., Schulten, K.J.: A Neural-Gas Network Learns Topologies. In: Ko-
honen, T., Mkisara, K., Simula, O., Kangas, J. (eds.) Artificial Neural Networks, pp.
397402 (1991)
22. Schlkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regular-
ization, Optimization, and Beyond. In: Adaptive Computation and Machine Learning.
The MIT Press, Cambridge (2001)
23. Stanley, K.O., DAmbrosio, D., Gauci, J.: A Hypercube-Based Indirect Encoding for
Evolving Large-Scale Neural Networks. Accepted to appear in Artificial Life journal
(2009)
24. Tajine, M., Elizondo, D.: The Recursive Deterministic Perceptron Neural Network. Neu-
ral Networks 11, 15711588 (1998)
25. Vreeken, J.: Spiking Neural Networks, an Introduction. Intelligent Systems Group, In-
stitute for Information and Computing Sciences, Utrecht University (2003)
Avoiding Prototype Proliferation in Incremental
Vector Quantization of Large Heterogeneous
Datasets
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 243260.
springerlink.com c Springer-Verlag Berlin Heidelberg 2009
244 H.F. Satizabal, A. Perez-Uribe, and M. Tomassini
1 Introduction
Processing information from large databases has become an important issue since
the emergence of the new large scale and complex information systems (e.g., satel-
lite images, bank transaction databases, marketing databases, internet). Extracting
knowledge from such databases is not an easy task due to the execution time and
memory constraints of actual systems. Nonetheless, the need for using this informa-
tion to guide decision-making processes is imperative.
Classical data mining algorithms exploit several approaches in order to deal with
this kind of dataset [8, 3]. Sampling, partitioning or hashing the dataset drives the
process to a split and merge, hierarchical or constructive framework, giving the
possibility of building large models by assembling (or adding) smaller individual
parts. Another possibility to deal with large datasets is incremental learning [9]. In
this case, the main idea is to transform the modelling task into an incremental task1
by means of a sampling or partitioning procedure, and the use of an incremental
learner that builds a model from the single samples of data (one at a time).
Moreover, large databases contain a lot of redundant information. Thus, having
the complete set of observations is not mandatory. Instead, selecting a small set of
prototypes containing as much information as possible would give a more feasible
approach to tackle the knowledge extraction problem. One well known approach
to do so is Vector Quantization (VQ). VQ is a classical quantization technique that
allows the modelling of a distribution of points by the distribution of prototypes or
reference vectors. Using this approach, data points are represented by the index of
their closest prototype. The codebook, i.e. the collection of prototypes, typically has
many entries in high density regions, and discards regions where there is no data [1].
A widely used algorithm implementing VQ in an incremental manner is Grow-
ing Neural Gas (GNG) [7]. This neural network is part of the group of topology-
representing networks which are unsupervised neural network models intended to
reflect the topology (i.e. dimensionality, distribution) of an input dataset [12]. GNG
generates a graph structure that reflects the topology of the input data manifold
(topology learning). This data structure has a dimensionality that varies with the di-
mensionality of the input data. The generated graph can be used to identify clusters
in the input data, and the nodes by themselves could serve as a codebook for vector
quantization [5].
In summary, building a model from a large dataset could be done by splitting
the dataset in order to make the problem an incremental task, then applying an
incremental learning algorithm performing vector quantization in order to obtain
a reduced set of prototypes representing the whole set of data, and then using the
resulting codebook to build the desired model.
Growing Neural Gas suffers from prototype proliferation in regions with high
density due to the absence of a parameter stopping the insertion of units in
sufficiently-represented2 areas. This stopping criterion could be based on a local
1 A learning task is incremental if the training examples used to solve it become available
over time, usually one at a time [9].
2 Areas with low quantization error.
Avoiding Prototype Proliferation 245
Step 5: Move s1 and its direct topological neighbours towards by fractions b and n ,
respectively, of the total distance:
Step 6: If s1 and s2 are connected by an edge, set the age of this edge to zero. If such an
edge does not exist, create it
Step 7: Remove edges with an age larger than amax . If the remaining units have no
emanating edges, remove them as well
Step 8: If the number of input signals generated so far is an integer multiple of a param-
eter , insert a new unit as follows:
Figure 2 shows the position and distribution of the 200 cells of the resulting
structure. As we can see, the distribution of each one of the variables is reproduced
by the group of prototypes required.
The GNG algorithm is a vector quantizer which places prototypes by perform-
ing entropy maximization [6]. This approach, while allowing the vector prototypes
Avoiding Prototype Proliferation 247
Fig. 2 Positions of neurons of the GNG model. a) Position of the neuron units. b) Distribution
of X. c) Distribution of Y.
Parameter b n amax d
value 0.05 0.005 100 100 0.5 0.9
Step 8: If the number of input signals generated so far is an integer multiple of a param-
eter , insert a new unit as follows:
dist = q f (8)
Else,
Insert a new unit r halfway between q and its neighbour f with the largest
error variable:
wr = 0.5 wq + w f (11)
Insert edges connecting the new unit r with units q and f , and remove the
original edge between q and f .
Decrease the error variables of q and f by multiplying them with a constant
. Initialize the error variable of r with the new value of the error variable
of q.
error and its neighbour f having also the highest error (see equation 4). Knowing
the distance between unit q and unit f (equation 8), and taking the quantization error
qE as the radius of each unit, one could change the step 8 of the original algorithm
proposed by Fritzke as shown in table 3. Hence, we propose to insert a new unit as
in the original version (equation 11), but only if there is enough place between unit
q and unit f (equation 10).
controlling the region of influence of each unit, and the latter by controlling the
superposition of units. These natural meanings allow them to be tuned according to
the requirements of each specific application.
In a less obvious sense, parameter h controls the distribution of units between
high and low density areas, modulating the distribution-matching property of the
algorithm. In order to do so, parameter h modulates the signal that drives the inser-
tion of new units in the network only if the best matching neuron fulfills a given
quantization error condition. In this way, the algorithm does not insert unnecessary
units in well represented zones, even if the local error measure increases due to high
data density. Some examples exploring this parameters are given in section 4.
4 Toyset Experiments
In section 2, a non-uniform distribution of points in two dimensions was used to
train a GNG network. Figure 2 shows a high concentration of prototypes in the zone
with higher density due to the property of density matching of the model. This is
an excellent result if we do not have any constraint on the amount of prototypes.
In fact, having more prototypes increases the execution time of the algorithm since
Fig. 3 Results of the modified algorithm varying parameter h (qE = 0.1 and sp = 0.75). a)
Using the original GNG algorithm b) Using h = 1.00 c) Using h = 0.75 d) Using h = 0.50 e)
Using h = 0.25 f) Using h = 0.00.
Avoiding Prototype Proliferation 251
Fig. 4 The original and the modified version of GNG trained with a dataset like the one
used by Cselenyi [4]. a) Training data. b) Algorithm of GNG by Fritzke. c) Modified GNG
algorithm, qE = 0.1, h = 0.1, sp = 0.5.
there are more units to evaluate each time a new point is evaluated, and this is not
desirable if we have a very large dataset. Moreover, we apply vector quantization
in order to reduce the number of points to process by choosing a suitable codebok,
and therefore, redundant prototypes are not desirable. This section shows how the
proposed modification to controlling prototype proliferation allows us to overcome
this situation. Two experiments with controlled toysets should help in the testing
and understanding of the modified algorithm.
Figure 3 shows the results of the modified algorithm when trained with the data
distribution showed in figure 1. Figure 3 shows how parameter h effectively controls
the proportion of units assigned to regions with high and low densities. In this case
the parameters qE and sp were kept constant (qE = 0.1 and sp = 0.75) since their
effects are more global and depend less on the data distribution. The rest of the
parameters were set as shown in table 2.
Another interesting test consists of using a dataset similar to the one proposed
by Martinetz [13] in the early implementations of this kind of network (neural gas,
252 H.F. Satizabal, A. Perez-Uribe, and M. Tomassini
growing cell structures, growing neural gas) [5]. This distribution of data has been
used by several researchers [4, 5, 7, 12, 13] in order to show the ability of the
topology-preserving networks in modelling the distribution and dimensionality of
data. The generated dataset shown in figure 4 a) presents two different levels of den-
sities for points situated in three, two and one dimension, and has points describing
a circle.
When this dataset is used, the model has to deal with data having different di-
mensionalities, different densities and different topologies. Figures 4 b) and 4 c)
show the position of the units of two GNG networks, one of them using the original
algorithm and the other one using the modified version. Both structures preserve
the topology of the data in terms of dimensionality by placing and connecting units
depending on local conditions. Conversely, the two models behave differently in
terms of the distribution of the data. The codebook of the original GNG algorithm
reproduces the distribution of the training data by assigning almost the same quan-
tity of data points to each vector prototype. In the case of the modified version, the
parameter h set to 0.1 makes the distribution of prototypes more uniform due to the
fact that the insertion of new units is conditioned with the quantization error. Other
parameters were set as shown in table 2.
2e+05
0e+00
0 5 10 15 20 25
2e+05
0e+00
0 2 4 6 8
Fig. 5 Histogram of the quantization error for a large dataset. a) Fritzkes original GNG
algorithm. b) Modified GNG algorithm, qE = 1 C, h = 0.1, sp = 0.5.
254 H.F. Satizabal, A. Perez-Uribe, and M. Tomassini
the resulting model. Instead of representing better these areas, our approach is to
avoid prototype proliferation in regions with regular conditions in order to better
represent heterogeneous zones (e.g., mountains). Figure 6.b) shows that the modi-
fied version of GNG places less prototypes in flat areas (i.e., high density regions)
than the original version (Figure 6.a), and assigns more prototypes (i.e., cluster
centres) to the lower density points belonging to mountain areas (i.e., low density
regions).
Fig. 7 Error distributions after each incremental step. Each line shows the evolution of the
quantization error for a given subset when adding new knowledge to the model.
learner by using the individual parts. This scenario was tested by using the complete
dataset mentioned in section 5.
The complete dataset of climate with 1,336.025 data points and forty-eight di-
mensions was divided in an arbitrary way (i.e. from north to south) into 7 parts; six
parts of 200.000 observations, and one final part having 136.025 data points. These
individual subsets were used in order to incrementally train a model using our mod-
ified version of the GNG algorithm. The parameters used are shown in table 4.
256 H.F. Satizabal, A. Perez-Uribe, and M. Tomassini
Fig. 8 Error distributions after each training-repetition step. Each line shows the evolution of
the quantization error for a given subset.
4e+05
Frequency
2e+05
0e+00
Quantization error
Fig. 9 Histogram of the quantization error for codebook obtained with the modified version
of GNG, after feeding in the whole climate dataset
Figure 7 shows the evolution of the quantization error after adding each one of
the seven subsets. Each horizontal line of boxplots represents the error for a given
subset, and each step means the fact of training the network with a new subset. As
Avoiding Prototype Proliferation 257
Fig. 10 Prototype boundaries of the final GNG network projected on the geographic map.
can be seen in figure 7, the quantization error for a subset presented in previous
steps increases in further steps with the addition of new knowledge. Such behaviour
suggests that, even if the GNG algorithm performs incrementally, the addition of
new data can produce interference with the knowledge already stored within the
network. This undesirable situation happens if the new data are close enough to the
existant prototypes to make them move, forgetting the previous information.
In order to overcome this weakness of the algorithm, a sort of repetition policy
was added to the training procedure, as follows. The GNG network was trained sev-
eral times with the same sequence of subsets, from the first to the seventh subset. At
each iteration4 the algorithm had two possibilities: choosing a new observation from
the input dataset, or taking one of the prototype vectors from the current codebook
as input. The former option, which we named exploration, could be taken with a
4 Each time a new point is going to be presented to the network.
258 H.F. Satizabal, A. Perez-Uribe, and M. Tomassini
pre-defined probability. Figure 8 shows the evolution of the quantization error for
five training-repetition steps of the algorithm, using an exploration level of 50%.
As can be seen in figure 8, the quantization error for each one of the seven subsets
decreases with each iteration of the algorithm. Such behaviour means that the GNG
network is capable of avoiding catastrophic forgetting of knowledge when using
repetition. Moreover, after five iterations the network reaches a low quantization
error over the whole dataset. This result is shown in figure 9.
The resulting network has 14.756 prototype vectors in its codebook, which rep-
resents 1.1% of the total amount of pixels in the database. The number of prototypes
is larger than in the case of the temperature because of the larger dimensionality of
the observations (i.e. forty-eight dimensions instead of twelve). Moreover, precipi-
tation data have a wider range than temparature, increasing the area where prototype
vectors should be placed.
Figure 10 shows the Voronoi region of each prototype projected over the two-
dimensional geographic map. As in the previous case, one can see that prototypes
are distributed over the whole map, and they are more concentrated in mountain
zones, as desired.
Finally, after quantizing the dataset of 1,336.025 observations, the sets of similar
climate zones could be found by analyzing the 14.756 prototypes obtained from the
GNG network. This compression in the amount of data to be analyzed is possible
due to the existance of redundancy in the original dataset. In other words, pixels with
similar characteristics are represented by a reduced number of vector prototypes,
even if they are located in regions which are not geographically adjacent.
6 Conclusions
Nowadays, there is an increasing need for dealing with large datasets. A large dataset
can be split or sampled in order to divide the modelling task into smaller subtasks
that can be merged in a single model by means of an incremental learning tech-
nique performing vector quantization. In our case, we chose the Growing Neural
Gas (GNG) algorithm as the vector quantization technique. GNG allows us to get
a reduced codebook to analyse, instead of analysing the whole dataset. Growing
Neural Gas is an excellent incremental vector quantization technique, allowing us
to preserve the topology and the distribution of a set of data.
However, in our specific application, we found it necessary to modulate the topol-
ogy matching property of the GNG algorithm in order to control the distribution of
units between zones with high and low density. To achieve this, we modified the
original algorithm proposed by Fritzke by adding three new parameters, two con-
trolling the quantization error and the amount of neuron units in the network, and
one controlling the distribution of these units. The modified version still has the
property of topology-preservation, but contrary to the original version it permits
the modulation of the distribution matching capabilities of the original algorithm.
These changes allow the quantization of datasets having high contrasts in density
Avoiding Prototype Proliferation 259
while keeping the information of low density areas, and using a limited number of
prototypes.
Moreover, we tested the modified algorithm on the task of quantizing a heteroge-
neous set of real data. First, the difference in the distribution of prototypes between
the original and the modified version was tested by using the classical modelling
approach where the whole set of data is available during the training process. By
doing so, we verified that the modified version modulates the insertion of prototypes
in high density regions of data. Finally, we used the modified version of the algo-
rithm to perform an incremental modelling task over a larger version of the former
dataset. A repetition policy had to be added to the incremental training procedure
in order to carry out this test. This repetition strategy allowed the GNG network to
remember previous information, preventing catastrophic forgetting caused by new
data interfering with the already stored knowledge.
References
1. Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2004)
2. Bouchachia, A., Gabrys, B., Sahel, Z.: Overview of Some Incremental Learning Algo-
rithms. In: Fuzzy Systems Conference. FUZZ-IEEE 2007, vol. 23-26, pp. 16 (2007)
3. Bradley, P., Gehrke, J., Ramakrishnan, R., Srikant, R.: Scaling mining algorithms to large
databases. Commun. ACM 45, 3843 (2002)
4. Cselenyi, Z.: Mapping the dimensionality, density and topology of data: The growing
adaptive neural gas. Computer Methods and Programs in Biomedicine 78, 141156
(2005)
5. Fritzke, B.: Unsupervised ontogenic networks. In: Handbook of Neural Computation,
ch. C 2.4, Institute of Physics, Oxford University Press (1997)
6. Fritzke, B.: Goals of Competitive Learning. In: Some Competitive Learning
Methods (1997), http://www.neuroinformatik.rub.de/VDM/research/
gsn/JavaPaper/ (Cited October 26, 2008)
7. Fritzke, B.: A Growing Neural Gas Learns Topologies. In: Advances in Neural Informa-
tion Processing Systems, vol. 7. MIT Press, Cambridge (1995)
8. Ganti, V., Gehrke, J., Ramakrishnan, R.: Mining very large databases. Computer 32, 38
45 (1999)
9. Giraud-Carrier, C.: A note on the utility of incremental learning. AI Commun. 13, 215
223 (2000)
10. Heinke, D., Hamker, F.H.: Comparing neural networks: a benchmark on growing neural
gas, growing cell structures, and fuzzy ARTMAP. IEEE Transactions on Neural Net-
works 9, 12791291 (1998)
260 H.F. Satizabal, A. Perez-Uribe, and M. Tomassini
11. Hijmans, R., Cameron, S., Parra, J., Jones, P., Jarvis, A.: Very High Resolution Interpo-
lated Climate Surfaces for Global Land Areas. Int. J. Climatol 25, 19651978 (2005)
12. Martinetz, T., Schulten, K.: Topology representing networks. Neural Networks 7, 507
522 (1994)
13. Martinetz, T., Schulten, K.: A neural gas network learns topologies. Artificial Neural
Networks, 397402 (1991)
Tuning Parameters in Fuzzy Growing
Hierarchical Self-Organizing Networks
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 261279.
springerlink.com
c Springer-Verlag Berlin Heidelberg 2009
262 M.A. Barreto-Sanz et al.
network and the quality of the prototypes; in addition the motivation and
the theoretical basis of the algorithm are presented.
1 Introduction
We live in a world full of data. Every day we are confronted with the handling
of large amounts of information. This information is stored and represented
as data, for further analysis and management. One of the essential means in
dealing with data is to classify or group it into categories or clusters. In fact,
as one of the most ancient activities of human beings [1], classication plays
a very important role in the history of human development. In order to learn
a new object or distinguish a new phenomenon, people always try to look for
the features that can describe it and further compare it with other known
objects or phenomena, based on the similarity or dissimilarity, generalized as
proximity, according to some standards or rules.
In many cases classication must be done without a priori knowledge of
the classes in which the dataset is divided (unlabeled pattern). This kind of
classication is called clustering (unsupervised classication). On the con-
trary, discriminant analysis (supervised classication) is made by providing a
collection of labeled patterns; so the problem is to label a newly encountered,
unlabeled pattern. Typically, the given labeled patterns are used to learn de-
scriptions of classes which in turn are used to label a new pattern. In the case
of clustering, the problem is to group a given collection of unlabeled patterns
into meaningful clusters. In a sense, labels are associated with clusters also,
but these category labels are data driven; that is, they are obtained solely
from the data [15, 23].
Even though the unsupervised classication presents many advantages over
supervised classication1 , it is a subjective process in nature. As pointed out
by Backer and Jain [2], in cluster analysis a group of objects is split up into
a number of more or less homogeneous subgroups on the basis of an often
subjectively chosen measure of similarity (i.e., chosen subjectively based on
its ability to create interesting clusters), such that the similarity between
objects within a subgroup is larger than the similarity between objects be-
longing to dierent subgroups. Clustering algorithms partition data into a
certain number of clusters (groups, subsets, or categories). There is no uni-
versally agreed upon denition [8].
Thus, methodologies to evaluate clusters with dierent levels of abstraction
in order to nd interesting patterns are useful; these methodologies could
help to improve the analysis of cluster structure creating representations,
facilitating the selection of clusters of interest. Methods for tree structure
1
For instance, no extensive prior knowledge of the dataset is required, and it can
detect natural groupings in feature space.
Tuning Parameters in FGHSON 263
representation and data abstraction have been used for this task, revealing
the topology and organization of clusters.
On the one hand, hierarchical methods are used to help explain the inner
organization of datasets, since the hierarchical structure imposed by the data
produces a separation of clusters that is mapped onto dierent branches. Hi-
erarchical clustering algorithms organize data into a hierarchical structure
according to a proximity matrix. The results of Hierarchical clustering are
usually depicted by a binary tree or dendrogram. The root node of the den-
drogram represents the whole data set and each leaf node is regarded as a data
object. The intermediate nodes describe to what extent the objects are prox-
imal among them; and the height of the dendrogram usually expresses the
distance between each pair of objects or clusters, or an object and a cluster.
The ultimate clustering results can be obtained by cutting the dendrogram
at dierent levels. This representation provides very informative descriptions
and visualization for the potential data clustering structures, especially when
real hierarchical relations exist in the data, like the data from evolutionary
research on dierent species of organisms. Therefore, this hierarchical organi-
zation enables us to analyze complicated structures as well as the exploration
of the dataset at multiple levels of detail [23].
On the other hand, data abstraction permits the extraction of a simple
and compact representation of a data set. Here, simplicity is either from the
perspective of automatic processing (so that a machine can perform further
processing eciently) or is human-oriented (so that the representation ob-
tained is easy to comprehend and intuitively appealing). In the clustering
context, a typical data abstraction is a compact description of each clus-
ter, usually in terms of cluster prototypes or representative patterns such as
the centroid of the cluster [7]. Soft competitive learning methods [11] are em-
ployed on data abstraction in a self-organizing way. These algorithms attempt
to distribute a number of vectors (prototype vectors) in a potentially low-
dimensional space. The distribution of these vectors should reect (in one of
several possible ways) the probability distribution of the input signals which
in general is not given explicitly but through sample vectors. Two principal
approaches have been used for this purpose. The rst is based on a xed
network dimensionality (i.e. Kohonen maps [16]). In the second approach,
non xed dimensionality is imposed on the network; hence, this network can
automatically nd a suitable structure and size through a controlled growth
process [19].
Dierent approaches have been introduced in order to combine the capabil-
ities of tree structure of the hierarchical methods and the advantages of soft
competitive learning methods used for data abstraction [20, 13, 6, 22, 12, 18],
obtaining networks capable of representing the structure of clusters and their
prototypes in a hierarchical self-organizing way. These networks are able to
grow and adapt their structure in order to represent the characteristics of
clusters in the most accurate manner. Although these hybrid models provide
satisfactory results, they generate crisp partitions of the datasets. The crisp
264 M.A. Barreto-Sanz et al.
2 Methods
FKCN [4] integrate the idea of fuzzy membership from Fuzzy C-Means
(FCM), with the updating rules of SOM. Thus, creating a self-organizing
algorithm that automatically adjusts the size of the updated neighborhood
266 M.A. Barreto-Sanz et al.
during a learning process, which usually terminates when the FCM objec-
tive function is minimized. The update rule for the FKCN algorithm can be
given as:
Where m(t) is an exponent like the fuzzication index in FCM and Uik,t is
the membership value of the compound Zk to be part of cluster i. Both of
these constants vary at each iteration t according to:
1
c
2/(m1)
Zk Wi
Uik = ; 1kn; 1ic (3)
j=1
Zk Wj
d. If Et < stop.
2
In the perspective of neural networks it represents a neuron or a prototype vector.
So the number of neurons or prototype vectors will be equal to the number of
clusters.
Tuning Parameters in FGHSON 267
Both growing processes are modulated by three parameters that regulate the
breadth (growth of the layers), depth (hierarchical growth) and membership
degree of data to the prototype vectors.
The FGHSON works as follows:
Fig. 1 (a) Hierarchical structure showing the prototype vectors and FKCNs cre-
ated in each layer for a supposed case. (b) Membership degrees in each layer, corre-
sponding to the network shown in the diagram. The parameter (the well known
cut) represents the minimal degree membership of an observation to be part
of the dataset represented by a prototype vector, the group of data with a desired
membership to a prototype will be used for the training of a new FCKN in the
next layer (depth process). In this particular diagram the dataset is unidimensional
(represented by the small circles below the membership plot) in order to simplify
the example
The value of qe0 will help to measure the minimum quality of data representa-
tion of the prototype vectors in the subsequent layers. Succeeding prototypes
have the task of reducing the global representation error qe0 .
After the expansion process and creation of the new FKCNs, the breadth
process described in stage 2 begins with the newly established FKCNs, for
instance, F KCN2 and F KCN3 in gure 1(a). The methodology for adding
new prototypes, as well as the termination criterion of the breadth process,
is essentially the same as used in the rst layer. The dierence between the
training processes of the FKCNs in the rst layer and all subsequent layers, is
that only a fraction of the whole input data is selected for training. This por-
tion of data will be selected according to a minimal membership degree ().
This parameter (an cut) represents the minimal degree of membership
for an observation to be part of the dataset represented by a prototype vector.
Hence, is used as a selection parameter, so all the observations represented
by Wi have to fulll expression (11), where Uik is the degree of membership
of the Zk th element of the dataset to the cluster i. As an example, gure 1(b)
shows the membership functions of the FKCNs in each layer, and how is
used as a selection criteria to divide the dataset.
At the end of the creation of layer two, the same procedure described in step
2 is applied to build layer 3 and so forth.
The training process of the FGHSON is terminated when no prototypes
require further expansion. Note that this training process does not necessarily
lead to a balanced hierarchy, i.e., a hierarchy with equal depth in each branch.
Rather, the specic distribution of the input data is modeled by a hierarchical
structure, where some clusters require deeper branching than others.
3 Experimental Testing
categories. The second layer (gure 2(c)) reaches a more ne-grained descrip-
tion of the dataset, placing prototypes in almost all of the data distribution,
adding prototypes in the zones where more representation was needed. Fi-
nally in gure 2(d), it is possible to observe an over population of prototypes
in the middle of the cloud of Virginica and Versicolor observations. This oc-
curs because this part of the dataset presents observations with ambiguous
membership in the previous layer, then, several prototypes are placed in this
new layer for proper representation. Hence, permitting those observations
to obtain a higher membership of its new prototypes. The outcome of the
process is a more accurate representation of this zone.
Fig. 3 Distribution of the prototype vectors (represented by black points) (a) First
layer (b) Second layer (c) Third layer.
triplet (, 1 and 2 ) a FGHSON was trained using the following xed param-
eters: tmax = 100 (maximum number of iterations), = 0.0001 (termination
criterion) and m0 = 2 (fuzzication parameter).
Several variables were obtained in order to measure the quality of the
networks created for every FGHSON generated; for instance the number of
hierarchical levels of the obtained network, the number of FKCNs created
for each level, and nally the quantization error by prototype and level. The
analysis of these values will allow discovery of the relationships between the
parameters (, 1 and 2 ) and the topology of the networks (represented in
this experiment by the levels reached for each network and the number of
FKCN created). In addition, it will be possible to observe the relationship
between the quantization errors of prototypes by level and the parameters
of the algorithm. This activity makes it possible for us to nd values of the
parameters that allow us to build the most accurate structure, based on the
number of prototypes, the quantization error and the number of levels present
in the network.
Due the large amount of information involved, a graphical representation
of the obtained data was used in order to facilitate visualization of the results.
For this, 3D plots were used as follows: the parameter 1 (which regulates the
breadth of the networks) and the parameter 2 (which regulates the depth of
the hierarchical architecture) are shown on the x-axis and y-axis respectively.
The z-axis shows the quantity of levels in the hierarchy (see gure 4 and
gure 5 ). Each 3D plot corresponds to one xed value of . Hence, eight
3D plots represent the eight dierent values evaluated for , then each 3D
plot contains 100 possible combinations of the duple (1 ,2 ) for a specic .
Therefore, analysis of 1 , 2 and and the levels of the 800 networks were
generated and plotted.
Furthermore, additional information was added to the 3D plots. The num-
ber of FKCNs created for level were represented by a symbol in the 3D plots
(see gure 4 and gure 5 left side). The higher quantization error of the pro-
totypes that were expanded is shown in a new group of 3D plots; in others
words this prototype is the father of the prototypes in that level4 . The
rounded value of the quantization error is shown as a mark in the 3D plot
for each triplet of values, in each level (see gure 4 and gure 5 right side).
Examining the obtained results, there are some interesting results related
to the quantization error and the topology of the network. For instance, gure
4 and gure 5 show the dierent networks created. It can be seen that for
values of 2 above 0.3 the model generates networks with just one level, so
an interesting area to explore lies between the values of 2 = 0.1, 0.2 and 0.3.
With respect to the quantization error (gure 4 and gure 5 right side) for
almost all values of , the lower quantization error with the lower number of
4
For this reason, in level one all the values are 291 (see gure 4 and gure 5 right
side) because the prototype father that is expanded has the same quantization
error for all networks; in the case of the rst level this error is called qe0 , as is
described in section 2.3.
274 M.A. Barreto-Sanz et al.
Fig. 4 The gure has 3D plots showing the results obtained using = 0.1, 0.2, 0,3
and 0.4. On the left side it is possible observe the levels obtained for each triplet
(, 1 , 2 ), in addition the number of FKCN created for each level are represented
by a symbol. On the right side the higher quantization error of the prototypes that
were expanded is shown; in others words this prototype is the father (with higher
quantization error of the prototypes in that level). In the special case of the rst
level all the values are 291, because the prototype father that is expanded has
the same quantization error for all the networks; in the case of the rst level this
error is called qe0 , as described in section 2.3.
Tuning Parameters in FGHSON 275
Fig. 5 The gure shows 3D plots of the results obtained using = 0.5, 0.6, 0.7
and 0.8, in addition the number of FKCN created for each level are represented by
a symbol. On the right side the higher quantization error of the prototypes that
were expanded is shown; in others words this prototype is the father (with higher
quantization error) of the prototypes in that level.
276 M.A. Barreto-Sanz et al.
Fig. 6 Structures obtained tuning the model with the values (a) = 0.1, 1 =
0.2, 2 = 0.1 (b) = 0.3, 1 = 0.2, 2 = 0.1, and (c) = 0.3, 1 = 0.2, 2 = 0.1.
In this gure it is possible to observe the distribution of the prototype vectors; the
prototypes of the rst level are represented by circles and the prototypes of the
second level are represented by triangles.
the rst level of the hierarchy; these prototypes represent four classes5 . Then,
the prototypes of this layer represent the three classes of iris, in addition they
also take the problematic region between Versicolor and Virginica as a fourth
class. Furthermore, new prototypes are created in the second layer in order to
obtain a more accurate representation of the dataset, creating a proliferation
of prototypes. This phenomena is due to the low (0.1) being selected. This
is because the quantity of elements represented for each prototype is large
(due to low membership, a lot of data can be a member of one prototype)
so, many prototypes are necessary to reach a low quantization error.
In the next example with = 0.3, 1 = 0.2, 2 = 0.1, it is possible observe
(gure 6(b)) the three prototypes created in the rst level. In this case the
number of the prototypes matches the number of classes in the iris dataset.
Nevertheless, there is (as in the previous example) an abundance of pro-
totypes in the Virginica-Versicolor group. But in this case the number of
prototypes is lower compared with the preceding example, showing how
aects the quantity of prototypes created.
Finally, in the last example with = 0.6, 1 = 0.2, 2 = 0.1. Three proto-
types are created in the rst level of the network matching the classes of the
Iris dataset (gure 6(c)); additionally, in the second layer one of the previous
prototypes is expanded to three prototypes in order to represent the fuzzy
areas of the data set. This last network presents the lower values of vector
quantization, levels of hierarchy, and FKCNs; so it is possible to select this as
the more accurate topology. Consider the previously dened premise which
said that the most accurate network had to present lower number of levels,
number of FKCNs, and the lower quantization error.
4 Conclusion
The Fuzzy Growing Hierarchical Self-organizing Networks are fully adaptive
networks able to hierarchically represent complex datasets. Moreover, they
allow a fuzzy clustering of the data, allocating more prototype vectors or
5
It knows that there are three classes (iris Setosa, Virginica and Versicolor) but
the fourth exists in an area where Versicolor and Virginica present similar char-
acteristics.
278 M.A. Barreto-Sanz et al.
References
1. Anderberg, M.: Cluster Analysis for Applications. Academic, New York (1973)
2. Backer, E., Jain, A.: A clustering performance measure based on fuzzy set de-
composition. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-3(1), 6675 (1981)
3. Baraldi, A., Alpaydin, E.: Constructive feedforward ART clustering networks-
Part I and II. IEEE Trans. Neural Netw. 13(3), 645677 (2002)
4. Bezdek, J., Tsao, K., Pal, R.: Fuzzy Kohonen clustering networks. In: IEEE
Int. Conf. on Fuzzy Systems, pp. 10351043 (1992)
5. Burzevski, V., Mohan, C.: Hierarchical Growing Cell Structures. Tech Report:
Syracuse University (1996)
6. Doherty, J., Adams, G., Davey, N.: TreeGNG - Hierarchical topological clus-
tering. In: Proc. Euro. Symp. Articial Neural Networks, pp. 1924 (2005)
7. Diday, E., Simon, J.C.: Clustering analysis. In: Digital Pattern Recognition,
pp. 4794. Springer, Heidelberg (1976)
8. Everitt, B., Landau, S., Leese, M.: Cluster Analysis. Arnold, London (2001)
9. Fischer, G., van Velthuizen, H.T., Nachtergaele, F.O.: Global agro-ecological
zones assessment: methodology and results. Interim Report IR-00-064. IIASA,
Laxenburg, Austria and FAO, Rome (2000)
10. Fritzke, B.: Growing cell structures: a self-organizing network for unsupervised
and supervised learning. Neural Networks 7(9), 14411460 (1994)
11. Fritzke, B.: Some competitive learning methods, Draft Doc. (1998)
12. Taniichi, H., Kamiura, N., Isokawa, T., Matsui, N.: On hierarchical self-
organizing networks visualizing data classication processes. In: Annual Con-
ference, SICE 2007, pp. 1958196 (2007)
13. Hodge, V., Austin, J.: Hierarchical growing cell structures: TreeGCS. IEEE
Transactions on Knowledge and Data Engineering 13(2), 207218 (2001)
14. Huntsberger, T., Ajjimarangsee, P.: Parallel Self-organizing Feature Maps for
Unsupervised Pattern Recognition. Int. Jo. General Sys. 16, 357372 (1989)
Tuning Parameters in FGHSON 279
15. Jain, A., Murty, M., Flynn, P.: Data clustering: a review. ACM Comput.
Surv. 31(3), 264323 (1999)
16. Kohonen, T.: Self-organized formation of topologically correct feature maps.
Biological Cybernetics 43(1), 5969 (1982)
17. Lampinen, J., Oja, E.: Clustering properties of hierarchical self-organizing
maps. J. Math. Imag. Vis. 2(23), 261272 (1992)
18. Luttrell, S.: Hierarchical self-organizing networks. In: Proceedings of the 1st
IEE Conference on Articial Neural Networks, London, UK, pp. 26. British
Neural Network Society (1989)
19. Martinez, T., Schulten, J.: Topology representing networks. Neural Net-
works 7(3), 507522 (1994)
20. Merkl, D., He, H., Dittenbach, M., Rauber, A.: Adaptive hierarchical incre-
mental grid growing: An architecture for high-dimensional data visualization.
In: Proc. Workshop on SOM, Advances in SOM, pp. 293298 (2003)
21. Miikkulainen, R.: Script recognition with hierarchical feature maps. Connection
Science 2, 83101 (1990)
22. Rauber, A., Merkl, D., Dittenbach, M.: The growing hierarchical self-organizing
map: Exploratory analysis of high-dimensional data. IEEE Transactions on
Neural Networks 13(6), 13311341 (2002)
23. Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Transactions on
Neural Networks 16(3), 645678 (2005)
Self-Organizing Neural Grove: Efficient
Multiple Classifier System with Pruned
Self-Generating Neural Trees
Hirotaka Inoue
Abstract. Multiple classifier systems (MCS) have become popular during the last
decade. Self-generating neural tree (SGNT) is a suitable base-classifier for MCS
because of the simple setting and fast learning capability. However, the computation
cost of the MCS increases in proportion to the number of SGNTs. In an earlier
paper, we proposed a pruning method for the structure of the SGNT in the MCS to
reduce the computational cost. In this paper, we propose a novel pruning method
for more effective processing and we call this model self-organizing neural grove
(SONG). The pruning method is constructed from both an on-line and an off-line
pruning method. Experiments have been conducted to compare the SONG with an
unpruned MCS based on SGNT, an MCS based on C4.5, and the k-nearest neighbor
method. The results show that the SONG can improve its classification accuracy as
well as reducing the computation cost.
1 Introduction
Classifiers need to find hidden information in the large amount of given data ef-
fectively and must classify unknown data as accurately as possible [1]. Recently,
to improve the classification accuracy, multiple classifier systems (MCS) such as
neural network ensembles, bagging, and boosting have been used for practical data
mining applications [2, 3, 4, 5]. In general, the base classifiers of the MCS use tra-
ditional models such as neural networks (backpropagation network and radial basis
function network) [6] and decision trees (CART and C4.5) [7].
Neural networks have great advantages of adaptability, flexibility, and univer-
sal nonlinear input-output mapping capability. However, to apply these neural
Hirotaka Inoue
Kure National College of Technology, 2-2-11 Agaminami, Kure,
Hiroshima 737-8506, Japan
e-mail: hiro@kure-nct.ac.jp
L. Franco et al. (Eds.): Constructive Neural Networks, SCI 258, pp. 281291.
springerlink.com
c Springer-Verlag Berlin Heidelberg 2009
282 H. Inoue
networks, it is necessary that human experts determine the network structure and
some parameters, and it may be quite difficult to choose the right network structure
suitable for a particular application at hand. Moreover, a long training time is re-
quired to learn the input-output relation of the given data. These drawbacks prevent
neural networks being the base classifier of the MCS for practical applications.
Self-generating neural trees (SGNTs) [8] have simple network design and high
speed learning. SGNTs are an extension of the self-organizing maps (SOM) of Ko-
honen [9] and utilize competitive learning. The SGNT capabilities make it a suitable
base classifier for the MCS. In order to improve the accuracy of SGNN, we propose
ensemble self-generating neural networks (ESGNN) for classification [10] as one of
the MCS. Although the accuracy of ESGNN improves by using various SGNTs, the
computational cost, that is, the computation time and the memory capacity increases
in proportion to the increasing number of SGNNs in the MCS.
In an earlier paper [11], we proposed a pruning method for the structure of the
SGNN in the MCS to reduce the computational cost. In this paper, we propose a
novel MCS pruning method for more effective processing and we call this model
a self-organizing neural grove (SONG). This pruning method is comprised of two
stages. At the first stage, we introduce an on-line pruning method to reduce the com-
putational cost by using class labels in learning. At the second stage, we optimize
the structure of the SGNT in the MCS to improve the generalization capability by
pruning the redundant leaves after learning. In the optimization stage, we introduce
a threshold value as a pruning parameter to decide which subtrees leaves to prune
and estimate using 10-fold cross-validation [12]. After the optimization, the SONG
can improve its classification accuracy as well as reducing the computational cost.
Bagging [2] is used as a resampling technique for the SONG.
In this work, we investigate the improvement performance of the SONG by com-
paring it with an MCS based on C4.5 [13] using ten problems in a UCI machine
learning repository [14]. Moreover, we compare the SONG with k-nearest neighbor
(k-NN) [15] to investigate the computational cost and the classification accuracy.
The SONG demonstrates higher classification accuracy and faster processing speed
than k-NN on average.
The rest of the paper is organized as follows: the next section shows how to
construct the SONG. Then Section 3 is devoted to some experiments to investi-
gate its performance. Finally we present some conclusions, and outline plans for
future work.
1
w jk w jk + (eik w jk ), 1 k m. (1)
cj
Input:
A set of training examples E = {e_i}, i = 1, ... , N.
A distance measure d(e_i,w_j).
Program Code:
copy(n_1,e_1);
for (i = 2, j = 2; i <= N; i++) {
n_win = choose(e_i, n_1);
if (leaf(n_win)) {
copy(n_j, w_win);
connect(n_j, n_win);
j++;
}
copy(n_j, e_i);
connect(n_j, n_win);
j++;
prune(n_win);
}
Output:
Constructed SGNT by E.
After all training data are inserted into the SGNT as the leaves, each one has a class
label as the outputs and the weights of each node are the averages of the correspond-
ing weights of all its leaves. The topology of the whole SGNT network reflects the
given feature space. For more details concerning how to construct and perform the
SGNT, see [8]. Note, to optimize the structure of the SGNT effectively, we remove
the threshold value of the original SGNT algorithm in [8] to control the number of
leaves based on the distance because of the trade-off between the memory capacity
and the classification accuracy. In order to avoid the above problem, we introduce a
new pruning method in the sub procedure prune(n win). We use the class label
to prune leaves. For leaves that have the nwin parent node, if all leaves belong to the
same class, then these leaves are pruned and the parent node is given the class.
1
class0
class1 class0
class1
0.8 Height node
7
6
0.6 5
4
x2
3
2
0.4 1
0
1
0.8
0.2
0 0.6
0.2 0.4 x2
0.4 0.2
0.6
x1 0.8
0 1 0
0 0.2 0.4 0.6 0.8 1
x1
(a) (b)
class0 class0
class1 class1
Height node Height node
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
1 1
0.8 0.8
0 0.6 0 0.6
0.2 0.4 x2 0.2 0.4 x2
0.4 0.2 0.4 0.2
0.6 0.6
x1 0.8 x1 0.8
1 0 1 0
(c) (d)
Fig. 4 An example of the SONG pruning algorithm, (a) a two dimensional classification
problem with two equal circular Gaussian distribution, (b) the structure of the unpruned
SGNT, (c) the structure of the pruned SGNT ( = 1), and (d) the structure of the pruned
SGNT ( = 0.6). The shaded plane is the decision region of class 0 by the SGNT and the
dotted line shows the ideal decision boundary
1 1
class0 class0
class1 class1
0.8 0.8
0.6 0.6
x2
x2
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1
(a) (b)
Fig. 5 An example of the SONGs decision boundary (K = 25), (a) = 1, and (b) = 0.6.
The shaded plane is the decision region of class 0 by the SONG and the dotted line shows the
ideal decision boundary
SONG: Efficient MCS with Pruned Self-Generating Neural Trees 287
In the above example, we use all training data to construct the SGNT. The struc-
ture of the SGNT is changed by the order of the training data. Hence, we can con-
struct the SONG from the same training data by changing the input order.
To show how well the SONG is optimized by the pruning algorithm, we show an
example of the SONG in the same problem used above. Figure 5(a) and Figure 5(b)
show the decision region of the SONG in = 1 and = 0.6, respectively. We set
the number of SGNTs K to 25. The result of Figure 5(b) is a better estimation of
the ideal decision region than the result of Figure 5(a). We investigate the pruning
method for more complex problems in the next section.
3 Experimental Results
We investigate the computational cost (the memory capacity and the computation
time) and the classification accuracy of the SONG with bagging for ten benchmark
problems in the UCI machine learning repository [14]. Table 2 presents the abstract
of the datasets.
We evaluate how SONG is pruned using 10-fold cross-validation for the ten
benchmark problems. In this experiment, we use a modified Euclidean distance
measure for the SONG and k-NN. Since the performance of the SONG is not sen-
sitive in the threshold value , we set the different threshold values which are
moved from 0.5 to 1; = [0.5, 0.55, 0.6, . . ., 1]. We set the number of SGNTs K in
the SONG to 25 and execute 100 trials by changing the sampling order of each train-
ing set. All experiments in this section were performed on an UltraSPARC worksta-
tion with a 900MHz CPU, 1GB RAM, and Solaris 8.
Table 3 shows the average memory requirement and classification accuracy of
100 trials for the SONG. As the memory requirement, we count the number of units
which is the sum of the root, nodes, and leaves of the SGNT. The average memory
requirement is reduced from between 65% to 96.6% and the classification accu-
racy is improved by 0.1% to 2.9% by optimizing the SONG. This confirms that the
Table 2 Brief summary of the datasets. N is the number of instances, m is the number of
attributes
Dataset N m classes
balance-scale 625 4 3
breast-cancer-w 699 9 2
glass 214 9 6
ionosphere 351 34 2
iris 150 4 3
letter 20000 16 26
liver-disorders 345 6 2
new-thyroid 215 5 3
pima-diabetes 768 8 2
wine 178 13 3
288 H. Inoue
Table 3 The average memory requirement and classification accuracy of 100 trials for the
bagged SGNT in the SONG. The standard deviation is given inside the bracket on classifica-
tion accuracy (103 )
Table 4 The improved performance of the pruned MCS and the MCS based on C4.5 with
bagging
SONG can be effectively used for all datasets with regard to both the computational
cost and the classification accuracy.
To evaluate SONGs performance, we compare it with an MCS based on C4.5.
We set the number of classifiers K in the MCS to 25 and we construct both MCSs
by bagging. Table 4 shows the improved performance of the SONG and the MCS
based on C4.5. The results of the SGNT and the SONG are the average of 100 trials.
The SONG performs better than the MCS based on C4.5 for 6 of the 10 datasets. Al-
though the MCS based on C4.5 degrades the classification accuracy for iris, SONG
can improve the classification accuracy for all problems. Therefore, SONG is an
SONG: Efficient MCS with Pruned Self-Generating Neural Trees 289
Table 5 The classification accuracy, the memory requirement, and the computation time of
ten trials for the best pruned SONG and k-NN
efficient MCS on the basis of both the scalability for large scale datasets and the
robust improving generalization capability for the noisy datasets comparable to the
MCS with C4.5.
To show the advantages of SONG, we compare it with k-NN on the same prob-
lems. The best classification accuracy of 100 trials with bagging were chosen. In
k-NN, we choose the best accuracy where k is 1,3,5,7,9,11,13,15, and 25 with 10-
fold cross-validation. All methods are compiled using gcc with the optimization
level -O2 on the same workstation.
Table 5 shows the classification accuracy, the memory requirement, and the com-
putation time achieved by the SONG and k-NN. Although there are compression
methods available for k-NN [16], they take enormous computation time to construct
an effective model. We use the exhaustive k-NN in this experiment. Since k-NN does
not discard any training sample, the size of this classifier corresponds to the train-
ing set size. The results of k-NN correspond to the average measures obtained by
10-fold cross-validation, the same experimental procedure adapted in SONG. Next,
we show the results for each category.
First, with regard to the classification accuracy, SONG is superior to k-NN for
8 of the 10 datasets and gives 1.6% improvement on average. Second, in terms of
the memory requirement, even though the SONG includes the root and the nodes
which are generated by the SGNT generation algorithm, this is less than k-NN for
all problems. Although the memory requirement of the SONG is totally used K
times in Table 5, we release the memory of SGNT for each trial and reuse the mem-
ory for effective computation. Therefore, the memory requirement is suppressed by
the size of the single SGNT. Finally, in view of the computation time, although the
SONG consumes the cost of K times the SGNT to construct the model and test
for the unknown dataset, the average computation time is faster than k-NN. The
SONG is slower than k-NN for small datasets such as glass, ionosphere, and iris.
However, it is faster than k-NN for large datasets such as balance-scale, letter, and
290 H. Inoue
pima-diabetes. In the case of letter, in particular, the computation time of the SONG
is faster than k-NN by about 2.4 times. We need to repeat 10-fold cross validation
many times to select the optimum parameters for and k. This evaluation consumes
much computation time for large datasets such as letter. Therefore, the SONG based
on the fast and compact SGNT is useful and practical for large datasets. Moreover,
the SONG is capable parallel computation because each classifier behaves indepen-
dently. In conclusion, the SONG is a practical method for large-scale data mining
compared with k-NN.
4 Conclusions
In this paper, we proposed a new pruning method for the MCS based on SGNT,
which is called SONG, and evaluated the computation cost and the accuracy.We
introduced an on-line and off-line pruning method and evaluated the SONG by 10-
fold cross-validation. Experimental results showed that the memory requirement
is significant reduce, and by using the pruned SGNT as the base classifier of the
SONG, accuracy is increased. The SONG is a useful and practical MCS to classify
large datasets. In future work, we will study an incremental learning and a parallel
and distributed processing of the SONG for large scale data mining.
References
1. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Pub-
lishers, San Francisco (2000)
2. Breiman, L.: Bagging predictors. Machine Learning 24, 123140 (1996)
3. Schapire, R.E.: The strength of weak learnability. Machine Learning 5(2), 197227
(1990)
4. Quinlan, J.R.: Bagging, Boosting, and C4.5. In: Proceedings of the Thirteenth National
Conference on Artificial Intelligence, Portland, OR, August 4-8, 1996, pp. 725730.
AAAI Press, The MIT Press (1996)
5. Ratsch, G., Onoda, T., Muller, K.R.: Soft margins for AdaBoost. Machine Learn-
ing 42(3), 287320 (2001)
6. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, New
York (1995)
7. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. John Wiley & Sons
Inc., New York (2000)
8. Wen, W.X., Jennings, A., Liu, H.: Learning a neural tree. In: the International Joint Con-
ference on Neural Networks, Beijing, China, November 3-6, 1992, vol. 2, pp. 751756
(1992)
9. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1995)
10. Inoue, H., Narihisa, H.: Improving generalization ability of self-generating neural net-
works through ensemble averaging. In: Terano, T., Liu, H., Chen, A.L.P. (eds.) PAKDD
2000. LNCS, vol. 1805, pp. 177180. Springer, Heidelberg (2000)
SONG: Efficient MCS with Pruned Self-Generating Neural Trees 291
11. Inoue, H., Narihisa, H.: Optimizing a multiple classifier system. In: Ishizuka, M., Sattar,
A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, pp. 285294. Springer, Heidelberg
(2002)
12. Stone, M.: Cross-validation: A review. Math. Operationsforsch. Statist. Ser. Statis-
tics 9(1), 127139 (1978)
13. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo
(1993)
14. Blake, C., Merz, C.: UCI repository of machine learning databases (1998)
15. Patrick, E.A., Frederick, P., Fischer, I.: A generalized k-nearest neighbor rule. Informa-
tion and Control 16(2), 128152 (1970)
16. Zhang, B., Srihari, S.N.: Fast k-nearest neighbor classification using cluster-based trees.
IEEE Transactions on Pattern and Machine Intelligence 26(4), 525528 (2004)
Author Index