Artificial Neural Networks
Artificial Neural Networks
Artificial Neural Networks
net/publication/227310268
CITATIONS READS
42 198
1 author:
C. Fyfe
Universidad de Burgos
151 PUBLICATIONS 2,953 CITATIONS
SEE PROFILE
All content following this page was uploaded by C. Fyfe on 19 August 2014.
Further volume of this series Vol. 165. A.F. Rocha, E. Massad, A. Pereira Jr.
can be found on our homepage: The Brain: From Fuzzy Arithmetic to
springeronline.com Quantum Computing, 2005
ISBN 3-540-21858-0
Vol. 161. N. Nedjah, L. de Macedo Vol. 169. C.R. Bector, Suresh Chandra
Mourelle (Eds.) Fuzzy Mathematical Programming and
Evolvable Machines, 2005 Fuzzy Matrix Games, 2005
ISBN 3-540-22905-1 ISBN 3-540-23729-1
Vol. 162. R. Khosla, N. Ichalkaranje, L.C. Jain Vol. 170. Martin Pelikan
Design of Intelligent Multi-Agent Systems, Hierarchical Bayesian Optimization
2005 Algorithm, 2005
ISBN 3-540-22913-2 ISBN 3-540-23774-7
Vol. 163. A. Ghosh, L.C. Jain (Eds.) Vol. 171. James J. Buckley
Evolutionary Computation in Data Mining, Simulating Fuzzy Systems, 2005
2005 ISBN 3-540-24116-7
ISBN 3-540-22370-3 Vol. 172. Patricia Melin, Oscar Castillo
Vol. 164. M. Nikravesh, L.A. Zadeh, Hybrid Intelligent Systems for Pattern
J. Kacprzyk (Eds.) Recognition Using Soft Computing, 2005
Soft Computing for Information Prodessing ISBN 3-540-24121-3
and Analysis, 2005 Vol. 173. Bogdan Gabrys, Kauko Leiviskä,
ISBN 3-540-22930-2 Jens Strackeljan (Eds.)
Do Smart Adaptive Systems Exist?, 2005
ISBN 3-540-24077-2
Bogdan Gabrys
Kauko Leiviskä
Jens Strackeljan (Eds.)
Do Smart
Adaptive Systems
Exist?
Best Practice
for Selection and Combination
of Intelligent Methods
123
Bogdan Gabrys Jens Strackeljan
Bournemouth University Universität Magdeburg
School of Design Institut für Mechanik
Engineering & Computing Lehrstuhl für Technische Dynamik
Poole House Universitätsplatz 2, 39106 Magdeburg
Talbot Campus, Fern Barrow Germany
Poole, BH12 5BB E-mail:jens.strackeljan@mb.uni-
U.K. magdeburg.de
E-mail: bgabrys@bournemouth.ac.uk
Kauko Leiviskä
University Oulu
Department of Process Engineering
Control Engineering Laboratory
P.O.Box 4300, 90014 Oulu
Finland
E-mail: kauko.leiviska@oulu.fi
This work is subject to copyright. All rights are reserved, whether the whole or part of the material
is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Dupli-
cation of this publication or parts thereof is permitted only under the provisions of the German
Copyright Law of September 9, 1965, in its current version, and permission for use must always be
obtained from Springer. Violations are liable for prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springeronline.com
© Springer-Verlag Berlin Heidelberg 2005
Printed in The Netherlands
The use of general descriptive names, registered names, trademarks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
Typesetting: by the authors and TechBooks using a Springer LATEX macro package
Cover design: E. Kirchner, Springer Heidelberg
Printed on acid-free paper 89/3141/jl- 5 4 3 2 1 0
Preface
At first sight, the title of the present book may seem somewhat unusual,
since it ends with a question: Do Smart Adaptive Systems Exist? We have
deliberately chosen this form for the title, because the content of this book is
intended to elucidate two different aspects: First of all, we wish to define what
is meant by the term “Smart Adaptive Systems”. Furthermore, the question
asked in the title also implies that the applications described must be critically
examined to determine whether they satisfy the requirements imposed on a
smart adaptive system. Many readers will certainly have an intuitive notion of
the concepts associated with the terms “smart” and “adaptive”. Some readers
will probably also think of applications from their own field of work which they
consider to be both smart and adaptive.
Is there any need for a book of this kind? Is an attempt to provide a defin-
ition of terms and to describe methods and applications in this field necessary
at all? Two years ago, we answered this question with an unambiguous “yes”
and also started the book project for this reason. The starting point was the
result of joint activities among the authors in the EUNITE network, which
is dedicated to the topic of smart adaptive systems. EUNITE, the European
Network on Intelligent Technologies for Smart Adaptive Systems, was the
European Network of Excellence that started 2001 and ended mid of 2004.
It concerned with intelligent technologies, including neural networks, fuzzy
systems, methods from machine learning, and evolutionary computing, that
have recently lead to many successful industrial applications. Terms and defi-
nitions have been the subject of intensive discussions within the scope of this
network. These discussions were necessary because the existence of a gener-
ally accepted definition as a working basis is a prerequisite for joint activity
among the members of a network consisting of scientists and representatives
from industry. Finding such a definition proved to be quite difficult, especially
because of the multiplicity of highly personal opinions which could not be ex-
pressed in concise and conclusive form without contradiction. We hope that
this book will provide an idea of the consensus which has been reached within
EUNITE. Since a large number of European experts in the fields of computer
VI Preface
10 Monitoring
J. Strackeljan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
B. Gabrys
This chapter serves as an introduction to the book and especially to its first
part entitled “From methods to applications”. It begins with a description of
the motivations and driving forces behind the compilation of this book and
work within European Community concerned with smart adaptive systems
as the main theme of EUNITE Network of Excellence. This will be followed
by a short account of individual intelligent technologies within the scope of
EUNITE and their potential combinations which are perceived as good candi-
dates for constructing systems with a degree of adaptiveness and intelligence.
The chapters in the first part of the book cover some of these intelligent tech-
nologies like artificial neural networks, fuzzy expert systems, machine and
reinforcement learning, evolutionary computing and various hybridizations in
more detail. As it has proved throughout the life span of EUNITE it was not
an easy task to agree on the definitions of what adaptive and smart systems
are and therefore some compromise have had to be reached. As a result the
definitions of three levels of adaptivity and some interpretations of the word
smart adopted within EUNITE are given first. We then look at more general
requirements of intelligent (smart) adaptive systems of the future and discuss
some of the issues using the example of Evolutionary Connectionist Systems
(ECOS) framework. Within this general framework and the scope of the book
in the remaining sections we then concentrate on short description of exist-
ing methods for adaptation and learning, issues to do with model selection
and combination and conflicting goals of having systems that can adapt to
new/changing environments and at the same time have provable stability and
robustness characteristics. The pointers to the relevant chapters discussing
the highlighted issues in much greater detail are provided throughout this
introductory chapter.
B. Gabrys: Do Smart Adaptive Systems Exist? – Introduction, StudFuzz 173, 1–17 (2005)
www.springerlink.com c Springer-Verlag Berlin Heidelberg 2005
2 B. Gabrys
1.1 Introduction
Rapid development in computer and sensor technology not only used for highly
specialised applications but widespread and pervasive across a wide range of
business and industry has facilitated easy capture and storage of immense
amounts of data. Examples of such data collection include medical history
data in health care, financial data in banking, point of sale data in retail, plant
monitoring data based on instant availability of various sensor readings in
various industries, or airborne hyperspectral imaging data in natural resources
identification to mention only a few. However, with an increasing computer
power available at affordable prices and the availability of vast amount of data
there is an increasing need for robust methods and systems, which can take
advantage of all available information.
In essence there is a need for intelligent and smart adaptive methods but
do they really exist? Are there any existing intelligent techniques which are
more suitable for certain type of problems than others? How do we select those
methods and can we be sure that the method of choice is the best for solving
our problem? Do we need a combination of methods and if so then how to
best combine them for different purposes? Are there any generic frameworks
and requirements which would be highly desirable for solving data intensive
and unstationary problems? All these questions and many others have been
the focus of research vigorously pursued in many disciplines and some of them
will be addressed in this book.
One of the more promising approaches to constructing smart adaptive sys-
tems is based on intelligent technologies including artificial neural networks
[5, 25, 45], fuzzy systems [4, 31, 41, 46, 54, 55], methods from machine learn-
ing [8, 14, 15, 35, 44], parts of learning theory [51] and evolutionary com-
puting [16, 24, 26, 30] which have been especially successful in applications
where input-output data can be collected but the underlying physical model
is unknown. The incorporation of intelligent technologies has been used in
the conception and design of complex systems in which analytical and expert
systems techniques are used in combination. Viewed from a much broader
perspective, the above mentioned intelligent technologies are constituents of
a very active research area called soft computing (SC) (the terms compu-
tational intelligence and hybrid intelligent systems are also frequently used)
[1, 3, 6, 27, 28, 29, 32, 33, 34, 36, 39, 50, 56]. According to Zadeh [56], who
coined the term soft computing, the most important factor that underlies the
marked increase in machine intelligence nowadays is the use of soft comput-
ing to mimic the ability of the human mind to effectively employ modes of
reasoning that are approximate rather than exact. Unlike traditional hard
computing based on precision, certainty, and rigor, soft computing is tolerant
of imprecision, uncertainty and partial truth. The primary aim of soft comput-
ing is to exploit such tolerance to achieve tractability, robustness, a high level
of machine intelligence, and a low cost in practical applications. Although the
fundamental inspirations for each of the constituent intelligent technologies
1 Do Smart Adaptive Systems Exist? – Introduction 3
are quite different, they share the common ability to improve the intelligence
of systems working in an uncertain, imprecise and noisy environment, and
since they are complementary rather than competitive, it is frequently advan-
tageous to employ them in combination rather than exclusively.
This realisation has also been reflected in various European initiatives,
which have concentrated on bringing together communities working with and
potentially benefiting from the intelligent technologies. The identification of
the potential benefits from integration of intelligent methods within four
thematic networks of excellence for machine learning (MLNet), neural net-
works (NeuroNet), fuzzy logic (Erudit) and evolutionary computing (EvoNet)
initially led to conception of the cluster on Computational Intelligence and
Learning (CoIL) [50] and subsequent launch of EUNITE.
But before continuing with the theme of integration and hybridization
within the context of smart adaptive systems let us take a brief look at those
intelligent technologies and their frequently nature inspired origins.
Artificial neural networks (ANN) [5, 25, 45] have been inspired by biological
neural systems and the earliest ideas date back to the 1943 when McCal-
loch and Pits introduced a model of an artificial neuron. Since then many
researchers have been exploring artificial neural networks as a novel and very
powerful non-algorithmic approach to information processing. The underly-
ing presumption of investigations involving ANNs has been the notion that
by simulating the connectionists architectures of our brains supplemented by
various learning algorithms can lead to the emergence of mechanisms simu-
lating intelligent behaviour. As highlighted in Chap. 4 discussing ANNs in
much greater detail, they are not truly SAS despite being based on models of
truly adaptive system – the brain. However, without oversimplifying, in the
context of SAS probably the most important features of ANNs are the var-
ious learning algorithms and Chap. 4 discusses a number of such algorithms
developed for different types of ANNs and falling into two main groupings:
supervised and unsupervised learning algorithms. On the other hand, one of
4 B. Gabrys
Machine learning [8, 14, 15, 35, 44] has its roots in the artificial intelligence
and one of its central goals from the conception was the reproduction of hu-
man learning performance and modelling intelligence. While modelling human
learning seems to be a secondary issue in the current stream of ML research,
the recent focus has been on the analysis of very large data sets known as
Data Mining and Information Retrieval using varying sources and types of
data (please see Chaps. 5 and 17 for further details). What distinguishes ML
approaches and can be considered as one of their main strengths are the rigor-
ous, principled and statistically based methods for model building, selection
and performance estimation which are related to the most fundamental is-
sue of data overfitting versus generalisation when building models directly
from data. More details can be found in Chap. 5 and some additional dis-
cussion will be provided in further sections of this chapter. Chapter 5 will
also provide some discussion of reinforcement learning particularly useful in
non-stationary problems and dynamically changing environments. Reinforce-
ment learning has been frequently quoted as the third major group of learning
algorithms in addition to supervised and unsupervised approaches.
One of the basic traits of intelligent behaviour is the ability to reason and draw
conclusions. Conventional AI research focused on capturing and manipulating
knowledge in a symbolic form. One of the most successful practical outcomes
of conventional AI research were knowledge based or expert systems which
proved especially useful in some narrowly defined areas. The modelling of hu-
man reasoning and uncertainty were also the main motivations of developing
fuzzy set theory by Zadeh [54] and subsequent modifications and extensions
by a large number of other researchers [4, 31, 41, 46]. In contrast to the
classical set theory based on two value logic (true or false) fuzzy set theory
is based on degrees of truth (membership values) and provides a systematic
framework and calculus for dealing with uncertain and ambiguous data and
information. Fuzzy sets provided a powerful extension to the classical expert
systems. The fuzzy inference systems (FIS) proved to be extremely powerful
and successful tool for effective modelling of human expertise in some specific
applications. While one of the main advantages of fuzzy expert systems is their
interpretability and ability to cope with uncertain and ambiguous data, one
of their serious drawbacks, as far as SAS are concerned, is a complete lack
of adaptability to changing environments. Further details concerning fuzzy
expert systems can be found in Chap. 6.
1 Do Smart Adaptive Systems Exist? – Introduction 5
Neuro-fuzzy systems [2, 18, 19, 20, 21, 22, 23, 27, 34, 37, 38] are probably
the most extensively researched combination of intelligent techniques with
a number of demonstrated successes. The main motivation for such combi-
nation is the neural networks learning ability complimenting fuzzy systems’
interpretability and ability to deal with uncertain and imprecise data. Further
details on learning algorithms for fuzzy systems are covered in Chap. 7.
Some of the other hybridizations commonly used include a use of evo-
lutionary algorithms for optimization of structures and parameters of both
neural networks (evolutionary neural networks [53]) and fuzzy systems (evo-
lutionary fuzzy systems [40]). A combination of three or more techniques in
the quest for smart adaptive systems is not uncommon as also illustrated in
Chaps. 8 and 9.
However, as correctly pointed out in [1] hybrid soft computing frameworks
are relatively young, even comparing to the individual constituent technolo-
gies, and a lot of research is required to understand their strengths and weak-
nesses. Nevertheless hybridization and combination of intelligent technologies
within a flexible open framework like ECOS [28] discussed in the following
sections and Chap. 9, seem to be the most promising direction in achieving
the truly smart and adaptive systems today.
But what exactly do we mean by smart adaptive systems? To put some more
context behind this quite broad phrase which can mean many different things
in different situations, let us give some definitions which though clearly not
ideal and potentially debatable will hopefully clarify and focus a little bit
more the material covered in the following chapters.
As part of the work within EUNITE the following formal definitions of
the words “adaptive” and “smart” have been adopted in the context of smart
adaptive systems.
Due to the fact that there are systems with different levels of adaptivity
in existence, for our purposes the word “adaptive” has been defined on the
following three different levels which also imply an increasingly challenging
applications:
And in fact Kasabov [27] listed seven major requirements of future intelli-
gent systems and proposed an open framework called Evolving Connectionist
System (ECOS) [27, 28], some advanced aspects of which are discussed in
Chap. 9. We will now use both the requirements and the ECOS framework as
the basis for our discussion as, though with a number of open problems still
to be solved, they are the closest yet to the level 3 smart adaptive system as
defined in the previous section.
Seven major requirements for future intelligent/smart adaptive systems:
1. SAS should have an open, extendable and adjustable structure. This means
that the system has to have the ability to create new inputs and outputs,
connections, modules etc. while in operation. It should also be able to
accommodate in an incremental way all the data that is know and will be
known in the future about the problem.
2. SAS should adapt in an on-line, incremental, life-long mode where new
data is used as soon as it is available.
3. SAS should be able to learn fast from large amount of data ideally in an
“one-pass” training mode.
4. SAS should be memory-based with an ability to add, retrieve and delete
individual pieces of data and information.
5. SAS should be able to improve its performance via active interaction with
other systems and with the environment in a multi-modular, hierarchical
fashion.
6. SAS should represent adequately space and time at their different scales,
inner spatial representation, short- and long-term memory, age, forgetting,
etc.
7. SAS should be able to self-improve, analyse its own performance and ex-
plain what it has learnt about the problem it is solving.
While many of the above issues have been acknowledged and addressed by
researchers working with intelligent technologies since their conceptions, the
focus tends to be on the individual or small subsets of the above listed points.
The reason for this is that any of the issues listed are often difficult enough in
themselves and either the structure, learning or the knowledge representation
of the existing techniques are not adequate or flexible enough to cover all of
them. And as correctly pointed out in [27] it is not likely that a truly smart
adaptive system can be constructed if all of the above seven requirements are
not met and therefore radically new methods and systems are required.
One such attempt at proposing a model or a general framework that ad-
dresses, at least in principle, all seven requirements is the ECOS framework.
ECOS are multi-level, multi-modular, open structures which evolve in time
through interaction with environment.
There are five main parts of ECOS which include:
1. Presentation part. In this part an input filtering, preprocessing, feature se-
lection, input formation etc. are performed. What is interesting and some-
thing that would naturally be a consequent of acquiring new data and
1 Do Smart Adaptive Systems Exist? – Introduction 9
One of the most underrated but absolutely critical phases of designing SAS
are data preparation and preprocessing procedures [43]. In order to highlight
the importance of this stage, as part of the introductory material to the main
two parts of the book, Chap. 3 has been entirely dedicated to these issues.
Why are they so important? Since the changing environments requiring
adaptation usually manifest themselves in changes in the number and type of
observed variables (see requirement 1 and presentation part of ECOS), it is
particularly important/critical to know the implications it could have on pre-
processing (e.g. dealing with noise, missing values, scaling, normalisation etc.)
and data transformation techniques (e.g. dimensionality reduction, feature
selection, non-linear transformations etc.) which in turn have an enormous
impact on the selection and usage of appropriate modelling techniques. It is
even more critical as this stage in a SAS application should be performed in
an automatic fashion while currently in practise a vast majority of the data
cleaning, preprocessing and transformation is carried out off-line and usually
involves a considerable human expertise.
Not surprisingly the adaptation and learning algorithms should play a promi-
nent role in SAS and in fact in one form or another are represented in all
seven requirements.
There are currently two main groups of search, adaptation and learning
algorithms: local (often gradient based) search/learning algorithms and global
(stochastic) approaches.
The main learning algorithms in the first group have been developed over
the years in statistics, neural networks and machine learning fields [4, 5, 13, 44,
45, 52]. Many of the methods are based on the optimization/minimization of a
suitable cost, objective or error function. Often those algorithms also require
the cost function to be differentiable and many gradient based optimization
techniques have been used to derive learning rules on such basis. Unfortunately
such differentiable cost functions do not always exist.
Evolutionary algorithms and other stochastic search techniques [16, 24,
26, 30] provide a complementary powerful means of a global search often
used where gradient based techniques fail. As mentioned earlier these global
searches usually are less computationally effective then the gradient based
learning techniques.
Due to their complementary nature it is more and more common that both
types of algorithms are used in combination for SAS as illustrated in Chaps. 8
and 9.
1 Do Smart Adaptive Systems Exist? – Introduction 11
As highlighted above the different forms and modes of adaptation and learning
are pivotal in creating SAS and span all seven requirements. However, all
those learning procedures are closely associated with the chosen SAS model
representation. This is reflected in requirements 1, 4, 5, 6 and 7 focusing
on the need for open, extendable and adjustable structures, memory based,
multi-modular, hierarchical and interpretable representations of the data and
knowledge stored and manipulated in SAS frameworks.
Though the representation and memory part of the ECOS framework do
not stipulate the use of any particular models and in principle any number
and mixture of ANN, ML or HS based modules could be used, in an attempt
to cover all those requirements the neuro-fuzzy models emerged as the most
prominent and often used realisations of the ECOS framework.
One reason is highlighted in Chap. 7 as the neuro-fuzzy approaches with
their rule based structure lead to interpretable solutions while the ability to
incorporate expert knowledge and learn directly from data at the same time
allows them to bridge the gap between purely knowledge-based (like the ones
1 Do Smart Adaptive Systems Exist? – Introduction 13
discussed in Chap. 6) and purely data driven methods (like for instance ANNs
covered in Chap. 4).
There are many other desirable features of the fuzzy set/rule base repre-
sentation with respect to the requirements for SAS discussed in Chaps. 7, 8
and 9.
Having said that, there are many methods from statistics, ANN, ML, FS
and HS that have equivalent functional characteristics (i.e. classification meth-
ods are one typical example) and choosing one of them for different applica-
tions could pose a considerable problem.
So are there any good guidelines or criteria for selection of suitable meth-
ods? A number of examples can be found in Chaps. 10 to 17 which attempt
to answer exactly this question from different application domain points of
view but let us now consider some important issues and procedures that can
be applied for model selection and combination.
One of the main problems with flexible structure models which need to be
used in the SAS framework is the suitable selection of appropriate models,
their combination and fusion. There are a number of potential criteria that
can be used for model selection but probably the most commonly used is the
performance of the models as estimated during the overall selection process
following by the second criterion of interpretabilty or transparency of the
models. Please see Chap. 15 for some more details of the issue of the trade-off
between accuracy and interpretabilty.
One of the strongest, principled and statistically based approaches for
model error estimation are various methods based on statistical resampling
and cross-validation techniques [52] which are commonly used in the ML com-
munity and have not been as rigourously applied with some other intelligent
methods.
The performance on the data for which models have been developed (the
training data) can be often misleading and minimization of the error for the
training sets can lead to the problem of data overfitting and poor generalisa-
tion performance on unseen future data. This is quite critical but since the
cross-validation approaches can be used off-line with a known labelled data
sets only there can be problems with direct application to the dynamically
changing environments and some compromises would have to be made.
Some of the most common problems and challenges with ECOS framework
and its flexible structure is that a large number of parameters need to be ad-
justed continuously (see Chap. 9) for the model to work well. These adjusting
of the parameters is not a trivial but in fact a very critical process.
One approach to reduce the influence of individual parameters and improve
the overall stability and performance of the system is to use multi-component,
aggregated or combined methods like the ensemble of neural networks dis-
cussed in Chap. 4 or a number of other multi-classifier, multi-predictor etc.
14 B. Gabrys
methods [7, 9, 12, 18, 48]. The adaptation in such multi-component structure
can be much more difficult then with individual models though.
In general, however, the combination of models can be seen as a good way
of generating more robust less parameter sensitive solutions.
Another significant challenge is the fusion of different types of models (e.g.
models for recognition of face, gestures, sound etc.) and the adaptation of the
fusion methods themselves.
Self optimization of the whole hierarchies and multiple-units at the same
time is currently the topic of challenging and interesting research and some
examples are illustrated in Chaps. 8 and 9.
1.6 Conclusions
References
1. A. Abraham, “Intelligent Systems: Architectures and Perspectives”, Recent Ad-
vances in Intelligent Paradigms and Applications, Abraham A., Jain L. and
Kacprzyk J. (Eds.), Studies in Fuzziness and Soft Computing, Springer Verlag
Germany, Chap. 1, pp. 1-35, Aug. 2002
2. A. Abraham, “EvoNF: A Framework for Optimization of Fuzzy Inference Sys-
tems Using Neural Network Learning and Evolutionary Computation”, 2002
IEEE International Symposium on Intelligent Control (ISIC’02), Canada, IEEE
Press, pp. 327-332, 2002
3. B. Azvine and W. Wobcke, “Human-centered Intelligent Systems and Soft Com-
puting”, BT Technology Journal, vol. 16, no. 3, July 1998
4. J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms,
Plenum, New York, 1981
5. C.M. Bishop, Neural Networks for Pattern Recognition, Clarendon Press:
Oxford, 1995
6. P. Bonissone, Y-T. Chen, K. Goebel, P.S. Khedkar, “Hybrid Soft Computing
Systems: Industrial and Commercial Applications”, Proc. Of the IEEE, vol. 87,
no. 9, Sep. 1999
7. L. Breiman, “Bagging predictors”, Machine Learning, vol. 24, no. 2, pp. 123-140,
1996
8. T.G. Dietterich, “Machine Learning Research: Four Current Directions”, AI
Magazine, vol. 18, no. 4, pp. 97-136, 1997
9. T.G. Dietterich, “An Experimental Comparison of Three Methods for Con-
structing Ensembles of Decision Trees: Bagging, Boosting, and Randomization”,
Machine Learning, vol. 40, pp. 139-157, 2000
10. C. Doom, “Get Smart: How Intelligent Technology Will Enhance Our World”,
CSC’s Leading Edge Forum Report (www.csc.com), 2001
11. J. Doyle, T. Dean et al., “Strategic Directions in Artificial Intelligence”, ACM
Computing Surveys, vol. 28, no. 4, Dec. 1996
12. H. Drucker, C. Cortes, L.D. Jackel, Y. LeCun, and V. Vapnik, “Boosting and
other ensemble methods”, Neural Computation, vol. 6, pp. 1289-1301, 1994
13. R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley and Sons,
Inc., 2000
14. U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds., Ad-
vances in Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI/MIT
Press, 1996
16 B. Gabrys
15. U. Fayyad, G.P. Shapiro, and P. Smyth, “TheKDDprocess for extracting useful
knowledge from volumes of data”, Commun. ACM, vol. 39, pp. 27-34, 1996
16. D.B. Fogel, Evolving neural networks, Biol. Cybern., vol. 63, pp. 487-493, 1990
17. B. Gabrys and L. Petrakieva, “Combining labelled and unlabelled data in the
design of pattern classification systems”, International Journal of Approximate
Reasoning, vol. 35, no. 3, pp. 251-273, 2004
18. B. Gabrys, “Learning Hybrid Neuro-Fuzzy Classifier Models From Data: To
Combine or not to Combine?”, Fuzzy Sets and Systems, vol. 147, pp. 39-56,
2004
19. B. Gabrys, “Neuro-Fuzzy Approach to Processing Inputs with Missing Values
in Pattern Recognition Problems”, International Journal of Approximate Rea-
soning, vol. 30, no. 3, pp. 149-179, 2002
20. B. Gabrys, “Agglomerative Learning Algorithms for General Fuzzy Min-Max
Neural Network”, the special issue of the Journal of VLSI Signal Processing
Systems entitled “Advances in Neural Networks for Signal Processing”, vol. 32,
no. 1 /2 , pp. 67-82, 2002
21. B. Gabrys, “Combining Neuro-Fuzzy Classifiers for Improved Generalisation
and Reliability”, Proceedings of the Int. Joint Conference on Neural Networks,
(IJCNN’2002) a part of the WCCI’2002 Congress, ISBN: 0-7803-7278-6, pp.
2410-2415, Honolulu, USA, May 2002
22. B. Gabrys and A. Bargiela, “General Fuzzy Min-Max Neural Network for Clus-
tering and Classification”, IEEE Transactions on Neural Networks, vol. 11, no. 3,
pp. 769-783, 2000
23. B. Gabrys and A. Bargiela, “Neural Networks Based Decision Support in Pres-
ence of Uncertainties”, ASCE J. of Water Resources Planning and Management,
vol. 125, no. 5, pp. 272-280, 1999
24. D.E. Goldberg, Genetic Algorithms in Search, Optimization and machine Learn-
ing, Addison-Wesley, Reading, MA, 1989
25. S. Haykin, Neural Networks. A Comprehensive Foundation, Macmillan College
Publishing Company, New York, 1994
26. J.H. Holland, Adaptation in natural and artificial systems, The University of
Michigan Press, Ann Arbor, MI, 1975
27. N. Kasabov and R. Kozma (Eds.), Neuro-Fuzzy Techniques for Intelligent In-
formation Systems, Physics Verlag, 1999
28. N. Kasabov, Evolving connectionist systems – methods and applications in
bioinformatics, brain study and intelligent machines, Springer Verlag, London-
New York, 2002
29. R. Khosla and T. Dillon, Engineering Intelligent Hybrid Multi-Agent Systems,
Kluwer Academic Publishers, 1997
30. J.R. Koza, Genetic Programming, MIT Press, 1992
31. L.I. Kuncheva, Fuzzy Classifier Design, Physica-Verlag Heidelberg, 2000
32. C.T. Leondes (Ed.), Intelligent Systems: Technology and Applications (vol. 1
– Implementation Techniques; vol. 2 – Fuzzy Systems, Neural Networks, and
Expert Systems; vol. 3 – Signal, Image and Speech Processing; vol. 4 – Database
and Learning Systems; vol. 5 – Manufacturing, Industrial, and Management
Systems; vol. 6 – Control and Electric Power Systems), CRC Press, 2002
33. L.R. Medsker, Hybrid Intelligent Systems, Kluwer Academic Publishers, 1995
34. S. Mitra and Y. Hayashi, “Neuro-Fuzzy Rule Generation: Survey in Soft Com-
puting Framework”, IEEE Trans. on Neural Networks, vol. 11, no. 3, pp. 748-
768, May 2000
1 Do Smart Adaptive Systems Exist? – Introduction 17
35. T.M. Mitchell, “Machine learning and data mining”, Commun. ACM, vol. 42,
no. 11, pp. 477-481, 1999
36. S. Mitra, S.K. Pal and P. Mitra, “Data Mining in Soft Computing Framework:
A Survey”, IEEE Trans. on Neural Networks, vol. 13, no. 1, pp. 3-14, 2002
37. D. Nauck and R.Kruse, “A neuro-fuzzy method to learn fuzzy classification rules
from data”, Fuzzy Sets and Systems, vol. 89, no. 3, pp. 277-288, 1997
38. D. Nauck and R. Kruse. “Neuro-fuzzy systems for function approximation”,
Fuzzy Sets and Systems, vol. 101, pp. 261-271, 1999
39. S. Pal, V. Talwar, P. Mitra, “Web Mining in Soft Computing Framework: Rel-
evance, State of the Art and Future Directions”, To appear in IEEE Trans. on
Neural Networks, 2002
40. W. Pedrycz (Ed.), Fuzzy Evolutionary Computation, Kluwer Academic Pub-
lishers, USA, 1997
41. W. Pedrycz and F.Gomide, An Introduction to Fuzzy Sets: Analysis and Design,
The MIT Press, 1998
42. L. Petrakieva and B. Gabrys, “Selective Sampling for Combined Learning from
Labelled and Unlabelled Data”, In the book on “Applications and Science in Soft
Computing” published in the Springer Series on Advances in Soft Computing,
Lotfi, A. and Garibaldi, J.M. (Eds.), ISBN: 3-540-40856-8, pp. 139-148, 2004
43. D. Pyle, Data Preparation for Data Mining, Morgan Kaufman Publishers, 1999
44. J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufman, San
Mateo, CA, 1993
45. B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University
Press, 1996
46. E.H. Ruspini, P.P. Bonissone, and W. Pedrycz (Eds.), Handbook of Fuzzy Com-
putation, Institute of Physics, Bristol, UK, 1998
47. D. Ruta, and B. Gabrys, “An Overview of Classifier Fusion Methods”, Com-
puting and Information Systems (Ed. Prof. M. Crowe), University of Paisley,
vol. 7, no. 1, pp. 1-10, 2000
48. D. Ruta, and B. Gabrys, “Classifier Selection for Majority Voting”, Special issue
of the journal of INFORMATION FUSION on Diversity in Multiple Classifier
Systems, vol. 6/1 pp. 63-81, 2004
49. R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction, MIT
Press, 1998
50. G. Tselentis and M. van Someren, “A Technological Roadmap for Computa-
tional Intelligence and Learning”, Available at www.dcs.napier.ac.uk/coilweb/,
June 2000
51. V. Vapnik, The nature of statistical learning theory, Springer Verlag, New York,
1995
52. S.M. Weiss and C.A. Kulikowski, Computer Systems That Learn: Classification
and Prediction Methods from Statistics, Neural Nets, Machine Learning, and
Expert Systems, Morgan Kaufman Publishers, Inc., 1991
53. X. Yao, “Evolving Artificial Neural Networks”, Proc. of the IEEE, vol. 87, no. 9,
pp. 1423-47, Sep. 1999
54. L. Zadeh, “Fuzzy sets”, Information and Control, vol. 8, pp. 338-353, 1965
55. L. Zadeh, “Fuzzy logic and the calculi of fuzzy rules and fuzzy graphs: A precis”.
Int. J. Multiple-Valued Logic, vol. 1, pp. 1-38, 1996
56. L. Zadeh, “Soft Computing and Fuzzy Logic”, IEEE Software, pp. 48-56, Nov.
1994
2
Problem Definition – From Applications
to Methods
K. Leiviskä
Are there any general features that make some application especially suitable
for Smart Adaptive Systems (SAS)? These systems are now technologically
available, even though commercial applications are still few. And what things
are essential when starting to decide, define and design this kind of appli-
cation? What makes it different from the usual IT-project? Does adaptation
itself mean increasing complexity or need for special care? This chapter tries
to answer these questions at the general level under the title “Problem Defi-
nition – From Applications to Methods” which also serves as an introduction
to the second part of the book. Each case has its own requirements, but some
common themes are coming out. The author’s own experience is in industrial
systems, which is, inevitably, visible in the text.
Section 2.1 looks at adaptation from the Control Engineering point of view;
it is an area where practical applications have existed for four decades and
some analogies exist with other application areas. Also intelligent methods
are used there and therefore we can speak about genuine SAS applications in
control. Section 2.2 looks at the need for adaptation in different disciplines
and tries to give characteristics for fruitful application areas together with
economical potential. Section 2.3 lists some topics that should be kept in
mind when defining the SAS problem. Section 2.4 is a short summary.
2.1 Introduction
Many applications of intelligent methods, including fuzzy logic, neural net-
works, methods from machine learning, and evolutionary computing, have
been recently launched especially in cases were an explicit analytical model
is difficult to obtain. Examples come from the domains of industrial produc-
tion, economy, healthcare, transportation, and customer service applications
in communications, especially Internet. While being powerful and contribut-
ing to the increased efficiency and economy, most solutions using intelligent
methods lack one important property: they are not adaptive (or not adaptive
K. Leiviskä: Problem Definition – From Applications to Methods, StudFuzz 173, 19–26 (2005)
www.springerlink.com
c Springer-Verlag Berlin Heidelberg 2005
20 K. Leiviskä
that is becoming more and more common in every industry sets the most
stringent requirements.
In many economic systems, e.g. in banking and insurance, the varying
customer demands and profiles have the same kind of effect on the need to
adapt to changing situations. Also the security reasons, e.g. fighting against
fraud, come into the picture, here. In healthcare, each human is an own case
requiring own medication and care in different stages of life. Safety seems to
work to two opposite directions in healthcare: increasing safety would require
adaptation, but in many cases instead of automatic adaptation a tailored
system for each patient is needed.
All abovementioned needs seem to come together in Internet. There are
many customers and customer groups with different needs. Every individual
would like to have the system tailored for his/her own needs. And the safety
aspect must be taken into account in every application.
The effect can be approached from many directions, but only the econom-
ical effects are accounted for here. Usually they offer the strongest criteria
for the selection of adaptive systems, usually also coming before the need. Of
course, the effects to users, customers, image, etc. cannot be underestimated.
The economical benefits of SAS seem to come from the same sources as
other IT investments:
• Increased throughput, better quality and higher efficiency that effect di-
rectly on the economy of the company and can usually be calculated in
money.
• Better quality of operations and services that effect on the market position
and bring along economical benefits.
• Improved working conditions and safety that increase the staff motivation
and also improve the company image.
• Improved overall performance.
There are some common features in companies and actors that can gain
in using SAS1 :
• Tightening quality requirements characteristic to high quality products
and services, together with requirements to customise the products de-
pending on the customer [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14].
• Increasing throughput meaning high capacity requirements to the process
or mass productisation, and services offered to big audiences (e.g. in Inter-
net) or to a high number of customers (air traffic) [15, 16, 17, 18, 19, 20].
• Complicated functional structures e.g. industry with a high number of
processes, mills, customers, and products or services offered to high number
of customers with varying customer profiles, needs and capabilities in using
IT applications [21, 22, 23, 24, 25, 26, 27, 28, 29, 30].
• Capital-intensive investments including high economical risk in production
decisions [31, 32, 33].
1
The references given here are from three Eunite Annual Symposiums
22 K. Leiviskä
2.3.1 DETECT
DETECT means monitoring of the signals (in a wide sense) that tell when
the adaptation activity is triggered. First, the signals must be named. In
technical applications, this means the definition of external measurements,
grade change times, etc. In other applications, e.g. the development of certain
features in customer profiles is followed. The procedures to assure for data
quality and security must be defined in the same connection together with
possible thresholds or safety limits that must be exceeded before triggering
the adaptation.
Adaptation may also be triggered in the case of deterioration in the system
performance. The performance factor should be defined together with the
limits that cannot be exceeded without adaptation in the system structure or
parameters.
2.3.2 ADAPT
ADAPT means the mechanism used in changing the system. The calculation
routines or methods must be defined in the early stage of the design. Adapta-
tion is in many cases based on models, and the conventional stages of model
2 Problem Definition – From Applications to Methods 23
2.3.3 IMPLEMENT
IMPLEMENT means the ways, how the actual change in the system is carried
out and these mechanisms must be defined carefully case by case. In technical
systems, IMPLEMENT is usually integrated with ADAPT, but in service
applications in web it would mean that information shown or activities allowed
are adapted based on the changes in the customer structure or in the services
required. It is quite curious to see that safety plays a remarkable role in both
applications. In technical systems the question is to secure the system integrity
and prevent changes that would be harmful or impossible to realize. In other
systems, the changes that would endanger e.g. customer privacy should not
be allowed.
2.3.4 INDICATE
The act of adaptation should be transparent to the user. He/she should know
when the changes are made and how they effect on the system performance.
This is especially important in cases where the user can interfere the system
operation and via own activity cancel the changes made by adaptive systems.
This element calls for careful user interface design and user involvement in
the early design phases of the whole systems. It also underlines the need for
user training.
The ways to indicate are many. In simple process controllers, green lamp
light is showing that the adaptive controller is on. In other system, the changes
may be recorded in the system log. In some cases, the user acceptance may
also be asked before implementing the change.
2.3.5 EVALUATE
2.4 Summary
This chapter introduced some special themes necessary in defining the SAS
problem. Many points of view are possible in deciding, defining and designing
Smart Adaptive Systems. Need for adaptivity and its costs and benefits come
in the first line. Next, some practical aspects that make SAS project different
from usual IT-projects must be recognized.
One last comment is necessary. The availability of systems and tools is
crucial for gaining further ground for smart adaptive systems. Pioneers are, of
course, needed, but established, in-practice-proven, commercial systems are
needed in the market. This fact is clearly visible when looking at the life cycle
of any new paradigm from an idea to the product.
References2
1. Garrigues, S., Talou, T., Nesa, D. (2001): Neural Networks for QC in Automo-
tive Industry: Application to the Discrimination of Polypropylene Materials by
Electronic Nose based on Fingerprint Mass Spectrometry. Eunite 2001.
2. Rodriguez, M.T., Ortega, F., Rendueles, J.L., Menendez, C. (2002): Determi-
nation of the Quality of Hot Rolled Steel Strips With Multivariate Adaptive
Methods. Eunite 2002.
3. Oestergaard, J-J. (2003): FuzEvent & Adaptation of Operators Decisions.
Eunite 2003.
4. Rodriguez, M.T., Ortega, F., Rendueles, J.L., Menéndez, C. (2003): Combina-
tion of Multivariate Adaptive Techniques and Neural Networks for Prediction
and Control of Internal Cleanliness in Steel Strips. Eunite 2003.
5. Leiviskä, K., Posio, J., Tirri, T., Lehto, J. (2002): Integrated Control System
for Peroxide Bleaching of Pulps. Eunite 2002.
6. Zacharias, J., Hartmann, C., Delgado, A. (2001): Recognition of Damages on
Crates of Beverages by an Artificial Neural Network. Eunite 2001.
7. Linkens, D., Kwok, H.F., Mahfouf, M., Mills, G. (2001): An Adaptive Approach
to Respiratory Shunt Estimation for a Decision Support System in ICU. Eunite
2001.
8. Chassin, L. et al. (2001): Simulating Control of Glucose Concentration in Sub-
jects with Type 1 Diabetes: Example of Control System with Inter-and Intra-
Individual Variations. Eunite 2001.
9. Mitzschke, D., Oehme, B., Strackeljan, J. (2002): User-Dependent Adaptability
for Implementation of an Intelligent Dental Device. Eunite 2002.
10. Ghinea, G., Magoulas, G.D., Frank, A.O. (2002): Intelligent Multimedia Trans-
mission for Back Pain Treatment. Eunite 2002.
2
Note: References are to three Eunite Annual Symposium CD’s: Eunite 2001 An-
nual Symposium, Dec. 13–14, 2001, Puerto de la Cruz, Tenerife, Spain. Eunite
2002 Annual Symposium, Sept. 19–21, 2002, Albufeira, Algarve, Portugal. Eunite
2003 Annual Symposium, July 10–11, 2003, Oulu, Finland.
2 Problem Definition – From Applications to Methods 25
11. Emiris, D.M., Koulouritis, D.E. (2001): A Multi-Level Fuzzy System for the
Measurement of Quality of Telecom Services at the Customer Level. Eunite
2001.
12. Nürnberger, A., Detyniecki, M. (2002): User Adaptive Methods for Interactive
Analysis of Document Databases. Eunite 2002.
13. Raouzaiou, A., Tsapatsoulis, N., Tzouvaras, V., Stamou, G., Kollias, S. (2002):
A Hybrid Intelligence System for Facial Expression Recognition. Eunite 2002.
14. Balomenos, T., Raouzaiou, A., Karpouzis, K., Kollias, S., Cowie, R. (2003):
An Introduction to Emotionally Rich Man-Machine Intelligent Systems. Eunite
2003.
15. Arnold, S., Becker, T., Delgado, A., Emde, F., Follmann, H. (2001): The Virtual
Vinegar Brewer: Optimizing the High Strength Acetic Acid Fermentation in a
Cognitive Controlled, Unsteady State Process. Eunite 2001.
16. Garcia Fernendez, L.A., Toledo, F. (2001): Adaptive Urban Traffic Control using
MAS. Eunite 2001.
17. Hegyi, A., de Schutter, B., Hellendoorn, H. (2001): Model Predictive Control for
Optimal Coordination of Ramp Metering and Variable Speed Control. Eunite
2001.
18. Niittymäki, J., Nevala, R. (2001): Traffic Signals as Smart Control System –
Fuzzy Logic Based Traffic Controller. Eunite 2001.
19. Jardzioch, A.J., Honczarenko, J. (2002). An Adaptive Production Controlling
Approach for Flexible Manufacturing System. Eunite 2002.
20. Kocka, T., Berka, P., Kroupa, T. (2001): Smart Adaptive Support for Selling
Computers on the Internet. Eunite 2001.
21. Macias Hernandez, J.J. (2001): Design of a Production Expert System to Im-
prove Refinery Process Operations. Eunite 2001.
22. Guthke, R., Schmidt-Heck, W., Pfaff, M. (2001): Gene Expression Based Adap-
tive Fuzzy Control of Bioprocesses: State of the Art and Future Prospects.
Eunite 2001.
23. Larraz, F., Villaroal, R., Stylios, C.D. (2001): Fuzzy Cognitive Maps as Naphta
Reforming Monitoring Method. Eunite 2001.
24. Woods, M.J. et al. (2002): Deploying Adaptive Fuzzy Systems for Spacecraft
Control. Eunite 2002.
25. Kopecky, D., Adlassnig, K-P., Rappelsberger, A. (2001): Patient-Specific Adap-
tation of Medical Knowledge in an Extended Diagnostic and Therapeutic Con-
sultation System. Eunite 2001.
26. Palaez Sanchez, J.I., Lamata, M.T., Fuentes, F.J. (2001): A DSS for the Urban
Waste Collection Problem. Eunite 2001.
27. Fernandez, F., Ossowski, S., Alonso, E. (2003): Agent-based decision support
services. The case of bus fleet management. Eunite 2003.
28. Hernandez, J.Z., Carbone, F., Garcia-Serrano, A. (2003): Multiagent architec-
ture for intelligent traffic management in Bilbao. Eunite 2003.
29. Sterrit, R., Liu, W. (2001): Constructing Bayesian Belief Networks for Fault
Management in Telecommunications Systems. Eunite 2001.
30. Ambrosino, G., Logi, F., Sassoli, P. (2001): eBusiness Applications to Flexible
Transport and Mobility Services. Eunite 2001.
31. Koulouriotis, D.E., Diakoulakis, I.E., Emiris, D.M. (2001): Modeling Investor
Reasoning Using Fuzzy Cognitive Maps. Eunite 2001.
32. Danas, C.C. (2002): The VPIS System: A New Approach to Healthcare logistics.
Eunite 2002.
26 K. Leiviskä
D. Pyle
The problem for all machine learning methods is to find the most probable
model for a data set representing some specific domain in the real world.
Although there are several possible theoretical approaches to the problem of
appropriately preparing data, the only current answer is to treat the problem
as an empirical problem to solve, and to manipulate the data so as to make
the discovery of the most probable model more likely in the face of the noise,
distortion, bias, and error inherent in any real world data set.
3.1 Introduction
A model is usually initially fitted to (or discovered in) data that is especially
segregated from the data to which the model will be applied at the time of its
execution. This is usually called a training data set. This is not invariably the
case, of course. It is common practice to fit a model to a data set in order to
explain that specific data set in simplified and summarized terms – those of the
model. Such a model is a simplification and summarization of the data that is
more amenable to human cognition and intuition than the raw data. However,
adaptive systems are generally not intended primarily to model a specific
data set for the purposes of summary. Adaptive systems, when deployed, are
(among other things) usually intended to monitor continuous or intermittent
data streams and present summarized inferences derived from the training
data set that are appropriate and justified by the circumstances present in
the execution, or run-time data. Thus, any modification or adjustment that
is made to the training data must equally be applicable to the execution
data and produce identical results in both data sets. If the manipulations
are not identical, the result is operator-induced error – bias, distributional
nonstationarity, and a host of unnecessary problems – that are introduced
directly as a result of the faulty run-time data manipulation. Transformations
of raw data values are generally needed because the raw algorithms prevalent
today cannot “digest” all types of data. For instance, in general most popular
• Modifying range.
• Modifying distributions.
• Finding an appropriate numeric representation of categories.
• Finding an appropriate categorical representation of numeric values, either
continuous or discrete.
• Finding an appropriate representation for nulls or missing values.
Following this section, the chapter provides a separate section for each
of these topics followed by a short section of practical considerations and a
summary section. One or more – or conceivably none – of the transformations
mentioned may be needed, and the need for transformation depends princi-
pally on the requirements of the algorithm that is to be applied in fitting or
discovering the needed model. Further, any such data manipulations need to
consider the computational complexity involved in applying them at run time
since many applications are time sensitive, and too much complexity (i.e.,
too slow to produce) may render even optimal transformations impractical or
irrelevant by the time they are available. They may also need to be robust in
the face of greater variance than that experienced in the training data set.
• Linear modes
These include actually linear relationships and transformably linear rela-
tionships (those which are curvilinear but can be expressed in linear form).
• Functional modes
These include non-linear and other relationships that can be expressed as
mathematical functions. (A function produces a unique output value for
a unique set of input values.) These also include information transmitted
through interaction effects between the variables of the input battery.
• Disjoint modes
These include relationships that cannot be represented by functions. Ex-
amples of disjoint modes include non-convex clusters and relationships
that are non-contiguous, and perhaps orthogonal or nearly so, in adjacent
parts of the input space.
1
s= (3.2)
1 + e−x
where νmean is the mean of ν and σ the standard deviation.
Figure 3.1 shows the effect of softmax scaling. Similarly to the logistic
function, regardless of the value of the input, the output value lies between 0
and 1. The value of λ determines the extent of the linear range. For purposes
of illustration the figure shows a much shorter linear section of the softmax
curve than would be used in practice. Usually 95%–99% of the curve would
be linear so that only out-of-training-range values would be squashed.
Squashing out-of-range values, using this or some similar technique, places
them in the range over which the model was trained – but whether this is a
3 Data Preparation and Preprocessing 31
valid approach to dealing with the issue depends on the needs of the model.
Out-of-range values are still values that were not present in the training data,
and are values which the model has not characterized during its creation,
whether the values are squashed back inside the training range or not. Range
manipulation, including the use of out-of-range bins and squashing, has no
effect on the information content of the data set. Measures of information
are, in general, insensitive to the information transmission mode whether it
be linear or not, and introducing potential curvilinearities or nonlinearities
makes no difference to the total information transmitted. It does, of course,
change the mode of information transmission, and that may – almost certainly
will – have an impact on the modeling tool chosen.
Fig. 3.3. Amount of information transmitted through the 3 modes for the unmod-
ified CA data set
Fig. 3.5. Distribution histograms for the two variables’, “veraz” and “deposits”,
before (left) and after (right) continuous redistribution
3 Data Preparation and Preprocessing 35
3.5.1 Coding
Fig. 3.6. Information transmitted by the categorical variables in the cars data set
suffer tremendously in the presence of increased variable count (for many well
founded reasons not discussed here), and this violates the first principle of
data preparation – first, do no harm. For that reason alone, category coding
is not further discussed here.
3.5.2 Numeration
Arbitrary Numeration
Fig. 3.7. (a) Information transmission modes for arbitrary category numeration in
cars data. (b) Information transmission modes in the cars data set after unsupervised
numeration of the categorical variables. (c) Information transmission modes in the
cars data set after supervised numeration of the categorical variables. (d) Amount
of information transmitted through the 3 modes after optimally binning the CA
data set
Principled Numeration
more ways to get it wrong than there are to get it right, so chance favors this
sort of outcome.) Supervised numeration of categories, the second method of
principled numeration, results in the information transmission modes shown
in Fig. 3.7(c).
The improvement – that is, the increase in linear mode information trans-
mission – is no accident in Fig. 3.7(c). In this case, it would almost be possible
to use linear regression to fit a model. There is an apparently irreducible 6%
of information transmitted through the disjoint mode, and 9% of the infor-
mation required for a perfect model is not available at all, but the functional
mode information has been reduced to almost nothing (1%).
There are many ways of selecting an appropriate bin count. Many, if not most
ways of selecting the number of bins seem to rely on intuition and are not
designed to produce a specifically measurable effect. While some methods may
produce good estimates for a useful bin count, these are essentially arbitrary
choices since there is no measurable effect that the binning is intended to
produce, or more appropriately, there is no optimizable effect. For instance,
binning to get a better model gives no indication of how to measure when the
best model is achieved since it may always be possible that another bin count
may produce a better model, and there is no way to tell without actually
trying them all! In contrast, a few methods of selecting bin count do rely
on measurable and optimizable metrics. This section briefly discusses some
representative options:
Purely Arbitrary
Purely arbitrary bin count selection relies essentially on the opinion of the
modeler. It seems common practice to choose 10 to 20 bins: although some
recommend more, few arbitrary choices seem to reach 100. This count seems
to be chosen for ease of interpreting the “meaning” of the bins in terms of the
model, and low bin count seems to be chosen to make the explanation easier
to understand.
40 D. Pyle
Principled Arbitrary
This approach uses some usually statistical parameter describing the variable
to determine a bin count. Variance based bin count determination is one ex-
ample where the number of bins is chosen as: Round(3.5 · σ · n(1/3) ). This
produces a bin count that depends on the variance of a variable (σ) and the
number of records (n) available. However, although the bin count is deter-
mined by characteristics of the variable, there is no measure of how “good”
this binning is, nor even what it means to have a good, better, or best binning.
Principled
This approach uses a criterion to objectively determine the quality of the bin
count. SNR bin count is an example of this approach. SNR is an acronym
for “Signal to Noise Ratio”. Information theory provides an approach to mea-
sure the “Signal” and the “Noise” in a variable. Binning, as noted earlier,
reduces the amount of information available for transmission. Noise tends to
be present in small-scale fluctuations in data. Increasing bin count transmits
more information and also more noise: thus, with increasing bin count, the
total amount of information transmitted increases and the signal to noise ratio
decreases. With increasing bin count from 1 bin comes an optimum balance
between the two, and this is illustrated in Fig. 3.8 for one of the variables in
the CA data set.
Figure 3.8 shows the various components involved in determining the SNR
bin count. Smoothing reduces variance in information transmitted and in the
signal to noise ratio, and it is the smoothed versions that are used to determine
the signal gain. For the illustrated variable, the optimal point (that is the
point at which the maximum signal is passed for the minimum amount of
noise included) is indicated at 29 bins. However, other variables in this data
set have different optimal bin counts. It is worth comparing the SNR bin
count range with the number of bins indicated as suitable using the variance-
based binning. For the variable illustrated, “deposits”, which has a standard
deviation of 2.95 and with 9,936 records, this is:
1 1
3.5 · σ · n 3 = 3.5 · 2.95 · 9936 3 = 222 bins . (3.3)
Although Fig. 3.8 does not show a bin count of 222 bins, it is intuitively
apparent that such a number of bins is almost certainly a poor choice if
optimizing signal gain (maximum signal for minimum noise) is an objective.
some arrangements work better than others3 . Typically, bin boundaries are
arranged using unsupervised positioning. And as with discovering an appropri-
ate bin count, there are arbitrary, principled arbitrary and principled methods
of bin boundary positioning.
Arbitrary
Principled Arbitrary
These bin boundary assignment methods are fairly popular, and they can
work well. Typical methods include:
• Equal Range, where the bins each have the same size, which is 1/n of
the range where n is the number of bins. Bins may not contain the same
number of records, and in extreme cases (for example the presence of
extreme outliers) many or most of the bins may be unpopulated.
• Equal Frequency, where each bin has the same number of records as far as
possible, so that each bin contains 1/r of the records, where r is the number
of records. Bin boundaries may not span similarly sized sub-ranges of a
variable’s total range.
Principled
Supervised bin boundary placement must be principled since it uses the prin-
ciple of supervision to determine optimum positioning.
3
“work better” is taken to mean transmits more signal information while rejecting
noise information at the same time as transmitting maximum information through
the least complex modes.
42 D. Pyle
Principled
Relaxation:
Binning Summary
Binning is a useful and valuable tool in the armory of data preparation. How-
ever, the ease of binning is not an unmixed blessing because it is very easy
to do more harm than good with binning. See, for instance, Fig. 3.9, which
shows the result of one recommended method of binning applied to the CA
data set, and compare the results with Figs. 3.4 and 3.7(d).
4
A free version of the demonstration binning and numerating tool is available for
download from the author’s personal web site, www.modelandmine.com. This
is a reduced capability version that has been taken from Powerhouse to allow
experimentation with some of the techniques discussed in this chapter.
44 D. Pyle
The binning has done a lot of damage. Compared with original data
(Fig. 3.3), the data set no longer transmits sufficient information to define
the output completely – in fact, 8% of the necessary information is not trans-
mitted. Thus, using this binning, it is not possible, even in principle, to com-
pletely define the output – or in other words, it is no longer possible to get a
perfect model. The linear mode information transmitted has increased slightly
from 26% to 27%, but the functional mode transmission has fallen from 3% to
a fraction of 1%. Disjoint mode transmission has apparently decreased from
71% to 65%. However, recall that 8% is untransmitted using this binning
practice. To all intents and purposes, of the information that is transmitted
29% is through the linear mode and 71% is through the disjoint mode. This
is altogether not a useful binning, especially as compared with what a better
binning practice achieves as shown in Fig. 3.7(d). Yet, the binning strategy
adopted in this example is taken from the recommended practices published
by a major statistical and mining tool vendor for use with their tools.
When carefully applied, and when appropriate techniques for discovering
an appropriate bin count and for placing bin boundaries are used, binning
reduces noise, increases signal, moves information transmission from more to
less complex modes of transmission, and promotes ease of modeling. When in-
appropriately or arbitrarily applied, binning may increase noise, reduce signal,
or move information transmission into more complex modes. As with so many
tools in life, binning is a two-edged sword. Careful and principled application
is needed to get the most benefit from it.
Missing values, or nulls, present problems for some tools, and are considered
not to for others. However, the issue is not so straightforward. Tools that have
a problem usually require a value to be present for all variables in order to
generate an output. Clearly, some value has to be used to replace the null –
or the entire record has to be ignored. Ignoring records at the very least re-
duces the information available for defining output states, and may result in
introducing very significant bias, rendering the effectively available residual
data set unrepresentative of the domain being modeled. From an information
transmission perspective, the replacement value should be the one that least
disturbs the existing information content of a data set. That is, it should add
no information and should not affect the proportion of information transmit-
ted through any mode. This is true because, from one perspective, null values
are explicitly not there and should, thus, explicitly have no contribution to
make to information content. That being the case, any replacement value that
is used purely for the convenience of the modeling tool should likewise make no
contribution to the information content of the data set. All such a replacement
should do is to make information that is present in other variables accessible
to the tool. However, that is not the whole story. Usually (invariably in the
3 Data Preparation and Preprocessing 45
author’s experience) nulls do not occur at random in any data set. Indeed,
the presence of a null very often depends on the values of the other variables
in a data set, so even the presence of a null is sensitive to the multivariate
distribution, and this is just as true for categorical variables as much for as
numeric variables. Thus, it may be important to explicitly capture the infor-
mation that could be transmitted by the pattern with which nulls occur in a
data set – and it is this pattern that is potentially lost by any tool that ignores
missing values. By doing so, it may ignore a significant source of information.
For instance, the “credit” data set (see note 1) has many nulls. In fact, some
variables are mainly null. The output variable in this data set is “buyer” –
an indication of offer acceptance. Ignoring completely any non-null values and
considering exclusively the absence/presence of a value results in the infor-
mation transmission shown in Fig. 3.10. (In other words, a data set is created
that has the same variable names and count as in the original, but where
the original has a value the new data set has a “1”, and where the original
has a null the new data set has a “0”, except for the output variable that
is retained in its original form.) The figure shows that, perhaps surprisingly,
the mere absence or presence of a value in this data set transmits 33% of the
information needed to define the output – information that might be ignored
Principled Arbitrary
However, many if not most current methods for null replacement suggested
as good practice are principled arbitrary methods. One such practice, for
instance, is to replace a numeric variable null with the mean value of the vari-
able, or the most common class. The idea behind this method is that using
the mean is the most representative univariate value. Unfortunately for any
method that recommends any single value replacement of nulls in a variable,
it is not the univariate distribution that is important. What is important is
not distorting the multivariate distribution – and any single value replace-
ment disregards the multivariate distribution – indeed, it explicitly changes
the multivariate distribution. Naturally, this introduces new and completely
spurious information into the multivariate distribution. Now the credit data
set, whether missing values are replaced or not, has so much variability al-
ready present that it is already carrying the theoretical maximum amount of
information in the input battery, so replacing missing values makes no appar-
ent difference to the quantity of information in the data set. Metaphorically,
the bucket is already full, and measuring the level of the content of the bucket
before and after garbage has been added makes no difference to the level. Full
is full. However, the quality of the content is affected. As Fig. 3.11 shows,
a null replacement value imputed to maintain the multivariate distribution
(right) is more effective than using the univariate mean (left) at transmitting
information through the less complex modes.
This requires using multivariate imputation – in other words, taking the values
of the variables that are not null into account when imputing a replacement
value. Linear multivariate imputation works well – and curvilinear multivari-
ate imputation works better. However, these are far more complex techniques
3 Data Preparation and Preprocessing 47
Fig. 3.11. Affect on information transmission mode in credit data set of replacing
missing numeric values with mean (left) and multivariate sensitive value (right)
than using a univariate mean, and in a run-time data set the more complex
the technique used, generally the longer the null replacement takes to gen-
erate. The optimal value (from an information transmission perspective) is
a value that least disturbs the information content of a data set. In numeric
variables, linear multivariate imputation is a reasonable compromise that can
be applied on a record-by-record basis, and is reasonably fast. For categorical
variables, the equivalent of the univariate mean is the most frequent category.
The categorical equivalent of multivariate imputation is the selection of the
most representative category given the values that are not null, whether nu-
meric or categorical, in the other variables. The combination of including the
null value pattern information along with a principled null value imputation
makes additional information available. Figure 3.12 shows the information
transmission modes for the combined data set.
Fig. 3.12. Combined missing value pattern information with principled information
neutral null replacements in the credit data set
48 D. Pyle
What is clear in Fig. 3.12 is that, at least in the credit data set, the pattern
with which nulls occurs is certainly something that should be investigated
since it has such a large effect on the modes of information transmission. Of
course, there are many questions raised: why should this be, is it significant,
is it valid and can it be used, for instance.
Check
Variables
Null
replacement
Algorithm
requires:
Numbers Categories
Numerate Categorize
categories numbers
Create
model
It may happen that there are simply too many categories for the chosen al-
gorithm to handle. Few category oriented tools can handle ZIP codes, for
instance, or text messages coded into flagged states but which can run into
hundreds or thousands of codes. A useful option is to reduce the number of
categories by associating similar categories together into bins. Some binning
52 D. Pyle
Unfortunately space permits only a brief sketch of the preparation options and
this section summarizes a possible approach based on the earlier content of
the chapter. In fact there remains a lot for a modeler to do after preparation
and before modeling, Fig. 3.13 notwithstanding. For sure any modeler will
need to survey the final data set, determine the confidence that the data set
represents the multivariate distribution, determine how much of the necessary
information is transmitted, select a subset of variables for the actual model,
and many other tasks. Discussing those tasks is not itself part of preparing
the data, and so beyond the scope of this chapter. However, each of those
steps may well cause a return to one of the preparation stages for further
adjustments. To some degree preparation is an iterative process, and without
question it is an integrated part of the whole modeling process even though
it is discussed separately here.
3.9 Summary
This chapter has used information theory as a lens to look at the effects of
different data preparation techniques and practices. It is a useful tool inasmuch
as it is possible to use information transmission measurements that directly
relate to the ease (even the possibility in many cases) of creating models of
data using available algorithms. If the algorithms available were insensitive to
the mode of transmission of information, data preparation – at least of the
types discussed here – would not be needed. However, at least currently, that
is not the case.
The chapter has provided a brief look at five of the main issues that face
any practitioner trying to create a model of a data set: modifying range, modi-
fying distributions, representing categories as numbers, representing numbers
as categories and replacing nulls. In all cases, there are poor practices and
better practices. In general, the better practices result in increasing the pro-
portion of the information in a data set that is transmitted through the less
complex modes. However nomenclature may deceive since what were yester-
days “best practices” may be better than those they replaced, but although
still perhaps labeled “best practices” they are not necessarily still the best
available option. Time, knowledge and technology move on. Applying the
best practices is not always straightforward, and indeed as just mentioned,
what constitutes a current best practice has changed in the light of recent
developments in data modeling tools, and not all practitioners are aware of
3 Data Preparation and Preprocessing 53
the effects on model performance that different practices have. Given the pace
of development in all areas of life, data preparation is itself part of an adap-
tive system – one that adapts to take advantage of developing knowledge and
tools. Whatever developments are yet to be discovered, applying principled
approaches to data preparation provides the modeler who uses them with
a key ingredient in empirically fitting the most probable model to the data
available.
References
1. Allison P.D (2002): Missing Data. Sage Publications.
2. Henderson, Th. C. (1990) Discrete Relaxation Techniques, Oxford University
Press.
3. Pyle D. (1999) Data Preparation for Data Mining. Morgan Kaufmann Publishers.
4. Pyle D. (2003) Business Modeling and Data Mining. Morgan Kaufmann
Publishers.
4
Artificial Neural Networks
C. Fyfe
4.1 Introduction
Artificial Neural Networks is one of a group of new methods which are intended
to emulate the information processors which we find in biology. The underlying
presumption of creating artificial neural networks (ANNs) is that the expertise
which we humans exhibit is due to the nature of the hardware on which our
brains run. Therefore if we are to emulate biological proficiencies in areas such
as pattern recognition we must base our machines on hardware (or simulations
of such hardware) which seems to be a silicon equivalent to that found within
our heads.
First we should be clear about what the attractive properties of human
neural information processing are. They may be described as:
• Biological information processing is robust and fault-tolerant: early on
in life1 , we have our greatest number of neurons yet though we daily lose
1
Actually several weeks before birth.
synapse
incoming
Dendrites w1
axon Axon
cell Incoming
body data w2 y Output
w3
w4
Flow of information through the neuron
Fig. 4.1. Left: A diagrammatic representation of a real neuron. Right: The artificial
neuron. The weights model the synaptic efficiencies. Some form of processing not
specified in the diagram will take place within the cell body
4 Artificial Neural Networks 59
thought to have different efficiencies and that these efficiencies change during
the neuron’s lifetime. We will return to this feature when we discuss learning.
We generally model the biological neuron as shown in Fig. 4.1 (right). The
inputs are represented by the input vector x and the synapses’ efficiencies
are modelled by a weight vector w. Therefore the single output value of this
neuron is given by
y=f wi xi = f (w.x) = f (wT x) (4.1)
i
Notice that if the weight between two neurons is positive, the input neuron’s
effect may be described as excitatory; if the weight between two neurons is
negative, the input neuron’s effect may be described as inhibitory.
Therefore we can see that the single neuron is an extremely simple process-
ing unit. The power of neural networks is believed to come from the accumu-
lated power of adding many of these simple processing units together – i.e.
we throw lots of simple and robust power at a problem. Again we may be
thought to be emulating nature, as the typical human has several hundred
billion neurons. We will often imagine the neurons to be acting in concert in
layers such as in Fig. 4.2.
In this figure, we have a set of inputs (the input vector, x) entering the
network from the left-hand side and being propagated through the network
via the weights till the activation reaches the output layer. The middle layer
is known as the hidden layer as it is invisible from outside the net: we may
not affect its activation directly.
weights
Inputs
Outputs
Hidden
Input Neurons Output
Neurons Neurons
Fig. 4.2. A typical artificial neural network consisting of 3 layers of neurons and 2
connecting layers of weights
60 C. Fyfe
Control: For example to control the movement of a robot arm (or truck, or
any non-linear process) to learn what inputs (actions) will have the correct
outputs (results).
We first set the scene with two early networks and then discuss the most
common current supervised networks, the multilayered perceptron and radial
basis function networks.
The simple perceptron was first investigated by Rosenblatt. For a partic-
ular input pattern, xP , we have an output oP and target tP . The algorithm
is:
1. begin with the network in a randomised state: the weights between all
neurons are set to small random values (between −1 and 1).
2. select an input vector, xP , from the set of training examples
3. propagate the activation forward through the weights in the network to
calculate the output oP = w.xP .
4. if oP = tP , (i.e. the network’s output is correct) return to step 2.
5. else change the weights according to ∆wi = ηxP i (t − o ) where η is a
P P
where the fraction is included due to inspired hindsight. Now, if our Adaline
is to be as accurate as possible, we wish to minimise the squared error. To
minimise the error, we can find the gradient of the error with respect to the
weights and move the weights in the opposite direction. Formally ∆P wj =
P
−γ ∂E
∂wj . This rule is called the Least Mean Square error (LMS) or Delta rule
or Widrow-Hoff rule. Now, for an Adaline with a single output, o,
∂E P ∂E P ∂oP
= . (4.3)
∂wj ∂oP ∂wj
P P
and because of the linearity of the Adaline units, ∂o P ∂E
∂wj = xj . Also, ∂oP =
−(tP − oP ), and so ∆P wj = γ(tP − oP ).xP j . Notice the similarity between
this rule and the perceptron learning rule; however, this rule has far greater
applicability in that it can be used for both continuous and binary neurons.
4 Artificial Neural Networks 63
This has proved to be a most powerful rule and is at the core of many current
supervised learning methods.
The Perceptron (and Adeline) proved to be powerful learning machines
but there were certain mappings which were (and are) simply impossible using
these networks. Such mappings are characterised by being linearly inseparable.
Now it is possible to show that many linearly inseparable mappings may be
modelled by multi-layered perceptrons; this indeed was known in the 1960s but
what was not known was a rule which would allow such networks to learn the
mapping. Such a rule appears to have been discovered independently several
times but has been spectacularly popularised by the PDP(Parallel Distributed
Processing) Group [30] under the name backpropagation.
An example of a multi-layered perceptron (MLP) is shown in Fig. 4.2.
Activity in the network is propagated forwards via weights from the input layer
to the hidden layer where some function of the net activation is calculated.
Then the activity is propagated via more weights to the output neurons. Now
two sets of weights must be updated – those between the hidden and output
layers and those between the input and hidden layers. The error due to the
first set of weights is clearly calculable by the previously described LMS rule;
however, now we require to propagate backwards that part of the error due
to the errors which exist in the second set of weights and assign the error
proportionately to the weights which cause it.
We may have any number of hidden layers which we wish since the method
is quite general; however, the limiting factor is usually training time which
can be excessive for many-layered networks. In addition, it has been shown
that networks with a single hidden layer are sufficient to approximate any
continuous function (or indeed any function with only a finite number of dis-
continuities) provided we use (non-linear) differentiable activation functions
in the hidden layer.
Activation Functions
1
The most popular activation functions are the logistic function, 1+exp(−t) , and
the tanh() function. Both of these functions satisfy the basic criterion that
they are differentiable. In addition, they are both monotonic and have the
important property that their rate of change is greatest at intermediate values
and least at extreme values. This makes it possible to saturate a neuron’s
output at one or other of their extreme values. The final point worth noting
is the ease with which their derivatives can be calculated:
• if f(x) = tanh(bx), then f (a) = b(1 − f (a) ∗ f (a));
• similarly, if f (x) = 1+exp(−t)
1
then f (a) = bf (a)(1 − f (a)).
It is folk wisdom that convergence is faster when tanh() is used rather than the
logistic function. Note that in each case the target function must be within
the output range of the respective functions. If you have a wide spread of
values which you wish to approximate, you must use a linear output layer.
The initial values of the weights will in many cases affect the final converged
network’s values. Consider an energy surface with a number of energy wells;
then, if we are using a batch training method, the initial values of the weights
constitute the only stochastic element within the training regime. Thus the
network will converge to a particular value depending on the basin in which
the original vector lies. There is a danger that, if the initial network values are
4 Artificial Neural Networks 65
sufficiently large, the network will initially lie in a basin with a small basin
of attraction and a high local minimum; this will appear to the observer as a
network with all weights at saturation points (typically 0 and 1 or +1 and −1).
It is usual therefore to begin with small weights uniformly distributed inside
a small range. Reference [22], page 162, recommends the range (− 2.4 2.4
Fi , + Fi )
where Fi is the number of weights into the ith unit.
The basic backprop method described above is not known for its fast speed of
convergence. Note that though we could simply increase the learning rate, this
tends to introduce instability into the learning rule causing wild oscillations in
the learned weights. It is possible to speed up the basic method in a number of
ways. The simplest is to add a momentum term to the change of weights. The
basic idea is to make the new change of weights large if it is in the direction
of the previous changes of weights while if it is in a different direction make
it smaller. Thus we use ∆wij (t + 1) = (1 − α).δj .oi + α∆wij (t), where the α
determines the influence of the momentum. Clearly the momentum parameter
α must be between 0 and 1. The second term is sometimes known as the “flat
spot avoidance” term since the momentum has the additonal property that it
helps to slide the learning rule over local minima (see below).
Local Minima
Error descent is bedeviled with local minima. You may read that local minima
are not much problem to ANNs, in that a network’s weights will typically
converge to solutions which, even if they are not globally optimal, are good
enough. There is as yet little analytical evidence to support this belief. An
heuristic often quoted is to ensure that the initial (random) weights are such
that the average input to each neuron is approximately unity (or just below
it). This suggests randomising the initial weights of neuron j around the value
√1 , where N is the number of weights into the jth neuron. A second heuristic
N
is to introduce a little random noise into the network either on the inputs or
with respect to the weight changes. Such noise is typically decreased during
the course of the simulation. This acts like an annealing schedule.
that it discourages the use of large weights in that a single large weight may
be decayed more than a lot of small weights. More complex decay routines
can be found which will encourage small weights to disappear.
The number of hidden nodes has a particularly large effect on the generali-
sation capability of the network: networks with too many weights (too many
degrees of freedom) will tend to memorise the data; networks with too few
will be unable to perform the task allocated to it. Therefore many algorithms
have been derived to create neural networks with a smaller number of hidden
neurons.
In a typical radial basis function (RBF) network, the input layer is simply a
receptor for the input data. The crucial feature of the RBF network is the
function calculation which is performed in the hidden layer. This function
performs a non-linear transformation from the input space to the hidden-
layer space. The hidden neurons’ functions form a basis for the input vectors
and the output neurons merely calculate a linear (weighted) combination of
the hidden neurons’ outputs.
An often-used set of basis functions is the set of Gaussian functions whose
mean and standard deviation may be determined in some way by the input
data (see below). Therefore, if φ(x) is the vector of hidden neurons’ outputs
when the input pattern x is presented and if there are M hidden neurons, then
where the centres ci of the Gaussians will be determined by the input data.
Note that the terms x − ci represent the Euclidean distance between the
inputs and the ith centre. For the moment we will only consider basis functions
with λi = 0. The output of the network is calculated by
where w is the weight vector from the hidden neurons to the output neuron.
4 Artificial Neural Networks 67
We may train the network now using the simple LMS algorithm in the
usual way. Therefore we can either change the weights after presentation of
the ith input pattern xi by
∂E i
∆wk = − = ei .φk (xi ) (4.5)
∂wk
or, since the operation of the network after the basis function activation has
been calculated is wholly linear, we can use fast batch update methods. Ref-
erence [22] is particularly strong in discussing this network.
If we have little data, we may have no option but to position the centres of
our radial basis functions at the data points. However such problems may
be ill-posed and lead to poor generalisation. If we have more training data,
several solutions are possible:
1. Choose the centres of the basis functions randomly from the available train-
ing data.
2. Choose to allocate each point to a particular radial basis function (i.e. such
that the greatest component of the hidden layer’s activation comes from a
particular neuron) according to the k-means rule. The k-means, the centres
of the neurons are moved so that each remains the average of the inputs
allocated to it.
3. We can use a generalisation of the LMS rule:
∂E
∆ci = − (4.6)
∂ci
This unfortunately is not guaranteed to converge (unlike the equivalent
weight change rule) since the cost function E is not convex with respect to
the centres and so a local minimum is possible.
The question now arises as to which is best for specific problems, multilayered
perceptrons or radial basis networks. Both RBFs and MLPs can be shown to
be universal approximators i.e. each can arbitrarily closely model continuous
functions. There are however several important differences:
1. The neurons of an MLP generally all calculate the same function of the
neurons’ activations e.g. all neurons calculate the logistic function of their
weighted inputs. In an RBF, the hidden neurons perform a non-linear map-
ping whereas the output layer is always linear.
2. The non-linearity in MLPs is generally monotonic; in RBFs we use a radi-
ally decreasing function.
68 C. Fyfe
3. The argument of the MLP neuron’s function is the vector product w.x of
the input and the weights; in an RBF network, the argument is the distance
between the input and the centre of the radial basis function, x − w .
4. MLPs perform a global calculation whereas RBFs find a sum of local out-
puts. Therefore MLPs are better at finding answers in regions of the input
space where there is little data in the training set. If accurate results are
required over the whole training space, we may require many RBFs i.e.
many hidden neurons in an RBF network. However because of the local
nature of the model, RBFs are less sensitive to the order in which data is
presented to them.
5. MLPs must pass the error back in order to change weights progressively.
RBFs do not do this and so are much quicker to train.
4.4.5 Applications
Hebbian learning is so-called after Donald Hebb [23] who in 1949 conjectured:
When an axon of a cell A is near enough to excite a cell B and re-
peatedly or persistently takes part in firing it, some growth process
or metabolic change takes place in one or both cells such that A’s
efficiency as one of the cells firing B, is increased.
In the sort of feedforward neural networks we have been considering, this
would be interpreted as the weight between an input neuron and an output
neuron is very much strengthened when the input neuron’s activation when
passed forward to the output neuron causes the output neuron to fire strongly.
We can see that the rule favours the strong: if the weights between inputs and
outputs are already large (and so an input will have a strong effect on the
outputs) the chances of the weights growing is large.
More formally, consider the simplest feedforward neural network which has
a set of input neurons with associated input vector, x, and a set of output
neu-
rons with associated output vector, y. Then we have, as before, yi = j wij xj
where now the (Hebbian) learning rule is defined by ∆wij = ηxj yi . That is
the weight between each input and output neuron is increased proportional
to the magnitude of the simultaneous firing of these neurons.
Now we can substitute into the learning rule thevalue of y calculated
by the
feeding forward of the activity to get ∆wij = ηxj k wik xk = η k wik xk xj .
Writing the learning rule in this way emphasises the statistical properties of
the learning rule i.e. that the learning rule identifies the correlation between
70 C. Fyfe
different parts of the input data’s vector components. It does however, also
show that we have a difficulty with the basic rule as it stands which is that we
have a positive feedback rule which has an associated difficulty with lack of
stability. If we do not take some preventative measure, the weights will grow
without bound. Such preventative measures include
1. Clipping the weights i.e. insisting that there is a maximum, wmax and
minimum wmin within which the weights must remain.
2. Specifically normalising the weights after each update. i.e. ∆wij = ηxj yi
is followed by wij = ij
w +∆wij
2
which ensures that the weights into
( wik +∆wik )
k
each output neuron have length 1.
3. Having a weight decay term within the learing rule to stop it growing too
large e.g. ∆wij = ηxj yi −γ(wij ), where the decay function γ(wij ) represents
a monotonically increasing function of the weights.
4. Create a network containing a negative feedback of activation.
The last is the method we will use in the subsequent case study.
N
yi = Wij xj , ∀i (4.7)
j=1
M
ej = xj − Wij yi (4.8)
i=1
reveal such structure is to project the data onto a lower dimensional space and
then look for structure in this lower dimensional projection by eye. However
we need to determine what constitutes the best subspace onto which the data
should be projected.
Now the typical projection of high dimensional data will have a Gaussian
distribution [8] and so little structure will be evident. This has led researchers
to suggest that what they should be looking for is a projection which gives a
distribution as different from a Gaussian as possible. Thus we typically define
an index of “interestingness” in terms of how far the resultant projection
is from a Gaussian distribution. Since the Gaussian distribution is totally
determined by its first two moments, we usually sphere the data (make it zero
mean and with covariance matrix equal to the identity matrix) so that we
have a level playing field to determine departures from Gaussianity.
Two common measures of deviation from a Gaussian distribution are based
on the higher order moments of the distribution. Skewness is based on the nor-
malised third moment of the distribution and basically measures if the dis-
tribution is symmetrical. Kurtosis is based on the normalised fourth moment
of the distribution and measures the heaviness of the tails of a distribution.
A bimodal distribution will often also have a negative kurtosis and therefore
kurtosis can signal that a particular distribution shows evidence of clustering.
Whilst these measures have their drawbacks as measures of deviation from
normality (particularly their sensitivity to outliers), their simplicity makes
them ideal for explanatory purposes.
The only difference between the PCA network and this EPP network is
that a function of the output activations is calculated and used in the simple
Hebbian learning procedure. We have for N dimensional input data and M
output neurons
N
si = wij xj (4.10)
j=1
M
ej = xj − wkj sk (4.11)
k=1
⎛ ⎞
N
ri = f ⎝ wij xj ⎠ = f (si ) (4.12)
j=1
∆wji = ηt ri ej (4.13)
N
M
N
= ηt f wik xk xj − wlj wlp xp (4.14)
k=1 l=1 p=1
where ri is the value of the function f() at the ith output neuron.
To determine which f() to choose, we decide which feature we wish to
emphasise in our projection and use the derivative of the function which would
72 C. Fyfe
maximise the feature in the projection. For example, clusters in a data set
are often characterised by having low kurtosis which is measured by E(y 4 ),
the expected value of the (normalised) fourth moment. Thus we might choose
f (y) = −y 3 as our function. In practise, robust functions similar to the powers
of y are often used and so we would choose in the above f (y) = tanh(y).
This type of network has been used in exploratory data analysis [29] for
remote sensing data, financial data and census data. It has also been applied
[19, 20] to the difficult “cocktail-party problem” – the extraction of a single
signal from a mixture of signals when few assumptions are made about the
signals, the noise or the mixing method.
One of the non-biological aspects of the basic Hebbian learning rule is that
there is no limit to the amount of resources which may be given to a synapse.
This is at odds with real neural growth in that it is believed that there is a limit
on the number and efficiency of synapses per neuron. In other words, there
comes a point during learning in which if one synapse is to be strengthened,
another must be weakened. This is usually modelled as a competition for
resources.
In competitive learning, there is a competition between the output neurons
to fire. Such output neurons are often called winner-take-all units. The aim
of competitive learning is to cluster the data. However, as with the Hebbian
learning networks, we provide no correct answer (i.e. no labelling information)
to the network. It must self-organise on the basis of the structure of the input
data.
The basic mechanism of simple competitive learning is to find a winning
unit and update its weights to make it more likely to win in future should a
similar input be given to the network. We first have a competition between
the output neurons and then
Note that the change in weights is a function of the difference between the
weights and the input. This rule will move the weights of the winning neuron
directly towards the input. If used over a distribution, the weights will tend to
the mean value of the distribution since ∆wij → 0 ⇐⇒ wij → E(xj ), where
E(.) indicates the ensemble average.
Probably the three most important variations of competitive learning are
1. Learning Vector Quantisation [26]
2. The ART models [4, 5]
3. The Kohonen feature map [27]
As a case study, we will discuss the last of these.
4 Artificial Neural Networks 73
The interest in feature maps stems directly from their biological importance.
A feature map uses the “physical layout” of the output neurons to model
some feature of the input space. In particular, if two inputs x1 and x2 are
close together with respect to some distance measure in the input space, then
if they cause output neurons ya and yb to fire respectively, ya and yb must
be close together in some layout of the output neurons. Further we can state
that the opposite should hold: if ya and yb are close together in the output
layer, then those inputs which cause ya and yb to fire should be close together
in the input space. When these two conditions hold, we have a feature map.
Such maps are also called topology preserving maps.
Examples of such maps in biology include
the retinotopic map which takes input from the retina (at the eye) and maps
it onto the visual cortex (back of the brain) in a two dimensional map
the somatosensory map which maps our touch centres on the skin to the so-
matosensory cortex
the tonotopic map which maps the responses of our ears to the auditory cor-
tex.
Each of these maps is believed to be determined genetically but refined by
usage. E.g. the retinotopic map is very different if one eye is excluded from
seeing during particular periods of development.
Kohonen’s algorithm [27] is exceedingly simple – the network is a simple
2-layer network and competition takes place between the output neurons; how-
ever now not only are the weights into the winning neuron updated but also
the weights into its neighbours. Kohonen defined a neighbourhood function
f (i, i∗ ) of the winning neuron i∗ . The neighbourhood function is a function of
the distance between i and i∗ . A typical function is the Difference of Gaussians
function; thus if unit i is at point ri in the output layer then
−|ri − ri∗ |2 −|ri − ri∗ |2
f (i, i∗ ) = a exp − b exp (4.15)
2σ 2 2σ12
where rk is the position in neuron space of the kth centre: if the neuron space is
1 dimensional, rk = k is a typical choice; if the neuron space is 2 dimensional,
rk = (xk , yk ), its two dimensional Cartesian coordinates.
Results from an example experiment is shown in Fig. 4.3. The experiment
consists of a neural network with two inputs and twenty five outputs. The
two inputs at each iteration are drawn from a uniform distribution over the
square from −1 to 1 in two directions. The algorithm is
1. Select at random an input point.
2. There is a competition among the output neurons. That neuron whose
weights are closest to the input data point wins the competition:
Fig. 4.3. A one dimensional mapping of the two dimensional input space
where
−|ri − ri∗ |2 −|ri − ri∗ |2
f (i, i∗ ) = a exp − b exp (4.18)
2σ 2 2σ12
4. Go back to the start until some termination criterion has been met.
Kohonen typically keeps the learning rate constant for the first 1000 iterations
or so and then slowly decreases it to zero over the remainder of the experiment
(we can be talking about 100 000 iterations for self-organising maps).
In [27], a whole chapter is given over to applications of the SOM; examples
include text mining, specifically of web pages, a phonetic typewriter, clustering
of financial variables, robot control and forecasting.
3. It is often best to decrease the learning rate during training. This smoothes
out the effect of individual data items and accords well with stochastic
gradient descent theory.
1.5 2.5
2
1
1.5
0.5 1
0.5
0 0
–0.5
–0.5
–1
–1 –1.5
–2
–1.5 –2.5
–2.5 –2 –1.5 –1 –0.5 0 0.5 1 1.5 2 2.5 –3 –2 –1 0 1 2 3 4
Fig. 4.4. Left: Two SOMs trained on the same data; it is difficult/impossible to
find a joint map. Right: Two SOMs trained on the same data when the centres were
initialised to lie on the first Principal Component
4.7 Conclusion
In this chapter, we have reviewed some of the major types of artificial
neural networks. We have suggested that the major differentiation within the
78 C. Fyfe
artificial neural network field is between those networks which use supervised
learning and those networks which use unsupervised learning. We have given
some examples of each but any good textbook in the field (and we particularly
recommend [22]) will give many more examples.
Finally, we have discussed the use of ensemble methods which uses one
of several standard statistical techniques to combine sets of artificial neural
networks. The resulting networks are shown to be more powerful than the
individual networks and illustrates the use of hybrid techniques in this field.
References
1. L. Breimen. Using adaptive bagging to debias regressions. Technical Report
547, Statistics Dept, University of California at Berkeley, February 1999.
2. L. Breimen. Arcing the edge. Technical report, Statistics Dept, University of
California, Berkeley, 486.
3. C. J. C. Burges. A tutorial on support vector machines for pattern recognition.
Knowledge Discovery and Data Mining, 2(2), 1998.
4. Gail A. Carpenter and Stephen Grossberg. Art 2: Self-organization of stable
category recognition codes for analog input patterns. Applied Optics, 26:4919–
4930, 1987.
5. Gail A. Carpenter and Stephen Grossberg. Art 3:hierarchical search using chem-
ical transmitters in self-organizing pattern recognition architectures. Neural
Networks, 3:129–152, 1990.
6. D. Charles and C. Fyfe. Modelling multiple cause structure using rectification
constraints. Network: Computation in Neural Systems, 9:167–182, May 1998.
7. Fyfe C. Koetsier J. MacDonald D. Charles, D. Unsupervised neural networks
for the identification of minimum overcomplete basis in visual data. Neurocom-
puting, 2001.
8. Persi Diaconis and David Freedman. Asymptotics of graphical projections. The
Annals of Statistics, 12(3):793–815, 1984.
9. R. Dybowski and S. Roberts. Clinical Applications of Artificial Neural Networks,
chapter Confidence Intervals and Prediction Intervals for Feed-Forward Neural
Networks, pages 1–30. Cambridge University Press, 2003.
10. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a sta-
tistical view of boosting, technical report. Technical report, Statistics Dept,
Stanford University, 1998.
11. Jerome H Friedman. Exploratory projection pursuit. Journal of the American
Statistical Association, 82(397):249–266, March 1987.
12. C. Fyfe. Pca properties of interneurons. In From Neurobiology to Real World
Computing, ICANN 93, pages 183–188, 1993.
13. C. Fyfe. Introducing asymmetry into interneuron learning. Neural Computation,
7(6):1167–1181, 1995.
14. C. Fyfe. Radial feature mapping. In International Conference on Artificial
Neural Networks, ICANN95, Oct. 1995.
15. C. Fyfe. A comparative study of two neural methods of exploratory projection
pursuit. Neural Networks, 10(2):257–262, 1997.
4 Artificial Neural Networks 79
5.1 Introduction
Adaptive systems are able to change their behaviour according to their en-
vironment. Adaptive systems be viewed as having two functions: (1) a basic,
standard function and (2) the optimisation of performance of the standard
function with respect to a performance goal. In this chapter we review meth-
ods and systems that perform this combination of tasks using methods from
Machine Learning and Reinforcement Learning. We focus on the methods
on which such systems are based and the conditions that make the methods
effective.
Machine Learning provides methods for different forms of adaptivity. We
can make the following main distinctions:
M. van Someren and S. ten Hagen: Machine Learning and Reinforcement Learning, StudFuzz
173, 81–104 (2005)
www.springerlink.com c Springer-Verlag Berlin Heidelberg 2005
82 M. van Someren and S. ten Hagen
Learning to Data Mining. The best-known model is CRISP [7]. The CRISP
process model distinguishes the following steps:
1. Problem understanding,
2. Characterising the tasks,
3. Designing the data acquisition component,
4. Designing the performance component,
5. Designing the adaptation component,
6. Implementing the adaptive system,
7. Evaluating the adaptive system,
8. Deployment.
In this section we discuss these steps in more depth. In Sects. 5.3.3 and 5.4
we illustrate the process model with examples.
84 M. van Someren and S. ten Hagen
Problem understanding amounts first of all to defining the tasks for the adap-
tive system: the performance task and the adaptation task. For both of these
the input, output and function need to be specified. An important point is the
type of feedback that is input for the adaptation task. Consider the following
example. Suppose that we want to build a system that learns to recognise
diseases, as a simple case of an adaptive system.
Performance task:
input: description of patient
output: disease
Adaptation task:
input: descriptions of patients with diseases
output: prediction model (for performance task).
Performance task:
input: description of patient
output: best treatment
Adaptation task:
input: descriptions of patients -> best treatments
output: prediction model.
Compound task:
Performance task 1:
input: description of patient + treatment
output: effect
Performance task 2:
input: description of patient + treatment + effect
output: best treatment
Adaptation task:
5 Machine Learning and Reinforcement Learning 85
Data Cleaning
Then, in the data cleansing step the data has to be assessed and approved to
be of the required quality for the actual learning. Often, data is not collected
with the purpose of using it for a learning technique. In industry, data is often
collected for controlling a process. Data formats may change over time, sam-
pling frequency may differ between attributes and other data quality defects
may occur. Consequently, the data suffers from missing values and contains
records with contradictory or complementary information. Before being suit-
able for application of a learning technique, some of these defects have to
be repaired. Missing values, for instance, may lead to removal from the data
set, or to assigning a “don’t know” value. Moreover, data enrichment may be
possible: adding information from specific sources to the data set.
Finally, in the feature selection step, the contribution of individual at-
tributes in the data to the task to achieve (classification, prediction or char-
acterization) is assessed, leading to the removal of non-contributing attributes
from the data set.
Feature Selection
Deciding which features are relevant is of key importance for effective adaptive
systems. This is because redundant features may produce spurious (parts of)
models when the amount of data is limited. Eliminating features that are
almost certain to have no effect will improve the speed at which the system
will adapt correctly.
86 M. van Someren and S. ten Hagen
Feature Transformation
For the choice of a method, the type of data is important. Most problems
involve data that can be represented as feature vectors or values on a fixed
number of variables. The main scale types are: numerical, ordinal and nominal
(also called categorical or symbolic). There are methods for transforming the
scale or the scale type. For example, a numerical variable (effect of a treat-
ment) can be represented as a set of ordered labels (e.g. complete recovery,
partial recovery, partial recovery and additional new symptoms, no recovery,
getting worse). It can also be represented as a binary choice (improved/not
improved) or as a numerical scale. Scales can be reduced by merging values.
It is also possible to construct a scale from categorical labels. For example, if
the country of birth of patients is known then the values of this feature can be
scaled by their association with the predicted variable resulting in an ordered
(or even numerical variable). The main principles that govern the transfor-
mations are: (1) remove redundant information and (2) satisfy requirements
of the method that will be applied. If transformations can be reliably made
from prior knowledge about the domain, this is preferred.
Feature Construction
Relations between features can also be exploited. If two features are strongly
correlated in the data, they can be (normalised and) added to create a sin-
gle new feature. Features with a “nested” structure (e.g. male/female and
pregnant/not-pregnant) can be reformulated as a single feature (male/female-
pregnant/female-not-pregnant). If features have many values then they can
be mapped to a less fine scale. Feature selection and feature construction can
be done from prior knowledge or from initial analysis of the data.
Technical Evaluation
• Train-and-test:
split data in train and test set, train the classifier on the train data, and
estimate its accuracy by running it on the test data. If the amount of data is
small relative to the strength and complexity of effects then this procedure
will underestimate the accuracy because both learning and testing will be
unstable.
• Cross validation:
Divide the data in M sub-samples, For each sub-sample i, train the classifier
on the all other sub-samples, and test on sub-sample i. The estimated
error rate is the average error rates for the m sub-samples. The classifier
to exploit is trained from the entire data set.
88 M. van Someren and S. ten Hagen
• Resampling:
Draw from the data set a new data set with replacement, the sample being
of the same size as the data set. Train the classifier on the drawn set, and
estimate the error on the unused parts. Averaging the performance on a
large number of experiments gives the final estimate.
Application Evaluation
In this stage the system is delivered and integrated in its environment. Be-
sides implementation work this will need planning and involvement of the
user environment because the system may need time to adapt and improve
performance, which is not what users expect.
P (B|A)P (A)
P (A|B) = (5.2)
P (B)
We now interpret A as “user likes document” and B as a property of the
document then we can use this formula to calculate P (A|B) from P (B|A),
P (A) and P (B). P (A) can be estimated from observations. We have no single
usable property of documents, so instead we use many properties: the pres-
ence of a word in the document. The Naive Bayesian classifier is based on
the assumption that the probabilities P (word|interesting) for different words
are independent of each other. In that case we can use the following variation
of formula (5.1):
5 Machine Learning and Reinforcement Learning 89
P (interesting|W1 ∩ W2 ∩ W3 ∩ · · · ∩ Wn ) ; (5.3)
P (W1 |interesting)P (W2 |interesting) · · · P (Wn |interesting)
=
P (W1 )P (W2 ) · · · P (Wn )
These probabilities can be estimated from data. When more data are col-
lected these estimates will become more accurate.
A classic family of methods that learn models in the form of trees. Associated
with the nodes in the trees are data variables and the branches correspond to
values (or value intervals for numerical variables). Leaves are associated with
classes and objects are classified by traversing the tree. The tree is constructed
by selecting the best predicting variable, splitting the data by the selected
variable, and repeating this recursively for each subset. Another problem is
that the process divides the data into ever smaller subsets. At some point the
amount of data is too small to decide about extending the tree. Some criteria
for stopping the process is evaluation against a separate dataset, statistical
significance of tree extensions.
Detailed descriptions can be found in the literature e.g. [19] or [30] and a
number of commercial tools are available that usually include procedures for
evaluating trees and for displaying and exporting trees.
• A set of states
• A set of actions
• A dynamic system
• A scalar evaluation
• An objective
Here the state is the variable that can change over time and that has to be
controlled. The dynamics of the system describes how the state changes, given
the past states and actions. There should be a conditional dependency between
the actions and the state change, so that the actions can be used to control the
system. In automatic control the actions are selected as a function of at least
the current state. In RL terminology the function that computes that action is
referred to as the policy. The scalar evaluation is the reinforcement. This can
be regarded as a reward or punishment of the current state and action. The
objective indicates the criterion based on which the policy is being improved.
So the criterion indicated whether the reinforcement should be interpreted as
reward of punishment.
Since the system is dynamic an action now can have consequences for
future reinforcements, therefore the objective of RL is usually expressed as
the sum or the average of the future reinforcements. RL solves this problem,
the temporal credit assignment problem, by propagating evaluations back in
time to all past actions that could be responsible for these evaluation.
RL in Theory
A special class of problems are those that can be formulated as a Markov De-
cision Problem (MDP). Here the states s are elements from a finite set S, and
an action a can be chosen from a finite set A(s). The state transition proba-
= P r(s |s, a).
a
bility is given for all state/action/next-state combinations: Pss
This specifies the probability of s being the next state when in state s action
a is taken. The expected reinforcement received at a certain moment only
depends on current state/action/next-state combination, so it is a stationary
process. The reinforcement that is received at time step k is indicated by rk .
Depending on whether r indicates a reward or punishment the objective is
to minimize or maximize the (discounted) sum of future reinforcements. The
value function maps the (discounted) sum of future reinforcements to the state
space. When s is the state at time step k the value is given by:
∞
V π (s) = γ i−k ri (5.4)
i=k
5 Machine Learning and Reinforcement Learning 91
Here γ ∈ [0, 1] is the discount factor that makes sure that the sum is
finite. The policy π determines for all states which action to take, and in
(5.4) it indicates what actions will be taken in future states. Because the state
a
transition Pss is conditionally independent of the past (the Markov property),
the future only depends on the current state and all actions taken in future.
In other words, the values V π (s) for all states do exist and can be computed
a
when Pss , π and all reinforcements are known. The objective is now to find
that policy for which the value for all states is minimal or maximal. A classical
solution to the MDP is Dynamic Programming (DP) [3, 5], in which the state
a
transition probabilities Pss are used to compute the optimal value function.
Once the optimal value function is given the optimal policy can be found by
selecting for each state that action that has the highest expected value for the
next time step. In RL the value function is not computed off-line. Instead the
values for each state are estimated on-line by controlling the system using a
policy. The estimation is based on the temporal difference [22]. The idea is
that the current value should equal the reinforcement received plus the value
of the next state.
According to (5.4):
∞
V π (sk ) = rk + γ γ i−k−1 ri = rk + γV π (sk+1 ) (5.5)
i=k+1
The same should hold for the approximated value function V̂ . If it does
not, the estimated value of the current state is adjusted based on the estimated
value of the next state plus the reinforcement received.
RL in Practice
• Is learning required?
If a good control policy can be designed manually, there is no need to
improve it using RL.
• Is the state transition function known?
If the state transition function is known training can be performed in
simulation and after training the policy can be transferred to the real
system. If the state transition function is not known simulations are not
possible and also only Q-learning methods can be applied. There are also
RL methods that estimate the state transition functions and use this to
speed up the learning process.
• Is there a finite and small enough set of states?
In theory RL stores for each state the value and decides for each state
the action. In case of a continuous state space this is no longer possible.
It is possible to discretize the continuous state space, but usually this
requires some knowledge about the problem domain. Alternative a function
approximator like a neural network can be used to represent the value
function and policy.
• Is enough data available?
RL may require many epochs of training, which is no problem if the sys-
tem is simulated on a computer. However, when learning is applied to a
real system it may be that the system itself is too slow. In that case gen-
eralization over the state and actions space or the reuse of data may be
required to learn fast enough.
5 Machine Learning and Reinforcement Learning 93
A difficulty is that some pilot actions were anticipatory rather than reactive.
Some actions were taken not as a reaction to some measurable change but be-
cause the pilot anticipated new conditions. No action was taken to deal with
this problem. Top-down induction of decision trees was selected as method
because this produces relatively simple controllers that can be reviewed by
experts and that was sufficient for constructing an adequate controller. The
model was constructed offline by a standard tool for decision tree learning
(C 4.5, now succeeded by C 5.0; see [19] for a detailed description of C 4.5).
This tool can generate a decision tree and a procedure for applying this tree
(in C) that can be run by itself or integrated into a system that connects with
the measurements and controls. This was found to work very well. Experi-
ments showed that the synthesized code cannot be distinguished from human
pilots, for these manoeuvres. The system was not deployed but acted as a
demonstration of the technology that was followed up by a project to explore
this further.
the highest rating was used to recognise the components of addresses. The
grammar adaptation system used a hill climbing search strategy and opera-
tors that add and delete terms in the grammar rules and modify the ratings
of grammar rules. Grammars were evaluated by the number of addresses that
they can recognise. The results show that using a dataset of 459 addresses,
the system constructed a grammar that correctly recognised 88% of the ad-
dresses. To evaluate this, a grammar was constructed manually by a linguist.
The accuracy of the automatically adapted grammar is equal to the accuracy
of a grammar that was constructed manually by a human language engineer.
The grammar constructed by the system was somewhat more complex than
the “human” grammar. Presumably this can be avoided by adding a premium
for simplicity to the evaluation of candidate grammars.
The World Wide Web contains about 800 million webpages and that number
is increasing by the day. Finding the right information becomes harder, finding
interesting sites might become impossible. Try searching with AltaVista for
websites about “machine learning”. You will probably find over 4 million hits!
Even if you only try 1% of these and if you would take 1 minute to visit one
site, it would take you about 8.5 months of continuous surfing to visit them
all.
In this section we summarise the development of an early recommender
system, Syskill & Webert [18]. Here we focus on the problem of assisting a
person to find information that satisfies long-term, recurring goals (such as
finding information on machine applications in medicine) rather than short-
term goals (such as finding a paper by a particular author). The “interesting-
ness” of a webpage is defined as the relevance of the page with respect to the
user’s long-term information goals. Feedback on the interestingness of a set
of previously visited sites can be used to learn a profile that would predict
the interestingness of unseen sites. In this section, it will be shown that re-
vising profiles results in more accurate classifications, particularly with small
training sets. For a general overview of the idea, see Fig. 5.1.
The user classifies visited sites into “interesting” and “not interesting”.
This website collection is used to create a user profile, which is used by the
adaptive web browser to annotate the links on a new website. The results are
shown to the user as in Fig. 5.2. The main problem to be addressed in this
adaptive system is how to give advice to the user about the interestingness
96 M. van Someren and S. ten Hagen
Fig. 5.2. The adaptive web browser shows its advice to the user
5 Machine Learning and Reinforcement Learning 97
of a website, and on what data this advice should be based. The only data
available is a list of websites visited by the user, and a corresponding list of
ratings. Using this data, the system must be able to predict whether a webpage
is interesting or not. And if it’s wrong, it has to learn from its mistakes.
Data Preparation
The available data consists of HTML files that the user visited in the past.
These files contain HTML markup tags and actual text. For this purpose,
the HTML markup can be deleted because it plays only a minor role for
understanding the content and its meaning. Everything between HTML tags
(< and >) is therefore omitted. Very short words are often very frequent and
not too discriminative between documents and therefore words with a length
smaller than three are ignored. In addition about 600 frequent longer words
(taken from a “stoplist”) were also omitted.
Selecting single words has its drawbacks: words like “computer science”
and “science fiction” will result in counting twice the word “science”, although
the meaning of this word depends on its context. To be able to compare
words to each other in a later stage, all words are transformed to lower case.
Cleaning the HTML page shown in Data Design like this, we obtain a list of
words. Further processing is possible. For example, the words “application”
and “applications” appear both. One could argue that they refer to the same
concept and that merging them will give better estimates of probabilities
because the data for both words can be pooled together.
When the user uses the adaptive web browser, the learning method will
predict which links on the current page are interesting, by reading ahead those
links and analyzing their content. If the user follows some of the links, he can
enter his opinion about the suggested links. If he agrees, the browser will
record the prediction has succeeded, but if the user doesn’t, the prediction
has failed. The success or failure of the prediction can be can be used to
improve the predictions. In this way, the browser learns from the user what
type of sites should be recommended and what type of sites should be avoided.
they consider good indicators for websites of their interest. In this way, the
browser uses prior knowledge about the domain without having to learn it.
Also, using lexical knowledge can enhance performance further. Using knowl-
edge about relations between words can improve the quality of the most in-
formative words. Lexical knowledge can improve the quality of the list by
removing some non-informative words.
One RL approach able to deal with continuous state and actions spaces is
based on Linear Quadratic Regulation (LQR) [6, 12, 15, 29]. Here the state
and actions are represented by vectors and the state transition function is
given by a linear mapping from current state and action to the next state.
The reinforcements represent cost and are computed as quadratic functions of
5 Machine Learning and Reinforcement Learning 99
state and actions. The value function or Q-function for such problem is again
a quadratic functions. In case of a Q-function the improved linear policy can
be derived directly from the estimated parameters of the Q-function.
For the estimation of these parameters it is possible to create an input
vector containing all quadratic combination of elements of the state and action
vector. The Q-function is formed by a linear mapping from input vector a
scalar value. Now the temporal difference can be rewritten such that a linear
least squares estimating can be given of the parameters of the Q-function.
The dynamics of many systems, like the mobile robot, do not have a linear
state transition function. The LQR approach can not be applied, because it
would lead to very unreliable results. In [24] a feed forward neural network
is used instead of the linear mapping that forms the quadratic Q-function.
The network has one linear output unit and the hidden units have hyperbolic
tangent activations functions. If all the weights of the hidden units are very
small the network forms a linear mapping and the represented Q-function is
quadratic. This is how the network is initialized. The network is trained based
on the same error as in the least squares estimation. During training certain
weights of hidden units can become larger and the Q-function is no longer a
quadratic function. Obtaining the improved policy is a bit more complicated
than in the LQR case. A two step approach can be used where first the
hidden layer is ignored to form a global linear policy that can be corrected in
the second step by using the hidden layers. This results in a nonlinear policy.
Evaluation
Using the above described configuration the robot was placed in a large area.
A random linear policy plus noise for exploration was used to make the robot
move. The robot started at four different initial positions where it ran for 20
seconds to gather the data that was used to train the neural network. The
nonlinear policy was extracted from the trained network and tested on the
robot. For initial positions close to the line it started turning smoothly in
the right direction. For the position far from the line the robot first turned
in the direction of the line. Here is had a slight overshoot and turn a bit too
much. Closer to the line the robot starts to turn to follow the line again. Only
80 second of running the system was sufficient to learn the correct behavior.
A control policy for a tracking task of a two linked robot arm is optimized
using RL in [16].
The control policy should take into account possible different loads on the
arm and ensure stability while minimizing the tracking error. The purpose
of [16] is to introduce a stable fuzzy controller that can be tuned using RL.
The dynamics of the robot arm is described by a nonlinear function, which
makes it hard to obtain an optimal controller. A policy that tracks well for
a heavy load may cause instability when the load is reduced. In spite of the
difficult nonlinear model, the robot arm is an engineered artifact that is not
completely unknown. This allows for a fuzzy controller to compensate for the
nonlinearities. RL is only needed for the improvement of the performance.
The closed loop behavior can be analyzed and conditions for the tunable
parameters can be given. These conditions can be taken into account so that
during learning the system remains always stable.
Approach
RL methods to tune fuzzy controllers were developed [4, 14] and used [11].
These approaches use the original architecture from [2], where an actor forms
the policy and an critic the approximation of the value function. Instead of a
function approximator for the actor a fuzzy controller with tunable parameters
is be used. When learning is on-line the risk of an unstable closed loop should
be avioded. In [1] this is solved by using two policies, one stabilizing controller
and one that is improved using Q-Learning. The second policy is used except
when instability may occur.
In [16] fuzzy is used to compensate for the nonlinearities in the system.
This is possible because the kinematic model of the robot arm is known. To
improve the performance the defuzzified output is weighted with parameters
that are updated based on the critic. By choosing the right fuzzy inference
model, a closed loop configuration can be obtain for which the parameters
can never cause any instability.
Evaluation
The result is a fuzzy controller that provides stability and performs well as
tracker.
5 Machine Learning and Reinforcement Learning 101
• The dynamics of man made machines are usually not completely unknown.
In that case use the prior knowledge available.
• In on-line learning the stability of the closed loop is very important.
If possible modify the learning mechanism in such way that instability
cannot occur.
In [9] RL is used to make an agent learn to play the computer game Digger.
The player controls a machine that digs tunnels and the objective of the
game is to collect all the emeralds while avoiding or shooting monsters. The
purpose of [9] is to demonstrate that human guidance can be exploited to
speed up learning, specially in the initial phase of learning
The state representation of a game like Digger is rather difficult. The po-
sition of the player, monsters and the presents of emeralds should be encoded
in the state. Also the monsters can only move in tunnels dug by the player
such that the tunnels have to be encode as well. This makes the state space
very large and virtually impossible to explore. However, most parts of the
state space will be irrelevant. In the original game the machine is controller
by a human player, and this is used as guidance in the learning process. The
policies of human players are more reasonable to start with than a random
learning policies.
Approach
a first order logic decision tree, in which structural descriptions are mapped
to real numbers. For Digger this means that predicates like “visibleMonster”
and “nearestEmerald” could be used. Learning was performed using different
number of guidance traces.
Result
The results was that the game of digger could be played after learning. The
result of using guidance was a bit different than expected. The initial learning
speed did not increase much, but instead it increased performance a little bit.
5.5 Conclusion
Acknowledgements
References
1. P.E. An, S. Aslam-Mir, M. Brown and C.J. Harris. A reinforcement learning
approach to on-line optimal control. In Proceedings of the International Con-
ference on Neural Networks, 1994.
2. A.G. Barto, R.S. Sutton and C.W. Anderson. Neuronlike adaptive elements that
can solve difficult learning control problems. IEEE Transactions on Systems,
Man and Cybernetics, 1983.
3. R. Bellman. Dynamic Programming. Princeton University Press, 1957.
5 Machine Learning and Reinforcement Learning 103
4. H.R. Berenji and P. Khedkar. Learning and tuning fuzzy logic controller through
reinforcements. IEEE Trans. on Neural Networks, 1992.
5. D.P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models.
Prentice-Hall, 1987.
6. S.J. Bradtke. Reinforcement learning applied to linear quadratic regulation. In
Advances in Neural Information Processing Systems, 1993.
7. P. Chapman., J. Clinton, T. Khabaza, T. Reinartz and R. Wirth. The
CRISP-DM Process Model. Technical report, Crisp Consortium, 1999,
http://www.crisp-dm.org/.
8. P. Dayan. The Convergence of TD(λ) for general λ Machine Learning, 8: 341-
362, 1992.
9. K. Driessens and Saso Dzeroski. Integrating experimentation and guidance in
relational reinforcement learning. In Proceedings of the Nineteenth International
Conference on Machine Learning, pages 115-122. Morgan Kaufmann Publishers,
Inc, 2002.
10. S. Dzeroski, L. De Raedt and K. Driessens. Relational reinforcement learning.
Machine Learning, 43(1): 7-52, 2001.
11. D. Gu and H. Hu. Reinforcement learning of fuzzy logic controller for quadruped
walking robots. In Proceedings of 15th IFAC World Congress, 2002.
12. S.H.G. ten Hagen and B.J.A. Kröse. Linear quadratic regulation using reinforce-
ment learning. In F. Verdenius and W. van den Broek, editors, Proc. of the 8th
Belgian-Dutch Conf. on Machine Learning, pages 39-46, Wageningen, October
1998. BENELEARN-98.
13. T. Jaakkola, M.I. Jordan and S. P. Singh. On the convergence of stochastic
iterative dynamic programming algorithms. Neural Computation, 1994.
14. L. Jouffe. Fuzzy inference system learning by reinforcement methods. IEEE
Trans. on System, Man and Cybernetic, 28(3), 1998.
15. T. Landelius. Reinforcement learning and distributed local model synthesis.
PhD thesis, Linköping University, 1997.
16. C.-K. Lin. A reinforcement learning adaptive fuzzy controller for robots. Fuzzy
Sets and Systems, 137(3): 339-352, 2003.
17. D. Michie and C. Sammut. Behavioural clones and cognitive skill models. In K.
Furukawa, D. Michie and S. Muggleton, editors, Machine Intelligence 14. Oxford
University Press, Oxford, 1995.
18. M.J. Pazzani and Daniel Billsus. Learning and revising user profiles: The iden-
tification of interesting web sites. Machine Learning, 27(3): 313-331, 1997.
19. J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
20. C. Sammut. Automatic construktion of reaktive control systems using symbolic
machine learning. The Knowledge Engineering Review, 11:27-42, 1996.
21. W.D. Smart and L.P. Kaelbling. Effective reinforcement learning for mobile
robots. In Proceedings of the International Conference on Robotics and Au-
tomation, 2002.
22. R.S. Sutton. Learning to predict by the methods of temporal differences. Ma-
chine Learning, 1988.
23. R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT
Press, 1998.
24. S. ten Hagen and B. Kröse. Neural Q-learning. Neural Computing and Appli-
cations, 12(2): 81-88, November 2003.
25. G. Tesauro. Temporal difference learning in TD-gammon. In Communications
of the ACM, 1995.
104 M. van Someren and S. ten Hagen
26. T. van der Boogaard and M. van Someren. Grammar induction in the domain of
postal addresses. In Proceedings Belgian-Netherlands Machine Learning Work-
shop Benelearn, Brussels, 2004. University of Brussels.
27. Ch. J.C.H. Watkins and P. Dayan. Technical note: Q learning. Machine Learn-
ing, 1992.
28. S.M. Weiss and N. Indurkhya. Predictive Data-Mining. A Practical Guide.
Morgan Kaufmann Publishers, San Francisco, California, 1997.
29. P.J. Werbos. Consitency of HDP applied to a simple reinforcement learning
problem. Neural networks, 1990.
30. I.H. Witten and E. Frank. Data-Mining: Practical Machine Learning Tools
and Techniques with Java Implementations. Morgan Kaufmann Publishers, San
Francisco, 2000.
6
Fuzzy Expert Systems
J.M. Garibaldi
In this chapter, the steps necessary to develop a fuzzy expert system (FES)
from the initial model design through to final system evaluation will be pre-
sented. The current state-of-the-art of fuzzy modelling can be summed up
informally as “anything goes”. What this actually means is that the devel-
oper of the fuzzy model is faced with many steps in the process each with
many options from which selections must be made. In general, there is no
specific or prescriptive method that can be used to make these choices, there
are simply heuristics (“rules-of-thumb”) which may be employed to help guide
the process. Each of the steps will be described in detail, a summary of the
main options available will be provided and the available heuristics to guide
selection will be reviewed.
The steps will be illustrated by describing two cases studies: one will be a
mock example of a fuzzy expert system for financial forecasting and the other
will be a real example of a fuzzy expert system for a medical application. The
expert system framework considered here is restricted to rule-based systems.
While there are other frameworks that have been proposed for processing
information utilising fuzzy methodologies, these are generally less popular in
the context of fuzzy expert systems.
As a note on terminology, the term model is used to refer to the abstract
conception of the process being studied and hence fuzzy model is the notional
representation of the process in terms of fuzzy variables, rules and methods
that together define the input-output mapping relationship. In contrast, the
term system (as in fuzzy expert system) is used to refer to the embodiment, re-
alisation or implementation of the theoretical model in some software language
or package. A single model may be realised in different forms, for example, via
differing software languages or differing hardware platforms. Thus it should
be realised that there is a subtle, but important, distinction between the eval-
uation of a fuzzy model of expertise and the evaluation of (one or more of) its
corresponding fuzzy expert systems. A model may be evaluated as accurately
capturing or representing the domain problem under consideration, whereas
its realisation as software might contain bug(s) that cause undesired artefacts
in the output. This topic will be further explored in Sect. 6.7.
It will generally be assumed that the reader is familiar with fuzzy theory,
methods and terminology – this chapter is not intended to be an introductory
tutorial, rather it is a guide to currently accepted best practice for building a
fuzzy expert system. For a simple introductory tutorial the reader is referred
to Cox [8]; for a comprehensive coverage of fuzzy methods see, for example,
Klir and Yuan [22], Ruspini et al [34] or Kasabov [21].
The central question for this chapter is “what are smart adaptive fuzzy
expert systems?” In current state-of-the-art it is not possible to automatically
adapt a system created in one application area to address a novel application
area. Automatic tuning or optimisation techniques may be applied to (in some
sense) adapt a given fuzzy expert system to particular data (see Sect. 6.8).
Real “smart adaptive” systems are presently more likely to be found in neuro-
fuzzy or hybrid systems covered in subsequent chapters. In each new applica-
tion area, a fuzzy expert system must effectively be “hand-crafted” to achieve
the desired performance. Thus the creation of good fuzzy expert systems is
an art that requires skill and experience. Hints and tips to assist those new
to this area in making appropriate choices at each stage will be provided.
6.1 Introduction
The generic architecture of a fuzzy expert system showing the flow of data
through the system is shown in Fig. 6.1 (adapted from Mendel [26]). The
general process of constructing such a fuzzy expert system from initial model
design to system evaluation is shown in Fig. 6.2. This illustrates the typical
process flow as distinct stages for clarity but in reality the process is not
usually composed of such separate discrete steps and many of the stages,
although present, are blurred into each other.
fuzzy
expert rules
system
crisp crisp
inputs outputs
x fuzzifier defuzzifier y
inference
y = f (x)
START
structure optimisation
data
preparation
structure optimisation
inference
methodology
defuzzification
method
structure optimisation
linguistic
variables
parameter optimisation
membership
rules optimisation
functions
evaluation
END
Once the problem has been clearly specified (see Chap. 2), the process of
constructing the fuzzy expert system can begin. Invariably some degree of data
preparation and preprocessing is required, and this stage has been discussed
in detail in Chap. 3. The first major choice the designer has to face is whether
to use the Mamdani inference method [24] or the Takagi-Sugeno-Kang (TSK)
method [37, 39]. The essential difference in these two methodologies is that
the result of Mamdani inference is one or more fuzzy sets which must (almost
always) then be defuzzified into one or more real numbers, whereas the re-
sult of TSK inference is one or more real functions which may be evaluated
directly. Thus the choice of inference methodology is linked to the choice of
defuzzification method. Once the inference methodology and defuzzification
method have been chosen, the process of enumerating the linguistic variables
necessary can commence. This should be relatively straightforward if the prob-
lem has been well specified and is reasonably well understood. If this is not
the case, then the decision to construct a fuzzy expert system may not be ap-
propriate. The next stage of deciding the necessary terms with their defining
membership functions and determining the rules to be used is far from trivial
108 J.M. Garibaldi
however. Indeed, this stage is usually the most difficult and time consuming
of the whole process.
After a set of fuzzy membership functions and rules has been established
the system may be evaluated, usually by comparison of the obtained output
against some desired or known output using some form of error or distance
function. However, it is very rare that the first system constructed will perform
at an acceptable level. Usually some form of optimisation or performance
tuning of the system will need to be undertaken. Again, there are a multitude
of options that a designer may consider for model optimisation. A primary
distinction illustrated in Fig. 6.2 is the use of either parameter optimisation
in which (usually) only aspects of the model such as the shape and location
of membership functions and the number and form of rules are altered, or
structure optimisation in which all aspects of the system including items such
as the inference methodology, defuzzification method, or number of linguistic
variables may be altered. In general, though, there is no clear distinction.
Some authors consider rule modification to be structure optimisation, while
others parameterise the rules.
As stated earlier, two case studies will be used through the rest of this chapter
to provide a grounding for the discussions. The two case studies are now briefly
introduced.
The problem is to design a fuzzy expert system to predict (advise) when to buy
or sell shares in an American company based on three sources of information:
1. the past share price of the company itself (share pr),
2. the Euro/Dollar exchange rate (xchg rate), and
3. the FTSE (Financial Times Stock Exchange) share index (the UK stock
index) (FTSE).
Clearly, this is an artificial scenario as the Dow Jones index would almost
certainly be used in any real predictor of an American company, but the case
study is designed to be illustrative rather than realistic.
Childbirth is a stressful experience for both mother and infant. Even during
normal labour every infant is being regularly deprived of oxygen as mater-
nal contractions, which increase in frequency and duration throughout labour
until delivery, restrict blood supply to the placenta. This oxygen deprivation
can lead to fetal “distress”, permanent brain damage and, in the extreme,
6 Fuzzy Expert Systems 109
min
rule 2 has
no dependency
on input 2
min
rule 2 has
no dependency
on input 2
may result in fewer rules. As an example, consider the two case-studies pre-
sented. In the financial forecasting example, there would appear to be little /
no need for uncertainty representation in the output. If the output is above a
threshold, the advice will be to buy shares; if the output is below a threshold,
the advice will be to sell shares held. Although there is no specific require-
ment for real-time operation, TSK inference might be utilised as a Mamdani
system would just introduce an “uncecessary” step of defuzzification. In the
umbilical acid-base expert system, there was a specific requirement both for a
representation of uncertainty in the output and for the potential of obtaining a
linguistic rather than numeric output (see Sect. 6.6 for information on obtain-
ing linguistic output). There was no real-time constraint and hence Mamdani
inference was clearly indicated.
A great deal has been written about the theoretical properties and practical
choices of operators necessary to carry out fuzzy operations of set intersection
(AND) and union (OR). It is well established that fuzzy intersections are
represented by a class of functions that are called triangular norms or T-norms
and that fuzzy unions are represented by triangular conorms or T-conorms.
A T-norm ⊗ is a binary function:
⊗ : II2 → II
where II represents the set of real numbers in the unit interval [0, 1], that
satisfies the following conditions: for any a, b ∈ II,
(i) 1⊗a=a (identity)
(ii) (a ⊗ b) ⊗ c = a ⊗ (b ⊗ c) (associativity)
(iii) a⊗b=b⊗a (commutivity)
(iv) a ⊗ b ≤ a ⊗ c, if b ≤ c (monotonicity)
where c, σ and a are parameters that determine the centre and spread of
the Guassian or sigmoidal. Again, during development of the umbilical acid-
base FES it was found both that clinicians preferred the appearance of non-
piecewise linear membership functions (in fact, a combination of sigmoidals
and double-sigmoidals were used) and that these membership functions gave
a slight performance gain. Many authors, particularly in the context of fuzzy
control, prefer to use piecewise linear functions, probably due to their ease of
calculation, whereas other such as Mendel [26] prefer Gaussians because they
are more simply analysable (e.g. differentiable) and hence are more suitable
for automated tuning methodologies.
In the context of a fuzzy expert system, the source of membership func-
tions would ideally be domain experts. However, again, there is no generally
accepted method for eliciting membership functions from experts. An excel-
lent survey of methods for membership function elicitation and learning is
given by Krishnapuram [23]. The methods include:
• Membership functions based on perceptions – in which human sub-
jects are polled or questioned.
• Heuristic methods – guessing!
• Histogram-based methods – in which membership functions are some-
how matched to histograms of the base variables obtained from real data.
• Transformation of probability distributions to possibility distri-
butions – if membership functions are considered numerically equivalent
to possibility distributions and probability distributions are available, then
methods are available to transform the probability distributions to possi-
bility distributions.
• Neural-network-based methods – standard feedforward neural net-
works can be used to generate membership functions from labelled training
data.
• Clustering-based methods – Clustering algorithms such as fuzzy c-
means [4] can be used.
One further observation will be added. Most elicitation methods concen-
trate on direct or indirect elicitation of the membership functions themselves –
i.e. they operate on the input side of the FES. During knowledge elicitation
for the umbilical acid-base FES, it was found that such methods led to poor
performance. That is, if an expert indicated membership functions directly,
then the overall agreement of the FES with that expert was poor. In contrast,
it was found that if the expert indicated desirable or undesirable features in
the output consequent sets for specific (training) cases, then the input mem-
bership functions could be adjusted by a sort of informal back-propagation
methodology to achieve better agreement with the expert.
6 Fuzzy Expert Systems 117
• Neuro-fuzzy approaches (see Chaps. 7 & 9) or, for example, the methods
of Jang [16] and Wang and Mendel [41].
6.6 Defuzzification
Once the fuzzy reasoning has been completed it is usually necessary to present
the output of the reasoning in a human understandable form, through a
process termed defuzzification. There are two principal classes of defuzzifi-
cation, arithmetic defuzzification and linguistic approximation. In arithmetic
defuzzification a mathematical method is used to extract the single value in
the universe of discourse that “best” (in some sense) represents the arbitrar-
ily complex consequent fuzzy set (Fig. 6.5). This approach is typically used
in areas of control engineering where some crisp result must be obtained. In
linguistic approximation the primary terms in the consequent variable’s term
set are compared against the actual output set in a variety of combinations
until the “best” representation is obtained in natural language. This approach
might be used in expert system advisory applications where human users view
the output, although its actual usage has been relatively rare in the literature.
The two most popular methods of arithmetic defuzzification are the centre-
of-gravity (COG) algorithm and the mean-of-maxima algorithm. For the con-
sequent set A = µ1 /x1 + µ2 /x2 + . . . + µN /xN , the centre-of-gravity algorithm
provides a single value by calculating the imaginary balance point of the shape
of the membership:
N
(µi · xi )
xg = i=1 N (6.4)
i=1 µi
6 Fuzzy Expert Systems 119
1.00
0.50
0.00
0 5 10 15 20 x
xm = max µi (6.5)
i
and calculates the mean of all the maxima if more than one maximum is found.
An illustration of the output of these two methods is shown in Fig. 6.6. Un-
fortunately, both these methods have problems. Firstly, they both obviously
lose information by trying to represent a complex fuzzy shape as a single
scalar number. The COG method is insensitive to the overall height of the
fuzzy consequent set, and the mean-of-maxima is prone to discontinuities in
output, as only a small change in shape (for instance if there are two similar
sized peaks) can cause a sudden large change in output value.
A number of alternative parameters can also be calculated to provide more
information on the shape of the output as well as its location. A variety of such
parameters are described, and illustrated through the example fuzzy output
sets, A, B, C, D, E, and F , as shown in Fig. 6.7.
Mean of Maxima
Centre of Gravity
1.00
0.50
0.00
0 5 10 15 20 x
1.0
0.9
C D F
0.8
0.7
membership value
0.6
0.5
A B
0.4
0.3 E
0.2
0.1
0.0
0 10 20 30 40 50 60 70 80 90 100
universe of discourse
Normalised Area
The area of the output set normalised to its maximum value is given by:
N
µi
area = i=1 (6.6)
N
This gives a value of 1.0 for the unknown set (µ = 1 across the universe of
discourse), a value of 0.0 for the undefined set (µ = 0 across the universe
of discourse), and would give a minimal value (≈0) for a fuzzy singleton. In
Fig. 6.7 the output fuzzy sets B and C have the same COG (xg = 40) and
the same membership at this point (µg = 0.80), but set B has a larger area
and hence a larger uncertainty.
Fuzzy Entropy
Yet another measure, termed the entropy of a fuzzy set is defined by:
N
(−µi log2 (µi ) − (1 − µi ) log2 (1 − µi ))
S = i=1 (6.7)
N
This is normalised to its maximum value to give a value between zero and
one, and provides an indication of the lack of information contained in the
output set in terms of the distance away from the extremes of µ = 0.0 and
µ = 1.0. It therefore gives a value of 0.0 for the unknown and undefined sets,
and gives a value of 1.0 for the indeterminate set (µ = 0.5 across the universe
of discourse). Similarly to the normalised area, it too gives a minimal value
for fuzzy singletons.
where µi is the membership of the output set and ηi is the membership grade
of the currently considered linguistic approximation – the minimum value of δ
will determine the best match. Alternatively, the degree of overlap, γ, of two
fuzzy sets, A and B, can be calculated by dividing the area of intersection by
the area of the union of the sets:
A∩B
γ= (6.9)
A∪B
to give a value between zero (for disparate sets) and one (for coincidental
sets) – the maximum value of γ will determine the best match.
A search is then initiated to find the best match whilst attempting to limit
the complexity of the combination of terms, in order to produce comprehen-
sible output. For example, although the linguistic combination not extremely
cold and fairly medium and medium and fairly hot might produce a better
match than medium and fairly hot for Fig. 6.6, the latter term would be
preferred due to its relative simplicity.
6 Fuzzy Expert Systems 123
6.7 Evaluation
Many authors have used the terms verification, validation, assessment and
evaluation in differing and inconsistent manners in the literature [29, 30]. In
this section the following terminology, designed specifically for the European
Advanced Informatics in Medicine (AIM) project [10], is adopted:
• verification is the process of ensuring that the expert system is functioning
according to its specification,
• validation is the process of ensuring that the knowledge embedded within
the expert system is an accurate representation of the domain, and
• assessment is the process of determining the effect that the expert system
has in the real-world setting – this can be further split into two further
sub-tasks:
1. human factors assessment – determining whether the system is useful
to and usable by its target users, and
2. performance assessment – determining whether the system makes a
measurable difference (improvement) when deployed.
Evaluation is a global term that refers to the collective processes of verifica-
tion, validation and assessment.
In an ideal framework of scientific investigation the entire evaluation
methodology would be established and fixed a priori. That is to say, the data,
experiments, statistical tests and acceptable performance measures to be used
in deciding whether an expert system was acceptable would all be decided be-
fore construction of the system commenced. This is very rarely the case in
practical expert system development and fuzzy expert systems are no excep-
tion. It is not possible to cover the process of system evaluation thoroughly
here; the reader is referred to [29, 30] for general guidance.
performance measure is the profit (or loss!) made when trading shares using
real share prices. Either past data that has not been used for system training
and testing could be used for evaluation or, better, real data collected after
the system can be proven to have been fixed. Often some variation of mean-
squared-error (MSE) between actual output and ideal (or desired) output is
utilised.
However, frequently in expert systems (particularly medical expert sys-
tems) there is no objective measure of correct performance and other means
must be used. The solution in such cases is usually to compare the output of
the (fuzzy) expert system against human expert opinion on a range of data.
An indirect measure of performance is then created. If the expert opinion and
expert system output is categorical (e.g. classification into one of a number
of specified disease categories) then the proportion of correct matches can be
used. It is better to use a statistical method to correct for chance agreement
such as the Kappa statistic [6], which can be used to measure either exact
agreements only or can be modified to allow for partial agreement(s) [7]. In
the case of the umbilical acid-base FES, a set of 50 difficult cases (see below)
was selected for evaluation. Clinical experts and the fuzzy expert system were
then asked to rank those cases from worst to best in terms of their indication
of the infant’s state of health. Spearman rank order correlation, effectively a
form of MSE, was then used to measure agreement.
Imagine, as often the case, that the result of an FES is a number on a
continuous scale obtained by, for example, centre-of-gravity defuzzification. If
this number is representing, say, diagnosis of the presence of a disease, then
what threshold should be used to indicate that the disease is indeed present?
Often an arbitrary threshold is chosen (e.g. 50 on a scale 0 . . . 100). However,
a better solution is to use a technique known as Receiver Operating Char-
acteristic (ROC) curves [9, 38] in which the threshold is continuously varied
in order to achieve the best agreement. Note that, if ROC analysis is used, a
method of generating accurate confidence intervals should be employed [40].
Picking suitable data to be used for evaluation purposes is another tricky
area. Ideally, suitable data should cover the range of output possibilities of
the fuzzy expert system in a systematic manner. But how can this be known
until after evaluation has been carried out? There is no simple answer to this.
Probably the best solution, if possible, is to have an independent, acknowl-
edged domain expert pick a suitable set of cases. However, it is often difficult
to find such an expert and undesirable, as is technically necessary, to then not
use this expert for the evaluation exercise.
6.11 Conclusions
References
1. E. Aarts and J. Korst. Simulated Annealing and Boltzmann Machines: A Sto-
chastic Approach to Combinatorial Optimization and Neural Computing. John
Wiley & Sons, New York, 1989.
2. A. Abraham. EvoNF: A framework for optimisation of fuzzy inference systems
using neural network learning and evolutionary computation. In Proceedings of
the 17th IEEE International Symposium on Intelligent Control (ISIC’02), pages
327–332, IEEE Press, 2002.
3. C. Baroglio, A. Giordana, M. Kaiser, M. Nuttin, and R. Piola. Learning control
for industrial robots. Machine Learning, 2/3:221–250, 1996.
4. J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms.
Plenum, New York, 1981.
130 J.M. Garibaldi
5. J.L. Castro, J.J. Castro-Schez, and J.M. Zurita. Learning maximal structure
rules in fuzzy logic for knowledge acquisition in expert systems. Fuzzy Sets and
Systems, 101:331–342, 1999.
6. J. Cohen. A coefficient of agreement for nominal scales. Educational Psycho-
logical Measurement, 20:37–46, 1960.
7. J. Cohen. Weighted kappa: Nominal scale agreement with provision for scaled
disagreement or partial credit. Psychological Bulletin, 70:213–220, 1968.
8. E. Cox. The Fuzzy Systems Handbook: A Practitioner’s Guide to Building,
Using and Maintaining Fuzzy Systems. AP Professional, San Diego, CA, second
edition, 1998.
9. J.P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, New
York, 1975.
10. R. Engelbrecht, A. Rector, and W. Moser. Verification and validation. In
E.M.S.J. van Gennip and J.L. Talmon, editors, Assessment and Evaluation of
Information Technologies, pages 51–66. IOS Press, 1995.
11. J.M. Garibaldi and E.C. Ifeachor. Application of simulated annealing fuzzy
model tuning to umbilical cord acid-base interpretation. IEEE Transactions on
Fuzzy Systems, 7(1):72–84, 1999.
12. J.M. Garibaldi and E.C. Ifeachor. The development of a fuzzy expert system
for the analysis of umbilical cord blood. In P. Szczepaniak, P.J.G. Lisboa, and
J. Kacprzyk, editors, Fuzzy Systems in Medicine, pages 652–668. Springer-
Verlag, 2000.
13. J.M. Garibaldi and R.I. John. Choosing membership functions of linguistic
terms. In Proceedings of the 2003 IEEE International Conference on Fuzzy
Systems (FUZZ-IEEE 2003), pages 578–583, St. Louis, USA, 2003. IEEE, New
York.
14. J.M. Garibaldi, J.A. Westgate, and E.C. Ifeachor. The evaluation of an ex-
pert system for the analysis of umbilical cord blood. Artificial Intelligence in
Medicine, 17(2):109–130, 1999.
15. J.M. Garibaldi, J.A. Westgate, E.C. Ifeachor, and K.R. Greene. The develop-
ment and implementation of an expert system for the analysis of umbilical cord
blood. Artificial Intelligence in Medicine, 10(2):129–144, 1997.
16. J.-S.R. Jang. ANFIS: Adaptive-network-based fuzzy inference system. IEEE
Transactions on Systems, Man, and Cybernetics, 23(3):665–685, 1993.
17. J.-S.R. Jang. Structure determination in fuzzy modeling: A fuzzy CART ap-
proach. In Proceedings of IEEE International Conference on Fuzzy Systems
(FUZZ-IEEE’94), pages 480–485, Orlando, FL, 1994.
18. R.I. John. Embedded interval valued fuzzy sets. Proceedings of FUZZ-IEEE
2002, pages 1316–1321, 2002.
19. R.I. John, P.R. Innocent, and M.R. Barnes. Neuro-fuzzy clustering of radi-
ographic tibia image data using type-2 fuzzy sets. Information Sciences, 125/1-
4:203–220, 2000.
20. N.N. Karnik and J.M. Mendel. Operations on type-2 fuzzy sets. Fuzzy Sets and
Systems, 122:327–348, 2001.
21. N.K. Kasabov. Foundations of Neural Networks, Fuzzy Systems and Knowledge
Engineering. MIT Press, Cambridge, Massachussets, 1996.
22. G.J. Klir and B. Yuan. Fuzzy Sets and Fuzzy Logic: Theory and Applications.
Prentice-Hall, Upper Saddle River, NJ, 1995.
6 Fuzzy Expert Systems 131
44. L.A. Zadeh. From computing with numbers to computing with words – from
manipulation of measurements to manipulation of perceptions. IEEE Transac-
tions on Circuits and Systems, 45(1):105–119, 1999.
45. L.A. Zadeh. A prototype-centered approach to adding deduction capability to
search engines – the concept of protoform. Proceedings of NAFIPS 2002, pages
523–525, 2002.
46. National Research Council Canada, Institute for Information Technology, Fuzzy
CLIPS Website, http://www.iit.nrc.ca/IR public/fuzzy.
47. Python Website, http://www.python.org.
7
Learning Algorithms for Neuro-Fuzzy Systems
D.D. Nauck
In this chapter we look at techniques for learning fuzzy systems from data.
These approaches are usually called neuro-fuzzy systems, because many of
the available learning algorithms used in that area are inspired by techniques
known from artificial neural networks. Neuro-fuzzy system are generally ac-
cepted as hybrid approaches although they rarely combine a neural network
with a fuzzy system as the name would suggest. However, they combine tech-
niques of both areas and therefore the term justified.
Methods for learning fuzzy rules are important tools for analysing and
explaining data. Fuzzy data analysis – meaning here the application of fuzzy
methods to the analysis of crisp data – can lead to simple, inexpensive and
user-friendly solutions. The rule based structure and the ability of fuzzy sys-
tems to compress data and thus reducing complexity leads to interpretable
solutions which is important for business applications. While many approaches
to learning fuzzy systems from data exist, fuzzy solutions can also accommo-
date prior expert knowledge in form of simple to understand fuzzy rules –
thus closing the gap between purely knowledge-based and purely data-driven
methods. This chapter reviews several basic learning methods to derive fuzzy
rules in a data mining or intelligent data analysis context. We present one al-
gorithm in more detail because it is particularly designed for generating fuzzy
rules in a simple and efficient way which makes it very useful for including it
into applications.
7.1 Introduction
Modern businesses gather vast amounts of data daily. For example, data about
their customers, use of their products and use of their resources. The com-
puterisation of all aspects of our daily live and the ever-growing use of the
Internet make it ever easier to collect and store data. Nowadays customers
D.D. Nauck: Learning Algorithms for Neuro-Fuzzy Systems, StudFuzz 173, 133–158 (2005)
www.springerlink.com c Springer-Verlag Berlin Heidelberg 2005
134 D.D. Nauck
expect that businesses cater for their individual needs. In order to personalise
services, intelligent data analysis (IDA) [3] and adaptive (learning) systems
are required. Simple linear statistical analysis as it is mainly used in today’s
businesses cannot model complex dynamic dependencies that are hidden in
the collected data. IDA goes one step further than today’s data mining ap-
proaches and also considers the suitability of the created solutions in terms like
usability, comprehension, simplicity and cost. The intelligence in IDA comes
from the expert knowledge that can be integrated in the analysis process, the
knowledge-based methods used for analysis and the new knowledge created
and communicated by the analysis process.
Whatever strategy businesses pursue today cost reduction is invariably
at the heart of it. In order to succeed they must know the performance of
their processes and find means to optimise them. IDA provides means to find
combine process knowledge with the collected data. Learning systems based
on IDA methods can continuously optimise processes and also provide new
knowledge about business processes. IDA is therefore an important aspect in
modern knowledge management and business intelligence.
In addition to statistical methods, today we also have modern intelli-
gent algorithms based on computational intelligence and machine learning.
Computational intelligent methods like neuro-fuzzy systems and probabilistic
networks or AI methods like decision trees or inductive logic programming
provide new, intelligent ways for analysing data. Research in data analysis
more and more focuses on methods that allow both the inclusion of available
knowledge and the extraction of new, comprehensible knowledge about the
analysed data.
Methods that learn fuzzy rules from data are a good example for this type
of research activity. A fuzzy system consists of a collections of fuzzy rules
which use fuzzy sets [70] to specify relations between variables. Fuzzy rules
are also called linguistic rules, because fuzzy sets can be conveniently used for
describing linguistic expressions like, for example, small, medium or large.
We interpret fuzzy systems as convenient models to linguistically represent
(non-linear) mappings [71]. The designer of a fuzzy system specifies character-
istic points of an assumed underlying function by encoding them in the form
of simple fuzzy rules. This function is unknown except for those characteristic
points. The fuzzy sets that are used to linguistically describe those points ex-
press the degree of indistinguishability of points that are close to each other
[26, 33]. For parts of the function where characteristic points are not known,
but where training data is available, fuzzy rules can be conveniently mined
from the data by a variety of learning algorithms.
The advantages of applying a fuzzy system are the simplicity and the
linguistic interpretation of the approach. This allows for the inexpensive and
fast development and maintenance of solutions and thus enables us to solve
problems in application areas where rigorous formal analysis would be too
expensive and time-consuming.
7 Learning Algorithms for Neuro-Fuzzy Systems 135
In this chapter we review several approaches to learn fuzzy rules from data
in Sect. 7.2 and explain one particular algorithm (NEFCLASS) in more detail
in Sect. 7.3. We close the chapter with an example of applying this learning
algorithm to a data set.
7.2.1 Cluster-Oriented
and Hyperbox-Oriented Fuzzy Rule Learning
Cluster-oriented methods try to group the training data into clusters and
use them to create rules. Fuzzy cluster analysis [5, 7] can be used for this
task by searching for spherical or hyperellipsoidal clusters. The clusters are
multidimensional (discrete) fuzzy sets which overlap. An overview on several
fuzzy clustering algorithms can be found, for example, in [16].
Each fuzzy cluster can be transformed into a fuzzy rule, by projecting the
degrees of membership of the training data to the single dimensions. Thus
for each cluster and each variable a histogram is obtained that must be ap-
proximated either by connecting the degrees of membership by a line, by a
convex fuzzy set, or – more preferably – by a parameterised membership func-
tion that should be both normal and convex and fits the projected degrees of
memberships as well as possible [28, 63]. This approach can result in forms of
membership functions which are difficult to interpret (see Fig. 7.1)
This procedure also causes a loss of information because the Cartesian
product of the induced membership functions does not reproduce a fuzzy
cluster exactly (see Fig. 7.2). This loss of information is strongest in the case
of arbitrarily oriented hyperellipsoids. To make this problem easier to handle,
it is possible to search for axes–parallel hyperellipsoids only [29].
The fuzzy rule base obtained by projecting the clusters is usually not easy
to interpret, because the fuzzy sets are induced individually for each rule
7 Learning Algorithms for Neuro-Fuzzy Systems 137
µ(x) Approximation by a
triangular fuzzy set
1
Approximation by a
convex fuzzy set
x
Fig. 7.1. Creation of a fuzzy set from projected degrees of membership
y y
x x
A B
Fig. 7.2. If clusters in the form of hyperellipsoids are projected to obtain fuzzy
rules, a loss of information occurs and unusual fuzzy partitions can be obtained
(Fig. 7.2). For each feature there will be as many different fuzzy sets as there
are clusters. Some of these fuzzy sets may be similar, yet they are usually not
identical. For a good interpretation it is necessary to have a fuzzy partition
of few fuzzy sets where each clearly represents a linguistic.
The loss of information that occurs in projecting fuzzy clusters can be
avoided, if the clusters are hyperboxes and parameterised multidimensional
membership functions are used to represent a cluster. As in fuzzy cluster
analysis the clusters (hyperboxes) are multidimensional overlapping fuzzy
sets. The degree of membership is usually computed in such a way that the
projections of the hyperboxes on the individual variables are triangular or
trapezoidal membership functions (Fig. 7.3).
Hyperbox-oriented fuzzy rule learning is usually supervised. For each pat-
tern of the training set that is not covered by a hyperbox, a new hyperbox is
created and the output of the training pattern (class information or output
value) is attached to this new hyperbox. If a pattern is incorrectly covered
by a hyperbox with different output, this hyperbox is shrunk. If a pattern is
138 D.D. Nauck
y y
µ µ
x x
y y y y
x x x x
Fig. 7.5. Searching for hyperboxes to create fuzzy rules and fuzzy sets
analysis, where the clusters are hyperboxes which are aligned on a grid. It can
therefore also be interpreted as hyperbox-oriented. Because the data space
is structured by predefined fuzzy sets it can also be viewed as a structure-
oriented learning method as discussed in Sect. 7.3. The clusters are given by
membership functions and not vice versa. Therefore the rule base obtained
by grid clustering can be well interpreted, because the fuzzy rules do not use
individual fuzzy sets.
It is also possible to use neural networks to create fuzzy rule bases. RBF
networks can be used to obtain a (Sugeno-type) fuzzy rule base. An RBF net-
work uses multi-dimensional radial basis functions in the nodes of its hidden
layer. Each of these functions can be interpreted as a fuzzy cluster. If the
RBF network is trained by a gradient descent procedure it adjusts the loca-
tion of the radial basis functions and – depending on the type of network –
also their size and orientation. A fuzzy rule base is determined after training
by projecting the radial basis functions.
Pedrycz and Card suggested a way to linguistically interpret Kohonen’s
self-organizing feature maps (SOM) [31] in order to create a fuzzy rule base
[58]. A feature map is used to perform a cluster analysis on the training data
and thus to reduce the data set to a set of prototypes represented by neurons.
The prototypes are then used to create a fuzzy rule base by a structure-
oriented approach as they are discussed in Sect. 7.3.
Kosko suggested a cluster-oriented approach to fuzzy rule learning that is
based on his FAM model (Fuzzy Associative Memory) [32]. Kosko uses a form
of adaptive vector quantisation that is not topology preserving as in the case
140 D.D. Nauck
large
medium
small
Fig. 7.6. Structure-oriented approaches use initially defined fuzzy sets to structure
the data space by overlapping hyperboxes, which represent fuzzy rules
To apply the Wang & Mendel algorithm all variables are partitioned by
fuzzy sets. For this purpose equidistant overlapping triangular or trapezoidal
membership functions are usually used. By this means the feature space is
partitioned by overlapping multidimensional fuzzy sets whose support is a
hyperbox. Rules are created by selecting those hyperboxes that contain data.
Wang and Mendel designed their algorithm to create fuzzy systems for func-
tion approximation. In order to mediate between different output values for
the same combination of input values, they used weighted rules. In [67] a
proof can be found that this algorithm can create fuzzy rule bases that can
approximate any real continuous function over a compact set to an arbitrary
accuracy.
Higgins & Goodman [15] suggested a variation of the Wang & Mendel al-
gorithm in order to create fuzzy partitions during rule creation by refining the
existing partitions. The algorithm begins with only one membership function
for each variable, such that the whole feature space is covered by one large
hyperbox. Subsequently, new membership functions are inserted at points of
maximum error by refining the fuzzy partitions of all variables. Then the old
rules are discarded and a new set of rules is generated based on the new fuzzy
partitions. This procedure is iterated until a maximum number of fuzzy sets
is created or the error decreases below some threshold. This algorithm was
created in order to compensate for a drawback of the Wang & Mendel al-
gorithm, which has problems modelling extreme values of the function to be
approximated. However, the Higgins&Goodman algorithm tends to fit outliers
because it concentrates on areas with large error.
Fuzzy decision trees are another approach to structure-oriented fuzzy rule
learning. Induction of decision trees [19, 20] is a very popular approach in
142 D.D. Nauck
(a) replace the functions used in the fuzzy system (like min and max) by
differentiable functions, or
(b) do not use a gradient-based neural learning algorithm but a better-suited
procedure.
• When real world data sets must be analysed we often have to deal with dif-
ferent types of variables on different scales, i.e. nominal scales (categorical
or symbolic data), ordinal scales, or interval and ratio scales (both metric).
Fuzzy systems do no depend on the type of the data they process, i.e. they
can work with numerical and non-numerical variables. A neuro-fuzzy algo-
rithm that addresses combinations of numerical and non-numerical data
is discussed in [12] and [42].
• In processing real world data we often must deal with missing values.
Many learning algorithms cannot cope with this problem and simply delete
incomplete patterns from the learning problem. This, however, can lead to
a substantial or even unacceptable loss of training data. An approach to
learning fuzzy rules if the data contain missing values is described in [41].
• For learning fuzzy controllers usually reinforcement learning [24] – a special
form of supervised learning – is used. This type of learning uses reinforce-
ment signals instead of target output values, which are typically unknown
for control problem. For an overview on reinforcement learning in fuzzy
systems see [16]. Recent advances can be found in [57].
where s is the cardinality of the training data set and qi is the number of fuzzy
sets provided for variable xi . If the training data has a clustered structure and
concentrates only in some areas of the data space, then the number of rules
will be much smaller then the theoretically possible number of rules. The
146 D.D. Nauck
actual number of rules will normally be bound by criteria defined by the user
like “create no more than k rules” or “create so many rules that at least p%
of all training data are covered”.
The suitability of the rule base depends on the initial fuzzy partitions. If
there are too few fuzzy sets, groups of data that should be represented by
different rules might be covered by a single rule only. If there are more fuzzy
sets than necessary to distinguish different groups of data, too many rules will
be created and the interpretability of the rule base decreases. The example
in Fig. 7.6 shows three clusters of data that are represented by the following
three rules:
if x is small then y is large
if x is medium then y is small
if x is large then y is medium
In Algorithms 7.1 – 7.3 we present procedures for structure-oriented fuzzy
rule learning in classification or function approximation problems. The algo-
rithms are implemented in the neuro-fuzzy approach NEFCLASS. The algo-
rithms use the following notations:
• L̃: a set of training data (fixed learning problem) with L̃ = s, which rep-
resents a classification problem where patterns p ∈ IRn are to be assigned
to m classes C1 , . . . , Cm , with Ci ⊆ IRn .
• (p, t) ∈ L̃: a training pattern consists of an input vector p ∈ IRn and a
target vector t ∈ [0, 1]m . The target vector represents a possibly vague
classification of the input pattern p. The class index of p is given by the
index of the largest component of t: class(p) = argmaxj {tj }.
• R = (A, C): a fuzzy classification rule with antecedent ant(R) = A and
(1) (n)
consequent con(R) = C, where A = (µj1 , . . . , µjn ) and C is a class.
We use both R(p) and A(p) to denote the degree of fulfilment of rule R
(1)
(with antecedent A) for pattern p, i.e. R(p) = A(p) = min{µj1 (p1 ), . . . ,
(n)
µjn (pn )}.
(i)
• µj : jth fuzzy set of the fuzzy partition of input variable xi . There are qi
fuzzy sets for variable xi .
• cA : a vector with m entries to represent the accumulated degrees of mem-
bership to each class for all patterns with A(p) > 0; cA [j] is the jth entry
of cA .
• PR ∈ [−1, 1]: a value representing the performance of rule R:
1 c 0 if class(p) = con(R) ,
PR = (−1) R(p), with c = (7.1)
s 1 otherwise.
(p,t)∈L̃
At first, the rule learning algorithm detects all rule antecedents that cover
some training data and creates a list of antecedents. In the beginning this list
is either empty, or it contains antecedents from rules given as prior knowledge.
7 Learning Algorithms for Neuro-Fuzzy Systems 147
Algorithm 7.2 Select the best rules for the rule base
SelectBestRules
(* The algorithm determines a rule base by selecting the best rules from *)
(* the list of rule candidates created by Algorithm 7.1. *)
1: k = 0; stop = false;
2: repeat
3: R = argmax {PR };
R
4: if fixed rule base size then
5: if (k < kmax ) then
6: add R to rule base;
7: delete R from list of rule candidates;
8: k = k + 1;
9: else
10: stop = true;
11: end if
12: else if (all patterns must be covered) then
13: if (R covers some still uncovered patterns) then
14: add R to rule base;
15: delete R from list of rule candidates;
16: if (all patterns are now covered) then
17: stop = true;
18: end if
19: end if
20: end if
21: until stop
Algorithm 7.3 Select the best rules per class for the rule base
SelectBestRulesPerClass
(* The algorithm determines a rule base by selecting the best rules for each *)
(* class from the list of rule base candidates created by Algorithm 7.1. *)
1: k = 0; stop = false;
2: repeat
3: for all classes C do
4: if (∃R : con(R) = C) then
5: R = argmax {PR };
R: con(R)=C
6: if (fixed rule base size) then
7: if (k < kmax ) then
8: add R to rule base;
9: delete R from list of rule candidates;
10: k = k + 1;
11: else
12: stop = true;
13: end if
14: else if (all patterns must be covered) then
15: if (R covers some still uncovered patterns) then
16: add R to rule base;
17: delete R from list of rule candidates;
18: end if
19: if (all patterns are now covered) then
20: stop = true;
21: end if
22: end if
23: end if
24: end for
25: until stop
rules is restricted by some value given by the user. The latter method does
not need to process the training data again. Only if the the rule base size is
determined automatically, must the training patterns be processed again until
so many rules have been selected that all patterns are covered by rules.
If the number of rules is restricted the rule learning algorithm is not very
much influenced by outliers. Rules that are only created to cover outliers have
a low performance value and will not be selected for the rule base.
The performance of the selected rule base depends on the fuzzy parti-
tions that are provided for the input (and output) variables. To increase the
performance the fuzzy sets should be tuned by a suitable algorithm.
1.0
0.85
0.5
0.15
a b c x
Fig. 7.7. To increase the degree of membership for the current input pattern the
original representation of the fuzzy set (centre) assumes the representation on the
right, to decrease the degree of membership, it assumes the representation on the left
error measure [44, 40]. ε is a small positive number, e.g. ε = 0.01, that is used
to also train rules with a degree of fulfilment of 0 or 1 to a small extent. Thus
we compensate for the absence of adaptable consequent parameters in fuzzy
classifiers. This means that a fuzzy cluster that corresponds to a rule can be
moved even it it is located in an area of the input space with no data, or if it
exactly matches certain outliers.
In order to further tune the structure of a fuzzy system, pruning strategies
can be applied. Pruning is well-known from neural networks [14, 53] and deci-
sion tree learning [60]. For Fuzzy classifiers like NEFCLASS simple heuristics
exist that try to remove variables and rules from the rule base [40, 52].
• The rule learning algorithms in this section can be used as an example for
implementing a neuro-fuzzy system. They are kept deliberately simple in
order to facilitate learning of interpretable fuzzy systems by understanding
what is going on in the algorithm.
• Hyperbox-oriented fuzzy rule learning algorithms are similarly simple and
also easy to implement [2].
• Free implementations of neuro-fuzzy systems like NEFCLASS [11] or AN-
FIS [19] can be found on the Internet. Just enter these names into a search
engine.
• In order to obtain interpretable solutions fuzzy set learning must be con-
strained and some modifications to a fuzzy set after a learning step may
have to be “repaired” in order to control overlapping [16, 40].
sm lg
1.0
0.5
0.0
1.0 2.8 4.6 6.4 8.2 10.0
Fig. 7.8. Initial membership functions for the variables of the WBC data set
x2 appeared only once in a rule base during cross validation. The training
protocol reveals, that it was not pruned from the final rule base, because the
error slightly increases and one more misclassification occurs if x2 is removed.
The mean error that was computed during cross validation is 5.86% (min-
imum: 2.86%, maximum: 11.43%, standard deviation: 2.95%). The 99% con-
fidence interval for the estimated error is computed to 5.86% ± 2.54%. This
provides an estimation for the error on unseen data processed by the final
classifier created from the whole data set.
On the training set with all 699 cases the final classifier rules causes 40
misclassifications (5.72%), i.e. 94.28% of the patterns are classified correctly.
There are 28 errors and 12 unclassified patterns which are not covered by one
of the two rules. The confusion matrix for this result is given in Table 7.1.
If one more misclassification can be tolerated, we can also delete variable x2
from both rules. In this case 41 patterns are misclassified (32 errors and 9
unclassified patterns).
Table 7.1. The confusion matrix of the final classifier obtained by NEFCLASS
Predicted Class
Malignant Benign Not Classified Sum
malignant 215 (30.76%) 15 (2.15%) 11 (1.57%) 241 (34.48%)
benign 13 (1.86%) 444 (63.52%) 1 (0.14%) 458 (65.52%)
sum 228 (32.62%) 459 (65.67%) 12 (1.72%) 699 (100.00%)
The linguistic terms small and large for each variable are represented by
membership functions that can be well associated with the terms, even though
they intersect at a slightly higher membership degree than 0.5 (Fig. 7.9)
Uniformity of Cell Size Uniformity of Cell Shape Bare Nuclei
sm lg sm lg sm lg
1.0 1.0 1.0
Fig. 7.9. The membership functions for the three variables used by the final classifier
7.5 Conclusions
systems. Obviously, we could only touch the surface of fuzzy classifier design
and many other approaches exist. For a more comprehensive discussion of
fuzzy classifiers the book by Kuncheva [10] is recommended.
References
1. Hamid R. Berenji and Pratap Khedkar. Learning and tuning fuzzy logic con-
trollers through reinforcements. IEEE Trans. Neural Networks, 3:724–740, Sep-
tember 1992.
2. Michael Berthold. Mixed fuzzy rule formation. Int. J. Approximate Reasoning,
32:67–84, 2003.
3. Michael Berthold and David J. Hand, editors. Intelligent Data Analysis: An
Introduction. Springer-Verlag, Berlin, 1999.
4. Michael Berthold and Klaus-Peter Huber. Constructing fuzzy graphs from
examples. Int. J. Intelligent Data Analysis, 3(1), 1999. Electronic journal
(http://www.elsevier.com/locate/ida).
5. James C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algo-
rithms. Plenum Press, New York, 1981.
6. James C. Bezdek, Eric Chen-Kuo Tsao, and Nikhil R. Pal. Fuzzy Kohonen
clustering networks. In Proc. IEEE Int. Conf. on Fuzzy Systems 1992, pages
1035–1043, San Diego, CA, 1992.
7. J.C. Bezdek, J.M. Keller, R. Krishnapuram, and N. Pal. Fuzzy Models and
Algorithms for Pattern Recognition and Image Processing. The Handbooks on
Fuzzy Sets. Kluwer Academic Publishers, Norwell, MA, 1998.
8. Christian Borgelt and Rudolf Kruse. Attributauswahlmaße für die Induktion von
Entscheidungsbäumen. In Gholamreza Nakhaeizadeh, editor, Data Mining. The-
oretische Aspekte und Anwendungen, number 27 in Beiträge zur Wirtschaftsin-
formatik, pages 77–98. Physica-Verlag, Heidelberg, 1998.
9. Xavier Boyen and Louis Wehenkel. Automatic induction of fuzzy decision trees
and its application to power system security assessment. Fuzzy Sets and Sys-
tems, 102(1):3–19, 1999.
10. L. Breiman, J.H. Friedman, R.A. Olsen, and C.J. Stone. Classification and
Regression Trees. Wadsworth International, 1984.
11. James J. Buckley and Yoichi Hayashi. Fuzzy neural networks: A survey. Fuzzy
Sets and Systems, 66:1–13, 1994.
12. James J. Buckley and Yoichi Hayashi. Neural networks for fuzzy systems. Fuzzy
Sets and Systems, 71:265–276, 1995.
13. A. Grauel, G. Klene, and L.A. Ludwig. Data analysis by fuzzy clustering meth-
ods. In A. Grauel, W. Becker, and F. Belli, editors, Fuzzy-Neuro-Systeme’97 –
Computational Intelligence. Proc. 4th Int. Workshop Fuzzy-Neuro-Systeme’97
(FNS’97) in Soest, Germany, Proceedings in Artificial Intelligence, pages 563–
572, Sankt Augustin, 1997. infix.
14. Simon Haykin. Neural Networks. A Comprehensive Foundation. Macmillan
College Publishing Company, New York, 1994.
15. C. Higgins and R. Goodman. Learning fuzzy rule-based neural networks for
control. Advances in Neural Information Processing Systems, 5:350–357, 1993.
16. Frank Höppner, Frank Klawonn, Rudolf Kruse, and Thomas Runkler. Fuzzy
Cluster Analysis. Wiley, Chichester, 1999.
156 D.D. Nauck
53. R. Neuneier and H.-G. Zimmermann. How to train neural networks. In Tricks
of the Trade: How to Make Algorithms Really Work, LNCS State-of-the-Art-
Survey. Springer-Verlag, Berlin, 1998.
54. Andreas Nürnberger, , Aljoscha Klose, and Rudolf Kruse. Discussing cluster
shapes of fuzzy classifiers. In Proc. 18th International Conf. of the North Amer-
ican Fuzzy Information Processing Society (NAFIPS99), pages 546–550, New
York, 1999. IEEE.
55. Andreas Nürnberger, Christian Borgelt, and Aljoscha Klose. Improving naı̈ve
bayes classifiers using neuro-fuzzy learning. In Proc. 6th International Con-
ference on Neural Information Processing – ICONIP’99, pages 154–159, Perth,
Australia, 1999.
56. Andreas Nürnberger, Aljoscha Klose, and Rudolf Kruse. Analysing borders
between partially contradicting fuzzy classification rules. In Proc. 19th Inter-
national Conf. of the North American Fuzzy Information Processing Society
(NAFIPS2000), pages 59–63, Atlanta, 2000.
57. Andreas Nürnberger, Detlef Nauck, and Rudolf Kruse. Neuro-fuzzy control
based on the NEFCON-model: Recent developments. Soft Computing, 2(4):168–
182, 1999.
58. Witold Pedrycz and H.C. Card. Linguistic interpretation of self-organizing
maps. In Proc. IEEE Int. Conf. on Fuzzy Systems 1992, pages 371–378, San
Diego, CA, 1992.
59. J.R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986.
60. J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, San
Mateo, CA, 1993.
61. P.K. Simpson. Fuzzy min-max neural networks – part 1: Classification. IEEE
Trans. Neural Networks, 3:776–786, 1992.
62. P.K. Simpson. Fuzzy min-max neural networks – part 2: Clustering. IEEE
Trans. Fuzzy Systems, 1:32–45, February 1992.
63. M. Sugeno and T. Yasukawa. A fuzzy-logic-based approach to qualitative mod-
eling. IEEE Trans. Fuzzy Systems, 1:7–31, 1993.
64. Nadine Tschichold Gürman. RuleNet – A New Knowledge-Based Artificial
Neural Network Model with Application Examples in Robotics. PhD thesis,
ETH Zürich, 1996.
65. Petri Vuorimaa. Fuzzy self-organizing map. Fuzzy Sets and Systems, 66:223–
231, 1994.
66. Li-Xin Wang and Jerry M. Mendel. Generation rules by learning from examples.
In International Symposium on Intelligent Control, pages 263–268. IEEE Press,
1991.
67. Li-Xin Wang and Jerry M. Mendel. Generating fuzzy rules by learning from
examples. IEEE Trans. Syst., Man, Cybern., 22(6):1414–1427, 1992.
68. W.H. Wolberg and O.L. Mangasarian. Multisurface method of pattern separa-
tion for medical diagnosis applied to breast cytology. Proc. National Academy
of Sciences, 87:9193–9196, December 1990.
69. Yufei Yuan and Michael J. Shaw. Induction of fuzzy decision trees. Fuzzy Sets
and Systems, 69(2):125–139, 1995.
70. Lotfi A. Zadeh. Fuzzy sets. Information and Control, 8:338–353, 1965.
71. Lotfi A. Zadeh. Fuzzy logic and the calculi of fuzzy rules and fuzzy graphs: A
precis. Int. J. Multiple-Valued Logic, 1:1–38, 1996.
8
Hybrid Intelligent Systems:
Evolving Intelligence in Hierarchical Layers
A. Abraham
8.1 Introduction
In recent years, several adaptive hybrid soft computing [27] frameworks have
been developed for model expertise, decision support, image and video seg-
mentation techniques, process control, mechatronics, robotics and complicated
automation tasks. Many of these approaches use a combination of differ-
ent knowledge representation schemes, decision making models and learning
The antecedent of the fuzzy rule defines a local fuzzy region, while the
consequent describes the behavior within the region via various constituents.
The consequent constituent can be a membership function (Mamdani model)
or a linear equation (first order Takagi-Sugeno model) [22].
Adaptation of fuzzy inference systems using evolutionary computation
techniques has been widely explored [2, 4, 7, 8, 18, 20]. The automatic
adaptation of membership functions is popularly known as self-tuning. The
genome encodes parameters of trapezoidal, triangle, logistic, hyperbolic-
tangent, Gaussian membership functions and so on.
The evolutionary search of fuzzy rules can be carried out using three ap-
proaches [10]. In the first (Michigan approach), the fuzzy knowledge base is
adapted as a result of the antagonistic roles of competition and cooperation
of fuzzy rules. Each genotype represents a single fuzzy rule and the entire
population represents a solution. The second method (Pittsburgh approach)
evolves a population of knowledge bases rather than individual fuzzy rules.
Genetic operators serve to provide a new combination of rules and new rules.
The disadvantage is the increased complexity of the search space and the addi-
tional computational burden, especially for online learning. The third method
(iterative rule learning approach) is similar to the first, with each chromosome
representing a single rule, but contrary to the Michigan approach, only the
best individual is considered to form part of the solution, the remaining chro-
mosomes in the population are discarded. The evolutionary learning process
builds up the complete rule base through an iterative learning process.
In a neuro-fuzzy model [4], there is no guarantee that the neural network-
learning algorithm will converge and the tuning of fuzzy inference system
be successful (determining the optimal parameter values of the membership
functions, fuzzy operators and so on). A distinct feature of evolutionary fuzzy
162 A. Abraham
We present the Evolving Neuro Fuzzy (EvoNF) model which optimizes the
fuzzy inference system using a meta-heuristic approach combining neural net-
work learning and evolutionary computation. The proposed technique could
be considered as a methodology to integrate neural network learning, fuzzy
inference systems and evolutionary search procedures [2, 8].
The evolutionary search of membership functions, rule base, fuzzy oper-
ators progress on different time scales to adapt the fuzzy inference system
according to the problem environment. Figure 8.2 illustrates the general in-
teraction mechanism with the evolutionary search of a fuzzy inference system
(Mamdani, Takagi-Sugeno etc.) evolving at the highest level on the slowest
time scale. For each evolutionary search of fuzzy operators (for example, best
combination of T-norm, T-conorm and defuzzification strategy), the search
for the fuzzy rule base progresses at a faster time scale in an environment de-
cided by the fuzzy inference system and the problem. In a similar manner, the
evolutionary search of membership functions proceeds at a faster time scale
8 Evolving Intelligence in Hierarchical Layers 163
(for every rule base) in the environment decided by the fuzzy inference system,
fuzzy operators and the problem. Thus, the evolution of the fuzzy inference
system evolves at the slowest time scale while the evolution of the quantity
and type of membership functions evolves at the fastest rate. The function
of the other layers could be derived similarly. The hierarchy of the different
adaptation layers (procedures) relies on prior knowledge. For example, if there
is more prior knowledge about the knowledge base (if-then rules) than the in-
ference mechanism then it is better to implement the knowledge base at a
higher level. If a particular fuzzy inference system best suits the problem, the
computational task could be reduced by minimizing the search space. The
chromosome architecture is depicted in Fig. 8.3.
The architecture and the evolving mechanism could be considered as a
general framework for adaptive fuzzy systems, that is a fuzzy model that can
change membership functions (quantity and shape), rule base (architecture),
fuzzy operators and learning parameters according to different environments
without human intervention.
Referring to Fig. 8.3 each layer (from fastest to slowest) of the hierarchical
evolutionary search process has to be represented in a chromosome for suc-
cessful modelling of EvoNF. The detailed functioning and modelling process
is as follows.
Layer 1: The simplest way is to encode the number of membership func-
tions per input variable and the parameters of the membership functions.
Figure 8.4 depicts the chromosome representation of n bell membership func-
tions specified by its parameters p, q and r. The optimal parameters of the
membership functions located by the evolutionary algorithm will be further
fine tuned by the neural network-learning algorithm. Similar strategy could
be used for the output membership functions in the case of a Mamdani fuzzy
inference system. Experts may be consulted to estimate the MF shape forming
parameters to estimate the search space of the MF parameters.
In our experiments angular coding method proposed by Cordón et al.
were used to represent the rule consequent parameters of the Takagi-Sugeno
inference system [10].
Layer 2. This layer is responsible for the optimization of the rule base.
This includes deciding the total number of rules, representation of the an-
tecedent and consequent parts. Depending on the representation used (Michi-
gan, Pittsburg, iterative learning and so on), the number of rules grow rapidly
with an increasing number of variables and fuzzy sets. The simplest way is
that each gene represents one rule, and “1” stands for a selected and “0” for
a non-selected rule. Figure 8.5 displays such a chromosome structure repre-
sentation. To represent a single rule a position dependent code with as many
elements as the number of variables of the system is used. Each element is
a binary string with a bit per fuzzy set in the fuzzy partition of the vari-
able, meaning the absence or presence of the corresponding linguistic label
in the rule. For a three input and one output variable, with fuzzy partitions
composed of 3, 2, 2 fuzzy sets for input variables and 3 fuzzy sets for output
variable, the fuzzy rule will have a representation as shown in Fig. 8.6.
Layer 3. In this layer, a chromosome represents the different parameters of
the T-norm and T-conorm operators. Real number representation is adequate
Fig. 8.5. Representation of the entire rule base consisting of m fuzzy rules
8 Evolving Intelligence in Hierarchical Layers 165
In this section, we will examine the application of the proposed EvoNF model
to approximate the export behavior of multi-national subsidiaries. Several
specific subsidiary features identified in international business literature are
particularly relevant when seeking to explain Multi-National Company (MNC)
subsidiary export behavior. Our purpose is to model the complex export pat-
tern behavior using a Takagi-Sugeno fuzzy inference system in order to de-
termine the actual volume of Multinational Cooperation Subsidiaries (MCS)
export output (sales exported) [11]. Malaysia has been pursuing an economic
strategy of export-led industrialization. To facilitate this strategy, foreign in-
vestment is courted through the creation of attractive incentive packages.
These primarily entail taxation allowances and more liberal ownership rights
for investments. The quest to attract foreign direct investment (FDI) has
proved to be highly successful. The bulk of investment has gone into export-
oriented manufacturing industries. For simulations we have used data provided
166 A. Abraham
Population Size 40
Maximum no of generations 35
FIS Takagi Sugeno
Rule antecedent MF 2 MF (parameterised Gaussian)/input
Rule consequent parameters Linear parameters
Gradient descent learning 10 epochs
Ranked based selection 0.50
Elitism 5%
Starting mutation rate 0.50
from a survey of 69 Malaysian MCS. Each corporation subsidiary data set were
represented by product manufactured, resources, tax protection, involvement
strategy, financial independence and suppliers relationship.
We used the popular grid partitioning method to generate the initial rule
base [24]. This partition strategy works well when only few number of inputs
are involved since it requires only a small number of MF for each input.
We used 90% of the data for training and remaining 10% for testing and
validation purposes. The initial populations were randomly created based on
the parameters shown in Table 8.1. We used an adaptive mutation operator,
which decreases the mutation rate as the algorithm greedily proceeds in the
search space 0. The parameters mentioned in Table 8.1 were decided after a
few trial and error approaches. Experiments were repeated 3 times and the
average performance measures are reported. Figure 8.7 illustrates the meta-
learning approach for training and test data combining evolutionary learning
and gradient descent technique during the 35 generations.
The 35 generations of meta-learning approach created 76 if-then Takagi-
Sugeno type fuzzy if-then rules compared to 128 rules using the conventional
grid-partitioning method. We also used a feed forward neural network with
0.08
RMSE
0.06
0.04
0.02
0
1 6 11 16 21 26 31
Train set convergence Test data convergence Evolutionary learning (no. of generations)
Table 8.2. Training and test performance of the different intelligent paradigms
Intelligent Paradigms
EvoNF Neural Network
Export Output RMSE RMSE
CC CC
Train Test Train Test
0.0013 0.012 0.989 0.0107 0.1261 0.946
12 hidden neurons (single hidden layer) to model the export output for the
given input variables. The learning rate and momentum were set at 0.05 and
0.2 respectively and the network was trained for 10,000 epochs using BP.
The network parameters were decided after a trial and error approach. The
obtained training and test results are depicted in Table 8.2 (RMSE = Root
Mean Squared Error, CC = correlation coefficient).
Our analysis on the export behavior of Malaysia’s MCS reveals that the
developed EvoNF model could learn the chaotic patterns and model the be-
havior using an optimized Takagi Sugeno FIS. As illustrated in Fig. 8.8 and
Table 8.2, EvoNF could approximate the export behavior within the toler-
ance limits. When compared to a direct neural network approach, EvoNF
performed better (in terms of lowest RMSE) and better correlation coeffi-
cient.
0.8
0.7
0.6
0.5
Output
0.4
0.3
0.2
0.1
0
−0.1
0 2 4 6 8 10 12 14
Fig. 8.8. Test results showing the export output (scaled values) for 13 MNC’s with
respect to the desired values
One of the widely used clustering methods is the fuzzy c-means (FCM) al-
gorithm developed by Bezdek [9]. FCM partitions a collection of n vectors
xi , i = 1, 2 . . . , n into c fuzzy groups and finds a cluster center in each group
such that a cost function of dissimilarity measure is minimized. To accom-
modate the introduction of fuzzy partitioning, the membership matrix U is
allowed to have elements with values between 0 and 1. The FCM objective
function takes the form
c
c
n
J(U, c1 , . . . cc ) = Ji = um 2
ij dij
i=1 i=1 j=1
data are fed to a Takagi-Sugeno fuzzy inference system to analyze the Web
server access trend patterns. The if-then rule structures are learned using an
iterative learning procedure [10] by an evolutionary algorithm and the rule
parameters are fine-tuned using gradient decent algorithm. The hierarchical
distribution of i-Miner is depicted in Fig. 8.10. The arrow direction indicates
the hierarchy of the evolutionary search. In simple words, the optimization
of clustering algorithm progresses at a faster time scale at the lowest level in
an environment decided by the fuzzy inference system and the problem envi-
N
E= (dk − xk )2 (8.1)
k=1
where dk is the kth component of the rth desired output vector and xk is
the kth component of the actual output vector by presenting the rth input
vector to the network. The gradients of the rule parameters to be optimized,
172 A. Abraham
∂E
namely the consequent parameters (Pn ) ∂P n
for all rules Rn and the premise
∂E ∂E
parameters ∂σi and ∂ci for all fuzzy sets Fi (σ and c represents the MF width
and center of a Gaussian MF)are to be computed. As far as rule parameter
learning is concerned, the key difference between i-Miner and the EvoNF
approach is in the way the consequent parameters were determined.
Once the three layers are represented in a chromosome structure C, then
the learning procedure could be initiated as defined in Sect. 8.2.
The hybrid framework described in Sect. 8.3 was used for Web usage mining
[1]. The statistical/text data generated by the log file analyzer from 1 January
2002 to 7 July 2002. Selecting useful data is an important task in the data pre-
processing block. After some preliminary analysis, we selected the statistical
data comprising of domain byte requests, hourly page requests and daily page
requests as focus of the cluster models for finding Web users’ usage patterns.
It is also important to remove irrelevant and noisy data in order to build
a precise model. We also included an additional input “index number” to
distinguish the time sequence of the data. The most recently accessed data
were indexed higher while the least recently accessed data were placed at
the bottom. Besides the inputs “volume of requests” and “ volume of pages
(bytes)” and “index number”, we also used the “cluster information” provided
by the clustering algorithm as an additional input variable. The data was re-
indexed based on the cluster information. Our task is to predict the Web traffic
volume on a hourly and daily basis. We used the data from 17 February 2002
to 30 June 2002 for training and the data from 1 July 2002 to 6 July 2002 for
testing and validation purposes.
The performance is compared with self-organizing maps (alternative for
FCM) and several function approximation techniques like neural networks,
linear genetic programming and Takagi-Sugeno fuzzy inference system (to
predict the trends). The results are graphically illustrated and the practical
significance is discussed in detail.
The initial populations were randomly created based on the parameters
shown in Table 8.3. Choosing good reproduction operator values is often a
challenging task. We used a special mutation operator, which decreases the
mutation rate as the algorithm greedily proceeds in the search space [3]. If the
allelic value xi of the ith gene ranges over the domain ai and bi the mutated
gene xi is drawn randomly uniformly from the interval [ai , bi ].
xi + ∆(t, bi − xi ), if ω = 0
xi = (8.2)
xi + ∆(t, xi − ai ), if ω = 1
where ω represents an unbiased coin flip p(ω = 0) = p(ω = 1) = 0.5, and
t b
∆(t, x) = x 1 − γ (1 − tmax ) (8.3)
8 Evolving Intelligence in Hierarchical Layers 173
Population Size 30
Maximum no of generations 35
Fuzzy inference system Takagi Sugeno
Rule antecedent membership functions 3 membership functions per input
Rule consequent parameters variable (parameterized Gaussian)
linear parameters
Gradient descent learning 10 epochs
Ranked based selection 0.50
Elitism 5%
Starting mutation rate 0.50
and t is the current generation and tmax is the maximum number of genera-
tions. The function ∆ computes a value in the range [0, x] such that the prob-
ability of returning a number close to zero increases as the algorithm proceeds
with the search. The parameter b determines the impact of time on the prob-
ability distribution ∆ over [0, x]. Large values of b decrease the likelihood of
large mutations in a small number of generations. The parameters mentioned
in Table 8.3 were decided after a few trial and error approaches (basically by
monitoring the algorithm convergence and the output error measures). Ex-
periments were repeated 3 times and the average performance measures are
reported. Figures 8.12 and 8.13 illustrates the meta-learning approach com-
bining evolutionary learning and gradient descent technique during the 35
generations.
0.12
RMSE (training data)
0.1
0.08
0.06
0.04
0.02
0
1 6 11 16 21 26 31
One day ahead trends average hourly trends Evolutionary learning (no. of generations)
Table 8.4 summarizes the performance of the developed i-Miner for train-
ing and test data. i-Miner trend prediction performance is compared with
ANFIS [13], Artificial Neural Network (ANN) and Linear Genetic Program-
ming (LGP). The Correlation Coefficient (CC) for the test data set is also
174 A. Abraham
0.12
0.1
RMSE (test data)
0.08
0.06
0.04
0.02
0
1 6 11 16 21 26 31
One day ahead trends average hourly trends Evolutionary learning (no. of generations)
Period
Daily (1 day ahead) Hourly (1 hour ahead)
RMSE RMSE
Method Train Test CC Train Test CC
i-Miner 0.0044 0.0053 0.9967 0.0012 0.0041 0.9981
ANFIS 0.0176 0.0402 0.9953 0.0433 0.0433 0.9841
ANN 0.0345 0.0481 0.9292 0.0546 0.0639 0.9493
LGP 0.0543 0.0749 0.9315 0.0654 0.0516 0.9446
Daily requests
Volume of requests (Thousands)
1200
900
600
300
1 2 3 4 5 6
Day of the week
i-Miner Actual vol. of requests FIS
ANN LPG Web traffic trends
140
FIS
ANN
120
LGP
100 Web traffic trends
80
60
40
20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Hour of the day
Fig. 8.15. Test results of the average hourly trends for 6 days
Fig. 8.18 depicts the volume of visitors according to domain names from a
cluster developed using the evolutionary FCM approach. Detailed discussion
on the knowledge discovered from the data clusters is beyond the scope of this
chapter.
• Most of the hints and tips given for the design of EvoNF is applicable for
i-Miner.
• Data pre-processing is an important issue for optimal performance. In most
cases, normalization or scaling would suffice.
176 A. Abraham
× 10 5
8
hourly web data
cluster center
7
6
Volume of requests
0
0 5 10 15 20 25
Hour of the day
× 105
18
daily web data
cluster center
16
14
Volume of requests
12
10
0
1 2 3 4 5 6 7
Day of the week
2800000
2450000
Hourly visitors
2100000
1750000
1400000
1050000
700000
350000
0
vic.gov.au
vic.csro.au
lu.net
unimelb.edu.au
tpgi.com.au
tmns.net.au
telstra.net
sony.co.jp
singnet.com.sg
scymaxonline.com.sg
rr.com
panopticsearch.com
opiusnet.com.au
noroff.no
netvigator.com
netspace.ret.au
jaring.my
primus.net.au
inktomisearch.com
ihug.com.au
googlebot.com
directhit.com
dft.com.au
connect.com.au
comindico.com.au
cmis.csiro.au
bigpcnd.ret.au
av.com
acl.com
alphalink.com.au
Domain names
Fig. 8.18. Hourly visitor information according to the domain names from an FCM
cluster
8.4 Conclusions
This chapter has presented some of the architectures of hybrid intelligent
systems involving fuzzy clustering algorithms, neural network learning, fuzzy
inference systems and evolutionary computation. The key idea was to demon-
strate the evolution of intelligence in hierarchical layers. The developed hybrid
intelligent systems were applied to two real world applications illustrating the
importance of such complicated approaches. For the two applications consid-
ered, the hybrid models performed better than the individual approaches. We
were able to improve the performance (low RMSE and high CC) and at the
same time we were able to substantially reduce the number of rules. Hence
these approaches might be extremely useful for hardware implementations.
The hybrid intelligent systems has many important practical applications
in science, technology, business and commercial. Compared to the individual
intelligent constituents hybrid intelligent frameworks are relatively young. As
the strengths and weakness of different hybrid architectures are understood,
it will be possible to use them more efficiently to solve real world problems.
Integration issues range from different techniques and theories of computa-
tion to problems of exactly how best to implement hybrid systems. Like most
biological systems which can adapt to any environment, adaptable intelligent
systems are required to tackle future complex problems involving huge data
volume. Most of the existing hybrid soft computing frameworks rely on sev-
eral user specified network parameters. For the system to be fully adaptable,
performance should not be heavily dependant on user-specified parameters.
The real success in modelling the proposed hybrid architectures will di-
rectly depend on the genotype representation of the different layers. The
population-based collective learning process, self-adaptation, and robustness
are some of the key features. Evolutionary algorithms attract considerable
computational effort especially for problems involving complexity and huge
data volume. Fortunately, evolutionary algorithms work with a population of
178 A. Abraham
References
1. Abraham A., Business Intelligence from Web Usage Mining, Journal of Infor-
mation and Knowledge Management (JIKM), World Scientific Publishing Co.,
Singapore, Volume 2, No. 4, pp. 1-15, 2003.
2. Abraham A., EvoNF: A Framework for Optimization of Fuzzy Inference Systems
Using Neural Network Learning and Evolutionary Computation, 2002 IEEE
International Symposium on Intelligent Control (ISIC’02), Canada, IEEE Press,
pp. 327-332, 2002.
3. Abraham A., i-Miner: A Web Usage Mining Framework Using Hierarchical In-
telligent Systems, The IEEE International Conference on Fuzzy Systems FUZZ-
IEEE’03, IEEE Press, pp. 1129-1134, 2003.
4. Abraham A., Neuro-Fuzzy Systems: State-of-the-Art Modeling Techniques,
Connectionist Models of Neurons, Learning Processes, and Artificial Intelli-
gence, LNCS 2084, Mira J. and Prieto A. (Eds.), Springer-Verlag Germany,
pp. 269-276, 2001.
5. Abraham A., Intelligent Systems: Architectures and Perspectives, Recent Ad-
vances in Intelligent Paradigms and Applications, Abraham A., Jain L. and
Kacprzyk J. (Eds.), Studies in Fuzziness and Soft Computing, Springer Verlag
Germany, Chap. 1, pp. 1-35, 2002.
6. Abraham A., Meta-Learning Evolutionary Artificial Neural Networks, Neuro-
computing Journal, Elsevier Science, Netherlands, Vol. 56c, pp. 1-38, 2004.
7. Abraham A. and Nath B., Evolutionary Design of Fuzzy Control Systems – An
Hybrid Approach, The Sixth International Conference on Control, Automation,
Robotics and Vision, (ICARCV 2000), CD-ROM Proceeding, Wang J.L. (Ed.),
ISBN 9810434456, Singapore, 2000.
8. Abraham A. and Nath B., Evolutionary Design of Neuro-Fuzzy Systems – A
Generic Framework, In Proceedings of The 4-th Japan-Australia Joint Workshop
on Intelligent and Evolutionary Systems, Namatame A. et al (Eds.), Japan,
pp. 106-113, 2000.
9. Bezdek J.C., Pattern Recognition with Fuzzy Objective Function Algorithms,
New York: Plenum Press, 1981.
10. Cordón O., Herrera F., Hoffmann F., and Magdalena L., Genetic Fuzzy Systems:
Evolutionary Tuning and Learning of Fuzzy Knowledge Bases, World Scientific
Publishing Company, Singapore, 2001.
11. Edwards R., Abraham A. and Petrovic-Lazarevic S., Export Behaviour Mod-
eling Using EvoNF Approach, The International Conference on Computational
Science (ICCS 2003), Springer Verlag, Lecture Notes in Computer Science- Vol-
ume 2660, Sloot P.M.A. et al (Eds.), pp. 169-178, 2003.
12. Hall L.O., Ozyurt I.B., and Bezdek J.C., Clustering with a Genetically Op-
timized Approach, IEEE Transactions on Evolutionary Computation, Vol. 3,
No. 2, pp. 103-112, 1999.
13. Jang J.S.R., ANFIS: Adaptive-Network-BasedFuzzy Inference System, IEEE
Transactions in Systems Man and Cybernetics, Vol. 23, No. 3, pp. 665-685,
1993.
8 Evolving Intelligence in Hierarchical Layers 179
9.1 Introduction
Many real-world problems, such as biological data processing, are continuously
changing non-linear processes that require fast, adapting non-linear systems
capable of following the process dynamics and discovering the rules of these
changes. Since the 80’s, many Artificial Neural Network (ANN) models such
as the popular multilayer perceptrons (MLP) and radial basis network (RBF)
have been proposed to capture the process non-linearity that often fail clas-
sical linear systems. However, most of these models are designed for batch
learning and black-box operation. Only few are capable of performing on-line,
incremental learning and/or knowledge discovery along with self-optimising
their parameters as they learn incrementally.
Aiming to tackle these weaknesses of common ANN models, a new class of
on-line self-evolving networks called Evolving Connectionist Systems (ECOS)
N. Kasabov, Z. Chan, Q. Song, and D. Greer: Evolving Connectionist Systems with Evolutionary
Self-Optimisatio, StudFuzz 173, 181–202 (2005)
www.springerlink.com
c Springer-Verlag Berlin Heidelberg 2005
182 N. Kasabov et al.
was proposed [17, 18, 21, 34]. ECOS are capable of performing the follow-
ing functions: adaptive learning, incremental learning, lifelong learning, on-
line learning, constructivist structural learning that is supported by biological
facts, selectivist structural learning and knowledge-based learning. Incremen-
tal learning is supported by the creation and the modification of the number
and the position of neurons and their connection weights in a local problem
space as new data comes. Fuzzy rules can be extracted for knowledge discovery.
Derivatives of ECOS, including Evolving Fuzzy Neural Network (EFuNN) [18],
Evolving Classification Function (ECF), Evolving Clustering Method for Clas-
sification (ECMC) [17], Dynamic Evolving Neural-Fuzzy Inference System
(DENFIS) [21] and ZISC [34] have been applied to speech and image recogni-
tion, brain signal analysis and modeling dynamic time-series prediction, gene
expression clustering and gene regulatory network inference [17, 19].
Evolutionary computation (EC) are robust, global optimisation methods
[3, 4, 12, 13, 23, 36] that have been widely applied to various neuro-fuzzy
systems for prediction and control [8, 9, 10, 11, 16, 24, 27, 29, 33, 38, 39].
While ECOS models were designed to be self-evolving systems, it is our goal
to extend this capability through applying EC to perform various on-line
and off-line optimisation of the control parameters. In this work, such ex-
ample applications are illustrated. EC, in the form of Genetic Algorithms
(GA) and Evolutionary Strategies (ES), are applied to the ECOS models of
EFuNN, ECF and ECMC. All methods are illustrated on benchmark examples
of Mackey-Glass and Iris data [6] to demonstrate the performance enhance-
ment with EC.
The paper is organised as follows. Section 9.2 presents the main principles
of ECOS and some of their models – EFuNN, DENFIS, ECF, ECMC, while
Sect. 9.3 presents the principle of Evolutionary Computation (EC) and in par-
ticular, Evolutionary Strategies (ES) and Genetic Algorithm (GA), which are
the two forms of EC employed in this work. Sections 9.4 and 9.5 present the
application of ES for on-line and off-line parameter optimisation of EFuNN [7],
while Sect. 9.6 presents the application of GA to off-line parameter optimi-
sation of ECF [20, 22]. Section 9.7 applies GA to off-line optimisation of
the normalisation ranges of input variables and to feature weighting for the
EFuNN and ECMC models [37]. Conclusions and outlook for future research
are discussed in Sect. 9.8.
a neural network that operates continuously in time and adapts its struc-
ture and functionality through a continuous interaction with the environment
and with other systems according to: (i) a set of parameters that are subject
to change during the system operation; (ii) an incoming continuous flow of
information with unknown distribution; (iii) a goal (rationale) criteria (also
subject to modification) that is applied to optimise the performance of the
system over time. ECOS have the following characteristics [17, 18, 21, 34]:
1. They evolve in an open space, not necessarily of fixed dimensions.
2. They are capable of on-line, incremental, fast learning – possibly through
one pass of data propagation.
3. They operate in a life-long learning mode.
4. They learn as both individual systems, and as part of an evolutionary
population of such systems.
5. They have evolving structures and use constructive learning.
6. They learn locally via local partitioning of the problem space, thus allowing
for fast adaptation and process tracing over time.
7. They facilitate different kind of knowledge representation and extraction,
mostly – memory-based, statistical and symbolic knowledge.
In this work, we study the application of Evolutionary Computation (EC)
to the self-optimisation of the ECOS models of EFuNN, ECF and ECMC.
These three ECOS models are briefly described in the following sections.
Evolving Fuzzy Neural Networks (EFuNNs) are ECOS models that evolve
their nodes (neurons) and connections through supervised incremental learn-
ing from input-output data pairs. A simple version of EFuNN is shown in
Fig. 9.1 [18]. It has a five-layer structure. The first layer contains the input
nodes that represent the input variables; the second layer contains the input
fuzzy membership nodes that represent the membership degrees of the input
values to each of the defined membership functions; the third layer contains
W1(.) W2(.)
the rule nodes that represent cluster centers of samples in the problem space
and their associated output function; the fourth layer contains output fuzzy
membership nodes that represent the membership degrees to which the out-
put values belong to defined membership functions and finally, the fifth layer
contains the output nodes that represent output variables.
EFuNN learns local models from data through clustering of the data into
rule nodes and associating a local output function for each cluster. Rule nodes
evolve from the input data stream to cluster the data. The first layer connec-
tion weights W 1(rj ), j = [1, 2, . . . , no. rule nodes] represent the co-ordinates
of the nodes rj in the input space and the second layer connection weights
W 2(rj ) represents the corresponding local models (output functions). Clus-
ters of data are created based on similarity between data samples and ex-
isting rule nodes in both the input fuzzy space and the output fuzzy space
with the following methods. Samples that have a distance, measured in local
normalised fuzzy distance, to an existing cluster center (rule node) rj of less
than a threshold Rmax in the input space and Emax in the output space are
allocated to the same cluster and are used to update cluster centre and radius.
Let din and dout denote the fuzzified input variable and output variable re-
spectively, N exj the number of data associated with rj , l1 and l2 the learning
rate for W 1(.) and W 2(.) respectively. The centres are adjusted according to
the following error correction equations:
l1
(t+1) (t) (t)
W 1 rj = W 1 rj + din − W 1 rj (9.1)
N exj + 1
l2
(t+1) (t) (t)
W 2 rj = W 2 rj + dout − W 2 rj (9.2)
N exj + 1
The learning rates l1 and l2 control how fast the connection weights adjust
to the dynamics of the data. Samples that do not fit into existing clusters
form new clusters as they arrive in time. This procedures form the kernel of
EFuNN’s incremental learning: that cluster centres are continuously adjusted
according to the predefined learning rates according to new data samples, and
new clusters are created incrementally.
During learning, EFuNN creates a local output function for each cluster
that is represented as the W 2(·) connection weights. Each cluster thus rep-
resents a local model that can be described by a local fuzzy rule with an
antecedent – the cluster area, and a consequent – the output function applied
to data in this cluster, for example:
if x1 is High (0.7) and x2 is Low (0.9) then
y is High (0.8) (* radius of the input cluster 0.3, number of examples in
the cluster 13 *)
end if
where High and Low are fuzzy membership functions defined on the range of
the variables for x1 , x2 , and y. The number and the type of the membership
9 Evolving Connectionist Systems 185
functions can either be deduced from the data through learning algorithms,
or can be predefined based on human knowledge.
EFuNN can be trained either in incremental (on-line) mode, in which data
arrive one at a time and local learning is performed upon the arrival of each
datum, or in batch (off-line) mode, in which the whole data set is available
for global learning. The former follows the local learning procedures described
above, whereas the later requires clustering algorithms like k-means clustering
and/or Expectation-Maximisation algorithm.
In this work, EC is applied to EFuNN in three main areas: first, for opti-
mising the learning rates l1 and l2 in the incremental learning mode (Sect. 9.4);
second, for optimising the fuzzy MFs in the batch learning mode (Sect. 9.5)
and last, for performing feature weighting and feature selection (Sect. 9.7).
1. If all training data vectors have been entered into the system, complete the
learning phase; otherwise, enter a new input vector from the data set.
2. Find all existing rule nodes with the same class as the input vector’s class.
3. If there is not such a rule node, create a new rule node and go back to step
(1). The position of the new rule node is the same as the current input
vector in the input space, and the radius of its influence field is set to the
minimum radius Rmin .
4. For those rule nodes that are in the same class as the input vector’s class,
if the input vector lies within node’s influence field, update both the node
position and influence field. Suppose that the field has a radius of R(j) and
the distance between the rule node and the input vector is d; the increased
radius is R(j),new = (R(j) + d)/2 and the rule node moves to a new position
situated on the line connecting the input vector and the rule node.
186 N. Kasabov et al.
5. For those rule nodes that are in the same class as the input vector’s class,
if the input vector lies outside of the node’s influence, increase this field if
possible. If the new field does not include any input vectors from the data
set that belong to a different class, the increment is successful, the rule
node changes its position and the field increases. Otherwise, the increment
fails and both rule node and its field do not change. If the field increment is
successful or the input vector lies within an associated field, go to step (1);
otherwise, a new node is created according to step (3) and go to step (1).
The learning procedure takes only one iteration (epoch) to retain all input
vectors at their positions in the input/output space. The classification of new
input vectors is performed in the following way:
1. The new input vector is entered and the distance between it and all rule
nodes is calculated. If the new input vector lies within the field of one or
more rule nodes associated with one class, the vector belongs to this class.
2. If the input vector does not lie within any field, the vector will belong
to the class associated with the closest rule node. However, if no rules
are activated, ECF calculates the distance between the new vector and
the M closest rule nodes out of a total of N rule nodes. The average
distance is calculated between the new vector and the rule nodes of each
class. The vector is assigned the class corresponding to the smallest average
distance. This averaging procedure is the so-called M -of-N prediction or
classification.
In this work EC is applied to ECF for optimising the main control para-
meters of ECF, which include Rmin , Rmax , M -of-N and nM F (the number
of fuzzy membership functions for fuzzifying input data) (Sect. 9.6) and to
ECMC for feature weighting (Sect. 9.7).
In this work, Evolutionary Computation (EC) [4, 12, 13] is applied to im-
plement self-optimisation of ECOS. EC represents a general class of global
optimisation algorithms that imitate the evolutionary process explained by
Darwinian Natural Selection. Its models include Evolutionary Strategies (ES)
proposed by Schwefel et al. in the mid 1960s, Genetic Algorithms (GA) pro-
posed by Holland in 1975, and Genetic Programming (GP) by Koza in 1992 [4].
EC searches with multiple search points and requires only the function evalu-
ation of each point. It is therefore robust and convenient for optimising com-
plex, problems that lack derivative information and contain multi-modality,
which are the features that defeat classical, derivative-based search methods.
However, the large number of function evaluations causes EC to be compu-
tationally intensive. Thus although the earliest form of EC has existed since
9 Evolving Connectionist Systems 187
Population
Initialization
New Generation of
Parent population Parent population
Reproduction Selection
1. Recombination
2. Mutation
the mid 1960s, EC received research attention only after the availability of
high-speed computers in the last two decade.
Here we present a brief description of Genetic Algorithm (GA) and Evolu-
tionary Strategies (ES), which are the two forms of EC employed in this work.
Both GA and ES adopt the same algorithmic structure shown in Fig. 9.2 that
resembles the natural evolutionary process, searching with a set of solutions
called the population. In canonical GA, each solution (called an individual)
is expressed in a binary code sequence called string or chromosome and each
bit of the string is called a gene, following the biological counterparts. For
real number solutions, each individual can be encoded through the standard
binary code or Gray’s code. In ES, each individual is the real number solution
itself. No coding is required. Each iteration, called a generation, consists of
three phases: reproduction, evaluation and selection.
In the reproduction phase, the current population called the parent popu-
lation is processed by a set of evolutionary operators to create a new popula-
tion called the offspring population. The evolutionary operators include two
main operators: mutation and recombination, both imitate the functions of
their biological counterparts. Mutation causes independent perturbation to a
parent to form an offspring and is used for diversifying the search. It is an
asexual operator because it involves only one parent. In GA, mutation flips
each binary bit of an offspring string at a small, independent probability pm
(which is typically in the range [0.001, 0.01] [2]). In ES, mutation is the addi-
tion of a zero-mean Gaussian random number with covariance Σ to a parent
individual to create the offspring. Let sPA and sOF denote the parent and
offspring vector, they are related through the Gaussian mutation
Due to EC’s robustness and ease of use, it has been applied to many areas
of ANN training; in particular, network weights and architecture optimisa-
tion [10, 11, 39], regularisation [8], feature selection [14] and feature weight-
ing [15]. It is the goal of this work to extend these applications to ECOS. As
ECOS, unlike conventional ANNs, are capable f fast, incremental learning and
adapting their network architectures to new data, there is greater flexibility
in EC’s application area, of which two are identified here:
1. ECOS’s fast learning cycle reduces the computation time of fitness evalu-
ation, which in turn speeds up the overall EC search. Thus EC can afford
using larger population and longer evolutionary time to achieve consistent
convergence.
2. ECOS’s capability of incremental learning raises the opportunity of apply-
ing on-line EC to optimise the dynamic control parameters.
size (6-50) and high selective pressure (use of elitist selection) and low selec-
tion ratio, which aims to minimise computational intensity and to accelerate
convergence (at the cost of less exploration of the search space) respectively.
Their applications involve optimising only a small number of parameters (2-
10), which are often the high-level tuning parameters of a model. For appli-
cations involving larger number of parameters, parallel EC can be used to
accelerate the process. Due to the fast learning cycle of ECOS and the rela-
tively small data set used in our experiment, our on-line EC suffices in terms of
computational speed using small population size and high selection pressure.
The on-line EC is implemented with Evolutionary Strategies (ES). Each
(k) (k)
individual s(k) = (s1 , s2 ) is a 2-vector solution to l1 and l2 . Since there
are only two parameters to optimise, ES requires only a small population and
a short evolutionary time. In this case, we use µ = 1 parent and λ = 10
offspring and a small number of generations of genmax = 20. We set the
initial values of s to (0.2, 0.2) in the first run and use the previous best
solution as the initial values for subsequent runs. Using a non-randomised
initial population encourages a more localised optimisation and hence speeds
up convergence. We use simple Gaussian mutation with standard deviation
σ = 0.1 (empirically determined) for both parameters to generate new points.
For selection, we use the high selection pressure (µ + λ) scheme to accelerate
convergence, which picks the best µ of the joint pool of µ parents and λ
offspring to be the next generation parents.
The fitness function (or the optimisation objective function) is the pre-
diction error over the last nlast data, generated by using EFuNN model at
(t − nlast ) to perform incremental learning and predicting over the last nlast
data using the EFuNN model at (t − nlast ). The smaller nlast , the faster the
learning rates adapts and vice-versa. Since the effect of changing the learning
rates is usually not expressed immediately but after a longer period, the fit-
ness function can be noisy and inaccurate if nlast is too small. In this work
we set nlast = 50. The overall algorithm is as follows:
(r)
2. Reproduction. Randomly select one of the µ parents, sPA , to undergo
Gaussian mutation to produce a new offspring
(i) (r)
sOF = sPA + z(i) where z(i) ∼ N (0, σ 2 I), i = {1, 2, . . . λ}
3. Fitness Evaluation. Apply each of the λ offspring to the EFuNN model at
(t − nlast ) to perform incremental learning and prediction using data in
[t − nlast , t]. Set the respective prediction error as fitness.
4. Selection. Perform (µ + λ) selection
9 Evolving Connectionist Systems 191
l2 8.93
1.5
8.92
1
8.91
0.5 l1 8.9
0 8.89
0 5 10 15 0 5 10 15
gen gen
Fig. 9.3. Evolution of the best fitness and the learning rates over 15 generations
Figure 9.3 shows an example of the evolution of the least RMSE (used
as the fitness function) and the learning rates. The RMSE decreases uni-
directionally, which is characteristics of (µ + λ) selection because the best
individual is always kept. The optimal learning rates are achieved quickly
after 14 generations.
Figure 9.4 shows the dynamics of the learning rate over the period
t = [500,1000]. Both learning rates l1 and l2 vary considerably over the entire
course with only short stationary moments, showing that they are indeed dy-
namic parameters. The average RMSE for on-line prediction one step ahead
obtained with and without on-line EC are 0.0056 and 0.0068 respectively,
showing that on-line EC is effective in enhancing EFuNN’s prediction perfor-
mance during incremental learning.
192 N. Kasabov et al.
(r)
2. Reproduction. Randomly select one of the µ parents, sPA , to undergo
Gaussian mutation to produce a new offspring
(i) (r)
sOF = sPA + z(i) and z(i) ∼ N (0, σ 2 I) f or i = {1, 2, . . . , λ}
Resample if the membership hierarchy constraint is violated.
3. Fitness Evaluation. Apply each of the λ offspring to the EFuNN model at
(t − nlast ) to perform incremental learning and prediction using data in
[t − nlast , t]. Set the respective prediction error as fitness.
4. Selection. Perform (µ + λ) selection
5. Termination. Increment gen. Stop if gen ≥ genmax , otherwise go to step 2.
9 Evolving Connectionist Systems 193
0.0103 1
(a) (b)
0.0102 orignal MFs optimized MFs
0.5
0.0101 –1 –0.5 0 0.5 1 1.5 2 2.5 x4 3
fitness (RMSE) 0.2
0.01 (c)
0.1 frequency dist. of x4
0.0099 0
0 2 4 6 8 10 –1 –0.5 0 0.5 1 1.5 2 2.5 x 3
4
gen
Fig. 9.5. (a) The evolution of the best fitness from the off-line ES. (b) Initial
membership functions and EC-optimised membership functions, (c) Frequency dis-
tribution of the first input variable
Rmax : the maximum radius of the receptive hypersphere of the rule nodes.
If, during the training process, a rule node is adjusted such that the radius
194 N. Kasabov et al.
of its hyper-sphere becomes larger than Rmax then the rule node is left
unadjusted and a new rule node is created.
Rmin : the minimum radius of the receptive hypersphere of the rule nodes. It
becomes the radius of the hypersphere of a new rule node.
nM F : the number of membership functions used to fuzzify each input variable.
M -of-N : if no rules are activated, ECF calculates the distance between the
new vector and the M closest rule nodes. The average distance is calculated
between the new vector and the rule nodes of each class. The vector is
assigned the class corresponding to the smallest average distance.
Table 9.1. The number of bits used by GA to represent each parameter and the
range into which each bit string is decoded
The experiments for GA are run using a population size of 10 for 10 gen-
erations (empirically we find that 10 generations are sufficient for GA conver-
gence, see Fig. 9.4 and Fig. 9.5 shown later). For each individual solution, the
initial parameters are randomised within the predefined range. Mutation rate
pm is set to 1/ltot , which is the generally accepted optimal rate for unimodal
functions and the lower bound for multi-modal functions, yielding an average
of one bit inversion per string [1, 2]. Two-point crossover is used. Rank-based
selection is employed and an exponentially higher probability of survival is
assigned to high fitness individuals. The fitness function is determined by
the classification accuracy. In the control experiments, ECF is performed us-
ing the manually optimised parameters, which are Rmax = 1, Rmin = 0.01,
nM F = 1, M -of-N = 3. The experiments for GA and the control are repeated
50 times and between each run the whole data set is randomly split such that
50% of the data was used for training and 50% for testing. Performance is
determined as the percentage of correctly classified test data. The statistics
of the experiments are presented in Table 9.2.
The results show that there is a marked improvement in the average
accuracy in the GA-optimised network. Also the standard deviation of the
9 Evolving Connectionist Systems 195
Table 9.2. The results of the GA experiment repeated 50 times and averaged and
contrasted with a control experiment
Fig. 9.6. Neucom module that implements GA for the optimisation of ECF
6
nMF
5
M-of-N
4
1 Rmin
Rmax
0
0 5 10 15 20 25 30 35 40
gen
Finally, Fig. 9.7 shows the evolution of the parameters over 40 generations.
Each parameter converges quickly to their optimal value within the first 10
generations, showing the effectiveness of the GA implementation.
The above WDN method is illustrated in the next section on two case
study ECOS and on two typical problems, namely EFuNN for a time series
prediction, and ECMC for classification.
In the present paper, EFuNN is applied to the time series prediction. Improved
learning with the WDN method is demonstrated on the Mackey-Glass (MG)
time series prediction task [33]. In the experiments, 1000 data points, from
t = 118 to 1117, were extracted for predicting the 6-steps-ahead output value.
The first half of the data set was taken as the training data and the rest as
the testing data.
The following parameters are set in the experiments for the EFuNN model:
Rmax = 0.15; Emax = 0.15 and nM F = 3. The following GA parameter values
are used: for each input variable, the values from 0.16 to 1 are mapped onto
4 bit string; the number of individuals in a population is 12; mutation rate
is 0.001; termination criterion (the maximum epochs of GA operation) is 100
generations; the RMSE on the training data is used as a fitness function. The
optimsed weight values, the number of the rule nodes created by EFuNN with
198 N. Kasabov et al.
Table 9.3. Comparison between EFuNN without WDN and EFuNN with WDN
such weights, the training and testing RMSE and the control experiments are
shown in Table 9.3.
With the use of the WDN method, better prediction results are obtained
for a significantly less number of rule nodes (clusters) evolved in the EFuNN
models. This is because of the better clustering achieved when different vari-
ables are weighted accurately acccording to their relevance.
In this section, the ECMC with WDN is applied to the Iris data for both
classification and feature weighting/selection. All experiments in this section
are repeated 50 times with the same parameters and the results are averaged.
50% of the whole data set is randomly selected as the training data and the
rest as the testing data. The following parameters are set in the experiments
for the ECMC model: Rmin = 0.02; each of the weights for the four normalised
input variables is a value from 0.1 to 1, and is mapped into a 6-bit binary
string.
The following GA parameters are used: number of individuals in a pop-
ulation 12; mutation rate pm = 0.005; termination criterion (the maximum
epochs of GA operation) 50; fitness function is determined by the number of
created rule nodes.
The final weight values, the number of rule nodes created by ECMC and
the number of classification errors on the testing data, as well as the con-
trol experiment are shown in the first two rows of Table 9.4 respectively.
Results show that the weight of the first variable is much smaller than the
Table 9.4. Comparison between ECMC without WDN and ECMC with WDN
weights of the other variables. Now using the weights as a guide to prune away
the least relevant input variables, the same experiment is repeated without
the first input variable. As shown in the subsequent rows of Table 9.4, this
pruning operation slightly reduces test errors. However, if another variable is
removed (i.e. the total number of input variables is 2) test error increases. So
we conclude that for this particular application the optimum number of input
variables is 3.
9.9 Acknowledgements
The research presented in the paper is funded by the New Zealand Founda-
tion for Research, Science and Technology under grant NERF/AUTX02-01.
Software and data sets used in this chapter, along with some prototype demo
systems, can be found at the Web site of the Knowledge Engineering and
Discovery Research Institute, KEDRI – www.kedri.info.
References
1. T. Baeck, D. B. Fogel, and Z. Michalewicz. Evolutionary Computataion II.
Advanced algorithm and operators, volume 2. Institute of Physics Pub., Bristol,
2000.
2. T. Baeck, D. B. Fogel, and Z. Michalewicz. Evolutionary Computation I. Basic
algorithm and operators, volume 1. Institute of Physics Publishing, Bristol,
2000.
200 N. Kasabov et al.
J. Strackeljan
10.1 Introduction
1. Monitoring
Measurable variables are checked with regard to tolerances, and alarms are
generated for the operator.
2. Automatic protection
In the case of a dangerous process state, the monitoring system automati-
cally initiates an appropriate counteraction.
3. Monitoring with fault diagnosis
Based on measured variables, features are determined and a fault diagnosis
is performed; in advanced systems, decisions are made for counteractions.
1. Sensor technology
For this purpose, the measurement of machine vibrations, for instance,
has proved to be a diagnostic approach very well suited for determining
variations in machine behaviour. However, the temperature, pressure, or
other process parameters, as well as oil analysis with respect to the number
of suspended particles, etc., can also be employed for monitoring.
2. Analogue data pre-processing
The signals from the sensors must frequently be conditioned by offset elim-
ination, filtering, etc.
3. A/D conversion
As a rule, a sensor generates an analogue voltage or current signal, which
is then converted to a digital signal for further use.
4. Feature calculation and/or feature extraction
In the next step, status parameters (features) must be determined from
the digital signal or from the available individual values, which are un-
ambiguously correlated with the monitoring task as far as possible. In the
208 J. Strackeljan
1. Adapting
Ability to modify the system behaviour to fit environment, new locations
and process changes. The aspect is identified as the most important feature
for a smart adaptive monitoring systems.
2. Sensing
Ability to acquire informations from the world around and response to it
with consistent behavior. Chemcial and nuclear power plant monotoring are
large scale sensing systems. They accept input from hundreds of sensors to
regulate temperature, presssure and power throughout the plant. In spite
of the fact that different sensors exist more and better sensors remain a
continous need. Laser vibrometer which allows a precise measurement of
displacement and velocity, fiber optic sensors and micro-electronic sensors
should be mentoined as examples of promising new sensore technologies.
Along with a proliferation of sensors comes a greatly increased need to
combine, or fuse data from multiple sensors for more effective monitoring.
Need range from processing of multiple data streams from a simple array
of identical senosrs to data from sensors based entirely different physical
phenomena operating asynchonously at vastly different rates.
210 J. Strackeljan
3. Inferring
Ability to solve problems using embedded knowledge and draw conclusions.
These expert knowledge is in general a combination of theoretical under-
standing and a collection of heuristic problem solving rules that experience
has shown to be effective. A smart monitoring system should be able to
detect sensor faults to prevent the use of nonsensical input values for a
classification.
4. Learning
Ability to learn from experience to improve the system performance. Learn-
ing and adaptive behaviour are closely combined and without the imple-
mentation of learning strategies adapting will work.
If monitoring problems are already being solved sufficiently well today without
adaptive behaviour, a logical question is why there is a demand for smart
systems at all and whether this demand can be justified?
For this purpose, the type of monitoring which is currently applied in in-
dustry must first be considered. From a methodical standpoint, this kind of
monitoring no longer satisfies all of the requirements which must be imposed
on a modern monitoring system. The concept of preventive maintenance im-
plies the application of techniques for the early detection of faults and thus
the implementation of appropriate maintenance measures in due time. Main-
tenance of a machine should not depend merely on a specified operating time
or distance, or on a specified time interval. This revolution in maintenance
strategy has already occurred in the automotive field during the past five
years. The notion that an automobile must be inspected for the first time af-
ter 15,000 km or one year, whichever comes first, after the original purchase, is
obsolete. As far as monitoring of machines in industry is concerned, however, a
change of this kind has hardly taken place at all in practice. The potential for
increasing the operating time of machines by decreasing the frequency of fail-
ures and for entirely avoiding unplanned down time is evident here. However,
the acceptance of such monitoring methods is possible only if the probability
of false alarms can be kept extremely low. Changes in the process or in the
machine which do not result from a fault must be distinguished from those
which do result from a fault. Precisely this adaptivity at level 1 still presents
serious problems for many systems, however. If a given system functions cor-
rectly, weeks or even months are often necessary for transferring this property
to similar machines. Support for the expert in charge by the system itself
is severely limited or completely absent. The requirement for human experts
and the associated labour costs still severely restrict the acceptance of mon-
itoring systems. On the other hand, precisely this situation offers a special
10 Monitoring 213
opportunity for those who develop and supply monitoring systems, since an
adaptive system becomes independent of particular applications and can thus
provide specialised solutions at acceptable prices. For controlling the qual-
ity of products during flexible manufacturing, the systems for monitoring the
machines and products must also be flexible. In this case, flexibility can be
equated to adaptivity.
In this connection, a further reason for the hesitation to apply advanced
monitoring systems should be mentioned. In this case, however, the problem is
not so much of a technical nature. During the past 15 years, the allegedly “ul-
timate” system has repeatedly been offered to industry as a solution to moni-
toring problems. Many managers invested money and were then disappointed
because the programs did not perform as promised. At present, genuine inno-
vations would be feasible with the application of computational intelligence
methods, but the willingness to invest again has decreased markedly.
Fig. 10.3. Trendline showing manually set alarm levels and adaptive levels (bold
lines) and the real overall vibration measurement data of a rotating machinery
Fig. 10.4. Masc of a power spectra and alert circles if more than one feature is
required
Figure 10.5 shows, with reference to the detection of cracks in a rotating shaft,
a typical rule which can be regarded as state of the art. Curves represent the
relative values obtained by means of first (f), second (2 × f) and third (3 × f)
running speed components and the RMS value when the measurements are
performed using shaft displacement. The relationship between the structure
and the amplitudes of a vibration signal and the cause of the defect in the
216 J. Strackeljan
Fig. 10.5. Trends in some characteristic values for the detection of cracks in rotating
shafts
Short Time
Fourier Wavelet
Fourier
frequency
frequency
frequency
time time time
10.5 Applications
You will find a lot of research papers describing new developments in the
field of condition monitoring using advanced techniques. This applications
were selected because the author was involved in the design of the monitoring
concept and its realization.
Fig. 10.7. Typical bearing fault in the outer race of a roller bearing
class describes a situation status where the machine does not operate. In
Fig. 10.8 the evolution of the membership values is plotted over a period of
19 hours. The membership m1 for class Ω1 is plotted in the upper section of
Fig. 10.8. This membership is determined in a fuzzy manner, but only the
values 0 and 1 are employed for indicating membership in the “’shut-down”’
class. The lower section of Fig. 10.8 shows the class membership values for
the classes Ω1 to Ω4 .
0.5 0.5
0.4 0.4
0.3 0.3
acceleration [g]
0.2 0.2
acceleration [g]
0.1 0.1
0 0
–0.1 –0.1
–0.2 –0.2
–0.3 time –0.3
time
–0.4 –0.4
–0.5 –0.5
Fig. 10.10. Time record showing typical fault signature (left) and a signal of a
defect roller bearing with cavitation interference Learning samples for the diagnosis
of roller bearings (right)
The Human Interface Supervision System (HISS) [16] which monitors dif-
ferent plants in all fields of industry where a great hazard prevails, where
unacceptable working conditions exist, or where long distances make it dif-
ficult for humans to check plants or processes at regular intervals. HISS is
an automatic self-learning supervision system with integrated video and au-
dio supervision and an electronic nose. The HISS project was born in order
to find a solution to this problem by imitating human sense organs – eyes,
ears, nose – with the help of sensors. All sensor signals are integrated by an
educable logic in such a way that events in technical-process facilities which
cannot be identified by conventional process-control systems can be logically
evaluated and the resulting information passed on (Fig. 10.11).
Fig. 10.12. Typical sensor constellation for the monitoring of a sweet gas plant.
Microphones and acceleration sensors are used for the detection of leaks and machine
faults
for adapting and adjusting the system to a changing environment has been
reduced to an acceptable level.
The HISS Logic unit comprises a facility model in which the tasks and
characteristics of the process facility (e.g. whether it is a sweet or acid gas
facility) are stored 13. In an event-related database new situations are stored
and are available for a comparison with current measurements. Impotant is the
consideratation of process variables. An increase in the speed of production
leads to a change in the background noise of the facility. If the process variables
are included in the Logic unit this change does not lead to an alarm signal.
Information is pre-processed in the individual components and transferred
via interfaces to the central computer unit. In the central computer unit the
situation is registered, evaluated and a decision about further processing is
made (Fig. 10.13).
HISS Logic receives information from each of the three measurement sys-
tems if one of these systems have detected a fault. The Logic unit compresses
the information with respect to time and location parameters and gives an
appropriate message to the operator. Two cases must be distinguished: Either
the system recognizes the situation which has led to the fault, in which case
it can issue a message which points directly to this situation or it does not
recognize the situation, in which case the message is unknown scenario. In this
case the operator has to give a feedback, stating what the current situation
is and the system uses this feedback in order to learn. If this or a similar
situation will occur again in the future the system recognizes the situation
and can also give an expert message. By and by HISS Logic thus acquires an
increasingly comprehensive case database and consequently an ever greater
knowledge about the facility and its experience increases over time.
Essentially, finite element programs (FEM) were employed for the pur-
pose. Since boundary conditions, such the variation of rotational speed
and dimensions of machine components, can also be taken into account on
the computer, the adaptive behaviour can be simulated.
2. On the basis of this result, machine states are generated with artificially
introduced faults, such as roller bearing damage, alignment faults, gear
defects in transmissions.
The data from the fault simulation were compared with data from very
simple test rigs, for which the specific implementation of defective machine
components is feasible, in contrast to real industrial plants. The basic idea
of this approach is not new in itself, but the objective of the project was to
achieve a new dimension in the quality of the simulation. From a conceptual
standpoint, promising trial solutions have certainly resulted from this project,
but these are concerned essentially with integration strategies for the differ-
ent data sets to yield a standardised learning set. However, the overall result
can never be better than the simulation which is taken as basis. Precisely this
weakness limits the applicability of FEM in the field of machine monitoring at
present. Modelling of the extremely complex, frequently non-linear processes
of contact between machine components in motion is feasible only in simple
cases. As an example consider the contact of a roller bearing with pitting
in the track. In principle, an expert can also determine the structure of the
signal thus generated without the application of FEM programs. A periodic
excitation of natural vibrations in the system always occurs when a rolling
element is in contact with the uneven surface of the track. In practical tests,
226 J. Strackeljan
1. Neural networks (NN) Neural networks are certainly suited for applications
in machine monitoring. Numerous reports on successful applications have
been published. Because of their capability of operating with noisy signals
and of representing the non-linear relationships between sources of error
and the resulting signal, neural networks are predestined for this field of ap-
plication. A further advantage is the high potential for generalisation, that
is, the capability of reaching a sensible decision even with input data which
have not been learned. In contrast to closed-loop control applications, it is
not necessary to cover the entire input data space for classification tasks
with neural networks. However, the representation of a classification deci-
sion is not transparent. As the level of network complexity increases, the
interpretability decreases drastically. Because of this disadvantage, it has
proven expedient to avoid working with large networks which can repre-
sent a large number of faults; instead, smaller networks should be taught
to deal with a single fault or fault group. There are many different kinds
of Neural networks available. Specially learning vector quantization (LVQ)
which based on prototypes is particularly suited for applications where
10 Monitoring 227
The data mining technique described by McGarry in the paper [10] rein-
volves the process of training several RBF networks on vibration data and
then extracting symbolic rules from these networks as an aid to understanding
the RBF network internal operation and the relationships of the parameters
associated with each fault class.
A criticim of neural network architecture is their susceptibility to the so
called catastrophic interference which should be understood as the abilty to
forget previous learned data when presented with new patterns. A problem
which should never occur when a neural network is used as a classification
algorithm in condition monitoring applications. To avoid this some authors
have described neural network architectures with a kind of memory. In generell
two different concepts are promising: Either the network possess a context unit
which can store pattern for a later recall, or the network combining high-levels
of recurrence coupled with some form of back-propagation [2].
1. Are enough data available for all (as many) conditions or qualities (as pos-
sible)?
This question must be answered in order to decide, for instance, whether
automatic learning of the classifiers can be accomplished with neural net-
works. For a neural network which is employed as a monitoring system, the
situation is more favourable than for one employed as a process controller.
With a monitoring system, an unlearned condition results either in no in-
dication at all or, in the worst case, in a false indication, but the process
itself is not affected. If only data for a “good” condition are available, these
data can still be employed as starting material for a monitoring operation.
10 Monitoring 229
All significant deviations from this condition are then detected. The adap-
tive system then evolves in the course of the operation with the continuing
occurrence of new events.
2. How many features are available?
Features should be closely related to the monitored object. How many fea-
tures of this kind can be provided ? The determination of the number of
the features from that on a feature selection is necessary, can be decided
only under the considering of the current monitoring task. As a provisional
value you can take a number of 10 features. If more features are available
a selection should be carried out. Notice the existence of the close relation-
ship between the number of feature and the number of necessary random
samples for a lot of classifiers. If a reduction of dimensionality is necessary
prefer feature selection if possible. The interpretability of new features for
instance calculated by a principal component analysis is difficult for both
an expert in the field and more than ever for an operator.
3. Can every set of process or machine data be unambiguously correlated with
a class attribute (quality or damage)?
If this is possible, methods of “supervised learning” should be applied. If
this is not possible, or if class information is available only for portions
of the data sets, cluster methods, such as Kohonen networks, should be
employed. The problem is especially difficult if the task involves a very
large number of available features, and the associated data bases cannot
be classified by an expert. In such cases, two questions must be considered
in parallel: the question of feature selection and that of clustering. Nearly
all methods for the selection of features utilise the classification efficiency
of the learning or test data set as selection criterion. However, if no class
designation is available, the classification efficiency is not calculable, and
so-called filter methods must then be employed. As a rule, these methods
yield features which are decidedly less efficient.
4. Do I have enough theoretical background, system information and a suit-
able software to carry out a precise simulation which should include fault
simulation?
The development of a model based prognostic and diagnostic model re-
quires a proven methodology to create and validate physical models that
capture the dynamic response of the system under normal and faulted
conditions [4]. There are a couple of model update algorithms which could
improve the accuracy of a FEM-Model but you should never underestimate
the work load and expert knowledge which is necessary for realization of
the corresponding experiments and program usage.
5. What is of better quality or more reliable? The system model or the re-
sponse of your sensors indicating changes in the machine or process?
Having both would be the best situation because you are free in the decision
to use a forward model which calculate a prognosis of the system response
and compare the estimation with the actual sensor signal (Fig. 10.15).
This procedure allows an estimation of the time to failure of the system.
230 J. Strackeljan
Fig. 10.15. Prognostic and diagnostic approach for a montoring task [1]
Additionally the difference between sensor and model will give you useful
information about possible faults. For the inverse problem the quality of
the model is less important because the describing features and all condi-
tion information are extracted from the sensor signal independently from
the model. For the diagnostic of roller bearings we have a rough model
about the signal structure but in most cases not enough information for
the prediction.
6. Do I have a safety relevant application?
Until now there are a couple of problems concerning self adaptive moni-
toring system in safety relevant applications. These limitations are a result
from problems which could occur during the evaluation and testing phase of
such systems in all conceivable situations. The behaviour today and in the
future after a system adjustment could differ significantly. This will hit legal
aspects because national and international testing institutions might have
problems to certificate a system. Today in nuclear power plants advanced
self-learning and adaptive systems are additional information sources but
in general not the trigger for an automatic shout-down.
7. The cost factor. Do I have sensor technique and communication infrastruc-
ture for an on-line monitoring?
Current commonly used sensor technology only permits the most rudimen-
tary form of signal processing and analysis within the sensor. The use of
advanced data analysis techniques will need more signal data. This means
that large quantities of data must be transmitted from the sensor to a sep-
arate data collector for subsequent processing and analysis. If permanent
on-line vibration monitoring is required – and for adaptive techniques it
is absolute necessary – then at present, for anything other than an overall
vibration alarm, the cost of providing the required communications in-
frastructure and the data collection and analysis equipment far outweighs
the benefits to be obtained [9].
10 Monitoring 231
10.8 Conclusions
References
1. Adams D.E. (2001) Smart Diagnostics,prognostrics, and self healing in next
generation structures, Motion Systems Magazine, April 2001.
2. Addison J.F.D., Wermter St., Kenneth J McGarry and John MacIntyre (2002)
Methods for Integrating Memory into Neural Networks Applied to Condition
Monitoring, 6th IASTED International Conference Artificial Intelligence and
Soft Computing (ASC 2002), Banff, Alberta, Canada July 17-19, 2002.
3. Adgar A., Emmanouilidis C., MacIntyre J. et al. (1998) The application of
adaptive systems in condition monitoring, International Journal of Condition
Monitoring and Engineering Management, Vol. 1, pp. 13-17.
4. Begg C.D., Byington, C.S., Maynard K.P. (2000) Dynamic Simulation of me-
chanical fault transition, Proc. of the 54th Meeeting of the Society for Machinery
Failure Prevention, pp. 203-212.
5. Bojer T., Hammer B. and Koers C. (2003) Monitoring technical systems with
prototype based clustering. In: M.Verleysen (ed.), European Symposium on Ar-
tificial Neural Networks’2003, pp. 433-439, 2003
6. CSC (2001) Get smart. How Intelligent Technology will enhance our world. CSC
consulting. www.csc.com
7. Debabrata P. (1994) Detection of Change in Process Using Wavelets, Proc.
of the IEEE-SP Int. Symposium on Time-Frequency and Time-Scale Analysis,
pp. 174-177.
232 J. Strackeljan
K. Leiviskä
11.1 Introduction
Fault recovery is a case for (fuzzy) expert systems. Based on fault location
and cause, corrective actions are recommended to the user. This stage is not
included in any of the cases below. In most cases it is left for the operator to
choose or it is obvious from the problem statement.
Environmental changes, together with the changes in quality and produc-
tion requirements lead to the need for adaptation. For instance, in nozzle
clogging problem in Sect. 11.2.2, change in the steel grade changes the pat-
terns of single faults and it can also introduce totally new faults. Reasons for
faults can be traced back to earlier stages of production, to dosing of different
chemicals, bad control of process operations, and also to different routing of
customer orders through the production line. In this case, the new steel grade
change would require new models to be taken into the use.
Fault diagnosis is in many cases essential to quality control and optimising
the performance of the machinery. For example the overall performance of pa-
per machines can be rated to the operating efficiency, which varies depending
on breaks and the duration of the repair operations caused by the web break.
Leiviskä [1] analysed several cases for using intelligent methods in the ap-
plications developed in Control Engineering Laboratory, University of Oulu.
Two applications in model based diagnosis from process industries were con-
sidered: nozzle clogging in continuous casting of steel [4] and web break indi-
cator for the paper machine [5]. These cases are revisited here from another
perspective together with two further examples from electronics production;
screw insertion machine [6] and PCB production [7].
The runnability of the paper machine depends on the amount of web breaks
in comparison with the paper machine speed [5]. The paper web breaks when
the strain on it exceeds the strength of paper. When the runnability is good,
the machine can be run at the desired speed with the least possible amount of
breaks. The web break sensitivity stands for the probability of web to break,
predicting the amount of breaks during one day. Paper web breaks commonly
account for 2–7 percent of the total production loss, depending on the paper
machine type and its operation. According to statistics only 10–15 percent of
web breaks have a distinct reason. The most of the indistinct breaks are due
to dynamical changes in the chemical process conditions already before the
actual paper machine. In this area the paper making process is typically non-
linear with many long delays that change in time and with process conditions,
there are process feedbacks at several levels, there are closed control loops,
there exist factors that cannot be measured and there are strong interactions
between physical and chemical factors [8].
236 K. Leiviskä
The aim of this case study is to combine on-line measurements and ex-
pert knowledge and develop the web break sensitivity indicator. The indicator
would give the process operator continuous information on the web break sen-
sitivity in an easily understandable way. Being able to follow the development
of the break risk would give the operator a possibility to react on its changes
in advance and therefore avoid breaks. In this way, the efficiency of the paper
machine could be increased and also the process stability would improve with
fewer disturbances. In this case a lot of mill data from several years’ operation
was available.
Paper machines produce a great amount of different paper grades with
varying raw materials and process conditions. This leads to the fact that
reasons for web breaks can vary from one grade to another. This calls for
adaptive systems. Leaning capability would also shorten the start-up period
for the new machines.
Submerged entry nozzle connects the tundish and the mold in the continuous
casting of steel (Fig. 11.1) [4]. It is a tube with a diameter of about 15 cm and
it divides the molten steel into two directions on the both sides of the nozzle.
The nozzle transfers molten steel from the tundish to the mold and separates
steel from the atmosphere.
Steel is cast in batches (heats). Usually one cast series includes 3–6 suc-
cessive heats. Casting of a heat takes about 35–50 minutes. Each series is
designed to consist of the same steel quality. The nozzle is changed after each
series and the new series is started with a new nozzle. About 360–720 tons
of steel goes through a nozzle during its lifetime. Nozzle clogging stops the
Fig. 11.1. Submerged entry nozzle in the continuous casting process [4]
11 Examples of Smart Adaptive Systems 237
cast and causes production losses and extra work. The risk of nozzle clogging
increases all the time so that when about 360 tons of steel has been cast, the
nozzle condition must be checked carefully. The operators estimate, based on
vision and the casting plan, if the clogging should be removed or the nozzle
changed in the middle of the cast series.
This case study aims to develop tools for predicting the occurrence of
nozzle clogging phenomenon. It is known that better management of cast-
ing brings along considerable potential for production increase and quality
improvement. Also in this case a lot of mill data was available.
Also in this case, the production of a big variety of grades calls for adaptive
systems. In its simplest form, it would mean building of separate models for
few grade categories.
Once again, change in components or even in the whole design of PCB leads
to the need for adaptation. This is also a very strict requirement because of
short series production that gives only a short time for learning.
11.3 Solutions
Model-based diagnosis is a flexible tool for making the decisions early enough
to guarantee smooth operation. Intelligent methods provide additional tools
for generating, analysing, modelling and visualising in different stages in di-
agnosis. Expert knowledge is necessary in applications where only limited
amount of data is available. Neural networks are useful in extracting features
in data-intensive applications and in tuning of the systems. As mentioned be-
fore, fuzzy systems are conventional explanatory tools for diagnosis. Fig 11.2
shows the origin of data in model-based diagnosis.
The web break indicator problem is highly complex, and various techniques
are available for handling different aspects of the system [8, 9, 10, 11]:
1. Principal component analysis (PCA) has been used in data reduction and
interpretation, i.e. to find small amount of principal components, which can
reproduce most of the information. This technique can reveal interesting
relationships, but it has difficulties with non-linearity and variable delays.
2. Fuzzy causal modelling provides methods for extracting expert knowledge
and also rules can be generated. This is a suitable method for restricted
special cases, but the influence diagrams become very complex resulting in
huge rule bases.
11 Examples of Smart Adaptive Systems 239
In this case, the web break sensitivity indicator was developed as a Case-
Based Reasoning type application [12] with linguistic equations approach [13]
and fuzzy logic (Fig. 11.3). The case base contains modelled example cases
classified according to the amount of the related breaks. For example, breaks
can be classified to five categories, depending on how many breaks occur
during a period of one day: No breaks (0), a few breaks (1–2), normal (3–4),
many breaks (5–10) and a lot of breaks (>10). For example, if there are five
cases in each category the case base includes 25 cases as a whole. Each case
includes one or more models written in the form of linguistic equations [13]
and describing the interaction between process variables. This hierarchy is
described in Fig. 11.4.
The revise stage means the calculation of difference between the predicted
and real break sensitivity. Tested case will give the information of the web
break sensitivity with the degree of quality. Learning takes place when the
quality degree is poor. In retain stage new potential cases are modelled with
the break class information. Learned new case is saved in the case base. This
case is an example of a data-intensive modelling. A total number of 73 vari-
ables were considered and the modelling data covered 300 days. Data was
analysed in the periods of 24 hours and in the real application the number
11 Examples of Smart Adaptive Systems 241
of break categories was 10. Case selection was done interactively. The main
analysis tool was FuzzEqu Toolbox in Matlab-environment [5].
Fig. 11.6. Nozzle clogging detection based on the stopping rod position and neural
network model [4]
Modelled variables were chosen with correlation analysis [4]. The amount of
data limits the size of the network. The number of training parameters should
not exceed the number of training points. In practice, network modelling is
difficult if the number of data point is less than 60, because training requires
30 points at minimum and almost the same number is needed for testing. With
five inputs and 30 training points, a conventional network can include only
5 neurons. In addition to normal testing procedures, cross-testing considered
using the model developed for one caster to the data from another [4]. Only
grade 1 showed reasonable results. This confirms that different variables effect
on the nozzle clogging in different casters and adaptation is needed in moving
the model from one caster to another.
Modelling used only data from cases where clogging occurred. Models were
also tested using data from corresponding successful heats [4]. The result was
as expected. It is impossible to tell how many tons could have been cast in
successful cases, if casting had continued with the same nozzle. Therefore, the
cast tons given by the models remain in these cases lower than actual. The
11 Examples of Smart Adaptive Systems 243
result is important from the safety point of view: mod-els never pre-dict too
high cast tons.
There have recently been a few attempts to investigate the use of the soft
computing techniques for the screw insertion observation [14, 15, 16]. However,
the research has mainly focused on classification between successful and failed
insertions. In this case, the identification of the fault type was also required.
This case study consists of quality monitoring and fault identification
tasks. According to Fig. 11.7, residuals of the data-based normal model are
utilised for quality monitoring using fuzzy statistical inference. A learning al-
gorithm is applied to the parameter estimation of the normal model. Specified
fault models use fuzzy rules in fault detection.
The system was developed both for electric and pneumatic screwdrivers.
In electric case, a simulated torque signal was used and in pneumatic case
the measured pneumatic signal from the actual insertion machine was used.
In the latter case, a full two-level factorial design program was carried out
together with three center-point runs [6]. The system was programmed and
tested on-line in Matlab-environment.
Fuzzy rule bases were constructed to achieve a generic fault detection and
isolation database for both signal types. In case of the torque signal, the fuzzy
rule base was developed on the basis of expert knowledge and theory. In the
other case, real experiments were conducted to define a fault model – a rule
base for the pneumatic screw insertions. The fault models seem to detect
known faults. It was also found out during the tests that the normal model
detects unknown faults – i.e. faults not represented in the fault models. The
operation of the normal model was also robust against unknown situations
and smaller changes in process parameters. This is very important, because
process conditions usually change when products change. Also when starting
new production lines, this approach seems to work after short ramp-up times.
244 K. Leiviskä
Fig. 11.7. The structure of quality monitoring and fault detection scheme for screw
insertion process [6]
then to define test strategies using for example fault trees [18, 19]. Expert sys-
tems often work well only with a very simple PCB architecture and capturing
of human expertise is a difficult task in changing environments. The system
presented in this case relies only partly on expert knowledge. In order to be
flexible the system must be able to correct “inaccurate expert knowledge”
and learn from experience. This is achieved by using Linguistic Equations
approach as in Sect. 11.3.1 [7].
For the analysis, the PCB board is divided into distinct areas and mea-
surements describing the condition of the area (OK or faulty) are identified.
Then linguistic equations tying together the measurements are formed. This
includes also the tuning of membership definitions [7, 13]. All these stages can
be based on data or on expert knowledge. By a reasonable combination of
both approaches, it was found that even data from 15 prototype PCBs was
enough for tuning the system.
The use of the systems resembles very much the case in Sect. 11.3.1. The
actual measurements from the functional testing are introduced to the system.
The system calculates the equations one by one. If the equations are “close”
to zero, the area is in normal condition. If not, there is some fault in this
area, which is reported to the user. The “closeness” depends on the standard
deviations of the linguistic variables [7].
Data intensive cases require a lot of data pre-processing. Two cases can be
separated: preparation of the modelling data set and the assuring similar data
quality for the on-line use of the system. Both are difficult in complicated
cases.
Preparation of the modelling data set requires versatile tools. Data con-
sistency must be guaranteed and in many cases more than removing of out-
liers and replacing missing values is needed. Simulation and balance calcula-
tions can be used in assuring the correctness of data. Combining information
from several databases requires also good time-keeping in information sys-
tems. Time-labelled values (e.g. from the mill laboratory) require extra care.
In practice, a lot of pre-processing has already been done in automation and
monitoring systems. Sometimes it is difficult to tell what kind of filtering has
already been done when data has passed during several systems before intro-
ducing it to the fault diagnosis system. Anyhow, each filtering may have an
effect to the value used in models. Also different systems have different means
to handle e.g. missing values, outliers, and cumulated values. Experience has,
however, shown that the original measured value is in many cases difficult, or
even impossible, to access.
Scaling is needed almost always. It is elementary with neural networks
and the starting point in applying Linguistic Equations approach. The same
scaling must be installed in the on-line systems.
11.4.4 Training
It was very confusing to find in Case 2 that after data preparation the data
intensive problem was converted almost to “too few data” problem. This must
be checked in all cases and the number of parameters in models should be
carefully chosen. Automatic tools for training exist in abundance, they all
have their special features and must be studied before applying them.
When combining data intensive systems with expert knowledge the tool re-
quirements are increased. A tool is needed that easily combines the user-given
rules with automatically generated ones, that gives possibilities to simulate
and evaluate the different rules and rule bases and finally optimally combines
them. This process is complicated by the fact that in fault diagnosis the size
of the rule base easily increases to tens and hundreds, even thousands, of rules
and the conventional way of presenting the rules looses its visibility.
11.4.5 Testing
It is clear that testing must be done with independent data, preferably from
a totally different data set compared with the training data. Before going
on-line, the system must be tested in test bench using simulators.
11.4.6 Evaluation
The best technical evaluation strategy is to run the system in on-line envi-
ronment using real mill systems in data acquisition and user interface for the
longer time. This is not always possible and from the three cases above only
Cases 1 and 4 were tested on-line at the mill site. Case 2 was tested at the
development site, but with the data collected from the mill and Case 3 with
simulated and experimentally acquired data. In off-site testing, it is difficult
to evaluate all data preparation and user interface tools in the correct way.
In order to estimate the economic effects of the diagnosis system, the
baseline must be defined. This might be difficult in cases, where products or
production is changing or a totally new production line is started. Then the
earlier experiences or available theoretical values for i.e. the defect tendency
must be used.
11.4.7 Adaptation
All cases above show the need for adaptation, but only Cases 1 and 3 show how
this adaptation can be realised – so they are real Smart Adaptive Systems.
Anyway, intelligent model-based systems offer a good chance for adaptation.
The question is, when it is necessary and how to do it in practice. Fuzzy
logic gives the most versatile possibilities and adaptation can consider rule
bases, membership functions or separate tuning factors. Adaptation itself can
also utilise rule-based paradigm (See chapter on Adaptive Fuzzy Controllers).
248 K. Leiviskä
11.5 Summary
This chapter described four different cases where model-based diagnosis has
been applied in industrial systems. Two cases came from process industry and
two from manufacturing industries. In spite of totally different application
areas, synergies are easy to see and similar approaches can be used in both
areas. It is also clear that both areas can learn from each other.
Fuzzy logic seems to be the tool that most easily applies to diagnosis
task. It also offers good possibilities for adaptation, especially when applied
together with Case-Based Reasoning.
References
1. Ayoubi M., Isermann R., Neuro-fuzzy systems for diagnosis. Fuzzy Sets and
Systems 89(1997)289–307.
2. Isermann R., Supervision, fault-detection and fault-diagnosis methods. XIV
Imeko World Congress, 1–6 June 1997, Tampere, Finland.
3. Leiviskä K. Adaptation in intelligent systems – case studies from process indus-
try. Euro CI Symposium, 2002, Kosice, Slovakia.
4. Ikäheimonen, J., Leiviskä, K., Matkala, J. Nozzle clogging prediction in contin-
uous casting of steel. 15th IFAC World Conference, Barcelona, Spain, 2002.
5. Ahola, T., Kumpula, H., Juuso, E. Case Based Prediction of Paper Web Break
Sensitivity. 3rd EUNITE Annual Symposium, Oulu, Finland, July 10–11, 2003.
6. Ruusunen M., Paavola M. Monitoring of automated screw insertion processes –
a soft computing approach. 7th IFAC Workshop on Intelligent Manufacturing
Systems, Budapest, Hungary, 6.–8. 4. 2003.
7. Gebus S., Juuso E.K.K. Expert System for Defect Localization on Printed Cir-
cuit Boards. Automation Days, Sept. 2003, Helsinki, Finland.
8. Oinonen K. Methodology development with a web break sensitivity indicator as
a case. Proceedings of TOOLMET’96 – Tool Environments and Development
Methods for Intelligent Systems. Report A No 4, May 1996, Oulu, Finland, 1996,
pp. 132–142.
9. Miyahishi, T., Shimada, H. (1997): Neural networks in web-break diagnosis of
a newsprint paper machine. In: 1997 Engineering & Papermakers Conference,
Tappi Press, Atlanta, pp. 605–612.
10. Oyama, Y., Matumoto, T., Ogino, K. (1997): Pattern-based fault classification
system for the web-inspection process using a neural network. In: 1997 Japan
TAPPI Annual Meeting Proceedings, Japan TAPPI, pp. 575–580. (Abstract in
ABIPST 68(10), 1120).
11. Fadum, O. (1996): Expert systems in action. In: Fadum, O. (Ed.), Process,
Quality and Information Control in the Worldwide Pulp and Paper Industry,
Miller Freeman Inc., New York, pp. 23–34.
11 Examples of Smart Adaptive Systems 249
This chapter shows how to design adaptive fuzzy controllers for process con-
trol. First, the general design methods with the necessary steps are referred
to and then a simpler case of adaptive tuning of the scaling factors in the
fuzzy logic controller is presented. A case study concerning the adaptive fuzzy
control of the rotary dryer visualises the approach.
Fuzzy logic is particularly suitable for process control if no model exists for
the process or it is too complicated to handle or highly non-linear and sensitive
in the operation region. As conventional control methods are in most cases
inadequate for complex industrial processes, fuzzy logic control (FLC) is one
of the most promising control approaches in these cases.
Designing of the fuzzy controller follows the general principles shown in
Chap. 5, but the on-line application and requirements for the dynamical sys-
tem response set some additional requirements that are commented upon in
this chapter. A need for adaptive tuning is one of these features.
The terminology used in this chapter deserves some comments, especially
for non-control specialists. Control means the mechanism needed for two dif-
ferent tasks: to keep the system in its target state or to move it from one target
state to another. In control textbooks, this target state is usually called the set
point and we speak about stabilising control and trajectory control, depending
on the task in question.
In technical systems we can separate two control principles. In feedback
control, the controlled variable is measured, its value is compared with its
set point and the control action (the change in the manipulated variable) is
calculated based on this deviation, the control error. In feedforward control,
the measurement is exercised to the main disturbance variable and the ma-
nipulated variable is set according to the changes in it. The common problem
of the room temperature control gives good examples for both principles. If
the heat input to the room is controlled based on the room temperature it-
self, we speak about feedback control, and if it is set according to the outside
temperature, we have the case of feedforward control. (Note that the authors
are coming from the North).
K. Leiviskä and L. Yliniemi: Design of Adaptive Fuzzy Controllers, StudFuzz 173, 251–265
(2005)
www.springerlink.com c Springer-Verlag Berlin Heidelberg 2005
252 K. Leiviskä and L. Yliniemi
12.1 Introduction
There are several possibilities to implement fuzzy logic in control. The con-
troller itself can be replaced by fuzzy model, the fuzzy logic controller. Fuzzy
model can also be used to replace the feedforward controller or it can be run
parallel to or in series with the conventional controller, thus helping to take
some special situations into account. Fuzzy model can also be used in on-line
tuning of the conventional controller or the fuzzy controller, as shown later in
this chapter.
Figure 12.1 shows the block diagram of the two-term fuzzy logic controller
used as an example later in the text; two-term meaning that only control error
and the change in error are taken into account. Following notations are used:
sp(k) is the set point, e(k) is the error, ce(k) is the change in error, Ge, Gce
and Gcu are the tunable scaling factors, se(k) is the scaled error, sce(k) is
the scaled change in error, scu(k) is the crisp increment in the manipulated
variable, cu(k) is the de-scaled increment in the manipulated variable, u(k) is
the actual manipulated variable, u(k − 1) is the previous manipulated variable
and y(k) is the controlled variable. The variable k refers to the discrete con-
trol interval. The fuzzy logic controller (FLC) in Fig. 12.1 consists of blocks
for fuzzification, fuzzy reasoning and defuzzification as shown in Fig. 12.1 in
Chap. 5. The on-line tuning problem can be defined as follows:“Select the
scaling factors and/or the parameters of FLC so that the requirements set for
the controlled variable y(k) are fulfilled in the changing process conditions”.
The solution of this tuning problem leads to adaptive fuzzy control. Follow-
ing relationships are valid between the variables shown in Fig. 12.1 when the
denotations are as before:
The error is the difference between the set point and the controlled variable
The change in the error is the deviation between two successive errors
If the set point does not change, combining (12.1) and (12.2) gives for the
change in the error
ce(k) = y(k − 1) − y(k) (12.3)
For the scaling, the following equations can be written
The case study at the end of this chapter concerns with the adaptive fuzzy
control of the rotary dryer and it is based on [3]. Reference [4] deals with the
same tuning method, but instead of PI-controller PID-controller is used and
also an improved supervisory control strategy is tested.
expert systems and the results shown there are not repeated here. This sec-
tion introduces some general design approaches and more detailed discussion
and practical guidelines for the design of adaptive fuzzy controller are given in
Sect.12.3. Ali and Zhang [5] have written down the following list of questions
for constructing a rule-based fuzzy model for a given system:
• How to define membership functions? How to describe a given variable by
linguistic terms? How to define each linguistic term within its universe of
discourse and membership function, and how to determine the best shape
for each of these functions?
• How to obtain the fuzzy rule base? In modelling of many engineering prob-
lems, usually, nobody has sufficient experience to provide a comprehensive
knowledge base for a complex system that cannot be modelled physically,
and where experimental observations are insufficient for statistical mod-
elling. Moreover, human experience is debatable and almost impossible to
be verified absolutely.
• What are the best expressions for performing union and intersection op-
erations? In other words, which particular function of the t-norms and
s-norms should be used for a particular inference.
• What is the best defuzzification technique for a given problem?
• How to reduce the computational effort in operating with fuzzy sets, which
are normally much slower than operating with crisp numbers?
• How to improve computational accuracy of the model? Being fuzzy does
not mean inaccurate. At least, accuracy should be acceptable by the nature
of the engineering problem.
Isomursu [6] and later Frantti [7] have also approached the design of fuzzy
controllers using basically the general fuzzy modelling approach. They have
modified the systematics for fuzzy model design originally given in [8] for on-
line adaptation purposes. The approach given in [7] divides the design of fuzzy
controllers into following parts:
• Data pre-processing and analysis. Assuring that data used in the design
fulfils the requirements set by the design methods. This means mostly
eliminating or replacing erroneous measurement values and compensating
for process time delays.
• Automatic generation of membership functions. Making the fast adapta-
tion and tuning of fuzzy model (controller in this case) possible. On-line
use requires fast and robust algorithms.
• Normalization (Scaling). The input values of the physical domain are nor-
malized to some appropriate interval.
• Fuzzification. Representation of input values in the fuzzy domain.
• Reasoning. Interpretation of the output from read rules.
• Defuzzification. Crisp representation of the fuzzy output.
• Denormalization (De-scaling). Normalized output value is scaled back to
the physical domain.
12 Design of Adaptive Fuzzy Controllers 255
system is realisable. The size of the rule base depends on the choice of the
number of membership functions for each input variable. Complete rule
bases are used and their tuning is mostly done experimentally. Some data
based methods exist, but the problem may rise concerning the consistency
and completeness of the rule base. Graphical tools are of great help in
designing both the membership functions and rule bases.
Definition of fuzzy relations, compositional operators and inference mech-
anism together with the defuzzification strategy follow the guidelines given
in Chap. 5.
Despite the field of adaptive fuzzy logic control (FLC) is new, a number of
methods for designing adaptive FLCs has been developed. Daugherity et al.
[10] described a self-tuning fuzzy controller (STFLC) where the scaling factors
of the inputs are changed in the tuning procedure. The process in which the
tuning method was applied was a simple gas-fired water heater and the aim
was to replace an existing PID-controller with a fuzzy controller.
Zheng et al. [11] presented an adaptive FLC consisting of two rule sets:
The lower rule set, called the control rule set, forms a general fuzzy controller.
The upper rule sets, called tuning rule sets, adjusted the control rule set and
membership functions of input and output variables. Lui et al. [12] introduced
a novel self-tuning adaptive resolution (STAR) fuzzy control algorithm. One of
the unique features was that the fuzzy linguistic concepts change constantly
in response to the states of input signals. This was achieved by modifying
the corresponding membership functions. This adaptive resolution capability
was used to realise a control strategy that attempts to minimise both the rise
time and the overshoot. This approach was applied for a simple two input-one
output fuzzy controller.
Jung et al. [13] presented a self-tuning fuzzy water level controller based on
the real-time tuning of the scaling factors for the steam generator of a nuclear
258 K. Leiviskä and L. Yliniemi
power plant. This method used an error ratio as a tuning index and a vari-
able reference tuning index according to system response characteristics for
advanced tuning performances. Chiricozzi et al. [14] proposed a new gain self-
tuning method for PI-controllers based on the fuzzy inference mechanism. The
purpose was to design a FLC to adapt on-line the PI-controller parameters.
Miyata et al. [15] proposed a generation of piecewise membership functions in
the fuzzy control by the steepest descent method. This algorithm was tested
on a travelling control of a parallel parking of an autonomous mobile robot.
Chen and Lin [16] presented a methodology to tune the initial membership
functions of a fuzzy logic controller for controlled systems. These membership
functions of the controller output were adjusted according to the performance
index of sliding mode control.
Mudi and Pal [17] presented a simple but robust model for self-tuning
fuzzy logic controllers (FLCs). The self-tuning of a FLC was based on ad-
justing the output scaling factor of a FLC on-line by fuzzy rules according to
the current trend of the controlled process. Chung et al. [18] propose a self-
tuning PI-type fuzzy controller with a smart and easy structure. The tuning
scheme allows tune the scaling factors by only seven rules. The adaptive fuzzy
method developed by Ramkumar and Chidambaram [19] has the basic idea
to parameterise Ziegler-Nichols like tuning formula by two parameters and
then to use an on-line fuzzy inference mechanism to tune the parameters of
a PI-controller. This method includes the fuzzy tuner with the error in the
process output as input and the tuning parameters as outputs. The idea to
use the Ziegler-Nichols formula is originally presented in [20].
control actions are required at different error levels, the input variable to
the tuning algorithm is the error. If the controller performance depends on
some other variable, e.g. the production rate, this variable is chosen as the
input variable. In the latter case we usually speak about gain scheduling.
Tuning output variables are, in this case, the incremental changes of the
scaling factors.
• Definition of the tuning rule base. This is once again case-dependent and
can be based on experience or theoretical considerations.
• Selection of the defuzzification method. Usually, the centre of gravity/area
method is used.
• Testing with simulations.
In order to achieve the best possible control there are a number of con-
stants, which vary from one method to another. The selection of them calls for
knowledge and experience. Their effects can also be studied using simulations.
• The selection of tuning inputs and their membership functions calls for
process knowledge. The control error is the most common one because in
non-linear cases bigger errors require stronger control actions to get the
controlled variable inside its favourable region. If some other variable is
chosen, it must be reliably measured and its effect to the control perfor-
mance must be known.
• Also here as few membership functions as possible must be chosen. This
keeps the rule bases reasonable small. Because the actual processes are
always non-linear, rule bases are usually asymmetrical (see Sect.12.4). Ex-
pert knowledge helps, but usually the design is by trial and error.
• Graphical design environments help. Drawing control surfaces help in find-
ing possible discontinuities in rule bases or in definitions of membership
functions and also in evaluation of local control efforts.
The experimental work described in this section was done with a direct air-
heated, concurrent pilot plant rotary dryer shown in Fig. 12.3 [4]. The screw
conveyor feeds the solid, calcite (more than 98% CaCO3 ), from the silo into
a drum of length 3 m and diameter 0.5 m. The drum is slightly inclined hori-
zontally and insulated to eliminate heat losses, and contains 20 spiral flights
for solids transport. Two belt conveyors transfer the dried solids back into the
silo for wetting. Propane gas is used as the fuel. The fan takes the flue gases
260 K. Leiviskä and L. Yliniemi
to the cyclone, where dust is recovered. The dryer can operate in a concurrent
or counter-current manner.
The dryer is connected to an instrumentation system for control experi-
ments. In addition to measurements of temperature and flow of the solids and
drying air, the input and output moisture of the solids is measured continu-
ously by the IR-analysers. The flow of fuel and secondary air are controlled for
keeping the input temperature of drying air in the desired value. The velocity
of the solids is controlled by the rotational speed of the screw conveyor. It is
also possible to control the delay time of the dryer by the rotational speed of
the drum.
It is difficult to control a rotary dryer due to the long time delays involved.
Accidental variations in the input variables as in the moisture content, tem-
perature or flow of the solids will disturb the process for long periods of
time, until they are observed in the output variables, especially in the output
moisture content. Therefore, pure feedback control is inadequate for keeping
the most important variable to be controlled, the output moisture content of
the solids, at its set point with acceptable variations. Increasing demands for
uniform product quality and for economic and environmental aspects have
necessitated improvements in dryer control. Interest has been shown in recent
years in intelligent control systems based on expert systems, fuzzy logic or
neural nets for eliminating process disturbances at an early stage.
The basic objectives for the development of dryer control are [3]:
• To maintain the desired product moisture content in spite of disturbances
in drying operation,
• To maximize production with optimal energy use and at minimal costs so
that the costs of investment in automation are reasonable compared with
other equipment costs,
12 Design of Adaptive Fuzzy Controllers 261
Fig. 12.4. Different input and output variables of the rotary dryer [4]
• To avoid overdrying, which increases energy costs and can cause thermal
damage to heat-sensitive solids, and
• To stabilize the process.
Figure 12.4 shows the main input and output variables of the dryer.
In this case the method described in [17] is used. This method is a simple but
robust model for self-tuning fuzzy logic controllers. The self-tuning of a FLC
is based on adjusting the scaling factor for the manipulated variable (Gcu in
Fig. 12.1) on-line using the fuzzy rule base according to the current state of
the controlled process. The rule-base for tuning the scaling factor was defined
based on the error (e) and the change of error (∆e) using the most common
and unbiased membership function. The design procedure follows the steps
shown in the previous section.
In this case the design procedure goes as follows:
1. Definition of the performance requirements of the controller. The aim of
the STFLC is to improve the performance of the existing fuzzy controllers
by utilizing the operator’s expert knowledge.
2. Definition of the fuzzy controller. The main controlled variable is the out-
put moisture of solids and the main manipulated variable is the input
temperature of drying air, which correlates with the fuel flow. The velocity
of solids, which correlates with the rotational speed of the screw conveyor
can be used as an auxiliary manipulated variable, but it is not included
here. The main disturbances are the variations in the input moisture of
solids and the feed flow. The existing fuzzy controllers [21] were used.
3. Definition of the tuning variables. In the method of Mudi and Pal [17] the
only tuned variable is the scaling factor for the manipulated variable, which
is called the gain updating factor, Gcu. The input variables to the tuner
262 K. Leiviskä and L. Yliniemi
Fig. 12.7. Membership functions for the gain updating factor [3]
are the control error (e) and the change in error (∆e). The number and
location of membership functions were chosen experimentally. According
to Figs. 12.5 and 12.6, five membership for the control error and three for
the change of error were found best in this case. Triangular membership
functions were chosen. The output from the tuner is the gain updating
factor with five membership functions (Fig. 12.7).
12 Design of Adaptive Fuzzy Controllers 263
Table 12.1. The rule base for gain updating factor [3]
e\∆e N ZE P
VN VB M S
N VB B S
M B ZE B
P S M B
VP S M VB
4. Definition of the tuning rule base. The rules based on the analysis of the
controlled response and on the experience are shown in Table 12.1.
Fig. 12.8. The behaviour of the output moisture of solids; the self-tuned direct FLC
and the direct FLC for a step change in the input moisture of solids from 2.4 m-%
to 3.4 m-%
Fig. 12.9. The behaviour of the output moisture of solids; the self-tuned hybrid
PI-FLC and the hybrid PI-FLC for a step change in the input moisture of solids
from 2.4 m-% to 3.4 m-% [3]
gain updating vectors is big (upper left and lower right corners of the table).
This means that strong control actions are carried out to get the controlled
variable inside the recommended region. If, on the other hand, both input
variables are big, but have opposite signs (the situation is getting better, the
gain updating vector is small (upper right and lower left corners). If both
variables are small, the gain updating vector is zero.
12.5 Conclusions
This chapter has introduced the design of adaptive fuzzy controllers. These
controllers are used in direct control of process variables and in this way
replacing conventional PI-controllers. Good results have been gained with
rather simple structures that are also easy to install and maintain. However,
their design includes quite many manual stages that require expert knowledge.
References
1. Leiviskä K. Control Methods. In “Process Control” edited by K. Leiviskä. Book
14 in Series on Papermaking Science and Technology, Fapet Oy, Finland, 1999.
2. Corripio A.B. Digital Control Techniques. AIChE, Washington, Series A, vol.
3, 1982.
12 Design of Adaptive Fuzzy Controllers 265
P.P. Angelov, Y. Zhang, and J.A. Wright: Optimal Design Synthesis of Component-based Sys-
tems using Intelligent Techniques, StudFuzz 173, 267–283 (2005)
www.springerlink.com c Springer-Verlag Berlin Heidelberg 2005
268 P.P. Angelov et al.
13.1 Introduction
Building
system
design Get requirements
brief
Initialise a population of
genomes
Problem-specific
mutation
Evaluate individuals in
Supervisory GUI
Stochastic Ranking on
the current population
No
Termination
criteria satisfied?
Yes
Generate report(s)
Degree of acceptance
1
0
Tmin Tset Tmax
T
Temperature
Fig. 13.2. Membership function for the zone temperature requirement formulation
The design synthesis problem has been formulated as a fuzzy (soft) constraint
satisfaction problem, which in general can be described as:
Subject to:
hconf iguration.i (Xcomp , Ytopo ) = 0 (13.2)
henergybalance.j (Xcomp , Ytopo , Zparam ) ∼
=0 (13.3)
∼
hoperation.l (Xcomp , Ytopo , Zparam ) = 0 (13.4)
Where:
Xcomp = (x1 , . . . , xn ) ∈ n denotes the vector of integer
variables, which represent the
set of components of a configu-
ration
Ytopo = (y1 , . . . , yn+m ) ∈ n+m is the vector of integer variables
representing the topology of a
configuration
Zparam = (z1 , . . . , zm+1+k ) ∈ m+1+k denotes the vector of real vari-
ables, which represent the oper-
ational setting in the system un-
der a design condition, including
airflow rates and coil duties
hCond denotes the nominal occurrence
(hours) of a design condition in
a typical year
Qcomp denotes the duty (KW) on a
component under the design
condition
Wcirc denotes the fan power (KW) for
air circulation in the system un-
der a design condition
functions have been used for definition of the zone temperature requirements
as more adequate representations of the human perception, which tolerates
slight violations of the standard requirements.
⎧
⎪
⎪ 0, if T ≤ Tmin
⎪
⎪
⎪
⎨ Tmin −T , if Tmin < T ≤ Tset
Tmin −Tset
µ=
⎪
⎪ T −Tmax
⎪
⎪ Tset −Tmax , if Tset < T < Tmax
⎪
⎩
0, if T ≥ Tmax
The last, third set of (inequality) constraints concern the working range
of components of the HVAC system for a feasible operating condition. De-
pending on the type of the component, these constraints can include feasible
ranges of airflow, temperature, and humidity. Taking the feasible range of air
temperature of the airflow entering a cooling coil for example, the degree of
acceptance can be formulated by a trapezoidal membership function as show
in Fig. 13.3. ⎧
⎪
⎪ 0, if T ≤ Tmin
⎪
⎪
⎪
⎪ TTmin−T
−T
⎪
⎪ , if Tmin < T < Tlb
⎨ min lb
µ = 1, if Tlb ≤ T ≤ Tub
⎪
⎪
⎪
⎪ −T
⎪
⎪
T max
, if Tub < T < Tmax
⎪ T −T
⎪ ub max
⎩
0, if T ≥ Tmax
The aggregation of the fuzzy constraints (13.3), (13.4) is performed by min
operator, which represents the logical “AND”:
0
Tmin Tlb Tub Tmax
T
Temperature
fuzzy constraints and a crisp objective (13.1). The stochastic ranking method
[14] is used for numerical solution of this fuzzy optimization problem. It se-
lects probabilistically between the minimization of the energy consumption
and the improvement of the degree of acceptance for an HVAC design.
It should be mentioned that the proposed approach is open and allows
easily to add/modify any group of constraints.
A simplified HVAC system design problem is set up for testing the capacity of
this approach in synthesizing optimal designs. A virtual small office building
that contains 2 air-conditioned zones is located in a continental climate where
summer is hot and winter is cold. Three typical design conditions are selected
to represent the operation in summer, winter, and swing season respectively.
A conventional system design is selected from general design manual as the
benchmark of performance. Figure 13.4 shows the configuration of the conven-
tional design with its optimal operational setting for the summer condition.
The genome encoding is shown in Fig. 13.5.
An optimal design for the given set of design conditions is then automat-
ically designed using the software (ACG) written in Java. Figure 13.6 shows
one of the generated designs, together with its operational setting in sum-
mer condition. The genome of the design is also shown in Fig. 13.7. It should
be mentioned that the configuration of the synthesized design is significantly
different compared to the conventional one: the two zones are arranged suc-
cessively rather than in parallel. This allows better air distribution and energy
reuse in the system.
The annual energy consumption associated with the conventional and gen-
erated configurations is compared in Table. 13.1. The operation throughout a
276 P.P. Angelov et al.
Heat (6) Cool (9) 13.8ßC
1.200 25.8ßC
kg/s 1.227
kg/s 14.0ßC
Mix (3) Mix (4) Div (0)
0.842
kg/s
0.306 0.385
kg/s 14.0ßC kg/s
0kw -14.68kw
38.9ßC Heat (8) Heat (7)
1.20kw 0.0kw
Ambient
(13) 21.3ßC 21.3ßC
17.0ßC 14.0ßC
0.894 0.027
kg/s kg/s Zone (12) Zone (11)
[east] [west]
2.773kw 5.226kw
0.306
kg/s 24.0ßC 20.0ßC
21.3ßC
Div (2) Div (1) Mix (5)
1.200 0.842
kg/s 1.227 kg/s
kg/s
CompSet 0 0 0 1 1 1 2 2 2 3 4 7 7 11
Topology 7 2 3 6 10 1 9 11 12 4 0 5 5 3 8 4 13
FlowDuty_1 0.306 0.894 0.027 0.385 0.0 0.0 1.20 -14.7 0.0
FlowDuty_2 0.128 0.170 0.713 0.758 0.033 2.21 2.94 0.0 0.0
FlowDuty_3 0.992 0.0 0.0 0.493 0.0 0.0 0.0 0.0 0.0
year would result in significant energy saving (3575 kwhr, or 25%). It should
be noted, however, that the actual saving highly depends on the particular
design specification and therefore this figure should not be generalized.
Two sets of experiments are performed in order to evaluate the ability of the
approach to adapt the synthesized design to changes in the design require-
ments and/or weather conditions. The first set involves a small change in the
zone requirement. The temperature of zone (11) is reduced from 24◦ C to 22◦ C,
which is very common in reality if an occupant tunes down the preferable room
temperature. The new optimal operational setting for the generated design is
13 Synthesis of Component-Based Systems 277
0.128 0.394
kg/s 13.2ßC
kg/s 22.1ßC Cool (8) Heat (6)
38.9ßC
Zone (11)
Div(2)
0.102 [east]
Ambient 20.0ßC 2.773kw
kg/s
(12) 24.0ßC
0.0kw -9.32kw
0.128
kg/s
22.1ßC
Zone (10) 13.6ßC
[west]
Div (1) Mix (5)
20.0ßC
20.0ßC 1.166 0.874
kg/s 15.7ßC kg/s
5.266kw
Ambient
Div Mix Heat Cool Zone
CompSet 0 0 0 1 1 1 2 2 3 3 7 7 11
Topology 2 4 7 11 0 10 3 5 6 5 1 9 4 3 12 8
shown in Fig. 13.8, where the new zone temperature and the redistributed
loads on the cooling coils are highlighted.
Three experiments have been performed to illustrate the ability of the
approach to adapt the previously optimized operational conditions in the new
optimization task:
Cond1 - Summer 650.0 10321.8 265.7 10587.5 6658.9 484.8 7143.7 3443.8
Cond2 - Winter 650.0 3368.7 105.8 3474.4 3061.4 147.7 3209.1 265.3
Cond3 - Swing 1300.0 0.0 296.2 296.2 0.0 430.3 430.3 -134.1
0.084 0.612
kg/s kg/s 10.8ßC
21.5ßC Cool (8) Heat (6)
38.9ßC
Zone (11)
Div (2)
0.219 [east]
Ambient 20.0ßC 2.773kw
kg/s
(12) 22.0ßC
0.0kw -7.05kw
0.084
kg/s
21.5ßC
Zone (10) 11.1ßC
[west]
Div (1) Mix (5)
20.0ßC 20.0ßC
1.024 15.1ßC
0.632
kg/s kg/s
5.266kw
The experiments are repeated 10 times each, and the averaged results are
shown in Fig. 13.9. The results from the third experiment performs signif-
icantly better than the other two in terms of convergence of the objective
(energy consumption), and the average number of generations required to
find the first feasible (without constraint violation) solution. The second test
result has an intermediate performance, though the difference in the average
number of generations required to find the first feasible solution is insignificant
compared to the first experiment.
13 Synthesis of Component-Based Systems 279
50.0
45.0
Objective (Energy consumption, kW)
35.0
Started with a SEEDED population;
25.0
Started with an EXISTING population;
20.0
22 generations before first feasible solution
15.0
10.0
5.0
0.0
0 100 200 300 400 500 600 700 800 900 1000
Generation
Fig. 13.9. Adaptation of the design synthesis to a small change in the design
requirements
28.9ßC
Zone (11)
Div (2)
0.754 [east]
Ambient 24.0ßC kg/s
5.266kw
(12) 20.0ßC
1.069 0.407
kg/s -8.60kw 0.0kw
kg/s 20.0ßC
Heat (7) Cool (9)
0.0kw -0.0kw
0.091
kg/s
24.4ßC
Zone (10) 20.0ßC
[west]
Div (1) Mix (5)
24.0ßC
24.0ßC 1.161 0.754
kg/s 21.5ßC kg/s
2.773kw
50.0
45.0
Objective (energy consumption, kW)
35.0
Started with a SEEDED population;
30.0
66 generations before first feasible solution
25.0
10.0
5.0
0.0
0 100 200 300 400 500 600 700 800 900 1000
Generation
Fig. 13.11. Adaptation of the design synthesis to a larger change in the design
requirements
All of the three experiments described above have been performed on the
new design settings. The characteristics of the performance in all three exper-
iments (Fig. 13.11) are similar to the characteristics when the change in the
design requirements is small (Fig. 13.9). However, for this set of experiments,
it takes significantly more generations for the EA to find the first feasible
solution compared to the corresponding experiments in the first set.
• Other search algorithms can be suitable for part of the problem (gradient-
based search for the physical variables optimization; integer programming
or combinatorial search for the configuration optimization), but there is
no single algorithm or method that outperforms the others for the overall
problem of design synthesis;
• It is very important to properly formulate the constraints in this opti-
mization problem; it is highly constraint one and the result is only one
single optimal configuration or a very small number of configurations with
similar performance.
• This can be done using fuzzy sets theory and fuzzy optimization, respec-
tively [15, 16];
• Be careful in formulation of the parameters of the membership functions
of the fuzzy constraints. They can be logical physical values (zero flow
rate, outside temperature etc.) or some tolerances (Tmin , Tmax ), which
are based on the experience, e.g. the comfort temperature is in the range
[20◦ C, 22◦ C], it is not a fixed number;
• There is no significant difference in the results in respect to the type of
membership functions (because the optimization leads to a single solu-
tion or a small number of optimal solutions); therefore, use Gaussian or
triangular type, which are convenient for differentiation and encoding.
• It has been demonstrated that different configurations are needed for dif-
ferent seasons, but this can come as a result of switching off some of the
capacities (coils) during of the seasons and other coils during the other
season (compare Fig. 13.8 and Fig. 13.10, including the parameters val-
ues);
• It has been demonstrated that using the best solution from a similar prob-
lem increases significantly the search process (Fig. 13.9 and Fig. 13.11,
the line in the middle), but it has also been demonstrated on a practical
example, that if we use the whole population from a similar problem it
speeds up the search even more (the same figures, the line in the bottom).
13.6 Conclusions
In this chapter we have considered a special case of application of intelligent
techniques, namely evolutionary algorithms and fuzzy optimization for au-
tomatic synthesis of the design of a component-based system. We have also
considered the problem of adaptation of the solutions found and the ways it
can be done “smartly”, maximizing the reuse of the previous knowledge and
search history. Some useful hints and tips for practical use of these techniques
in their application to different problems has been made.
The approach we have presented for automatic synthesis of optimal design
of component-based systems is novel and original. It can be applied to similar
problems in automatic design of complex engineering systems, like petrochem-
ical pipelines, electronic circuits design, etc. The complex problem of design
synthesis, which involves human decisions at its early stages, is treated as a
soft constraints satisfaction problem. EA and problem-specific operators are
used to numerically solve this fuzzy optimization problem. Their use is mo-
tivated by the complex nature of the problem, which results in the use of
different types of variables (real and integer) that represent both the physical
and the topological properties of the system. The ultimate objective of this
fuzzy optimization problem is to design a feasible and efficient system.
The important problem of the design adaptation to the changed speci-
fications (constraints) has also been considered in detail. This adds to the
flexibility of the approach and its portability. The approach which is pre-
sented in this chapter has been tested with real systems and is realized in
Java. It fully automates the design process, although interactive supervision
by a human-designer is possible using a specialized GUI.
Acknowledgement
This particular problem of the optimal HVAC design synthesis has been sup-
ported by the American Society for Heating Refrigeration and Air-Condition-
ing Engineering (ASHRAE) through the RP-1049 research project.
13 Synthesis of Component-Based Systems 283
References
1. Angelov P P, Zhang Y, Wright J A, Buswell R A, Hanby V I (2003) Automatic
Design Synthesis and Optimization of Component-based Systems by Evolution-
ary Algorithms, In: Lecture Notes in Computer Science 2724 Genetic and Evo-
lutionary Computation (E. Cantu-Paz et al. Eds.): Springer-Verlag, pp. 1938–
1950.
2. Angelov P P, Hanby V I, Wright J A (1999) An Automatic Generator of HVAC
Secondary Systems Configurations, Technical Report, Loughborough University,
UK.
3. Michalewicz Z (1996) Genetic Algorithms + Data Structures = Evolution Pro-
grams. 3rd edn. Springer-Verlag, Berlin Heidelberg New York.
4. Blast (1992) Blast 3.0 User Manual, Urbana-Champaign, Illinois, Blast Support
Office, Department of Machine and Industrial Engineering, University of Illinois.
5. York D A, Cappiello C C (1981) DOE-2 Engineers manual (Version 2.1.A),
Lawrence Berkeley Laboratory and Los Alamos National Lab., LBL-11353 and
LA-8520-M.
6. Park C, Clarke D R, Kelly G E (1985) An Overview of HVACSIM+, a Dynamic
Building/HVAC/Control System Simulation Program, In: 1st Ann. Build En-
ergy Sim Conf, Seattle, WA.
7. Klein S A, Beckman W A, Duffie J A (1976) TRNSYS – a Transient Simulation
Program, ASHRAE Trans 82, Pt. 2.
8. Sahlin P (1991) IDA – a Modelling and Simulation Environment for Building
Applications, Swedish Institute of Applied Mathematics, ITH Report No 1991:2.
9. Wright J A, Hanby V I (1987) The Formulation, Characteristics, and Solution
of HVAC System Optimised Design Problems, ASHRAE Trans 93, pt. 2.
10. Hanby V I, Wright J A (1989) HVAC Optimisation Studies: Component Mod-
elling Methodology, Building Services Engineering Research and Technology, 10
(1): 35–39.
11. Sowell E F, Taghavi K, Levy H, Low D W (1984) Generation of Building Energy
Management System Models, ASHRAE Trans 90, pt. 1: 573–586.
12. Volkov A (2001) Automated Design of Branched Duct Networks of Ventilated
and air-conditioning Systems, In: Proc. of the CLIMA 2000 Conference, Napoli,
Italy, September 15–18, 2001.
13. Bentley P J (2000) Evolutionary Design by Computers, John Willey Ltd.,
London.
14. Runarsson T P, Yao X (2000) Stochastic Ranking for Constrained Evolutionary
Optimisation, IEEE Transactions on Evolutionary Computation, 4(3): 284–294.
15. Angelov P (1994) A Generalized Approach to Fuzzy Optimization, International
Journal of Intelligent Systems, 9(4): 261–268.
16. Filev D, Angelov P (1992) Fuzzy Optimal Control, Fuzzy Sets and Systems,
v. 47 (2): 151–156.
14
Intelligent Methods in Finance Applications:
From Questions to Solutions
M. Nelke
14.1 Introduction
Smart adaptive systems are systems developed with the aid of Intelligent
Technologies including neural networks, fuzzy systems, methods from ma-
chine learning, and evolutionary computing and aim to develop adaptive be-
havior that converge faster, more effectively and in a more appropriate way
than standard adaptive systems. These improved characteristics are due to
either learning and/or reasoning capabilities or to the intervention of and/or
interaction with smart human controllers or decision makers. It is referred
to adaptation in two ways: The first has to do with robustness i.e. systems
that adapt in a changing environment. The second has to do with portability
i.e. how can a system be transferred to a similar site with minimal changes.
However, the need for adaptation in real application problems is growing.
The end-user is not technology dependent: he just wants his problem to be
solved [3].
The area of Business Intelligence is one of the most active ones in the
late years. Business intelligence is a term describing new technologies and
concepts to improve decision making in business by using data analysis and
fact–based systems. Business Intelligence encompasses terminology including:
Executive Information Systems, Decision Support Systems, Enterprise Infor-
mation Systems, Management Support Systems and OLAP (On-Line Analyti-
cal Processing) as well as technologies such as Data Mining, Data Visualization
and Geographic Information Systems. It also includes the enabling technolo-
gies of Data Warehousing and Middleware that are critical components of
many Business Intelligence Systems. The need to develop customer relation-
ship models, to mine large business data bases, to find effective interfaces of
the hot topics in the area. Still, the business environment is characterized by
complexity and is subject to external factors that make traditional models
to operate only under ideal conditions. The problem of building systems that
adapt to the users and in particular situations is present more than ever as
competition is pressing in this area for effective and aggressive solutions.
Intelligent technologies, such as fuzzy logic and neural networks, have since
long emerged from the ivory towers of science to the commercial benefit of
many companies. In a time of ever more complex operations and tighter re-
sources conventional methods are all too often not enough to maintain a com-
petitive advantage. The technology race has accelerated its pace and impacts
globally. This has led to increasing competitiveness, with many companies
having seen themselves marooned. Intelligent technologies can be applied to
the efficient solution of problems in many business and industrial areas. Both
fuzzy technologies as well as neural networks are well proven and are already
bringing rewards to those who have sought to adopt them.
Recent years have witnessed fundamental changes in the financial service
sector with competition between banking and credit institutions intensifying
as a result of the changing markets. The products and services offered by
the various financial institutions have become interchangeable, as a result of
which their perceived images have lost their sharp focus. Today, there are
hardly any remaining banking institutions which specialize within only one
specific customer group or with only one traditional banking product.
Against this background, the need to cater for the specific requirements
of target customer groups has become the guiding principle behind the busi-
ness strategies adopted by banking institutions. The importance of a trusting
relationship between a bank and its customers, and the fostering of this re-
lationship, have also become to be recognized as decisive factors in financial
transactions. Accordingly, the building and nurturing of this trust between
bank and customer is now at the center of a bank’s strategic planning.
It is not simply the provision of services by the bank to the customer, but
moreover the fulfillment of a customer’s needs with the right banking products
and services which has become to be seen as important. These customer needs,
and subsequently the provision of the appropriate products and services to
14 Intelligent Methods in Finance Applications 287
the customer, are dependent on the personal attributes of the individual, e.g.
his age, income, assets and credit worthiness, see Sect.14.3.2.
Due to the steady growth of the number of business transactions on one
side and of insolvencies on the other side, credit insurance plays an increasing
role in the economy. Since the number of proposals which have to be decided
by credit insurance companies is also steadily growing, automated systems
for credit analysis are needed. Already running systems cannot model expert
behavior well enough. So for the optimization of the credit analysis and for the
making of objective decisions via an automated software system an approach
by Business Intelligence technology is used, see Sect.14.3.1.
The requirements for computer systems used for advanced analyses and ap-
plications in finance applications are derived by the necessity of adaptation to
the dynamic change of the customer’s behavior and situation and the target
of increasing automation and better performance saving costs and fulfilling
external directives (e.g. by laws and regulations for credits).
To develop a successful business solution based on a smart adaptive system,
first of all it is necessary to focus on questions. Questions are asked by the
end-user who wants to benefit from the system during his (daily) business.
His questions like
• What will be my benefits in terms of time/cost savings or increased mar-
gins?
• Do the existing interfaces support an easy integration into our existing
IT-systems?
• Can I compare different parameter settings in the way of simulation of
business scenarios?
• Will I understand the behavior of the smart adaptive system?
• How much consultancy do I need when changes on the system are required
caused by new boundary conditions?
Lead to basic requirements for the system development:
• Technology, e.g.
– using of well-known established methods and technologies
– supporting all necessary steps of the system development (interfaces,
model development, model analysis, verification)
• Understanding, e.g.
– rule-based models are easier to understand than non-linear formulas
– reports showing the business results instead of data tables
– user interfaces for end-users based on their business processes
288 M. Nelke
• Adaptation, e.g.
– data pre-processing finding appropriate parameter pre-settings
– flexibility to adapt and compare different models to find the best one
– running the same models over time for dynamic processes
In the next section the data analysis process as a path to develop a smart
adaptive system solution is described.
Will the used storage systems and interfaces (files, databases, . . . ) work under
growing security and extensibility needs?
The selection of appropriate methods depends on the required technical
performance (modeling results, computing time) and business needs, e.g. pre-
sentation of the results and interpretability of models. From the point of exe-
cution, most technologies can be compared and they perform well. But based
on the available raw data, the model building process differs. For example,
if no training data is available, a knowledge-based approach by a fuzzy rule
base derived from an expert by questions is a possible choice. If training data
is available, is there also test data available? That means, can we use a super-
vised methods for classification because we do know the classes to model, or
do we have to find first the clusters by an unsupervised technology and label
them by expert knowledge (Fig. 14.3)? In general, the rule-based methods like
decision trees or fuzzy rule base models allow an easy understanding by using
a model representation with readable rules and should be preferred if possible
to present results to non-data analysis experts.
The following remarks and comments may help the inexperienced designer
the choose the appropriate methods:
• Decision Tree: This method recursively partitions the data so as to derive
a decision tree (a set of decision rules) for predicting outputs from inputs.
The decision tree is constructed top-down, starting with a root node and
incrementally adding variable-testing decision branches. The process con-
tinues until decisions can be added to the tip nodes correctly classifying all
the given examples. Typically, decision trees have continuous or discrete
inputs and discrete outputs. Training and test data has to be provided,
and the model can be described in rules.
• Neural Network: The concept of artificial neural networks is to imitate
the structure and workings of the human brain by means of mathematical
models. Three basic qualities of the human brain form the foundations of
most neural network models: knowledge is distributed over many neurons
within the brain, neurons can communicate (locally) with one another
and the brain is adaptable. In the case of supervised learning, in addi-
tion to the input patterns, the desired corresponding output patterns are
also presented to the network in the training phase. The network calcu-
lates a current output from the input pattern, and this current output is
compared with the desired output. An error signal is obtained from the
difference between the generated and the required output. This signal is
then employed to modify the weights in accordance with the current learn-
ing rule, as a result of which the error signal is reduced. The best-known
and most commonly employed network model here is the multilayer per-
ceptron with backpropagation learning rule. It has continuous or discrete
inputs, continuous outputs. Training and test data has to be provided. In
the case of unsupervised learning, the network is required to find classifica-
tion criteria for the input patterns independently. The network attempts
292 M. Nelke
during the modeling phase (clusters and outliers). A log file helps to analyze
consistency checks at any stage of the modeling process.
If you use new technologies, compare the results on known solutions with
other (traditional) methods to make your customers trust in the technology.
To find proper parameter settings for a special task, start from an existing
solution with one dimensional parameter variation. Tools can support the
model evaluation and analysis, see Fig. 14.4 and [4].
What is important? Success factors are a target-oriented conceptual de-
sign and detailed specifications, a real-world data base which allows to develop
good-performing models using expert knowledge about the business process
and modeling technologies as well as a teamwork of different computer scien-
tists, data analysts and the business users.
After a successful modeling, the implementation and integration of the
system has to consider security, safety and maintenance aspects.
In the next section some real world examples are presented.
294 M. Nelke
14.3 Applications
interviewing some credit managers to find out their decision procedures. The
features and rules have been determined. The execution of the first rule set
was not as expected, but after an analysis of the deviation on a small number
of cases using a neural network, its connection weights helped to modify the
rule base structure and parameters for a successful solution.
The necessity for financial services providers is to comply with the private cus-
tomers requirements for an individual support service and, at the same time,
to make best use of the advisory resources; that means to put consultancy ca-
pacity where it is most cost effective. Usually, due to the present assignment
of the clients, financial services providers are not operating at full capacity,
and the available potentials are not fully exploited. In order to classify the
customers into different customer advisory levels, a data based analysis of the
clients by income, capital and securities is applied. Following the request of
transparency and objectivity of the customer segmentation, an automated so-
lution has been developed which applies intelligent technologies such as Fuzzy
Set Theory for analyzing customer profiles and discovering new information
from the data. The solution offers a flexible adaptation of all parameters and
considers also the dynamic changes of a clients environment. Compared to
14 Intelligent Methods in Finance Applications 297
euro and capital of 74.000 euro). Typical common selections based on simple
queries on the crisp limits (income more than 70.000 euro or capital more than
75.000 euro), whereas the designed scoring method combines both income and
capital using a parameterized inference (adjustable between maximum and
sum of the values for income and capital).
At the beginning, for each client different information is collected covering
demographic, organizational and asset data. Master features like income, cap-
ital, securities, and others are derived from these. Single data sets belonging
to members of the same family are associated to one client data set represent-
ing the family association. Depending on the structure of the family (number
of adults, number of children) and the age, the different master features are
scaled by parameterized functions. For example, an income of 3.000 EURO
earned by a young executive can be treated in the same way as the income
of 5.000 EURO of a middle-aged manager. Treating age and family status as
input features for the model (and not as parameters) enables now to use one
adaptive model for the scoring of different customer unions.
For each scaled master feature a partial score is calculated using a mem-
bership function by Fuzzy Set Theory. A client with a monthly income of
5.000 EURO will get a score of 0.7, a monthly income of 10.000 EURO will
receive a score of 0.9, and a monthly income of 25.000 EURO ore more will be
scored with a value of 1.0 (see example in Fig. 14.9). Each feature is related
to a set of membership functions for different ages of customers. Based on the
partial scores of each master feature, a final score for each client is calculated
using a parameterized inference function. This score has a value between 0
an 100. The clients are assigned to the advisory levels and the related organi-
zation units depending on the score value and the available capacity for the
consultancy service on that level; the system also offers the opportunity to use
separate conditions to handle exceptions from standard assigning by score.
14 Intelligent Methods in Finance Applications 299
capital and income. Based on this, the estimated effort for the consultancy
service is calculated and used during the assignment to optimize the capacity
utilization.
The application is based on a distributed system architecture for off-
line use. It has two main components: a database with flexible interfaces to
allow the data exchange with the financial services providers systems and the
mit.finanz.ks software application for the data pre-processing, scoring, assign-
ment and reporting. Two other components support the main system: for the
data pre-processing a client association identification tool to generate addi-
tional association information and data administration support, and for the
evaluation and use of the new assignment in the organization a web-based so-
lution to support the consultants by executing the clients transition from one
advisory level to another. The needs of financial services providers to comply
with the private customers requirements for an individual support service and,
at the same time, to make best use of the advisory resources in the framework
of customer relationship management can be fulfilled with the mit.finanz.ks
approach. The flexible adaptation of data selection, data pre-processing and
data analysis followed by a reporting facility. The design of the interactive
and user-friendly graphical interface enables an easy browser access for the
end-users as confirmed by different financial services providers in Germany.
The main advantages to be noticed are:
• Efficient customer care capacities
• Increase of the number of attended clients and efficient use customer care
capacities
14 Intelligent Methods in Finance Applications 301
Generic lessons learned from these applications and the experience with the
customers (system users) are discussed in the following. Is there a market place
for smart adaptive systems? Yes! Looking at the systems and projects devel-
oped so far, the current situation for applications of smart adaptive systems in
financial business is determined by headlines like “competition requires cost
reductions” and “mass marketing is out of date, customers demand individual
service”. The marketing budgets are limited, but increasing the market share
is only possible by detailed knowledge of market and customers. That is why
solutions like
• customer segmentation
• customer scoring and identification of future buyers of a certain product
• support of sales representatives
• yield management
• response forecast for mail-order houses
• migration analysis/churn management
• product placement and basket analysis
are required. These solutions are a market place for system developers if they
can point out the benefits of smart adaptive systems in those fields. The
guiding principle of soft computing “exploit the tolerance for imprecision, un-
certainty and partial truth to achieve tractability, robustness and low solution
costs’ can contribute because unlike the traditional, soft computing aims at
an accommodation with the pervasive imprecision of the real world.
What was important? Teamwork right from the start and support of the
whole process. A team of computer scientists, data analysts and business users
has to work together, covering requirement analysis and definition of project
goals, allocation of data, modeling, validation of the results, software integra-
tion, testing, training of staff and integration into daily workflow. Especially
302 M. Nelke
14.4 Conclusions
The design of a successful smart adaptive system in finance applications is a
complex iterative process. Having adaptation in the main focus here means
that such systems should offer solutions which results are independent from
(minor) changes of the existing environment covering
• data (providing, pre-processing, handling, storing),
• features and modeling techniques,
• interfaces/software design.
For a successful design of a smart adaptive system, you should answer yes
to the following questions:
• Have you planned to set up a raw prototype for the whole system at an
early stage covering all single steps of the whole process? Each step should
be treated separately in detail when necessary as shown in the established
rapid prototyping e.g. of automotive industry. Give an overview of the
whole system as early as possible.
• Do the selected data sources will continue providing all required informa-
tion in the future? If it is planned to build a model for periodic use, check
if all the required input data will be available in the correct format, scale,
period and data range.
• Do the interfaces offer a fast and flexible access and do they consider
security aspects? Use encryption to make sure that information is hidden
from anyone for whom it is not intended, even those who can see the
encrypted data. If necessary, add an user administration to give access
only for authorized people.
14 Intelligent Methods in Finance Applications 303
• Can you use model parameters as input features for a more general model
to reduce manual adaptation efforts? Instead of building different models
e.g. for car buying behavior in the four seasons, build one model and
include the season as additional input parameter.
• Is the presentation of the results easy to understand also for non-experts?
This is very important to show the benefit of a model and to convince
customers to buy your system. Especially when risk evaluations or licensing
procedures are carried out, a detailed explanation of the model for non-
experts may be necessary.
• Does the evaluation phase in a real-world environment show the expected
results and performance fulfilling the user requirements? You can deal with
complex scientific modeling, colored graphics and nice user interfaces, but
at any stage, focus on the problem to solve (management by objectives).
Artificial approaches must learn from user experience in the system devel-
opment process.
References
1. Berthold, M., Hand, D. J. (Eds.) (2000) Intelligent Data Analysis – An Intro-
duction, 2nd rev. and ext. ed. 2003 XI, Springer Heidelberg
2. Bezdek, J. C. (1981) Pattern Recognition with Fuzzy Objective Function Algo-
rithms. Plenum Press, New York, USA
3. EUNITE (2001) EUNITE – EUropean Network on Intelligent TEchnologies for
Smart Adaptive Systems, funded by the Information Society Technologies Pro-
gramme (IST) within the European Union’s Fifth RTD Framework Programme,
see http://www.eunite.org
4. MIT GmbH (2001), “Manual DataEngine 4.1”, MIT GmbH, Aachen, Germany
5. Nelke M. (2002): Customer Relationship Management: A Combined Aproach by
Customer Segmentation and Database Marketing. Hans-Juergen Zimmermann,
Georgios Tselentis, Maarsten van Someren, Georgios Dounias (Eds.): Advances
in Computational Intelligence and Learning: Methods and Applications. Interna-
tional Series in Intelligent Technologies Kluwer (2002): 281–290
6. Jostes M., Nelke M. (1999) Using Business Intelligence for Factoring Credit
Analysis. In: Proceedings of ESIT 99 European Symposium on Intelligent Tech-
niques, June 3–4 1999, Orthodox Academy of Crete, Greece
7. Nelke M., Tselentis G. (2002) Customer Identification and Advisory Management
at Financial Services Providers. In: Proceedings of Eunite 2002 – European Sym-
posium on Intelligent Technologies, Hybrid Systems and their implementation on
Smart Adaptive Systems, 8.–21. September 2002 Albufeira, Portugal
8. Zimmermann H.-J., Zysno, P. (1983) Decisions and Evaluations by Hierarchical
Aggregation of Information. Fuzzy Sets and Systems 10: 243–260
9. Zimmermann, H.-J. (1987) Fuzzy Sets, Decision Making, and Expert Systems.
Kluwer Academic Publishers, Boston Dordrecht Lancaster 1987
15
Neuro-Fuzzy Systems for Explaining Data Sets
D.D. Nauck
D.D. Nauck: Neuro-Fuzzy Systems for Explaining Data Sets, StudFuzz 173, 305–319 (2005)
www.springerlink.com
c Springer-Verlag Berlin Heidelberg 2005
306 D.D. Nauck
15.1 Introduction
Our modern world is constantly growing in complexity. This complexity man-
ifests itself, for example, in the form of information overload. Nowadays, we
perceive a vast amount of information from channels like e-mail and other
Internet-based services, hundreds of TV channels, telecommunication devices
etc. Businesses have to deal with huge amounts of data they collect within
their organization or from transactions with customers. It is paramount for
businesses to put the collected data to good use [2]. Examples are dynamic
workforce management based on internally collected data or customer rela-
tionship management based on transaction data gathered from Internet click-
streams or more mundane transactions like those at supermarket check-outs.
In order to transform data into valuable information an intelligent ap-
proach to data analysis is required. We do not simply require models for
prediction which we would blindly follow and let them determine our busi-
ness decisions. In order to properly facilitate our decision making processes
we also require models for understanding, i.e. models that provide us with
explanations. In this chapter we discuss how neuro-fuzzy systems can be used
to generate explanations about data.
In data analysis applications an interpretable model of the data is espe-
cially important in areas
• where humans usually make decisions, but where machines can now sup-
port the decision making process or even take full responsibility for it,
• where prior knowledge is to be used in the data analysis process and the
modification of this knowledge by a learning process must be checked,
• where solutions must be explained or justified to non-experts.
The application that we discuss in this chapter belongs to the last area.
At BT Exact we have developed ITEMS, an intelligent data analysis system
for estimating and visualizing the travel patterns of a mobile workforce [1,
5, 6]. The application allows users to track journey times and to generate
explanations why a particular journey was possibly late, for example. We use
the neuro-fuzzy approach NEFCLASS to generate such explanations.
In Sect. 15.2 we discuss some aspects of interpretable fuzzy systems and in
Sect. 15.3 we consider the special circumstances of generating models for ex-
planation instead of prediction. Section 15.4 explains how we use NEFCLASS
to build explanatory rules and in Sect. 15.5 we present an example in the
context of ITEMS before we conclude the chapter.
• What kind of explanations does the user want and can understand?
• How can we build such rules quickly in interactive applications?
• How do we present the explanations?
In this chapter we only consider Mamdani-type fuzzy rules of the form
if x is µ and . . . and xn is µn then . . . ,
i.e. fuzzy rules that use operators like “or” and “not” or linguistic hedges
are not discussed here. The consequent of a fuzzy rule can be a linguistic
term (regression) or a class label (classification). The following list describes
a number of desirable features of an interpretable fuzzy rule base.
A ⊆ A ∧ C = C .
• Avoiding exceptions, i.e. there should not be a pair of rules (A, C) and
(A , C ) with
A ⊆ A ∧ C = C .
Although exceptions can be a way to reduce the number of rules they can
make the rule base harder to read.
• Fuzzy partitions should be “meaningful”, i.e. fuzzy sets should be normal
and convex and must keep their relative positions during learning.
• There are two possible objectives when searching for an interpretable fuzzy
rule base. We can look for global or local interpretability. Global inter-
pretability means we can understand the (small) rule base completely,
while local interpretability means we can at least understand a number of
rules that apply at a time to a particular input situation. Depending on
the application, local interpretability – i.e. understanding why we obtain
a particular output from the rule base – may be sufficient.
308 D.D. Nauck
When neuro-fuzzy systems are applied in data analysis the usual target is
to build a reasonably accurate predictor with a reasonably interpretable rule
base. Obviously, there is a trade-off between accuracy and interpretability and
depending on the application the solution will concentrate more on prediction
than interpretability, or vice versa.
In order to obtain a model that can be used for prediction we have to
avoid over-generalization. This means, we usually split up the available data
in training and validation sets. In order to obtain a useful estimate of the
error on unseen data this procedure is usually repeated several times (n-fold
cross-validation).
When we build a model for explanation we have a different target. The
model will not (and must not) be used for prediction. The idea of the model
is to provide a summary of the currently available data in a meaningful way.
In this sense an explanatory model is like a descriptive statistic of a sample,
where we are only interested in describing the sample and not the underlying
population.
That means to build an explanatory model we will use all available data
and do not worry about over-generalization. Actually, we try to get the best
possible fit to the data because that means we obtain the most accurate
description of the data. We will also restrict the degrees of freedom for our
model in order to obtain a small, simple and interpretable model. Obviously
this leads to the same dilemma as for predictive models: the trade-off between
accuracy and interpretability.
The demand for a predictive model usually does not impose strict time
constraints on the model building process. The only real requirement is that
the model is available before the first prediction is due. Explanations, however,
are more likely to be demanded in a real time process. A typical scenario is
that a user explores some data and demands an explanation for some findings.
It is then not acceptable to wait a long time for a model. The model building
process must be completed within seconds or minutes at most.
This time constraint forbids an extensive search for the smallest, most in-
terpretable model with an acceptable accuracy. That means in a real world
applications there is typically no time for building a model, checking its in-
terpretability and then restart the model building with different parameters
15 Neuro-Fuzzy Systems for Explaining Data Sets 309
until we have reached the best result. However, it may be possible to do this
in parallel, if the computing resources are available.
Therefore, we are interested in learning approaches that either try to build
small models from the beginning or are capable of pruning a large model as
part of the learning process.
Rule induction methods based on decision trees try to build small models
by selecting the most promising variables first in the hope of not requiring
all variables for the final model [7, 19]. Another option is to use a structure-
oriented fuzzy rule learning method [16] as it is implemented in the neuro-
fuzzy system NEFCLASS (see also Chap. 7). This approach first detects all
rules that are induced by a data set, selects the best rules based on a rule
performance measure, trains the fuzzy sets, and then prunes the rule base. For
ITEMS we have implemented explanation facilities based on (crisp) decisions
trees [20] and neuro-fuzzy rules. Users can select which version of rules they
prefer.
If enough computing resources are available, it is possible to generate sev-
eral different explanatory models in parallel. The system can then try to
measure the interpretability based on approaches like the minimum descrip-
tion length principle [9] that can also be applied to neuro-fuzzy learning [8].
Another possibility to compare different rule based models is to consider the
number of relevant parameters.
where m is the number of classes, r is the number of rules and ni is the number
of variables used in the ith rule. This measure is 1, if the classifier contains
only one rule per class using one variable each and it approaches 0 the more
rules and variables are used. We assume that at least one rule per class is
required, i.e. systems with a default rule must be suitably recoded.
When we want to compare fuzzy rule bases we also want to measure the
quality of the fuzzy partitions. Usually, a fuzzy partitioning is considered to
be “good”, if it provides complete coverage (i.e. membership degrees add up
to 1 for each element of the domain) and has only a small number of fuzzy
sets. If we assume that the domain Xi (i ∈ {1, . . . , n}) of the ith variable is
310 D.D. Nauck
(1) (p )
partitioned by pi fuzzy sets µi , . . . , µi i then we can measure the degree of
coverage provided by the fuzzy partition over Xi by the coverage index
ĥi (x)dx
Xi
covi = , where (15.1)
Ni
⎧
⎨ hi (x) if 0 ≤ h(x) ≤ 1
ĥi (x) = p − h (x) , where
⎩ i i
otherwise
pi − 1
pi
(k)
hi (x) = µi (x)
k=1
with Ni = Xi dx for continuous domains. For discrete finite domains we have
Ni = |X| and replace the integral in (15.1) by a sum.
We can see that covi ∈ [0, 1], where covi = 0 means that the variable is
either not partitioned or that we have a crisp partition such that all µ(i) = Xi .
Both extreme cases mean that the variable would be considered as “don’t care”
and would not appear in any rule. Complete coverage is indicated by covi = 1.
Note that a partition that covers only 10% of all values gets approximately
the same score as a partition where 90% of the whole domain is completely
covered by all sets. This feature may be disputable and has to be considered
when applying the index.
In order to penalize partitions with a high granularity we can use the
partition index
1
parti =
pi − 1
assuming that pi ≥ 2, because otherwise we would not consider that variable.
We use cov to denote the average normalized coverage and part to denote
the average normalized partition index for all variables actually used in the
classifier.
Finally, we can use the interpretability index
SSE usually indicates that the classifications are nearly crisp and a small
number of misclassifications is an obvious target. In order to obtain mean-
ingful fuzzy sets, the learning algorithm is constrained. For example, fuzzy
sets must not change their relative position to each other and must always
overlap.
• Automatic exhaustive pruning: NEFCLASS includes four pruning
strategies that try to delete variables, rules, terms and fuzzy sets from
the rule base. In order to obtain a rule base that is a small as possible all
those four strategies are applied one after the other in an exhaustive way
to all possible parameters. For the sake of speed the fuzzy sets are not
retrained after each pruning step, but only once after pruning.
In order to start the learning process NEFCLASS requires initial fuzzy
partitions for all variables. This part is not yet fully automated, because for
numeric variables a number of fuzzy sets must be specified. For the discussed
scenario this should be done by the application designer. We are currently
working on automating this process as well. For symbolic variables, NEF-
CLASS can determine the initial fuzzy partitions automatically during rule
learning [12].
When we execute NEFCLASS in an application environment with no user
intervention, then we must try to balance the required computation time with
the quality of the rule base. While rule learning and pruning are usually fast,
fuzzy set learning can take a lot of time, because it requires many iterations
through the training data set. Fuzzy set learning is influenced by some para-
meters (learning rate, batch/online learning, look ahead mode for trying to
escape local minima) and by the already mentioned constraints for guaran-
teeing meaningful fuzzy sets.
For running fuzzy set learning automatically we can select a small learning
rate (e.g. 0.1) together with batch learning to avoid oscillation and select a
small look ahead value (10 epochs) which continues training beyond a local
minimum in the hope to escape it again. The number of epochs are usually
chosen in such a way that a minimum number of learning steps are computed
(e.g. 100) and that learning is stopped after, for example, 30 seconds. This time
has to be set according to the application scenario. If users of the application
usually tend to inspecting other features of the requested data first before
checking the fuzzy rules, learning may continue for longer. If the user mainly
waits for the rules, learning may have to be shorter.
We also must balance the constraints we impose on the learning algorithm.
We could enforce strict constraints like that the membership degrees for each
element must add up to 1.0 [16, 18]. However, strict constraints tend to prevent
the system from reaching an acceptable classification performance and usually
require inspection of the learning outcome and repeated trials, for example,
with different numbers of fuzzy sets. In an automated scenario this is not
possible, therefore we use only the above-mentioned less strict constraints.
15 Neuro-Fuzzy Systems for Explaining Data Sets 313
• Our objective was to embed a rule learner into a software platform such
that it can perform fully automatically, is transparent to the user, and
produces a rule base very quickly (in under a minute). If you have a similar
objective, then look for approaches that have few parameters and simple
learning algorithms. Decision tree learners are a good choice, for example.
• The data mining software package Weka [21] contains some rule learners
that you can use in your own applications.
• Several NEFCLASS implementations are also available on the Internet
[11].
Any organization with a large mobile workforce needs to ensure efficient uti-
lization of its resources as they move between tasks distributed over a large
geographical area. BT employs around 20000 engineers in the UK who provide
services for business and residential customers such as network maintenance,
line provision and fault repairs. In order to manage its resources efficiently
and effectively, BT uses a sophisticated dynamic scheduling system to build
proposed sequences of work for field engineers.
A typical schedule for a field engineer contains a sequence of time windows
for travel and task. To generate accurate schedules the system must have ac-
curate estimates for the time required to travel between tasks and estimates
for task duration. At BT Exact – BT’s research, technology and IT opera-
tions business – we have implemented a system that improves the accuracy of
travel time estimates by 30% compared to the previous system. The system
[1, 5, 6] contains an estimation module, a visual data mining module and an
explanation facility.
Note that the travel time is calculated as the difference between the time
a task is issued and the arrival time on site. Specifically, travel time includes
the time required to leave the current site, walk to the car-park, start the car,
drive to the destination site, park the car, and gaining access to the premises
of the next customer. However, all these individual activities are not logged,
only the start time of the journey and the start time of the actual work are
available. It is obvious that we have to deal with huge differences between
urban and rural areas, for example, just in finding a space to park.
314 D.D. Nauck
15.5.2 Example
In the following we present a short scenario about how ITEMS uses NE-
FCLASS and a decision tree learner to generate rules for explaining travel
patterns. The data contains both numeric and symbolic data and we are us-
ing the algorithm described in [14]. For confidentiality reasons we can only
reveal parts of the result.
As an example we take a closer look at one model for one organizational
unit comprising 13 technicians, using three weeks of data. This is a typical
set of data analyzed by a unit manager. The data contains 10 input variables,
where five are symbolic. The data was classified into three classes: early, on
time and late. After rule learning, fuzzy set tuning and pruning, NEFCLASS
presented a rule base of seven rules using four variables (Technician, Time
(hour), Start Location, Destination). The decision tree learner created a rule
base with 66 rules with more than six variables on average.
The accuracy of the fuzzy rule base was 76.6% while the decision tree had
and accuracy of 70%. On our interpretability index introduced in Sect. 15.3
the fuzzy rule base obtained a score of
Let us take a closer look at the more interpretable rule base. The class
accuracies are early = 30%, late = 32%, on time = 97%. That means the rule
base is better in predicting the majority class (on time) which is no surprise.
Still, the rules are useful for explaining patterns. Let us concentrate on the
fuzzy rules that describe late journeys for this particular set of jobs. Groupn ,
Startn and Endn are fuzzy sets for the symbolic variables ID (technician),
Start (start location) and End (destination), where n is an index.
• If ID is Group1 and Start Hour is small and Start is Start1
and End is End1 then Late
• If ID is Group4 and Start is Start2 and End is End3 then Late
• If ID is Group5 and Start is Start3 and End is End4 then Late
Further analysis of the classification of individual jobs shows that 85% of
the misclassified late patterns have almost equal membership with late and
on time. In addition all late patterns have actually non-zero membership with
the class late.
When the user clicks on the graphical representation of a journey in the
graphical user interface, he will see all rules that fire for that particular record.
If the user is particularly interested in late travel, at least one of the above
presented rules will be displayed. Even if the pattern is misclassified, the rules
can still provide a useful explanation.
15 Neuro-Fuzzy Systems for Explaining Data Sets 317
Note, that the rules are not meant for prediction, but for explanation only.
That is why we use all the data to generate the rules and do not use validation.
The rules represent a rule-based summary of the data. For example, if there is a
rule that correctly classifies a lot of late journeys, the manager can investigate,
why this particular pattern is present in the data. Such a scenario can point
to problems in the prediction module or to specific situations like on-going
road works that are only indirectly reflected in the data. On the other hand,
if there is a late journey that cannot be classified correctly, this means that
it cannot be explained by the data and may be an outlier (exception) so that
no action is required.
Most of the rules shown above contain only symbolic variables and there-
fore use fuzzy sets that are represented as a list of value/membership pairs.
They are not as easily interpretable as, for example, the fuzzy set small for
the variable Start Hour in the first rule. However, close inspection of the fuzzy
sets reveals, that for example technicians who were frequently late all have
high degrees of membership with the fuzzy sets Group1 , Group4 and Group5 .
This can help the manager in identifying, for example, individuals who may be
require additional training or information about better routes, because they
are, for example, new on the job.
Ii Ai
c=w· + (1 − w) · .
Ij Aj
15.6 Conclusions
Intelligent data analysis plays a crucial role in modern businesses [2]. They
do not only require predictions based on data but also require a deep under-
standing of the data that is collected from internal or external sources. Rule
based models that provide explanations can be a valuable tool in this area.
We feel that hybrid approaches like neuro-fuzzy systems are very well
suited for this task, because they tend to provide more intuitive rules than
crisp rule learners. For integrating any approach for explanation generation
into applications it is also very important, that it works automatically and
318 D.D. Nauck
fast. For this reason and because of the scope of our application project we
look only at NEFCLASS and decision trees.
While it is well known that decision trees are fast and run basically pa-
rameterless and automatically our project also showed that a neuro-fuzzy
approach like NEFCLASS can compete in scenarios where explanatory rules
are required.
The learning algorithms of NEFCLASS are capable of generating explana-
tions about a data set selected by a user in a reasonably short time. In order
to further automate the generation of fuzzy models by NEFCLASS we will
study methods for automatically generating an appropriate number of fuzzy
sets as they are described, for example, in [22].
References
1. Ben Azvine, Colin Ho, Stewart Kay, Detlef Nauck, and Martin Spott. Estimating
travel times of field engineers. BT Technology Journal, 21(4): 33–38, October
2003.
2. Ben Azvine, Detlef Nauck, and Colin Ho. Intelligent business analytics – a tool
to build decision-support sytems for eBusinesses. BT Technology Journal, 21(4):
65–71, October 2003.
3. Hugues Bersini, Gianluca Bontempi, and Mauro Birattari. Is readability com-
patible with accuracy? From neuro-fuzzy to lazy learning. In Fuzzy-Neuro Sys-
tems ’98 – Computational Intelligence. Proc. 5th Int. Workshop Fuzzy-Neuro-
Systems ’98 (FNS’98) in Munich, Germany, volume 7 of Proceedings in Artifical
Intelligence, pages 10–25, Sankt Augustin, 1998. infix.
4. J. Casillas, O. Cordon, F. Herrera, and L. Magdalena, editors. Trade-off be-
tween Accuracy and Interpretability in Fuzzy Rule-Based Modelling. Studies in
Fuzziness and Soft Computing. Physica-Verlag, Heidelberg, 2002.
5. Colin Ho and Ben Azvine. Mining travel data with a visualiser. In Proc. Inter-
national Workshop on Visual Data Mining at the 2nd European Conference on
Machine Learning (ECML’01) and 5th European Conference on Principles and
Practice of Knowledge Discovery in Databases (PKDD’01), Freiburg, 2001.
6. Colin Ho, Ben Azvine, Detlef Nauck, and Martin Spott. An intelligent travel
time estimation and management tool. In Proc. 7th European Conference on
Networks and Optical Communications (NOC 2002), pages 433–439, Darmstadt,
2002.
7. Cezary Z. Janikow. Fuzzy decision trees: Issues and methods. IEEE Trans.
Systems, Man & Cybernetics. Part B: Cybernetics, 28(1): 1–14, 1998.
8. Aljoscha Klose, Andreas Nürnberger, and Detlef Nauck. Some approaches to
improve the interpretability of neuro-fuzzy classifiers. In Proc. Sixth European
Congress on Intelligent Techniques and Soft Computing (EUFIT98), pages 629–
633, Aachen, 1998.
9. I. Kononenko. On biases in estimating multi-valued attributes. In Proc. 1st In-
ternational Conference on Knowledge Discovery and Data Mining, pages 1034–
1040, Montreal, 1995.
10. Ludmilla I. Kuncheva. Fuzzy Classifier Design. Springer-Verlag, Heidelberg,
2000.
15 Neuro-Fuzzy Systems for Explaining Data Sets 319
16.1 Introduction
In this chapter we address the problem that may be exemplified as follows.
There is a small (or a small-to-medium, SME for short) company that – as
all other companies and organizations – faces the problem of dealing with too
large sets of data that are not comprehensible by the human user. They know
that they need some data mining but they are fully aware of their limitations.
Mainly, in comparison with larger and richer companies and organization,
they need a simple and possibly inexpensive solution that is also as much hu-
man consistent as possible. They are aware that most of their employees are
J. Kacprzyk and S. Zadrożny: Fuzzy Linguistic Data Summaries as a Human Consistent, User
Adaptable Solution to Data Mining, StudFuzz 173, 321–340 (2005)
www.springerlink.com c Springer-Verlag Berlin Heidelberg 2005
322 J. Kacprzyk and S. Zadrożny
not qualified computer specialists, as they cannot afford to hire such people,
and hence solutions adopted should be possibly human consistent and intu-
itive, basically as heavily as possible based upon the use of natural language.
Such solutions have to offer at least a basic adaptability with respect to the
interpretation of linguistic terms that are used to express data values and
relations between data. Another dimension of the adaptability may be con-
sidered from the perspective of data sources taken into account. The primary
data source for such data mining tasks is, of course, a database of the user.
However, in order to discover some interesting phenomena in data it may be
worthwhile to acquire some other data as well as no company operates in a
vacuum, separated from the outside world. The Internet seems to be such a
source of choice. Nowadays, it may be still difficult to get interesting, relevant
data from the Internet without a careful planning and execution. However,
as soon as promises of the Semantic Web become the reality, it should be
fairly easy to arrange for automatic acquisition of data that is relevant for
our problem but does not have to be identified in advance. In many cases
such data may be easily integrated with our own data and provide the user
with interesting results. For example, coupling the data on sales per day with
weather information related to a given time period (that is not usually stored
in sales databases) may show some dependencies important for running the
business.
Generally, data summarization is still an unsolved problem in spite of
vast research efforts. Very many techniques are available but they are not
“intelligent enough”, and not human consistent, partly due to the fact that the
use of natural language is limited. This concerns, e.g., summarizing statistics,
exemplified by the average, median, minimum, maximum, α-percentile, etc.
which – in spite of recent efforts to soften them – are still far from being
able to reflect a real human perception of their essence. In this chapter we
discuss an approach to solve this problem. It is based on the concept of a
linguistic data (base) summary and has been originally proposed by Yager
[1, 2] and further developed by many authors (see, for instance, Kacprzyk
and Yager [3], and Kacprzyk, Yager and Zadrożny [4]). The essence of such
linguistic data summaries is that a set of data, say, concerning employees, with
(numeric) data on their age, sex, salaries, seniority, etc., can be summarized
linguistically with respect to a selected attribute or attributes, say age and
salaries, by linguistically quantified propositions, say “almost all employees are
well qualified”, “most young employees are well paid”, etc. Notice that such
simple, extremely human consistent and intuitive statements do summarize
in a concise yet very informative form what we may be interested in.
We present the essence of Yager’s [1, 2] approach to such summaries,
with its further extensions (cf. Kacprzyk and Yager [3], Kacprzyk, Yager and
Zadrożny [4, 5]) from the perspective of Zadeh’s computing with words and
perception paradigm (cf. Zadeh and Kacprzyk [6]) that can provide a general
theoretical framework which is implementable, as shown in works mentioned
above. In particular, we indicate the use of Zadeh’s concept of a protoform of
16 Fuzzy Linguistic Data Summaries as a Human Consistent 323
a fuzzy linguistic summary (cf. Zadeh [7], Kacprzyk and Zadrożny [8]) that
can provide a “portability” and “scalability” as meant above, and also some
“adaptivity” to different situations and needs by providing universal means
for representing quite general forms of summaries.
As an example we will show an implementation of the data summarization
system proposed for the derivation of linguistic data summaries in a sales
database of a computer retailer.
The basic philosophy of the approach and its algorithmic engine makes use
of the computing with words and perception paradigm introduced by Zadeh
in the mid-1990s, and best and most comprehensively presented in Zadeh and
Kacprzyk’s [6] books. It may be viewed as a new paradigm, or “technology” in
the representation, processing and solving of various real life problems when a
human being is a crucial element. Such problems are omnipresent. The basic
idea and rationale of computing with words and perceptions is that since for a
human being natural language is the only fully natural way of communication,
then maybe it could be expedient to try to “directly” use (elements of) natural
language in the formulation, processing and solution of problems considered
to maintain a higher human consistence, hence a higher implementability.
Notice that the philosophy and justification of the computing with words
and perception paradigm are in line with the requirements and specifics of
problems considered, and solution concepts adopted in this paper.
A prerequisite for computing with words is to have some way to formally
represent elements of natural language used. Zadeh proposed to use here the
PNL (precisiated natural language). Basically, in PNL, statements about val-
ues, relations, etc. between variables are represented by constraints. In the
conventional case, a statement is, e.g., that the value of variable x belongs to
a set X. In PNL, statements – generally written as “x is Z” – may be dif-
ferent, and correspond to numeric values, intervals, possibility distributions,
verity distributions, probability distributions, usuality qualified statements,
rough sets representations, fuzzy relations, etc. For our purposes, the usuality
qualified representation will be of a special relevance. Basically, it says “x is
usually Z” that is meant as “in most cases, x is Z”. PNL may play various
roles among which crucial are: the description of perceptions, the definition
of sophisticated concepts, a language for perception based reasoning, etc.
Recently, Zadeh [7] introduced the concept of a protoform. For our pur-
poses, one should notice that most perceptions are summaries. For instance,
a perception like “most Swedes are tall” is some sort of a summary. It can be
represented in Zadeh’s notation as “most As are Bs”. This can be employed
for reasoning under various assumptions. For instance, if we know that “x is
A”, we can deduce that, e.g.“it is likely that x is B”. We can also ask about an
average height of a Swede, etc. One can go a step further, and define a proto-
form as an abstracted summary. In our case, this would be “QAs are Bs”.
Notice that we now have a more general, deinstantiated form of our point of
departure “most Swedes are tall”, and also of “most As are Bs”. Needless to
say that much of human reasoning is protoform based, and the availability of
324 J. Kacprzyk and S. Zadrożny
such a more general representation is vary valuable, and provides tools that
can be used in many cases. From the point of view of the problem class con-
sidered in this chapter, the use of protoforms may be viewed to contribute to
the portability, scalability and adaptivity in the sense mentioned above.
We discuss a number of approaches to mining of linguistic summaries.
First, those based on Kacprzyk and Zadrożny’s [9, 10] idea of an interactive
approach to linguistic summaries in which the determination of a class of
summaries of interest is done via Kacprzyk and Zadrożny’s [11, 12] FQUERY
for Access, a fuzzy querying add-on to Microsoft Access. c It is shown that by
relating a range of types of linguistic summaries to fuzzy queries, with various
known and sought elements, we can arrive at a hierarchy of protoforms of
linguistic data summaries. Basically, there is a trade off between the specificity
in respect to the summaries sought and the complexity of a corresponding
mining process. In the simplest case, data mining boils down directly to a
flexible querying process. In the opposite case, the concept of a linguistic
association rule along with well known efficient mining algorithms may be
employed. Also other approaches to linguistic summaries mining are briefly
discussed in Sect. 16.3.
The line of reasoning adopted here should convince the reader that the use
of a broadly perceived paradigm of computing with words and perceptions,
equipped with a newly introduced concept of a protoform, may be a proper
tool for dealing with situations when we have to develop and implement a sys-
tem that should perform “intelligent” tasks, be human consistent and human
friendly, and some other relevant requirements should also be fulfilled as, e.g.,
to be inexpensive, easy to calibrate, portable, scalable, being able to somehow
adapt to changing conditions and requirements, etc.
Then, the truth values (from [0, 1]) of (16.3) and (16.4) are calculated,
respectively, as
326 J. Kacprzyk and S. Zadrożny
1
n
truth(Qy’s are S) = µQ µS (yi ) (16.6)
n i=1
n
(µR (yi ) ∧ µS (yi ))
i=1
truth(QRy’s are S) = µQ n (16.7)
i=1 µR (yi )
Both the fuzzy predicates S and R are assumed above to be of a rather
simplified, atomic form referring to just one attribute. They can be extended
to cover more sophisticated summaries involving some confluence of various
attribute values as, e.g, “young and well paid”. Clearly, when we try to linguis-
tically summarize data, the most interesting are non-trivial, human-consistent
summarizers (concepts) as, e.g.:
• productive workers,
• difficult orders, . . . ,
and it may easily be noticed that their proper definition may require a very
complicated combination of attributes as with, for instance: a hierarchy (not
all attributes are of the same importance for a concept in question), the at-
tribute values are ANDed and/or ORed, k out of n, most, . . . of them should
be accounted for, etc.
Recently, Zadeh [7] introduced the concept of a protoform that is highly
relevant in this context. Basically, a protoform is defined as a more or less
abstract prototype (template) of a linguistically quantified proposition. The
most abstract protoforms correspond to (16.3) and (16.4), while (16.1) and
(16.2) are examples of fully instantiated protoforms. Thus, evidently, proto-
forms form a hierarchy, where higher/lower levels correspond to more/less
abstract protoforms. Going down this hierarchy one has to instantiate partic-
ular components of (16.3) and (16.4), i.e., quantifier Q and fuzzy predicates S
and R. The instantiation of the former one consists in the selection of a quan-
tifier. The instantiation of fuzzy predicates requires the choice of attributes
together with linguistic values (atomic predicates) and a structure they form
when combined using logical connectives. This leads to a theoretically infi-
nite number of potential protoforms. However, for the purposes of mining of
linguistic summaries, there are obviously some limits on a reasonable size of
a set of summaries that should be taken into account. These results from a
limited capability of the user in the interpretion of summaries as well as from
the computational point of view.
The concept of a protoform may be taken as a guiding paradigm for the
design of a user interface supporting the mining of linguistic summaries. It
may be assumed that the user specifies a protoform of linguistic summaries
sought. Basically, the more abstract protoform the less should be assumed
about summaries sought, i.e., the wider range of summaries is expected by
the user. There are two limit cases, where:
16 Fuzzy Linguistic Data Summaries as a Human Consistent 327
In the first case the system has to construct all possible summaries (with
all possible linguistic components and their combinations) for the context of a
given database (table) and present to the user those verifying the validity to a
degree higher than some threshold. In the second case, the whole summary is
specified by the user and the system has only to verify its validity. Thus, the
former case is usually more interesting from the point of view of the user but
at the same time more complex from the computational point of view. There
is a number of intermediate cases that may be more practical. In Table 16.1
basic types of protoforms/linguistic summaries are shown, corresponding to
protoforms of a more and more abstract form.
Basically, each of fuzzy predicates S and R may be defined by listing
its atomic fuzzy predicates (i.e., pairs of “attribute/linguistic value”) and
structure, i.e., how these atomic predicates are combined. In Table 16.1 S
(or R) corresponds to the full description of both the atomic fuzzy predicates
(referred to as linguistic values, for short) as well as the structure. For example:
is a protoform of Type 3.
In case of (16.8) the system has to select a linguistic quantifier (usually
from a predefined dictionary) that when put in place of Q in (16.8) makes the
resulting linguistically quantified proposition valid to the highest degree. In
case of (16.9), the linguistic quantifier as well as the structure of summarizer S
are given. The system has to choose a linguistic value to replace the question
mark (“?”) yielding a linguistically quantified proposition as valid as possible.
Note that this may be interpreted as the search for a typical salary in the
company.
328 J. Kacprzyk and S. Zadrożny
with a truth degree being a function of the two components of the summary
that involve the truth (validity) T and the linguistic quantifier Q. In the
literature (cf., e.g., Dubois and Prade [19]) there are considered many pos-
sible interpretations for fuzzy rules. Some of them were directly discussed in
the context of linguistic summaries by some authors (cf. Sect. 16.3.3 in this
chapter).
Some authors consider the concept of a fuzzy functional dependency as
a suitable candidate for the linguistic summarization. The fuzzy functional
dependencies are an extension of the classical crisp functional dependencies
considered in the context of relational databases. The latter play a funda-
mental role in the theory of normalization. A functional dependency between
two sets of attributes {Ai } and {Bi } holds when the values of attributes {Ai }
fully determine the values of attributes {Bi }. Thus, a functional dependency
is a much stronger dependency between attributes than that expressed by
(16.10). The classical crisp functional dependencies are useless for data sum-
marization (at least in case of regular relational databases) as in a properly
designed database they should not appear, except the trivial ones. On the
other hand, fuzzy functional dependencies are of an approximate nature and
as such may be identified in a database and serve as linguistic summaries.
They may be referred to as extensional functional dependencies that may ap-
pear in a given instance of a database in contrast to intentionally interpreted
crisp functional dependencies that are, by design, avoided in any instance of
a database. A fuzzy functional dependency may be exemplified with
The basic concept of a linguistic summary seems to be fairly simple. The main
issue is how to generate summaries for a given database. The full search of
the solution space is practically infeasible. In the literature a number of ways
to solve this problem have been proposed. In what follows we briefly overview
some of them.
The process of mining of linguistic summaries may be more or less au-
tomatic. At the one extreme, the system may be responsible for both the
construction and verification of summaries (which corresponds to Type 5
protoforms/summaries given in Table 16.1). At the other extreme, the user
proposes a summary and the system only verifies its validity (which corre-
sponds to Type 0 protoforms/summaries in Table 16.1). The former approach
seems to be more attractive and in the spirit of data mining meant as the
discovery of interesting, unknown regularities in data. On the other hand,
the latter approach, obviously secures a better interpretability of the results.
Thus, we will discuss now the possibility to employ a flexible querying inter-
face for the purposes of linguistic summarization of data, and indicate the
implementability of a more automatic approach.
Referring to Table 16.1, we can observe that Type 0 as well as Type 1 lin-
guistic summaries may be easily produced by a simple extension of FQUERY
for Access. Basically, the user has to construct a query, a candidate sum-
mary, and it is to be determined which fraction of rows matches that query
(and which linguistic quantifier best denotes this fraction, in case of Type 1).
Type 3 summaries require much more effort as their primary goal is to deter-
mine typical (exceptional) values of an attribute (combination of attributes).
So, query/summarizer S consists of only one simple condition built of the
attribute whose typical (exceptional) value is sought, the “=” relational oper-
ator, and a placeholder for the value sought. For example, using: Q = “most”
and S = “age=?” we look for a typical value of “age”. From the computational
point of view Type 5 summaries represent the most general form considered:
fuzzy rules describing dependencies between specific values of particular at-
tributes.
The summaries of Type 1 and 3 have been implemented as an extension
to Kacprzyk and Zadrożny’s [22, 23, 24] FQUERY for Access.
A1 ∧ A2 ∧ . . . ∧ An −→ An+1 (16.13)
16 Fuzzy Linguistic Data Summaries as a Human Consistent 331
and note that much earlier origins of that concept are mentioned in the work
by Hájek and Holeňa [17]).
Thus, such an association rule states that if in a database row all the at-
tributes from the set {A1 , A2 , . . . , An } take on value 1, then also the attribute
An+1 is expected to take on value 1. The algorithms proposed in the litera-
ture for mining the association rules are based on the following concepts and
definitions. A row in a database (table) is said to support a set of attributes
{Ai }i∈I if all attributes from the set take on in this row value 1. The support
of a rule (16.13) is the fraction of the number of rows supporting the set of
attributes {Ai }i∈{1,...,n+1} in a database (table). The confidence of a rule in
a database (table) is the fraction of the number of rows supporting the set
of attributes {Ai }i∈{1,...,n+1} among all rows supporting the set of attributes
{Ai }i∈I . The well known algorithms (cf. Agrawal and Srikant [25] and Man-
nila et al. [26]) search for rules having values of the support measure above
some minimal threshold and a high value of the confidence measure. More-
over, these algorithms may be easily adopted for the non-binary valued data
and more sophisticated rules than one shown in (16.13).
In particular, fuzzy association rules may be considered:
A1 IS R1 ∧ A2 IS R2 ∧ . . . ∧ An IS Rn −→ An+1 IS S (16.14)
should take that into account. Basically, a scalar cardinality may be employed
(in the spirit of Zadeh’s calculus of linguistically quantified propositions).
Finally, each frequent itemset (i.e., with the support higher than a selected
threshold) is split (in all possible ways) into two parts treated as a conjunction
of atomic predicates and corresponding to the premise (predicate R in terms
of linguistic summaries) and consequence (predicate S in terms of linguistic
summaries) of the rule, respectively. Such a rule is accepted if its confidence
is higher than the selected threshold. Note that such an algorithm trivially
covers the linguistic summaries of type (16.3), too. For them the last step is
not necessary and each whole frequent itemset may be treated as a linguistic
summary of this type.
Fuzzy association rules were studied by many authors including Lee and
Lee-Kwang [27] and Au and Chan [28]. Hu et al. [29] simplify the form of
fuzzy association rules sought by assuming a single specific attribute (class)
in the consequent. This leads to the mining of fuzzy classification rules. Bosc
et al. [30] argue against the use of scalar cardinalities in fuzzy association rule
mining. Instead, they suggest to employ fuzzy cardinalities and propose an
approach for the calculation of rules’ frequencies. This is not a trivial problem
as it requires to divide the fuzzy cardinalities of two fuzzy sets. Kacprzyk,
Yager and Zadrożny [4, 22, 23, 24, 31] advocated the use of fuzzy association
rules for mining linguistic summaries in the framework of flexible querying
interface. Chen et al. [32] investigated the issue of generalized fuzzy rules
where a fuzzy taxonomy of linguistic terms is taken into account. Kacprzyk
and Zadrożny [33] proposed to use more flexible aggregation operators instead
of conjunction, but still in context of fuzzy association rules.
George and Srikanth [34, 35] use a genetic algorithm to mine linguistic sum-
maries. Basically, they consider the summarizer in the form of a conjunction
of atomic fuzzy predicates and a void subpopulation. Then, they search for
two linguistic summaries referred to as a constraint descriptor (“most specific
generalization”) and a constituent descriptor (“most general specification”),
respectively. The former is defined as a compromise solution having both the
maximum truth (validity) and number of covered attributes (these criteria are
combined by some aggregation operator). The latter is a linguistic summary
having the maximum validity and covering all attributes. As in virtually all
other approaches, a dictionary of linguistic quantifiers and linguistic values
over domains of all attributes is assumed. This is sometimes referred to as a
domain or background knowledge. Kacprzyk and Strykowski [36, 37] have also
implemented the mining of linguistic summaries using genetic algorithms. In
their approach, the fitting function is a combination of a wide array of indices
assessing a validity/interestingness of given summary. These indices include,
e.g., a degree of imprecision (fuzziness), a degree of covering, a degree of appro-
priateness, a length of a summary, and yields an overall degree of validity (cf.
16 Fuzzy Linguistic Data Summaries as a Human Consistent 333
also Kacprzyk and Yager [3]). Some examples of this approach are presented
and discussed in Sect. 16.4 of this chapter.
Rasmussen and Yager [38, 39] propose an extension, SummarySQL, to the
SQL language, an industry standard for querying relational databases, making
it possible to cover linguistic summaries. Actually, they do not address the
problem of mining linguistic summaries but merely of verifying them. The
user has to conceive a summary, express it using SummarySQL, and then
has it evaluated. In [39] it is shown how SummarySQL may also be used to
verify a kind of fuzzy gradual rules (cf. Dubois and Prade [40]) and fuzzy
functional dependencies. Again, the authors focus on a smooth integration of
a formalism for such rule expression with SQL rather than on the efficiency
of a verification procedure.
Raschia and Mouaddib [41] consider the problem of mining hierarchies of
summaries. Their understanding of summaries is slightly different than that
given by (16.4). Namely, their summary is a conjunction of atomic fuzzy pred-
icates (each referring to just one attribute). However, these predicates are not
defined by just one linguistic value but possibly by fuzzy sets of linguistic
values (i.e., fuzzy sets of higher levels are considered). It is assumed that both
linguistic values as well as fuzzy sets of higher levels based on them form
background knowledge provided by experts/users. The mining of summaries
(in fact what is mined is a whole hierarchy of summaries) is based on a con-
cept formation (conceptual clustering) process. The first step is, as usually, a
translation of the original tuples from database into so-called candidate tuples.
This step consists in replacing in the original tuples values of their attributes
with linguistic values best matching them which are defined over respective
domains. Then, candidate tuples obtained are aggregated to form final sum-
maries of various levels of hierarchy. This aggregation leads to possibly more
complex linguistic values (represented by fuzzy sets of a higher level). More
precisely, it is assumed that one candidate tuple is processed at a time. It
is inserted into appropriate summaries already present in the hierarchy. Each
tuple is first added to a top (root), most abstract summary, covering the whole
database (table). Then, the tuple is put into offspring summaries along the
selected branch in the hierarchy. In fact, a range of operations is considered
that may lead to a rearrangement of the hierarchy via the formation of new
node-summaries as well as splitting the old ones. The concept of a linguistic
quantifier does not directly appear in this approach. However, each summary
is accompanied with an index corresponding to the number of original tuples
covered by this summary.
summaries for a computer retailer. Basically, we will deal with its sales data-
base, and will only show some examples of linguistic summaries for some
interesting (for the user!) choices of relations between attributes.
The basic structure of the database is as shown in Table 16.2.
Linguistic summaries are generated using a genetic algorithm [36, 37]. We
will now give a couple of examples of resulting summaries. First, suppose that
we are interested in a relation between the commission and the type of goods
sold. The best linguistic summaries obtained are as shown in Table 16.3.
Table 16.3. Linguistic summaries expressing relations between the group of prod-
ucts and commission
Summary
About 1/3 of sales of network elements is with a high commission
About 1/2 of sales of computers is with a medium commission
Much sales of accessories is with a high commission
Much sales of components is with a low commission
About 1/2 of sales of software is with a low commission
About 1/3 of sales of computers is with a low commission
A few sales of components is without commission
A few sales of computers is with a high commission
Very few sales of printers is with a high commission
16 Fuzzy Linguistic Data Summaries as a Human Consistent 335
Table 16.4. Linguistic summaries expressing relations between the groups of prod-
ucts and times of sale
Summary
About 1/3 of sales of computers is by the end of year
About 1/2 of sales in autumn is of accessories
About 1/3 of sales of network elements is in the beginning of year
Very few sales of network elements is by the end of year
Very few sales of software is in the beginning of year
About 1/2 of sales in the beginning of year is of accessories
About 1/3 of sales in the summer is of accessories
About 1/3 of sales of peripherals is in the spring period
About 1/3 of sales of software is by the end of year
About 1/3 of sales of network elements is in the spring period
About 1/3 of sales in the summer period is of components
Very few sales of network elements is in the autumn period
A few sales of software is in the summer period
As we can see, the results can be very helpful, for instance while negotiating
commissions for various products sold.
Next, suppose that we are interested in relations between the groups of
products and times of sale. The best results obtained are as in Table 16.4.
Notice that in this case the summaries are much less obvious than in the
former case expressing relations between the group of product and commis-
sion. But, again, they provide very useful information.
Finally, let us show in Table 16.5 some of the obtained linguistic summaries
expressing relations between the attributes: size of customer, regularity of
customer (purchasing frequency), date of sale, time of sale, commission, group
of product and day of sale.
Table 16.5. Linguistic summaries expressing relations between the attributes: size
of customer, regularity of customer (purchasing frequency), date of sale, time of sale,
commission, group of product and day of sale
Summary
Much sales on Saturday is about noon with a low commission
Much sales on Saturday is about noon for bigger customers
Much sales on Saturday is about noon
Much sales on Saturday is about noon for regular customers
A few sales for regular customers is with a low commission
A few sales for small customers is with a low commission
A few sales for one-time customers is with a low commission
Much sales for small customers is for nonregular customers
336 J. Kacprzyk and S. Zadrożny
Table 16.6. Linguistic summaries expressing relations between the attributes: group
of products, time of sale, temperature, precipitacion, and type of customers
Summary
Very few sales of software in hot days to individual customers
About 1/2 of sales of accessories in rainy days on weekends by the end of the year
About 1/3 of sales of computers in rainy days to individual customers
Notice that the use of external data gives a new quality to possible linguis-
tic summaries. It can be viewed as providing a greater adaptivity to varying
conditions because the use of free or inexpensive data sources from the In-
ternet makes it possible to easily and quickly adapt the form and contents
of summaries to varying needs and interests. And this all is practically at no
additional price and effort.
References
1. R.R. Yager: A new approach to the summarization of data. Information Sciences,
28, pp. 69–86, 1982.
2. R.R. Yager R.R.: On linguistic summaries of data. In W. Frawley and
G. Piatetsky-Shapiro (Eds.): Knowledge Discovery in Databases. AAAI/MIT
Press, pp. 347–363, 1991.
3. J. Kacprzyk and R.R. Yager: Linguistic summaries of data using fuzzy logic.
International Journal of General Systems, 30, 33–154, 2001.
4. J. Kacprzyk, R.R. Yager and S. Zadrożny. A fuzzy logic based approach to
linguistic summaries of databases. International Journal of Applied Mathematics
and Computer Science, 10, 813–834, 2000.
5. J. Kacprzyk, R.R. Yager and S. Zadrożny. Fuzzy linguistic summaries of
databases for an efficient business data analysis and decision support. In W.
338 J. Kacprzyk and S. Zadrożny
22. J. Kacprzyk and S. Zadrożny. Computing with words: towards a new genera-
tion of linguistic querying and summarization of databases. In P. Sinčak and J.
Vaščak (Eds.): Quo Vadis Computational Intelligence?, pp. 144–175, Springer-
Verlag, Heidelberg and New York, 2000.
23. J. Kacprzyk and S. Zadrożny. On a fuzzy querying and data mining interface,
Kybernetika, 36, 657–670, 2000.
24. J. Kacprzyk J. and S. Zadrożny. On combining intelligent querying and data
mining using fuzzy logic concepts. In G. Bordogna and G. Pasi (Eds.): Re-
cent Research Issues on the Management of Fuzziness in Databases, pp. 67–81,
Springer–Verlag, Heidelberg and New York, 2000.
25. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. Pro-
ceedings of the 20th International Conference on Very Large Databases, Santiago
de Chile, 1994.
26. H. Mannila, H. Toivonen and A.I. Verkamo. Efficient algorithms for discovering
association rules. In U.M. Fayyad and R. Uthurusamy (Eds.): Proceedings of the
AAAI Workshop on Knowledge Discovery in Databases, pp. 181–192, Seattle,
USA, 1994.
27. Lee J.-H. and H. Lee-Kwang. An extension of association rules using fuzzy sets.
Proceedings of the Seventh IFSA World Congress, pp. 399–402, Prague, Czech
Republic, 1997.
28. W.-H. Au and K.C.C. Chan. FARM: A data mining system for discovering
fuzzy association rules. Proceedings of the 8th IEEE International Conference
on Fuzzy Systems, pp. 1217–1222, Seoul, Korea, 1999.
29. Y.-Ch. Hu, R.-Sh. Chen and G.-H. Tzeng. Mining fuzzy association rules for clas-
sification problems. Computers and Industrial Engineering, 43, 735–750, 2002.
30. P. Bosc, D. Dubois, O. Pivert, H. Prade and M. de Calmes. Fuzzy summarization
of data using fuzzy cardinalities. Proceedings of IPMU 2002, pp. 1553–1559,
Annecy, France, 2002.
31. J. Kacprzyk and S. Zadrożny. On linguistic approaches in flexible querying and
mining of association rules. In H.L. Larsen, J. Kacprzyk, S. Zadrożny, T. An-
dreasen and H. Christiansen (Eds.): Flexible Query Answering Systems. Recent
Advances, pp. 475–484, Springer-Verlag, Heidelberg and New York, 2001.
32. G. Chen, Q. Wei and E. Kerre. Fuzzy data mining: discovery of fuzzy generalized
association rules. In G. Bordogna and G. Pasi (Eds.): Recent Issues on Fuzzy
Databases, pp. 45–66. Springer-Verlag, Heidelberg and New York, 2000.
33. J. Kacprzyk and S. Zadrożny. Linguistic summarization of data sets using as-
sociation rules. Proceedings of The IEEE International Conference on Fuzzy
Systems, pp. 702–707, St. Louis, USA, 2003.
34. R. George and Srikanth R. Data summarization using genetic algorithms and
fuzzy logic. In F. Herrera and J.L. Verdegay (Eds.): Genetic Algorithms and
Soft Computing, pp. 599–611, Springer-Verlag, Heidelberg, 1996.
35. R. George and R. Srikanth. A soft computing approach to intensional answering
in databases. Information Sciences, 92, 313–328, 1996.
36. J. Kacprzyk and P. Strykowski. Linguistic data summaries for intelligent de-
cision support. In R. Felix (Ed.): Fuzzy Decision Analysis and Recognition
Technology for Management, Planning and Optimization – Proceedings of EF-
DAN’99, pp. 3–12, Dortmund, Germany, 1999.
37. J. Kacprzyk and P. Strykowski. Linguitic summaries of sales data at a computer
retailer: a case study. Proceedings of IFSA’99, pp. 29–33, Taipei, Taiwan R.O.C,
vol. 1, 1999.
340 J. Kacprzyk and S. Zadrożny
38. D. Rasmussen and R.R. Yager. Fuzzy query language for hypothesis evalua-
tion. In Andreasen T., H. Christiansen and H. L. Larsen (Eds.): Flexible Query
Answering Systems, pp. 23–43, Kluwer, Boston, 1997.
39. D. Rasmussen and R.R. Yager. Finding fuzzy and gradual functional dependen-
cies with SummarySQL. Fuzzy Sets and Systems, 106, 131–142, 1999.
40. D. Dubois and H. Prade. Gradual rules in approximate reasoning. Information
Sciences, 61, 103–122, 1992.
41. G. Raschia and N. Mouaddib. SAINTETIQ: a fuzzy set-based approach to data-
base summarization. Fuzzy Sets and Systems, 129, 137–162, 2002.
17
Adaptive Multimedia Retrieval:
From Data to User Interaction
To improve today’s multimedia retrieval tools and thus the overall satisfaction
of a user, it is necessary to develop methods that are able to support the user in
the retrieval process, e.g. by providing not only additional information about
the search results as well as the data collection itself, but also by adapting the
retrieval tool to the underlying data as well as to the user’s needs and interests.
In this chapter we give a brief overview of the state-of-the-art and current
trends in research.
17.1 Introduction
During the last years several approaches have been developed that tackle
specific problems of the multimedia retrieval process, e.g. feature extraction
methods for multimedia data, problem specific similarity measures and inter-
active user interfaces. These methods enable the design of efficient retrieval
tools if the user is able to provide an appropriate query. However, in most
cases the user needs several steps in order to find the searched objects. The
main reasons for this are on the one hand, the problem of users to specify
their interests in the form of a well-defined query – which is partially caused
by inappropriate user interfaces –, on the other hand, the problem of extract-
ing relevant features from the multimedia objects. Furthermore, user specific
interests and search context are usually neglected when objects are retrieved.
To improve today’s retrieval tools and thus the overall satisfaction of a
user, it is necessary to develop methods that are able to support the user in
the search process, e.g. by providing additional information about the search
results as well as the data collection itself and also by adapting the retrieval
tool to the user’s needs and interests.
In the following, we give an overview of methods that are used in retrieval
systems for text and multimedia data. To give a guideline for the development
of an integrated system, we describe methods that have been successfully used
A. Nürnberger and M. Detyniecki: Adaptive Multimedia Retrieval: From Data to User Interac-
tion, StudFuzz 173, 341–370 (2005)
www.springerlink.com c Springer-Verlag Berlin Heidelberg 2005
342 A. Nürnberger and M. Detyniecki
how to aggregate and rank the results. Each of these steps should be well
though-out and designed. Another common way to help a user in the retrieval
task is to structure the data set. Again, because of the volume of data, we are
looking for automatic techniques. One family of algorithms known as cluster-
ing techniques allow to find groups of similar objects in data. In Sect. 17.4,
after a short introduction and an overview of clustering techniques, we focus
on one particular method: The growing self-organizing map. The third way of
supporting a user in his task is visualization. Sophisticated visual interaction
can be very useful, unfortunately it is rarely exploited. In Sect. 17.5, we first
present briefly the state of the art of visualization techniques and then we
focus on self-organizing maps, which are particularly suited for visualization.
The final step of retrieval system design is the integration and coordination
of the different tools. This step is usually considered as interface design. We
dedicate Sect. 17.6 to present some important aspects of it.
Once we have designed the retrieval system based on the above mentioned
criteria, we can focus on personalization. The idea is to adapt the system
to an individual user or a group of users. This adjustment to the user can
be considered for every type of interaction. All personalization is based on a
model of the user, which can be manually configured or learned by analyzing
user behavior and user feedback. User feedback can be explicitly requested
or – if the system is suitable designed – learned from the interaction between
system and user. Finally, we discuss in Sect. 17.7 as exemplification, after a
short state of the art of user modeling approaches, personalization methods
for self-organizing maps.
One of the major difficulties that multimedia retrieval systems face is that they
have to deal with different forms of data (and data formats), as for instance
text (e.g. text, html), hyperlinks (e.g. html, pdf), audio (e.g. waw, midi),
images (e.g. jpeg, gif), video (e.g. mpeg, avi). Not only this polymorphism is
a problem, but also the fact that the information is not directly available. In
fact, it is not like a text file, where we can simply index by using the words
appearing in the text. Usually, we have to index by annotating media data –
manually or automatically –, in order to interact or work with them.
Formally, indexing is the process of attaching content-based labels to the
media. For instance, existing literature on video indexing defines video index-
ing as the process of extracting the temporal location of a feature and its
value from the video data. Currently indexing in the area of video is generally
done manually. But since the indexing effort is directly proportional to the
granularity of video access and since the number of videos available grows
and new applications demand fine grained access to video, automation of the
indexing process becomes more and more essential. Presently it is reasonable
344 A. Nürnberger and M. Detyniecki
From an image, we can extract color, texture, sketch, shape, objects and
their spatial relationships. The color feature is typically represented by image
histograms. The texture describes the contrast, the uniformity, the coarseness,
the roughness, the frequency, and the directionality. In order to obtain this
features either statistical techniques are used (autocorrelation, co-ocurrence
matrix) or spectral techniques as for instance detection of narrow peaks in the
spectrum. The sketch gives an image containing only the object outlines and it
is usually obtained by the combination of edge detection, thinning, shrinking
and other transformations of this type. The shape describes global features
as the circularity, eccentricity and major axis orientation, but also local ones
such as for instance point of curvature, corner location, turning angles and
algebraic moments. Note that the images can be already the result of an
extraction, as for instance the key-frames, which are representative images
from the visual stream (see video segmentation). Unfortunately, it is not yet
possible to reliably extract a semantic content description of an image, even
though several research groups are working on this problem.
17.2.2 Audio
From the audio stream, we can extract the following basic characteristics: The
loudness as the strength of sound, determined by the amplitude of the sound
17 Adaptive Multimedia Retrieval 345
We can simply extract all images and the sound track from a video or we
can be more specific with respect to the video format and extract information
regarding the structure by video segmentation and by extracting a represen-
tative image (called key-frame) from a segment. The purpose of video seg-
mentation is to partition the video stream into basic units (shots) in order to
facilitate indexing, browsing and to give some structure similar to paragraphs
in a text document. Current techniques allow not only to find the shots, but
also to describe the transition between them, as for instance fade in, fade out,
dissolve or wipe.
Videos are stored either uncompressed or compressed, e.g. using MPEG or
any other video compression method. The techniques for segmentation used
in the uncompressed domain are based on pixel-wise or histogram comparison
[4]. In the compressed domain [5] they are based either on coefficient manipu-
lations as inner product or absolute difference or on the motion vectors. The
key-frames are usually extracted to reduce the image processing to one image
per shot. The idea is to find a representative frame (containing characteristic
features) from the sequence of video frames. One simple method consists in
extracting the first or the tenth frame [5]. More sophisticated methods look
for local minima of motion or significant pauses [6]. Lately, expert systems
having rules based on the camera motion have been proposed.
One important characteristic of the video is its temporal aspect. We can easily
extract the following motions: The camera motion describes the real (transla-
tion and rotation) and factual movement of the camera (zoom in and out). It
is usually obtained either by studying the optical flow by dividing the video
in several regions [7] or by studying the motion vector [8]. The object motion
describes the trajectory obtained by tracking one object on the screen [9, 10].
346 A. Nürnberger and M. Detyniecki
Extracting Text
Since we are able to use text very efficiently to retrieve objects from a col-
lection (see the following Sect. 17.2.4), a lot of research is currently done in
order to synchronize text with or to extract text from the video. One research
direction is to synchronize the written script with the video [11]. Another one
is to extract the text by doing a transcription of the audio channel [12]. This
is usually done with complex speech recognition techniques on pre-processed
data. Another interesting challenge is to extract the written information ap-
pearing on the screen [13, 14]. The idea is to locate the text on the screen and
then to recognize it. These techniques are first attempts, but very often the
synchronized text is available and can be exploited.
17.2.4 Text
element is set to one if the corresponding word is used in the document and
to zero if the word is not. This encoding will result in a simple Boolean search
if a query is encoded in a vector. Using Boolean encoding the importance of
all terms for a specific query or comparison is considered as similar. To im-
prove the performance usually term weighting schemes are used, where the
weights reflect the importance of a word in a specific document of the consid-
ered collection. Large weights are assigned to terms that are used frequently
in relevant documents but rarely in the whole document collection [23]. Thus
a weight wik for a term k in document i is computed by term frequency tfik
times inverse document frequency idfk , which describes the term specificity
within the document collection. In [22] a weighting scheme was proposed that
has meanwhile proven its usability in practice. Besides term frequency and
inverse document frequency – defined as idfk := log(N/nk ) –, a length nor-
malization factor is used to ensure that all documents have equal chances of
being retrieved independent of their lengths:
tfik log(N/nk )
wik = ! , (17.1)
t
j=1 (tfij )2 (log(N/nj ))2
For a more detailed discussion of the vector space model and weighting
schemes see, e.g. [15, 23, 24, 25].
To reduce the number of words and thus the dimensionality of the vector
space description of the document collection, the size of the dictionary of
words describing the documents can be reduced by filtering stop words and
by stemming the words used. The idea of stop word filtering is to remove
words that bear little or no content information, like articles, conjunctions,
prepositions, etc. Furthermore, words that occur extremely often can be said
to be of little information content to distinguish between documents, and
also words that occur very seldom are likely to be of no particular statistical
relevance and can be removed from the dictionary [26]. Stemming methods
try to build the basic forms of words, i.e. strip the plural “s” from nouns, the
“ing” from verbs, or other affixes. A stem is a natural group of words with
348 A. Nürnberger and M. Detyniecki
equal (or very similar) meaning. After the stemming process, every word is
represented by its stem in the vector space description. Thus a feature of a
document vector Di now describes a group of words. A well-known stemming
algorithm has been originally proposed by Porter [27]. He defined a set of
production rules to iteratively transform (English) words into their stems.
Automatic Indexing
To further decrease the number of words that should be used in the vector
description also indexing or keyword selection algorithms can be used (see,
e.g. [19, 28]). In this case, only the selected keywords are used to describe the
documents. A simple but efficient method for keyword selection is to extract
keywords based on their entropy. E.g. in the approach discussed in [29], for
each word a in the vocabulary the entropy as defined by [30] was calculated:
1
m
ni (a)
W (a) = 1 + pi (a) · ln(pi (a)) with pi (a) = m , (17.3)
ln(m) i=1 j=1 nj (a)
Given the current state of the art, reliable and efficient automatic indexing
is only possible for the presented low-level characteristics. But is clear that
any intelligent interaction with the multimedia data should be based on a
higher level of description. For instance, in the case of a video we are more
interested in finding a dialog, than in finding a series of alternated shots. In
case of an image you can always query for a specific color and texture, but is
this really what you want to do?
The current intelligent systems use high level indexing as for instance
a predefined term index or even ontological categories. Unfortunately, the
high level indexing techniques are based on manual annotation. So, these
approaches can only be used for small quantities of new video and do not
exploit intelligently the automatic extracted information.
In addition, the characteristics extracted by the automatic techniques are
clearly not crisp attributes (color, texture) or they are defined with a degree
of truth (camera motion: fade-in, zoom out) or imprecise (30% noise, 50%
speech). Therefore, fuzzy techniques seem to be a promising approach to deal
with this kind of data (see, for example, [31]).
17.3 Querying
17.3.1 Ranking
Ranking defines the process of ordering a result set with respect to given
criteria. Usually a similarity measure that allows to compute a numerical
similarity value for a document to a given query is used. For text document
collections ranking based on the vector space model and the tf ×idf weighting
scheme (17.1) discussed in the previous section has proven to provide good
results. Here the query is considered as a document and the similarity is
computed based on the scalar product (cosine; see 17.2) of the query and
each document (see also [23]).
If fuzzy modifiers should be used to define vague queries, e.g. using quan-
tifiers like “most of (term1, term2,. . .)” or “term1 and at least two terms of
(term2, term3,. . .)”, ranking methods based on ordered weighted averaging
operators (OWA) can be used [32]. If the document collection also provides
information about cross-references or links between the documents, these in-
formation can be used in order to compute the “importance” of a document
within a collection. A link is in this case considered as a vote for a document.
A good example is the PageRank algorithm [33] used as part of the ranking
procedure by the web search engine Google. It has proven to provide rankings
that sorts heavily linked documents for given search criteria in high list ranks.
• Users are used to querying systems. Therefore any retrieval system should
have a simple keyword query interface.
• Very often the raw results of the similarity (or distance) computation are
directly used to rank from high to low. However, a short analysis of the
ranking process can highly improve the quality of the system, especially if
diverse features are combined.
input and output layer of the neural network encode positions in the high-
dimensional data space. Thus, every unit in the output layer represents a
prototype.
Before the learning phase of the network, the two-dimensional structure
of the output units is fixed and the weights are initialized randomly. During
learning, the sample vectors are repeatedly propagated through the network.
The weights of the most similar prototype ws (winner neuron) are modified
such that the prototype moves toward the input vector wi . As similarity mea-
sure usually the Euclidean distance or scalar product is used. The weights
ws of the winner neuron are modified according to the following equation:
∀i : ws = ws + δ · (ws − wi ), where δ is a learning rate.
To preserve the neighborhood relations, prototypes that are close to the
winner neuron in the two-dimensional structure are also moved in the same
direction. The strength of the modification decreases with the distance from
the winner neuron. Therefore, the adaptation method is extended by a neigh-
borhood function v:
xi , yi : weight vectors
xk : weight vector of unit with highest error
yk
m: new unit
xk
xm
α, β: smoothness weights
x0 Computation of new weight vector xm for m:
y0 x1
n
1
xm = [xk + α · (xk − yk ) + (xi + β · (xi − yi ))] · n+1
i =0
y1 i =k
If these units accumulate high errors, which means that the assigned patterns
cannot be classified appropriately, this part of the map starts to grow. Even if
the considered neuron is an inner neuron, than the additional data pushes the
prior assigned patterns to outer areas to which new neurons had been created.
This can be interpreted as an increase of the number of data items belonging
to a specific area or cluster in data space, or if text documents are assigned to
the map as an increased number of publications concerning a specific topic.
Therefore also dynamic changes can be visualized by comparing maps, which
were incrementally trained by, e.g. newly published documents [49].
of documents usually text flags are used, which represent either a keyword
or the document category. Colors are frequently used to visualize the density,
e.g. the number of documents in this area, or the difference to neighboring
documents, e.g. in order to emphasize borders between different categories. If
three-dimensional projections are used, for example, the number of documents
assigned to a specific area can be represented by the z-coordinate.
Other Techniques
query and information extracted from the user profile. Current web search
engines usually sort hits only with respect to the given query and further in-
formation extracted from the web, e.g. based on a link analysis like PageRank
(see [33]), a popularity weighting of the site or specific keywords. This results
usually in an appropriate ordering of the result sets for popular topics and
well linked sites. Unfortunately, documents dealing with special (sub) topics
are ranked very low and a user has to provide very specific keywords in order
to find information provided by these documents. Especially for this problem
a user specific ranking could be very beneficial.
Another method to support a user in navigation is to structure result sets
as well as the whole data collection using, e.g., clustering methods as discussed
above. Thus, groups of documents with similar topics can be obtained (see,
for example, [37]). If result sets are grouped according to specific categories,
the user gets a better overview of the result set and can refine his search more
easily based on the proposed classes.
Fig. 17.8. User Profiling and Application of User Models in Retrieval Systems
between the document and its current node. And symmetrically decrease and
increase the weights of dissimilar features.
Let yi be the feature vector of an document i, s be the source and t the
target node, xs and xt be the corresponding prototypes, then w is computed
as described in the following. First we compute an error vector e for each
object based on the distance to the prototypes
yi − xj
ekji = dkji ∀k , where dji = . (17.5)
yi − xj
If we want to ensure that an object is moved from the source node to the
target node using feature weights, we have to assign higher weights to features
that are more similar to the target than to the source node. Thus for each
object we compute the difference of the distance vectors
The global weight vector is finally computed iteratively. For the initial
weight vector we choose w(0) = w1 , where w1 is a vector where all elements
are equal one. Then we compute a new global weight vector w(t+1) by doing
a by element multiplication:
The proposed learning method used is quite similar to the method de-
scribed above. However, instead of modifying the global weight w, we modify
local weights assigned to the source and the target nodes (noted here ws and
wt ). As before, we first compute an error vector e for each document based on
the distance to the prototypes, as defined in (17.2). Then we set all elements
of the weight vectors ws and wt to one and compute local document weights
wsi and wti by adding (subtracting) the error terms from the neutral weight-
ing scheme w1 . Then we compute the local weights iteratively similar to the
global weighting approach:
and
k(t+1) k(t)
wt = wt · wti ∀k , with wti = w1 − η · eti , (17.9)
where η is a learning rate. The weights assigned to the target and source node
are finally normalized such that the sum over all elements equals the number
of features in the vector, i.e.
wsk = wtk = 1. (17.10)
k k k
In this way the weights assigned to features that achieved a higher (lower)
error are decreased (increased) for the target node and vice versa for the
source node.
With the local approach we just modified weighting vectors of the source
and target nodes. However, as adjacent map nodes should ideally contain
364 A. Nürnberger and M. Detyniecki
"
r 1− dist(n,t)
if dist(n, t) < r,
gtn = r ,
0 otherwise
where dist(x, y) is the radial distance
in nodes between nodes x and y
similar documents, one could demand that the weights should not change
abruptly between nodes. Thus, it is a natural extension of this approach to
modify the weight vectors of the neighboring map units accordingly with a
similar mechanism as in the learning of the map. Depending on the radius r
of the neighborhood function, the result would lie between the local approach
(r = 0) and the global approach (r = ∞). In the following, we present such
an extension.
As for the local approach we have a weighting vector per node. Then –
as before – we start by computing an error vector e for each object based
on the distance to the prototypes, as defined in (17.2). Based on the error
vectors e weight vectors of each node n are computed iteratively. For the
k(0)
initial weight vector wn we choose vectors where all elements are equal
to one. We then compute a new local weight vector for each node by an
elementwise multiplication:
Fig. 17.10. A Prototypical Image Retrieval System: Overview of the Image Collec-
tion by a Map and Selected Clusters of Images
We note here that the general approach has by construction the local and
global approach as limiting models. In fact, depending on the radius r of the
neighborhood function, we obtain the local approach for r → 0 and the global
approach for r → ∞.
The methods described above have been integrated into a prototype for
image retrieval shown in Fig. 17.10. Further details can be found in [74].
• The user modelling step should be integrated at the very end of the de-
sign process. All other functionalities of the system should be tested and
evaluated before.
• A user model should contain at least a list of features ot keywords describ-
ing the user interests.
• The user should have access to his profile in order to increase his confidence
in the system.
• Particular care should be given on how to obtain user feedback. Should
the user be requested directly or should we “spy” on him? One hint to
answer this question is to take into account that most users do not like to
be disturbed too much by questions. This particular aspect has been the
reason of several failures of besides carefully designed systems.
366 A. Nürnberger and M. Detyniecki
References
1. Bonastre, J.F., Delacourt, P., Fredouille, C., Merlin, T., Wellekens, C.J.: A
speaker tracking system based on speaker turn detection for nist evaluation. In:
Proc. of ICASSP 2000, Istanbul (2000)
2. Santo, M.D., Percannella, G., Sansone, C., Vento, M.: Classifying audio streams
of movies by a multi-expert system. In: Proc. of Int. Conf. on Image Analysis
and Processing (ICIAP01), Palermo, Italy (2001)
3. Montacié, C., Caraty, M.J.: A silence/noise/music/speech splitting algorithm.
In: Proc. of ICSLP, Sydney, Australia (1998)
4. Idris, F., Panchanathan, S.: Review of image and video indexing techniques.
Journal of Visual Communication and Image Representation 8 (1997) 146–166
5. Zhang, H.J., Low, C.Y., Smoliar, S.W., Wu, J.H.: Video parsing, retrieval and
browsing: an integrated and content-based solution. In: Proc. of ACM Multi-
media 95 – electronic proc., San Franscisco, CA (1995)
6. Yu, H.H., Wolf, W.: A hierarchical multiresolution video shot transition detec-
tion scheme. Computer Vision and Image Understanding 75 (1999) 196–213
7. Sudhir, G., Lee, J.C.M.: Video annotation by motion interpretation using optical
flow streams. Journal of Visual Communication and Image Representation 7
(1996) 354–368
8. Pilu, M.: On using raw mepg motion vectors to determine global camera motion.
Technical report, Digital Media Dept. of HP Lab., Bristol (1997)
9. Lee, S.Y., Kao, H.M.: Video indexing – an approach based on moving object
and track. SPIE 1908 (1993) 81–92
10. Sahouria, E.: Video indexing based on object motion. Master’s thesis, UC
Berkeley, CA (1997)
17 Adaptive Multimedia Retrieval 367
11. Ronfard, R., Thuong, T.T.: A framework for aligning and indexing movies with
their script. In: Proc. of IEEE International Conference on Multimedia & Expo
(ICME 2003), IEEE (2003)
12. Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic
speech recognition: An overview. In Bailly, G., Vatikiotis-Bateson, E., Perrier,
P., eds.: Issues in Visual and Audio-Visual Speech Processing. MIT Press (2004)
13. Jain, A.K., Yu, B.: Automatic text location in images and video frames. Pattern
Recognition 31 (1998) 2055–2076
14. Wu, V., Manmatha, R., Riseman, E.M.: Textfinder: An automatic system to
detect and recognize text in images. IEEE Transactions on Pattern Analysis
and Machine Intelligence 21 (1999) 1224–1229
15. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing.
Communications of the ACM 18 (1975) 613–620 (see also TR74-218, Cornell
University, NY, USA)
16. Robertson, S.E.: The probability ranking principle. Journal of Documentation
33 (1977) 294–304
17. van Rijsbergen, C.J.: A non-classical logic for information retrieval. The Com-
puter Journal 29 (1986) 481–485
18. Turtle, H., Croft, W.B.: Inference networks for document retrieval. In: Proc.
of the 13th Int. Conf. on Research and Development in Information Retrieval,
New York, ACM (1990) 1–24
19. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K.: Indexing by latent
semantic analysis. Journal of the American Society for Information Sciences 41
(1990) 391–407
20. Kaski, S.: Dimensionality reduction by random mapping: Fast similarity compu-
tation for clustering. In: Proc. of the International Joint Conference on Artificial
Neural Networks (IJCNN’98). Volume 1., IEEE (1998) 413–418
21. Isbell, C.L., Viola, P.: Restructuring sparse high dimensional data for effec-
tive retrieval. In: Proc. of the Conference on Neural Information Processing
(NIPS’98). (1998) 480–486
22. Salton, G., Allan, J., Buckley, C.: Automatic structuring and retrieval of large
text files. Communications of the ACM 37 (1994) 97–108
23. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval.
Information Processing & Management 24 (1988) 513–523
24. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison
Wesley Longman (1999)
25. Greiff, W.R.: A theory of term weighting based on exploratory data analy-
sis. In: 21st Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, New York, NY, ACM (1998)
26. Frakes, W.B., Baeza-Yates, R.: Information Retrieval: Data Structures & Algo-
rithms. Prentice Hall, New Jersey (1992)
27. Porter, M.: An algorithm for suffix stripping. Program (1980) 130–137
28. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and
Indexing Documents and Images. Morgan Kaufmann Publishers, San Francisco
(1999)
29. Klose, A., Nürnberger, A., Kruse, R., Hartmann, G.K., Richards, M.: Interactive
text retrieval based on document similarities. Physics and Chemistry of the
Earth, Part A: Solid Earth and Geodesy 25 (2000) 649–654
368 A. Nürnberger and M. Detyniecki
30. Lochbaum, K.E., Streeter, L.A.: Combining and comparing the effectiveness of
latent semantic indexing and the ordinary vector space model for information
retrieval. Information Processing and Management 25 (1989) 665–676
31. Detyniecki, M.: Browsing a video with simple constrained queries over fuzzy
annotations. In: Flexible Query Answering Systems FQAS’2000, Warsaw (2000)
282–287
32. Yager, R.R.: A hierarchical document retrieval language. Information Retrieval
3 (2000) 357–377
33. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search en-
gine. In: Proc. of the 7th International World Wide Web Conference, Brisbane,
Australia (1998) 107–117
34. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, MA (2001)
35. Steinbach, M., Karypis, G., Kumara, V.: A comparison of document clustering
techniques. In: KDD Workshop on Text Mining. (2000) (see also TR 00-034,
University of Minnesota, MN)
36. Nürnberger, A.: Clustering of document collections using a growing self-
organizing map. In: Proc. of BISC International Workshop on Fuzzy Logic and
the Internet (FLINT 2001), Berkeley, USA, ERL, College of Engineering, Uni-
versity of California (2001) 136–141
37. Roussinov, D.G., Chen, H.: Information navigation on the web by clustering and
summarizing query results. Information Processing & Management 37 (2001)
789–816
38. Mendes, M.E., Sacks, L.: Dynamic knowledge representation for e-learning ap-
plications. In: Proc. of BISC International Workshop on Fuzzy Logic and the
Internet (FLINT 2001), Berkeley, USA, ERL, College of Engineering, University
of California (2001) 176–181
39. Weigend, A.S., Wiener, E.D., Pedersen, J.O.: Exploiting hierarchy in text cat-
egorization. Information Retrieval 1 (1999) 193–216
40. Ruiz, M.E., Srinivasan, P.: Hierarchical text categorization using neural net-
works. Information Retrieval 5 (2002) 87–118
41. Wermter, S.: Neural network agents for learning semantic text classification.
Information Retrieval 3 (2000) 87–103
42. Teuteberg, F.: Agentenbasierte informationserschließung im www unter ein-
satz von künstlichen neuronalen netzen und fuzzy-logik. Künstliche Intelligenz
03/02 (2002) 69–70
43. Benkhalifa, M., Mouradi, A., Bouyakhf, H.: Integrating external knowledge to
supplement training data in semi-supervised learning for text categorization.
Information Retrieval 4 (2001) 91–113
44. Vegas, J., de la Fuente, P., Crestani, F.: A graphical user interface for struc-
tured document retrieval. In Crestani, F., Girolami, M., van Rijsbergen, C.J.,
eds.: Advances in Information Retrieval, Proc. of 24th BCS-IRSG European
Colloqium on IR Research, Berlin, Springer (2002) 268–283
45. Kohonen, T.: Self-Organization and Associative Memory. Springer-Verlag,
Berlin (1984)
46. Nürnberger, A.: Interactive text retrieval supported by growing self-organizing
maps. In Ojala, T., ed.: Proc. of the International Workshop on Information
Retrieval (IR 2001), Oulu, Finland, Infotech (2001) 61–70
47. Fritzke, B.: Growing cell structures – a self-organizing network for unsupervised
and supervised learning. Neural Networks 7 (1994) 1441–1460
17 Adaptive Multimedia Retrieval 369
48. Alahakoon, D., Halgamuge, S.K., Srinivasan, B.: Dynamic self-organizing maps
with controlled growth for knowledge discovery. IEEE Transactions on Neural
Networks 11 (2000) 601–614
49. Nürnberger, A., Detyniecki, M.: Visualizing changes in data collections using
growing self-organizing maps. In: Proc. of International Joint Conference on
Neural Networks (IJCNN 2002), Piscataway, IEEE (2002) 1912–1917
50. Hearst, M.A., Karadi, C.: Cat-a-cone: An interactive interface for specifying
searches and viewing retrieval results using a large category hierarchie. In:
Proc. of the 20th Annual International ACM SIGIR Conference, ACM (1997)
246–255
51. Spoerri, A.: InfoCrystal: A Visual Tool for Information Retrieval. PhD thesis,
Massachusetts Institute of Technology, Cambridge, MA (1995)
52. Hemmje, M., Kunkel, C., Willett, A.: Lyberworld – a visualization user interface
supporting fulltext retrieval. In: Proc. of ACM SIGIR 94, ACM (1994) 254–259
53. Fox, K.L., Frieder, O., Knepper, M.M., Snowberg, E.J.: Sentinel: A multiple
engine information retrieval and visualization system. Journal of the American
Society of Information Science 50 (1999) 616–625
54. Pu, P., Pecenovic, Z.: Dynamic overview technique for image retrieval. In: Proc.
of Data Visualization 2000, Wien, Springer (2000) 43–52
55. Havre, S., Hetzler, E., Perrine, K., Jurrus, E., Miller, N.: Interactive visual-
ization of multiple query result. In: Proc. of IEEE Symposium on Information
Visualization 2001, IEEE (2001) 105–112
56. Lin, X., Marchionini, G., Soergel, D.: A selforganizing semantic map for infor-
mation retrieval. In: Proc. of the 14th International ACM/SIGIR Conference
on Research and Development in Information Retrieval, New York, ACM Press
(1991) 262–269
57. Honkela, T., Kaski, S., Lagus, K., Kohonen, T.: Newsgroup exploration with the
websom method and browsing interface. Technical report, Helsinki University
of Technology, Neural Networks Research Center, Espoo, Finland (1996)
58. Honkela, T.: Self-Organizing Maps in Natural Language Processing. PhD thesis,
Helsinki University of Technology, Neural Networks Research Center, Espoo,
Finland (1997)
59. Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paattero, V.,
Saarela, A.: Self organization of a massive document collection. IEEE Transac-
tions on Neural Networks 11 (2000) 574–585
60. Merkl, D.: Text classification with self-organizing maps: Some lessons learned.
Neurocomputing 21 (1998) 61–77
61. Boyack, K.W., Wylie, B.N., Davidson, G.S.: Domain visualization using vxin-
sight for science and technology management. Journal of the American Society
for Information Science and Technologie 53 (2002) 764–774
62. Wise, J.A., Thomas, J.J., Pennock, K., Lantrip, D., Pottier, M., Schur, A.,
Crow, V.: Visualizing the non-visual: Spatial analysis and interaction with
information from text documents. In: Proc. of IEEE Symposium on Information
Visualization ’95, IEEE Computer Society Press (1995) 51–58
63. Small, H.: Visualizing science by citation mapping. Journal of the American
Society for Information Science 50 (1999) 799–813
64. Nielsen, J.: Usability Engineering. Morgan Kaufmann Publishers (1994)
65. Shneiderman, B., Byrd, D., Croft, W.B.: Sorting out searching: A user-interface
framework for text searches. Communications of the ACM 41 (1998) 95–98
370 A. Nürnberger and M. Detyniecki
66. Klusch, M., ed.: Intelligent Information Agents. Springer Verlag, Berlin (1999)
67. Rochio, J.J.: Relevance feedback in information retrieval. In Salton, G., ed.: The
SMART Retrieval System. Prentice Hall, Englewood Cliffs, NJ (1971) 313–323
68. Robertson, S.E., Jones, K.S.: Relevance weighting of search terms. Journal of
the American Society for Information Science 27 (1976) 129–146
69. Wong, S.K.M., Butz, C.J.: A bayesian approach to user profiling in information.
Technology Letters 4 (2000) 50–56
70. Somlo, G.S., Howe, A.E.: Incremental clustering for profile maintenance in
information gathering web agents. In: Proc. of the 5th International Conference
on Autonomous Agents (AGENTS ’01), ACM Press (2001) 262–269
71. Joachims, T., Freitag, D., Mitchell, T.M.: Webwatcher: A tour guide for the
world wide web. In: Proc. of the International Joint Conferences on Artifi-
cial Intelligence (IJCAI 97), San Francisco, USA, Morgan Kaufmann Publishers
(1997) 770–777
72. Jameson, A.: Modeling both the context and the user. Personal and Ubiquitous
Computing 5 (2001) 29–33
73. Nürnberger, A., Klose, A., Kruse, R.: Self-organising maps for interactive search
in document databases. In Szczepaniak, P.S., Segovia, J., Kacprzyk, J., Zadeh,
L.A., eds.: Intelligent Exploration of the Web. Physica-Verlag, Heidelberg (2002)
119–135
74. Nürnberger, A., Klose, A.: Improving clustering and visualization of multimedia
data using interactive user feedback. In: Proc. of the 9th International Confer-
ence on Information Processing and Management of Uncertainty in Knowledge-
Based Systems (IPMU 2002). (2002) 993–999
75. Nürnberger, A., Detyniecki, M.: User adaptive methods for interactive analysis
of document databases. In: Proc. of the European Symposium on Intelligent
Technologies (EUNITE 2002), Aachen, Verlag Mainz (2002)