2019 Book EngineeringApplicationsOfNeura
2019 Book EngineeringApplicationsOfNeura
2019 Book EngineeringApplicationsOfNeura
Lazaros Iliadis
Ilias Maglogiannis
Chrisina Jayne (Eds.)
Engineering Applications
of Neural Networks
20th International Conference, EANN 2019
Xersonisos, Crete, Greece, May 24–26, 2019
Proceedings
123
Communications
in Computer and Information Science 1000
Commenced Publication in 2007
Founding and Former Series Editors:
Phoebe Chen, Alfredo Cuzzocrea, Xiaoyong Du, Orhun Kara, Ting Liu,
Krishna M. Sivalingam, Dominik Ślęzak, Takashi Washio, and Xiaokang Yang
Engineering Applications
of Neural Networks
20th International Conference, EANN 2019
Xersonisos, Crete, Greece, May 24–26, 2019
Proceedings
123
Editors
John Macintyre Lazaros Iliadis
David Goldman Informatics Centre Democritus University of Thrace
University of Sunderland Xanthi, Greece
Sunderland, UK
Chrisina Jayne
Ilias Maglogiannis Oxford Brookes University
University of Piraeus Oxford, UK
Piraeus, Greece
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
EANN 2019 Preface
It is a fact that (according to Google) in September 2015 the search term “machine
learning” (ML) became more popular than the term “artificial intelligence” (AI).
According to the Economist, “data is the new oil of the 21st century.” Today, we are
living the revolution of deep learning (DEL), convolutional neural networks (CNN),
and big data (BD). DEL, ML, and AI can be considered as a set of Russian dolls. DEL
is a subset of ML, which is a subset of AI.
In the following years, AI will become more widely available owing to the
explosion of cloud computing.
EANN is a mature international scientific conference held in Europe and well
established in the scientific area of AI. Its history is long and very successful, following
and spreading the evolution of intelligent systems.
The first event was organized in Otaniemi, Finland, in 1995. Since then, it has had a
continuous and dynamic presence as a major global but mainly European scientific
event. More specifically, it has been organized in Finland, UK, Sweden, Gibraltar,
Poland, Italy, Spain, Bulgaria, and Greece. It has always been technically supported by
the International Neural Network Society (INNS) and more specifically by the EANN
Special Interest Group.
Following a long-standing tradition, this Springer volume belongs to the CCIS
Springer series and it contains the papers that were accepted to be presented orally at
the 20th EANN 2019 conference and to the First Workshop on Pervasive Intelligence
(PEINT). The diverse nature of papers presented demonstrates the vitality of AI
algorithms and approaches. It certainly proves the very wide range of neural networks
and AI applications as well.
The event was held during May 24–26, 2019, in the Aldemar Knossos Royal
five-star hotel in Crete, Greece.
The response of the international scientific community to the EANN 2019 call for
papers was more than satisfactory, with 74 papers initially submitted. All papers were
peer reviewed by at least two independent academic referees. Where needed, a third
referee was consulted to resolve any potential conflicts. A total of 48.6% of the
submitted manuscripts (36 papers) were accepted to be published as full papers
(12 pages long) in the Springer CCIS proceedings. Owing to the high quality of the
submissions, the Program Committee decided that it should additionally accept five
more submissions to be published as short papers (10 pages long).
PEINT, which was organized under the framework of EANN 2019, also followed
the same review and acceptance ratio rules. More specifically, the workshop accepted
four full papers out of nine submissions (44.4%).
The following scientific workshop on timely AI and ANN subjects was organized
under the framework of the EANN 2019:
vi EANN 2019 Preface
The workshop format included three short presentations by the keynote speakers,
followed by an interactive Q&A session where the panel members and audience
engaged in a lively debate on the topics discussed.
The subjects of their presentations were the following:
John Macintyre: “The Future of AI – Existential Threat or New Revolution?”
Andrew Starr: “Practical AI for Practical Problems”
Four keynote speakers were invited to give lectures on timely aspects of artificial
neural networks and AI:
1. Professor Plamen Angelov, University of Lancaster, UK: “Empirical Approach:
How to Get Fast, Interpretable Deep Learning”
2. Dr. Evangelos Eleftheriou, IBM Fellow, Cloud and Computing Infrastructure,
Zurich Research Laboratory Switzerland: “In-memory Computing: Accelerating AI
Applications”
3. Dr. John Oommen, Carleton University, Ottawa, Canada: “The Power of Pursuit.”
Learning Paradigm in the Partitioning of Data”
4. Professor Panagiotis Papapetrou, Stockholm University, Sweden: “Learning from
Electronic Health Records: From temporal Abstraction to Timeseries
Interpretability”
A three-hour tutorial on “Automated Machine Learning for Bioinformatics and
Computational Biology” was given by Professor Ioannis Tsamardinos (Computer
Science Department of University of Crete, co-founder of Gnosis Data Analysis PC, a
university spin-off company, and Affiliated Faculty at IACM-FORTH) and Professor
Vincenzo Lagani Ilia State University (Tbilisi, Georgia, and Gnosis Data Analysis PC
co-founder).
Numerous bioinformaticians, computational biologists, and life scientists in general
are applying supervised learning techniques and feature selection in their research
work. The tutorial was addressed to this audience intending to shield them against
methodological pitfalls, inform them about new methodologies and tools emerging in
the field of Auto-ML, and increase their productivity.
The papers accepted for the 20th EANN conference are related to the following
thematic topics:
• Deep learning ANN
• Genetic algorithms - optimization
• Constraints modeling
• ANN training algorithms
• Social media intelligent modeling
• Text mining/machine translation
• Fuzzy modeling
• Biomedical and bioinformatics algorithms and systems
• Feature selection
• Emotion recognition
• Hybrid intelligent models
• Classification-pattern recognition
viii EANN 2019 Preface
Executive Committee
General Chairs
John Macintyre University of Sunderland UK (Dean of the Faculty
of Applied Sciences and Pro Vice Chancellor
of the University of Sunderland)
Chrisina Jayne Oxford Brooks University, (Head of the School
of Engineering, Computing and Mathematics)
Ilias Maglogiannis University of Piraeus, Greece
(President of the IFIP
WG12.5)
Program Chairs
Lazaros Iliadis Democritus University of Thrace, Greece
Elias Pimenidis University of the West of England, Bristol, UK
Workshop Chairs
Christos Makris University of Patras, Greece
Phivos Mylonas Ionian University, Greece
Spyros Sioutas University of Patras, Greece
Advisory Chairs
Andreas Stafylopatis National Technical University of Athens, Greece
Georgios Vouros University of Piraeus, Greece
Honorary Chairs
Vera Kurkova Czech Academy of Sciences, Czech Republic
Barbara Hammer Bielefeld University, Germany
Program Committee
Michel Aldanondo IMT Mines Albi, France
Athanasios Alexiou NGCEF, Australia
Ioannis Anagnostopoulos University of Central Greece, Greece
George Anastassopoulos Democritus University of Thrace, Greece
Costin Badica University of Craiova, Romania
x Organization
Co-chairs
Dimitris K. Iakovidis University of Thessaly, Greece
Evaggelos Spyrou University of Thessaly, Greece
Program Committee
Stylianos Asteriadis University of Maastricht, The Netherlands
Charis Dakolia University of Thessaly, Greece
Kostas Delibasis University of Thessaly, Greece
Theodore Behavioural Signals, Greece
Giannakopoulos
Enrique Hortal University of Maastricht, The Netherlands
Barna Iantovics Mures University, Romania
Maria Kozyri University of Thessaly, Greece
Artur Krukowski Intracom S.A. Telecom Solutions, Greece
Athanasios Loukopoulos University of Thessaly, Greece
Sara Paiva Applied Research Centre for Digital Transformation,
Portugal
Michalis Papakostas University of Texas at Arlington, USA
Stavros Perantonis National Center for Scientific Research Demokritos,
Greece
Panagiotis Papapetrou
Abstract. The first part of the talk will focus on data mining methods for
learning from Electronic Health Records (EHRs), which are typically perceived
as big and complex patient data sources. On them, scientists strive to perform
predictions on patients’ progress, to understand and predict response to therapy,
to detect adverse drug effects, and many other learning tasks. Medical
researchers are also interested in learning from cohorts of population-based
studies and of experiments. Learning tasks include the identification of disease
predictors that can lead to new diagnostic tests and the acquisition of insights on
interventions. The talk will elaborate on data sources, methods, and case studies
in medical mining.
The second part of the talk will tackle the issue of interpretability and
explainability of opaque machine learning models, with focus on time series
classification. Time series classification has received great attention over the past
decade with a wide range of methods focusing on predictive performance by
exploiting various types of temporal features. Nonetheless, little emphasis has
been placed on interpretability and explainability. This talk will formulate the
novel problem of explainable time series tweaking, where, given a time series
and an opaque classifier that provides a particular classification decision for the
time series, the objective is to find the minimum number of changes to be
performed to the given time series so that the classifier changes its decision to
another class. Moreover, it will be shown that the problem is NP-hard. Two
instantiations of the problem will be presented. The classifier under investigation
will be the random shapelet forest classifier. Moreover, two algorithmic
solutions for the two problem instantiations will be presented along with simple
optimizations, as well as a baseline solution using the nearest neighbor classifier.
Empirical Approach: How to Get Fast,
Interpretable Deep Learning
Plamen Angelov
Evangelos Eleftheriou
Invited Paper
Classification - Learning
Deep Learning
ML - DL Financial Modeling
1 Introduction
The Pursuit Concept in LA: Absolutely Expedient LA are absorbing and
there is always a small probability of them not converging to the best action.
Thathachar and Sastry realized this phenomenon and proposed to use Maxi-
mum Likelihood Estimators (MLEs) to hasten the LA’s convergence. Such an
MLE-based update method would utilize estimates of the reward probabilities in
the update equations. At every iteration, the estimated reward vector was also
used to update the action probabilities, instead of updating it based only on the
RE’s feedback. In this way, the probabilities of choosing the actions with higher
reward estimates were increased, and those with lower estimates were signifi-
cantly reduced, using which they proposed the family of estimator algorithms.
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 3–16, 2019.
https://doi.org/10.1007/978-3-030-20257-6_1
4 A. Shirvani and B. J. Oommen
2
To be consistent with the terminology of LA, we use the terms “action”, “class” and
“group” synonymously.
3
The OMA’s algorithms/figures are in [11], and omitted here in the interest of space.
6 A. Shirvani and B. J. Oommen
To asses the partitioning accuracy and the convergence speed of any EPP
solution, there must be an “oracle” with a pre-defined number of classes, and with
each class containing an equal number of objects. The OMA’s goal is to migrate
the objects between its classes, using the incoming queries. E is characterized by
three parameters: (a) W , the number of objects, (b) R, the number of partitions,
and (c) a probability ‘p’ quantifying how E pairs the elements in the query.
Every query presented to the OMA by E consists of two objects. E randomly
selects an initial class with probability R1 , and it then chooses the first object in
the query from it, say, q1 . The second element of the pair, q2 , is then chosen with
the probability p from the same class, and with the probability (1 − p) from one
of the other classes uniformly, each of them being chosen with the probability of
Table 1. Experimental results for the OMA done for an ensemble of 100 experiments
in which we have only included the results from experiments where convergence has
occurred.
1
R−1 . Thereafter, it chooses a random element from the second class uniformly.
We assume that E generates an “unending” continuous stream of query pairs.
The results of the simulations are given in Table 1, where in OM ApX , X
refers to the probability specified above, W , is the number of objects, W /R is
the number of objects per class, and R is the number of classes. The results are
given as a pair (a, b) where a refers to the number of iterations for the OMA to
reach the first correct classification and b refers to the case where the OMA has
fully converged. In all experiments, the number of states of the OMA is set to
10. Also, the OMA’s convergence for a single run and for an ensemble of runs
display a monotonically decreasing pattern (with time) for the latter.
where P (Rk ) is the probability that the first element, Ai , is chosen from the
group Rk , and P (Aj |Ai ) is the conditional probability of choosing Aj , which
is also from Rk , after Ai has been chosen. Since E chooses the elements of the
pairs from the other groups uniformly, with a possible re-numbering operation,
the matrix M∗ = [μ∗ (i, j)] is a block-diagonal matrix given by Eq. (1).
⎡M∗ 0 . . . 0 ⎤
1
⎢ .. ⎥
⎢ 0 M∗2 . ⎥
∗ ⎢
M = ⎣ .. . ⎥
. .. ⎦
.. (1)
.
0 · · · . . . M∗R
⎡ ⎤
0 R
W (W −R) ··· R
W (W −R)
⎢ R
··· R ⎥
⎢ W (W −R) 0 W (W −R) ⎥
M∗r = ⎢⎢ .. .. .. ⎥
⎥ (2)
⎣ . . . ⎦
R
W (W −R) ··· R
W (W −R) 0
In a real-world scenario where E is noisy, i.e., the objects from the different
groups can be paired together in a query, the general form for M∗ is:
8 A. Shirvani and B. J. Oommen
⎡M∗ θ ... θ ⎤
1
⎢ .. ⎥
⎢ θ M∗2 . ⎥
M =⎢
∗
⎣ .. . ⎥
,
. .. ⎦
.. (3)
.
θ ··· . . . M∗R
where θ and M∗r s are specified as per Eqs. (4) and (5).
Theorem 2. In the presence of noise in E, the entries of the pair Ai , Aj can
be selected from two different distinct classes, and hence the matrix specifying
the probabilities of the accesses of the pairs obeys Eq. (3), where:
⎡ ⎤
1 1 ··· 1
⎢ ⎥
θ = θo · ⎣ ... ... . . . ... ⎦ , (4)
1 ··· ··· 1
⎡ ⎤
0 1 ··· 1
⎢1 0 · · · 1⎥
⎢ ⎥
M∗r = θd · ⎢ . . . . ⎥ , (5)
⎣ .. .. . . .. ⎦
1 1 ··· 0
where, 0 < θd < 1 is the coefficient which specifies the accuracy of E, and θo is
1−θo (W − W
R )
related to θd as θd = W
−1
.
R
Proof. The proof of the theorem is omitted here and found in [11].
In a real world scenario, since E’s true statistical model is unknown, the expres-
sions in Eqs. (1) and (2) can only be estimated through observing a set of queries.
In the presence of noise though, we need to devise a measurable quantity which
makes the algorithm capable of recognizing divergent pairs.
Observe that whenever a real query Ai , Aj appears, we will be able to obtain
a simple ML estimate of how frequently Ai and Aj are accessed concurrently.
Clearly, by virtue of the Law of Large Numbers, these underlying estimates will
converge to the corresponding probabilities of E actually containing the elements
Ai and Aj in the same group. As the number of queries processed become larger,
the quantities inside M∗i will become significantly larger than the quantities in
each of the θ matrices. From the plot of these estimates [11], one will observe that
the estimates corresponding to the matrix M∗i have much higher values than
the off-diagonal entries. This implies that these off-diagonal entries represent
divergent queries which move the objects away from their accurate partitions.
Intuitively, the pursuit concept for the OPP can be best presented by a matrix
of size W × W where every entry will capture the same statistical measure about
The Power of the “Pursuit” Learning Paradigm in the Partitioning of Data 9
the stream of the input pairs. For the sake of simplicity, we use a simple averaging
and denote this matrix by P. Every block represents a pair and the height of
the block is set to the frequency count of the reciprocal pair. To obtain the
average frequency of each pair, we let the OMA iterate for a sufficient time, say J
iterations, and at every incident we update the value of the matrix P respectively.
In this way, at the end of the J-th iteration, we have simply estimated the
frequency of each pair. At this point, by observing the values of the matrix, the
user can determine an appropriate threshold (τ > 0) to be adopted as the accept
or reject policy for any future occurrence of this particular pair of objects. If we
permit the algorithm to collect a large enough number of pairs, we see that
∃θ∗ | ∀θo ≤ θ∗ , and that ∀ i, j : μi,j ∗
θi,j .
If we utilize a user-defined threshold, τ , (which is reasonably close to 0), we
will be able to compare every estimate to τ and make a meaningful decision about
the identity of the query. In other words, by merely comparing the estimate to
τ we can determine whether a query pair Ai , Aj should be processed, or quite
simply, be ignored. This leads us to algorithm POMA on Page 74 of [11] in which
every query which is inferred to be divergent is ignored. Otherwise, one invokes
the Reward and Penalty functions of the original OMA algorithm. The issue of
determining the parameters of the POMA algorithm are detailed in [11], and
omitted here in the interest of space.
In the initial phase of the algorithm, the estimates for the queries are unavail-
able. Thus, it only makes sense to consider every single query and to pro-
cess them using the OMA’s Reward and Penalty functions. Since the objects
in each
class are equally-likely to happen and the classes are equi-probable,
W 2
k≥ R −W R × R, is chosen as the lower-bound of the number of iterations
for any meaningful initialization.
We have compared our results with those presented in [5] and those reported
for the original OMA for various values of R and W . The number of states in
every action was set to 10, and the convergence was expected to have taken
place as soon as all the objects in the POMA fell within the last two internal
states. The results (specified using the same notation as in Table 1) obtained are
outstanding and are summarized in Table 2. The simulation results are based on
an ensemble of 100 runs with different uncertainty values, (i.e., values of p).
To observe the efficiency of the POMA, consider an easy-to-learn Environ-
ment of 6 groups with 2 objects in each group and where p = 0.9. It took the
OMA 599 iterations to converge. As opposed to this, the POMA converged in
only 69 iterations, which represents a ten-fold improvement. On the other hand,
given a difficult-to-learn Environment with 12 objects in 2 groups, the OMA
needs 6, 506 iterations to converge. The POMA required only 2, 112 iterations
to converge, which is more than a three-fold improvement.
10 A. Shirvani and B. J. Oommen
Table 2. Experimental results for the POMA approach done for an ensemble of 100
runs.
1. Initial Boundary State Distribution: All of the objects are initially dis-
tributed at the respective boundary states of their respective classes;
2. Redefinition of Internal States: They diminished the vulnerability of the
convergence criterion of the OMA by redefining the internal state to include
“the two innermost states of each class”, rather than a single innermost state.
3. Breaking the Deadlock: The original OMA possesses a deadlock-prone
infirmity (please see Section 4.3 of [11]) in which the machine can cycle
between two identical configurations by virtue of a sequence of query pairs.
Thus is especially evident in noise-free Environments. The EOMA remedies
this as follows. Give a query pair of objects Oi , Oj , let us assume that Oi , is
in the boundary state, and Oj is in a non-boundary (internal) state of another
class. If there exists an object in the boundary state of the same class, we pro-
pose that it gets swapped with the boundary object Oj to bring both of the
queried objects together in the same class. Simultaneously, a non-boundary
object has to be moved toward the boundary state of its class. Otherwise, if
there is no object in the boundary state of the class that contained Oj , the
algorithm performs identically to the OMA.
The Power of the “Pursuit” Learning Paradigm in the Partitioning of Data 11
Table 3. Experimental results for the Enhanced OMA (EOMA) done for an ensemble
of 100 runs.
The convergence of the EOMA with respect to time starts with a large num-
ber of objects which are located in random partitions. This number steadily
12 A. Shirvani and B. J. Oommen
decreases with time to a very small value. This graph is not monotonic for any
given experiment. But from the perspective of an ensemble, the performance is
much more monotonic in behavior.
Table 4. Experimental results for the PEOMA approach done for an ensemble of 100
runs.
although the gain is not significant for simple problems and easy Environments,
it becomes remarkably high for complex partitioning experiments.
Table 5. Experimental results for the TPEOMA approach done for an ensemble of
100 runs.
7 Conclusions
In this paper we have shown how we can utilize the “Pursuit” concept to enhance
solutions to the general problem of partitioning. Unlike traditional Learning
Automata (LA), which work with the understanding that the actions are chosen
purely based on the “state” in which the machine is, the “Pursuit” concept has
been used to estimate the Random Environment’s (RE’s) reward probabilities
and to take these into consideration to design Estimator/Pursuit LA. They, uti-
lize “cheap” estimates of the Environment’s reward probabilities to make them
converge by an order of magnitude faster. This is achieved by using inexpensive
estimates of the reward probabilities to rank the actions. Thereafter, when the
action probability vector has to be updated, it is done not on the basis of the
Environment’s response alone, but also based on the ranking of these estimates.
16 A. Shirvani and B. J. Oommen
In this paper we have shown how the “Pursuit” learning paradigm can be and
has been used in Object Partitioning. The results demonstrate that incorporating
this paradigm can hasten the partitioning by a order of magnitude. This paper
comprehensively describes all the Object Migration Automaton (OMA)-related
machines to date, including the Enhanced OMA [5]. It then incorporates the
Pursuit paradigm to yield the Pursuit OMA (POMA), the Pursuit Enhanced
OMA (PEOMA) and the Pursuit Transitive Enhanced OMA (PTOMA).
Apart from the schemes themselves, the papers reports the experimental
results that have been obtained by testing them on benchmark environments.
References
1. Godsil, C., Royle, G.F.: Algebraic Graph Theory, vol. 207. Springer, New York
(2013)
2. Biggs, N.: Algebraic Graph Theory. Cambridge University Press, Cambridge (1993)
3. Fayyoumi, E., Oommen, B.J.: Achieving microaggregation for secure statistical
databases using fixed-structure partitioning-based learning automata. IEEE Trans.
Syst. Man Cybern. Part B (Cybern.) 39(5), 1192–1205 (2009)
4. Freuder, E.C.: The object partition problem. Vision Flash, WP-(4) (1971)
5. Gale, W., Das, S., Yu, C.T.: Improvements to an algorithm for equipartitioning.
IEEE Trans. Comput. 39(5), 706–710 (1990)
6. Jobava, A.: Intelligent traffic-aware consolidation of virtual machines in a data
center. Master’s thesis, University of Oslo (2015)
7. Lanctot, J.K., Oommen, B.J.: Discretized estimator learning automata. IEEE
Trans. Syst. Man Cybern. 22(6), 1473–1483 (1992)
8. Mamaghani, A.S., Mahi, M., Meybodi, M.: A learning automaton based approach
for data fragments allocation in distributed database systems. In: 2010 IEEE 10th
International Conference on Computer and Information Technology (CIT), pp.
8–12. IEEE (2010)
9. Oommen, B.J., Ma, D.C.Y.: Stochastic automata solutions to the object partition-
ing problem. Carleton University, School of Computer Science (1986)
10. Oommen, B.J., Agache, M.: Continuous and discretized pursuit learning schemes:
various algorithms and their comparison. IEEE Trans. Syst. Man Cybern. Part B:
Cybern. 31(3), 277–287 (2001)
11. Shirvani, A.: Novel solutions and applications of the object partitioning problem.
Ph.D. thesis, Carleton University, Ottawa, Canada (2018)
12. Yazidi, A., Granmo, O.C., Oommen, B.J.: Service selection in stochastic environ-
ments: a learning-automaton based solution. Appl. Intell. 36(3), 617–637 (2012)
13. Amer, A., Oommen, B.J.: A novel framework for self-organizing lists in environ-
ments with locality of reference: lists-on-lists. Comput. J. 50(2), 186–196 (2007)
AI in Energy Management - Industrial
Applications
A Benchmark Framework to Evaluate
Energy Disaggregation Solutions
1 Introduction
In the modern world, one of the biggest problems humanity is facing is the
inadequate management of electrical energy. Overconsumption causes a series of
negative phenomena that have both economic and ecological impact, at individ-
ual and mass scale. Numerous factors affect this problem, such as the dramatic
increase in usage of electrical devices in recent decades, as well as the global pop-
ulation growth. It is apparent that extensive research is required to have better
methods of controlling electrical energy consumption. The existence and appli-
cation of smart meters already contribute in that regard. Energy disaggregation
can make further use of those, to enhance load monitoring capabilities.
Energy disaggregation, also known as Non-intrusive load monitoring (NILM),
is the task of decomposing an aggregate energy signal into its sub-components, i.e
identifying the individual appliances signal from the whole energy consumption
of a house. It was first introduced by Hart [4]. By applying NILM methods load
monitoring becomes easier and less costly, due to the requirement of only a single
meter from which the device level signals can be extracted, in contrast to ILM
(intrusive load monitoring) methods where multiple meters are needed per house.
Therefore ILM has much bigger economical cost and increased difficulty from
installing and configuring the meters, its only advantage being its guaranteed
higher accuracy.
Many researchers focus on finding solutions where a model is trained per
appliance, having as input the whole house energy data and as output the appli-
ance consumption. Each work has its own test cases defined and investigates
a set of metrics upon them. Due to this, many different possible scenarios are
created, which complicates the comparison between the proposed and existing
solutions. There is a lack of structure to follow when conducting the evaluation
of a new model. This creates a necessity for a well-defined set of experiments.
A set like this should include a variety of scenarios that cover a wide range of
aspects and goals for the model under evaluation. Each scenario is tied to a
specific purpose, that reveals if the model is suitable for the case.
In this paper we propose a benchmark framework for the evaluation of NILM
solutions. Also five ANNs are combined with the method of Stacking and then
evaluated with the proposed framework. This paper is structured as follows: in
Sect. 2 some of the most popular neural network solutions, that are important for
this study, are reviewed. In Sect. 3 the proposed categorization of experiments
is described. Section 4 expands upon the method of Stacking. Section 5 presents
the most important results produced from Stacked learning. Section 6 includes
conclusions from the experiments and discussion for future work. The implemen-
tation of Stacking and the five used ANNs, a detailed spreadsheet containing the
defined experiments for each category, as well as baseline results can be found
in the following repository https://github.com/symeonick15/NILM-Stacking.
2 Related Work
There have been many approaches to solving this problem, among which Machine
Learning (ML) has taken the lead of research in recent years. ML methods exhibit
great generalization capability on unseen environments, without the need of
prior information. Initially, Hart proposed a combinatorial optimization method,
which suffered from working only with devices that had a finite number of states.
Later Factorial Hidden Markov Models (FHMM) have become quite popular and
many developed techniques were based on them [1,7,15,19].
The study around NILM has recently turned towards implementing solutions
based on artificial neural networks. Deep and convolutional neural networks
have become dominant in many fields like Computer Vision [9] and Natural
Language Processing [3], and have been used successfully in many problems that
include time series data. Their ability to extract features and handling complex
data has lead researchers to start developing NILM solutions based on them,
outperforming former approaches [2,5,10–12,14,18].
A Benchmark Framework to Evaluate Energy Disaggregation Solutions 21
Kelly and Knottenbelt [5] developed three ANN architectures for the task of
Energy Disaggregation. The first was a Denoising Auto Encoder that handles
the aggregate signal as a “noisy” series, which filters out the noise (i.e. signals
from other devices) to extract the target appliance consumption. The 2nd was
a Recurrent Neural Network that used LSTM units (long short-term memory)
and had the ability to “remember” previously given inputs to use them for pre-
diction. The last architecture would find the Start time, End time and mean
consumption of the first activation in a given window. All three were trained
on the UK-DALE dataset, having as input the whole-house aggregate signal
and as target the appliance consumption. An FHMM approach and Hart’s com-
binatorial optimization algorithm were also used for the same experiments. In
comparison, ANNs outperformed these two methods.
In another study, Mauch and Yang [12] implemented a different architecture
with LSTM units for the Energy Disaggregation task. The proposed deep recur-
rent network had as input the aggregate energy value at a specific timestamp
and as output the appliance consumption at the same timestamp. Among the
goals of this network was to automatically extract features from low-frequency
data and to generalize well on unseen buildings. REDD dataset was used for
train and prediction, for two ON/OFF and one multistate devices. Results were
promising both on seen and unseen buildings.
Zhang et al. [18] proposed a different deep convolutional architecture that
they named sequence to point. The method took its name from the main idea
to predict the value of a single time point, based on a sequence of values in
the input that has the time point at its midpoint. The input window used had
600 samples (1 h). The network achieved state of the art results and had great
representation power, as it would base its predictions both on the past and the
future of a time point.
Krystalakos et al. [10] used Gated Recurrent Units during their experiments
to improve current LSTM architectures. To decrease the computational cost and
memory demands, LSTM neurons were replaced with GRU, fewer neurons were
used and dropout layers were added. Furthermore, a sliding window approach
was tested. The network was trained and tested on UK-DALE and compared
directly to implementations based on previous architectures, among which a
modified seq2point more suitable for online prediction (smaller input windows).
The proposed architecture showed promising results, especially on multi-state
devices, and it had the same or better results than LSTM while being lighter.
Although there wasn’t a benchmark that was actually followed by these stud-
ies to base their experiments, many have used the same or similar structure
presented in the work of Kelly and Knottenbelt [5]. Also, most had similarities
in their tests, such as having the trained model predicting in an unseen house,
aiming to evaluate specific aspects of the models.
In a review of ML approaches for NILM by Nalmpantis and Vrakas [13],
it is stated that an objective and direct comparison between different methods
is very difficult. The reason is that there are various metrics, many available
datasets, different criterias and a variety of methodologies that one can choose
22 N. Symeonidis et al.
3 Purpose
In this paper, we present a categorization of experiments with the purpose to
have a unified way of comparing models during research. The scenarios of the pro-
posed categorization include specific train and test cases, each with an explained
purpose that explores an aspect of the model, plus a basic set of metrics to eval-
uate the performance, as both classification and regression. Also, as a reference
point it can be further expanded with more cases or have its existing modified,
to accommodate additional scenarios, should it be required in a future research,
where a more specific goal is examined (e.g. commercial buildings). The pro-
posed benchmark is defined on a set of five appliances, however, the scenarios
included can easily be generalized for any device, or even different datasets.
Moreover, the effects of combining several neural networks with the method
of Stacking are investigated, by comparing them using the aforementioned taxon-
omy. Stacking has been used widely in Machine Learning approaches to combine
many different models in order to achieve better results than each model indi-
vidually. The basic idea is to follow this method, in hope of improving the per-
formance of existing neural network architectures, and to see the actual impact
it has on them.
4 Taxonomy of Experiments
In this section, the proposed categorization of experiments is described, for usage
in the evaluation of new Energy Disaggregation models. Specifically, this method
is suitable for evaluating techniques (e.g. ANNs) that take an aggregate signal
as input and predict a specific appliance consumption. There are four main
categories of experiments described for this method and the purpose that each
serves. Although these can be expanded and/or modified to suit the goals of
individual studies, we also proceed to define the datasets and appliances used in
this study.
The two datasets used in this study are Reference Energy Disaggregation
Data Set (REDD) [8] and UK-DALE [6]. Both datasets are freely available, sup-
port low-frequency data, refer to domestic buildings and have several houses and
appliances for testing. They are also two of the most popular datasets in the field
of NILM. The target appliances are: fridge, kettle, microwave, washing machine,
and dishwasher. These have been used by many researchers because they cover
a wide range of appliance types a house may include. For example, the kettle is
a simple ON/OFF device, while the dishwasher is multistate with more complex
behavior. Furthermore, they constitute most of a building’s consumption and
appear in most of the houses making a better target for evaluation.
A Benchmark Framework to Evaluate Energy Disaggregation Solutions 23
the base models. After each model is trained, the second part is given as an input
to each model to get their predictions. The predictions generated are then com-
bined into one matrix that will make up the training input of a different learner
(usually simpler), the meta-learner, while the target output is left as is (from
the second part). That way the stacked model learns from the predictions of the
base models, so it can combine them. After that, the stacked model is ready for
prediction. The prediction follows a similar procedure, where each base model
is given a copy of the input and then their predictions are given as an input to
the meta-model to generate the final prediction. Stacking is especially efficient
when base learners make different errors.
5.2 Implementation
The above procedure was implemented as a 2-step method for the stacking exper-
iments. During the first step, each of the neural networks was trained on a part
of the training set. Each trained model was also given the second part of the
train set and the test set as input for prediction, to generate the train and test
set of the stacked model respectively. The predictions were saved as intermedi-
ate files to be reused. In the second step, the prediction for the stack train was
loaded and aligned on their timestamps. Then they were scaled and used to fit
the selected meta-regressor. In the final phase, the predictions on the test set
were loaded in the same way and given as input to the meta-model to generate
the final predictions.
The sampling ratio for all data was 6 seconds. Only real data were used, with
no synthetic data generation. Code is written in Python. The implementation
of the used networks was based on a previous study of Krystalakos et al. [10],
which were developed using Keras with Tensorflow backend on GPUs. NILMTK
was used for loading and preprocessing of data during the base model training
and stacking phase. Scikit-learn was used for the Meta-regressors.
The metrics used for the evaluation were F1, Relative Error in Total Energy
(RETE) and Mean Absolute Error (MAE).
P recision × Recall
F1 = 2 × (1)
P recision + Recall
|E − E|
RET E = (2)
max(E , E)
1
M AE = |yt − yt | (3)
T
Where E is the total predicted energy, E is the total true energy, yt is the
consumption predicted at time t and yt is the true consumption at time t.
repository and was not presented here due to its size. Some results are high-
lighted to indicate that they were the best among those that were tested (best
among base and among stacked). The following are the meta-regressors refer-
enced from the result matrices. Ada Boost with Decision Tree of depth 3, 25
estimators, learning rate 0.1 and ‘square’ loss (AB-3d). Ada Boost with Deci-
sion Tree, 30 estimators and learning rate 0.5 (AB-30). Ada Boost with Decision
Tree, 15 estimators and learning rate 0.5 (AB-15). Multi Layer Perceptron with
one hidden layer of 100 neurons (MLP). Decision Tree Regressor of depth 5, 15%
min split ratio, 9% min leaf ratio (DT5). Simple Decision Tree Regressor (DT).
Gradient Boosting Regressor with 25 estimators and learning rate 0.5 (GB).
As it can be seen among the ANNs, some results in F1 score are ‘nan’.
In those cases the predictions of the model were never above the activation
threshold, leading to division by zero. This happens mostly on generalization
tasks, proving the difficulty of disaggregating the signal of an unseen building.
On the other hand stacking seems to have the advantage of overcoming this
problem.
Results for Category 1 experiments of fridge are shown at Table 1. Among
base models, GRU was a clear winner. Stacking with AB-3d mostly improved
F1, while AB-30 had very good RETE and MAE. AB-3d had short trees, so it
could not be as accurate, but managed to hit the activation threshold better.
AB-30 has predictions closer to the ground truth values, as can be seen from the
Fig. 1, but seems a bit “unstable” (many spikes where it should have continuous
values). Maybe by applying a smoothing technique on top of it, it could be
further improved. Generally stacking has good results, especially with AB-30,
which enhances regression efficiency.
Table 2 shows the results of category 3 experiments for fridge. Here the best
RETE was not improved, however, it’s still better than 3 out of 5 of base models.
MAE is reduced, while DT5 also has very good F1. As above the short tree
(DT5) is better suited for classification, while also is less prone to overfitting. In
A Benchmark Framework to Evaluate Energy Disaggregation Solutions 27
general, tree-based models seem to work better with the fridge, probably due to
the simple, repetitive nature of its time series.
Fig. 1. Sample signal plots including: original, predicted from ANN, predicted from
Stacking.
Across the results it appeared that some models were more suited than oth-
ers in different scenarios. The advantage of stacking was that it could either
improve those, or at least find a fine point between them, making it a robust
solution. The tested stacked models had good results mostly for disaggregat-
ing simple devices (fridge, kettle), especially on same house train-test scenarios.
Mainly, Tree based models were the best combiners, while AdaBoosting could
further enhance them with the risk of overfitting. This risk was made apparent
in generalization experiments (Categories 2–4).
An example scenario that uses stacking could include a weak fast solution
that produces online results, which is later combined with the output of other
models to produce more accurate final predictions. For generalization tasks
improvements seemed less and on complex devices stacking sometimes did not
succeed, possibly due to their complicated behaviour that is time dependant,
along with the number of states and functions they have. This suggests that other
meta-regressors may be more suited, possibly more complicated, time series ori-
ented techniques like Neural Nets, that also have better generalization abilities.
Another form of ensemble learning or stacking would also be interesting to test,
like meta decision trees [16]. Maybe even another method on top of that could be
used to smooth/filter the predicted signal. Regarding the proposed benchmark
and the categories of experiments defined, there could also be some other similar
scenarios not included here. For example one could have a training set combined
from the 2 used datasets (UK-DALE, REDD) and test on both of them or even
a third.
References
1. Aiad, M., Lee, P.H.: Non-intrusive load disaggregation with adaptive estimations of
devices main power effects and two-way interactions. Energy Build. 130, 131–139
(2016). https://doi.org/10.1016/j.enbuild.2016.08.050. http://www.sciencedirect.
com/science/article/pii/S0378778816307472
2. Chen, K., Wang, Q., He, Z., Chen, K., Hu, J., He, J.: Convolutional sequence
to sequence non-intrusive load monitoring. J. Eng. 2018(17), 1860–1864 (2018).
https://doi.org/10.1049/joe.2018.8352
3. Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint
arXiv:1308.0850 (2013)
4. Hart, G.W.: Nonintrusive appliance load monitoring. Proc. IEEE 80(12), 1870–
1891 (1992)
5. Kelly, J., Knottenbelt, W.: The UK-DALE dataset, domestic appliance-level elec-
tricity demand and whole-house demand from five UK homes. Sci. Data 2, 150007
(2015). https://doi.org/10.1038/sdata.2015.7
6. Kelly, J., Knottenbelt, W.: The UK-DALE dataset, domestic appliance-level elec-
tricity demand and whole-house demand from five UK homes. Sci. Data 2, 150007
(2015)
7. Kolter, J.Z., Jaakkola, T.: Approximate inference in additive factorial HMMs
with application to energy disaggregation, June 2018. https://doi.org/10.1184/
R1/6603563.v1
30 N. Symeonidis et al.
8. Kolter, J.Z., Johnson, M.J.: REDD: a public data set for energy disaggregation
research. In: Workshop on Data Mining Applications in Sustainability (SIGKDD),
San Diego, CA, vol. 25, pp. 59–62 (2011)
9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Sys-
tems, pp. 1097–1105 (2012)
10. Krystalakos, O., Nalmpantis, C., Vrakas, D.: Sliding window approach for online
energy disaggregation using artificial neural networks. In: Proceedings of the 10th
Hellenic Conference on Artificial Intelligence, SETN 2018, pp. 7:1–7:6. ACM, New
York (2018). https://doi.org/10.1145/3200947.3201011
11. Lange, H., Bergés, M.: The neural energy decoder: energy disaggregation by com-
bining binary subcomponents (2016)
12. Mauch, L., Yang, B.: A new approach for supervised power disaggregation by using
a deep recurrent LSTM network. In: 2015 IEEE Global Conference on Signal and
Information Processing (GlobalSIP), pp. 63–67. IEEE (2015)
13. Nalmpantis, C., Vrakas, D.: Machine learning approaches for non-intrusive load
monitoring: from qualitative to quantitative comparation. Artif. Intell. Rev. 1–27
(2018)
14. Paradiso, F., Paganelli, F., Giuli, D., Capobianco, S.: Context-based energy disag-
gregation in smart homes. Future Internet 8(1) (2016). https://doi.org/10.3390/
fi8010004. http://www.mdpi.com/1999-5903/8/1/4
15. Parson, O., Ghosh, S., Weal, M., Rogers, A.: Non-intrusive load monitoring using
prior models of general appliance types. In: Twenty-Sixth AAAI Conference on
Artificial Intelligence (2012)
16. Todorovski, L., Džeroski, S.: Combining classifiers with meta decision trees. Mach.
Learn. 50(3), 223–249 (2003)
17. Zeifman, M.: Disaggregation of home energy display data using probabilistic app-
roach. IEEE Trans. Consum. Electron. 58(1), 23–31 (2012)
18. Zhang, C., Zhong, M., Wang, Z., Goddard, N., Sutton, C.: Sequence-to-point learn-
ing with neural networks for non-intrusive load monitoring. In: Thirty-Second
AAAI Conference on Artificial Intelligence (2018)
19. Zhong, M., Goddard, N., Sutton, C.: Signal aggregate constraints in additive fac-
torial HMMs, with application to energy disaggregation. In: Advances in Neural
Information Processing Systems, pp. 3590–3598 (2014)
Application of Deep Learning Long
Short-Term Memory in Energy
Demand Forecasting
1 Introduction
The introduction of smart meters technology in recent years have shaped the metering
infrastructure industry providing a smart, efficient and regular monitoring of energy
consumption. Smart meters have been deployed in almost all residential and industrial
applications around the world and in Australia giving the potential for metering
intelligence and analytics [1, 2]. The smart meter usually captures aggregate of energy
consumption in either a 15 or 30 min window creating a historic profile of energy
consumption for each user.
Energy consumption is mainly driven by the energy consumer actions and beha-
viours which are affected by the consumer preferences. Since consumers’ preferences
are likely to change over time, this introduces uncertainties in the daily energy con-
sumption pattern. Other factors influencing energy consumption include economic
situations, climate change, holidays, working days, time periods and social and
behavioural aspect [3]. In addition to that electricity demand is on the rise with
increased population and introduction of many appliances such as dish washer and
cloth dryer into the household. Another contributor to the increase in energy con-
sumption is climate change which contributes to a spike in energy consumption for
approximately 40 h a year or 0.5% according to a report by Powercor Australia [4].
The supply side may or may not be able to handle the spike in demand if it has not been
properly planned. One way to manage the spike in demand is to increase electrical
assets such as generation, distribution and transmission infrastructure to accommodate
the increase in energy consumption. However this increase in asset is not suitable as the
cost of installing the electrical infrastructure far outweigh the benefits of meeting the
spike in demand as it only happens about 0.5% of the year. Another way of managing
the sudden increase in energy consumption is through energy management strategies
applied at the demand side [5, 6]. In order to properly and effectively manage peak
energy consumption in the short term, an accurate load forecasting is required.
Load forecasting is one of the most important analytics for the smart grid as it
provides a prediction of what would be the likely energy demand in the future with a
margin of error allowing for a timely decision making [7]. The purpose of demand
forecasting depends on the prediction period such as short, medium or long term. Short,
medium and long term forecasting are defined as being less than 1 week, between 1
week to one year and more than one year respectively [8]. Short term forecasting can be
used as an input into demand side management framework to help better addressing
high peak consumption due to specific events such as heat wave or blizzard depending
on the season. Medium term forecasting is mostly used for load scheduling and
maximizing power distribution and transmission asset utilization. Whereas long term
forecasting focuses on identifying time period where the demand will be the lowest to
plan for maintenance or shut down for upgrade. Long term is also used to plan for
upgrading the power network due to a constant and permanent increase in demand that
is not prone to seasonal or daily fluctuations [9].
There are many applications for load forecasting based on statistical and machine
learning technologies that have been developed in the literature [10]; time series and
artificial neural networks are most common techniques for short term and medium term
forecasting [11–13]. Other short term forecasting are based on deep learning Long
Short-Term Memory (LSTM) approach which has been proven in [14, 15] to be
effective compared to traditional approach. LSTM has also been widely used and
proven to be effective in short and medium term forecasting [16, 17].
A recent study by [18] to develop a high precision ANN for load forecasting
(DeepEnergy) conclude that the proposed technique exceeds the traditional LSTM
network in terms of the Mean Absolute Percentage Error (MAPE) for a 3-day ahead
forecast. However in their implementation of LSTM, they did not consider other
features to effectively train the LSTM network. Moreover they used data from two
different months for training and data from a third month for forecasting. This may
affect the prediction accuracy of LSTM as consumption can be different from one
month to another. In another study conducted by [2] using dynamic neural network to
load forecast which seems to be giving a good accuracy, however similar to the work
by [18], the experiment does not consider other features to improve the accuracy of the
model rather depend solely on the historic energy consumption.
Application of Deep Learning LSTM in Energy Demand Forecasting 33
In this paper we propose a LSTM deep learning to forecast energy demand for
clusters of energy users in the short and medium term. The difference between our
work and the work in the literature is in the way we pre-process the raw data for
training and the use of two more features in addition to the historic energy consumption
to improve the forecasting process. In terms of training the LSTM, most works in the
literature uses past history of time series to forecast future ones, however in this work
we propose a time feature curve to define unique time instance across the week. IN
doing this the data during training can be shuffled to avoid over fitting. The first trial is
to forecast 3-day ahead energy consumption in each month and the second trial is to
forecast 15 days ahead energy consumption in a year. This work is anticipated to be an
important step toward developing a LSTM model to accurately forecast peak energy
consumption in the short and medium term. The model can be used by utilities to
prepare for spike in energy and hence power demand during a heat wave.
2 Research Approach
Where ht−1 is the output of the LSTM cell at the previous time step. Similarly, the
cell state c is written as:
At each time step, the LSTM uses the time series to compute ct and ht which are
then fed into a regression layer to predict the time series value. We used the root mean
square error as a metric performance during training.
3 Data Processing
Combining all data files into one produces a 3-by-10,698,912 matrix where the
first, second and third column corresponds to consumer label, consumption time and
energy consumption respectively. The next step is to pre-process the raw data into a
feature vector for clustering and then forecasting.
3.3 Clustering
Forecasting the load of individual energy consumer is often impracticable in residential
sector as each consumer contributes a small proportion to the loading of the transformer
or the connection point unless the consumer is a large industrial load which can be
treated separately. As a result we employ a metric to cluster the dataset into optimal
number of clustering thus reducing the number of loads into manageable time series for
forecasting. The metric is based on the eigenvalues of the correlation matrix between
the clusters. We use self-organizing maps to group energy consumers from 2 to n
clusters and deduce a cluster’s representative profile by taking the mean of all energy
consumption profiles within the cluster. We then use the metric to decide whether the
number of clusters is optimal or not. We found that the optimal number of clusters for
this dataset is 4 clusters which are plotted in Fig. 1. The clustering technique is dis-
cussed in another paper and is outside the scope of this work.
Fig. 1. Representative cluster consumption profile of 609 energy consumers within the dataset
The daily temperature affects the consumer behavior to use less or more energy
depending on how hot or cold the weather is at a specific time instance. Temperatures
in the range of 19 to 25 °C are comfortable temperature that would unlikely to influ-
ence the energy consumption; however any temperature higher or lower than the
comfortable temperature range would likely to influence the energy consumption.
Figure 2a shows a typical plot of a temperature profile across the day and Fig. 3 shows
how the clusters energy consumption change during a hot day when the temperature is
higher than the comfortable temperature compared to a normal day. This is evident
from the energy consumption range of 0.1 to 1.2 kWh in Fig. 3a during a hot day
compared to the energy consumption range of 0.1 to 0.35 kWh during a normal day in
Fig. 3c.
Time affects energy consumer as it represents the instance when a specific energy is
consumed. This is usually different for each consumer as the consumption will be
affected by daily behavior such as scheduling of regular loads such laundry, dish
washing, TV and other electronics devices. Another factor that impact consumption on
different is whether the house become non-occupant during a certain time of the day
where parent are at work and children at school. This can change from one family to
another. The day of the week is also important as the consumer behavior is likely to be
different on weekdays compared to weekends. As a result it is vital to propose a time
feature where each time instances in a 7 day is unique and each day in a week is unique
as week. We construct the feature vector from the time instances where each day is
given a number from 1 to 7, where 2–6 represent weekdays and 1 and 7 represent
weekends, and each 30-min interval in the day is given a number from 1 to 48 where
corresponds to 00:00 and 48 corresponds to 23:00. The time instance then becomes a
factor of day and time by appending the time number to the end of the day number. For
example, 8:00 am on Tuesday is written as 318 where 3 corresponds to Tuesday and 18
corresponds to 8:00am. We then divide this number by the highest value which is 748
corresponding to Saturday at 23:00 or 11:30 pm to get a time feature value between 0
and 1. Figure 2b is a plot of the time feature profile across the week.
Fig. 2. (a) Temperature profile in °C across the day for 0 March 2015; (b) Time feature profile
to across the week giving each 30-min time instance a unique number
Application of Deep Learning LSTM in Energy Demand Forecasting 37
Fig. 3. (a) Clusters energy consumption on a hot day; (b) The temperature across the day for 13
January 2016; (c) Clusters energy consumption on a normal day; (d) The temperature across the
day for 16 October 2015 unique number
4 Experimental Results
We conducted the simulation over three experiments where each experiment tests
different hypothesis. The first experiment tests the effectiveness of LSTM short-term
forecasting capabilities by comparing performance error of 3-day ahead forecasts for
cluster 1 across different months of the year using the historic energy consumption
along with either the temperature or the time feature and by using all three features
together. The second experiment uses both time and temperature features to forecast
energy consumption for clusters 2, 3 and 4 and third experiment aims to forecast 15
days ahead in a year. In all cases the dataset is divided into a ratio of 70, 10 and 10 for
training, validation and testing respectively. In the case of 3-day ahead forecasts the
38 N. Al Khafaf et al.
Fig. 4. The mean absolute percentage error (a) and the root mean square error (b) for 3-day
ahead prediction for LSTM in each month and the average MAPE across all months for 2 and 3
features
Figure 5 shows the 3-day ahead energy consumption forecasts for April, June,
March and November. It can be observed from March forecasts that none of the
features used for training were able to accurately predict the second and the third peak
consumption. This can be attributed to March raw data being from two different years
as shown in Table 1. However the intensities of the peak consumptions in April, June
and November are forecasted with acceptable accuracy. Overall the forecasts using the
three features are more accurate compared with two features.
Application of Deep Learning LSTM in Energy Demand Forecasting 39
Fig. 5. 3-day ahead forecasts for 2 and 3 features in the months of March (a), April (b), June
(c) and November (d)
Fig. 6. 3-day ahead forecasts using 3 features in the months of February and September for
cluster 2 (a) and (b), cluster 3 (c) and (d) and cluster 4 (e) and (f)
figure that the energy consumption is forecasted with acceptable forecasting error of
3.7624% and 3.61% in terms of MAPE and RMSE respectively. A 15 days ahead
prediction can very useful for utilities to determine whether peak consumption is likely
to occur during the 15 days so they can plan for it accordingly.
Fig. 7. 15 days ahead energy consumption forecast for cluster 4 with MAPE of 3.7624% and
RMSE of 3.61%
This work focuses at the potential of using deep learning LSTM to forecast 3 and 15
days ahead energy demand of different load profiles. The outcome of the research
suggests that LSTM is a strong architecture for both short and medium term fore-
casting. We have also shown that defining effective features improve the forecasting
model. As load and energy demand is critical for demand side management, this work
can provide utilities with the needed information to make decision on how to better
manage peak energy demand with an error of 3.15%. As future works, we plan to
improve the LSTM forecasting in short term and accurately forecast 3-month ahead by
tweaking the existing architecture as well as develop LSTM deep learning network for
long term forecasting.
42 N. Al Khafaf et al.
References
1. Alahakoon, D., Yu, X.: Smart electricity meter data intelligence for future energy systems: a
survey (2016)
2. Wang, Y., Chen, Q., Hong, T., Kang, C.: Review of smart meter data analytics: applications,
methodologies, and challenges. IEEE Trans. Smart Grid PP(99), 1 (2018)
3. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997)
4. P. A. Citipower: Demand Side Management Strategy (2016). Accessed 1 Mar 2019
5. Barton, J., et al.: The evolution of electricity demand and the role for demand side
participation, in buildings and transport. Energy Policy 52(C), 85–102 (2013)
6. Fernández, M.R., García, A.C., Alonso, I.G., Casanova, E.Z.: Using the big data generated
by the smart home to improve energy efficiency management. Energy Effic. 9(1), 249–260
(2016)
7. Mirowski, P., Chen, S., Ho, T.K., Yu, C.N.: Demand forecasting in smart grids. Bell Labs
Tech. J. 18(4), 135–158 (2014)
8. Raza, M.Q., Khosravi, A.: A review on artificial intelligence based load demand forecasting
techniques for smart grid and buildings. Renew. Sustain. Energy Rev. 50, 1352–1372 (2015)
9. Zhao, K., Gan, L., Wang, H., Ye, A.: Application of combination forecast model in the
medium and long term power load forecast. Int. J. Comput. Sci. Issues (IJCSI) 9(5), 24
(2012)
10. Alahakoon, D., Yu, X.: Smart electricity meter data intelligence for future energy systems: a
survey. IEEE Trans. Ind. Inform. 12(1), 425–436 (2016)
11. Kwang-Ho, K., Hyoung-Sun, Y., Yong-Cheol, K.: Short-term load forecasting for special
days in anomalous load conditions using neural networks and fuzzy inference method. IEEE
Trans. Power Syst. 15(2), 559–565 (2000)
12. Verdu, S.V., Garcia, M.O., Senabre, C., Marin, A.G., Franco, F.J.G.: Classification, filtering,
and identification of electrical customer load patterns through the use of self-organizing
maps. IEEE Trans. Power Syst. 21(4), 1672–1682 (2006)
13. Pao, H.-T.: Forecasting electricity market pricing using artificial neural networks. Energy
Convers. Manag. 48(3), 907–912 (2007)
14. Hossen, T., Nair, A.S.: Residential load forecasting using deep neural networks
(DNN) (2018)
15. Liu, C., Jin, Z., Gu, J., Qiu, C.: Short-term load forecasting using a long short-term memory
network. In: 2017 IEEE PES Innovative Smart Grid Technologies Conference Europe
(ISGT-Europe), pp. 1–6 (2017)
16. Han, L., Peng, Y., Li, Y., Yong, B., Zhou, Q., Shu, L.: Enhanced deep networks for short-
term and medium-term load forecasting. IEEE Access 7, 4045–4055 (2019)
17. Kong, W., Dong, Z.Y., Jia, Y., Hill, D.J., Xu, Y., Zhang, Y.: Short-term residential load
forecasting based on LSTM recurrent neural network. IEEE Trans. Smart Grid 10(1), 841–
851 (2019)
18. Kuo, P.-H., Huang, C.-J.: A high precision artificial neural networks model for short-term
energy load forecasting. Energies 11(1), 213 (2018)
Modelling of Compressors
in an Industrial CO2 -Based Operational
Cooling System Using ANN for Energy
Management Purposes
1 Introduction
It is complex to design and operate an efficient building energy system that incor-
porates multiple elements of new and emerging technologies [7]. The increase in
building-integrated intermittent renewable energy production, local energy stor-
age, and micro-grid solutions provides the building operator with a multitude
of options in choosing the optimal operational mode of all the components at
any given time. Implementation of an Intelligent Energy Management System
(IEMS) is one way to automate this decision-making process in order to reduce
the total energy cost for a building [2,12–15]. An IEMS can be tasked to pre-
dict short- and long-term energy demand and local energy production in order
to continuously design an optimal schedule for all energy storage options, while
also considering energy price fluctuations and peak power tariffs.
For the heating and cooling demands, heat pumps and Cooling Systems (CS)
are widely accepted as an efficient way to produce thermal energy, with con-
tinual improvements being made to maximize efficiency [3]. In technologically
advanced food distribution warehouses, large scale CS represent a large part
of the buildings total energy demand. Changing operating conditions, such as
weather (including ambient temperature), flow of goods and building occupant
behaviour, will continuously impact CS performance [3,11]. This is especially
true for more environmentally friendly working fluids, such as CO2 , that have
made their relatively recent re-entry into the field of refrigeration technology
[8,9,11]. CO2 -based large scale multi-stage cooling systems are becoming com-
mon as natural refrigerants with low global warming potential and ozone deple-
tion potential replace synthetic refrigerants. One way to increase the efficiency of
these systems during operation is to produce thermal energy (heating or cooling)
at the most ideal operating conditions and store the energy in a thermal energy
storage (TES) for later use [1,10]. This requires an IEMS that is given accurate
energy measurements and individual system performance data. For the CS, this
includes cooling load which requires working fluid flow measurement. However,
because accurate CO2 flow measurements are difficult, energy measurement of
cooling demand supplied with CO2 as the working fluid of energy distribution,
typically in warehouses with large cooling and freezing storage areas, is usually
unavailable. Theoretical calculation of system efficiency, or Coefficient of Perfor-
mance (COP), is therefore necessary to determine system performance. However,
industrially sized CS’s are usually unique and built by intellectual property (IP)
protected components that limit the system owner and operators options for
continual performance evaluation. In many cases, system performance at given
operating conditions can only be calculated by the supplier using a proprietary
model, but the details of the model itself are not shared. Therefore, openly avail-
able alternatives are necessary in order to model the system for performance
evaluation purposes to provide reliable input to an IEMS.
In this work, we use an Artificial Neural Network (ANN) to model the freezing
stage compressors of an industrial and operational two-stage CO2 -based CS.
ANNs have already shown promising results in performance prediction modelling
of heat pump technology [5], but in [5] the training data set consisted of a
Modelling of Cooling System CO2 Compressors Using ANN 45
Table 2. Energy system components, capacity in [kWp ], [kW], [kWh], [m3 ] and
[kWthermal ]
cooling energy produced during optimal operating conditions for the CS. In order
to store excess energy from the PV-plant, the electrical energy is converted to
thermal energy by the CS and stored in the CES as chilled water in a temperature
range between 7 °C and 15 °C. In the evening, when production from the PV-
plant is naturally reduced, the CES is discharged by directly supplying cooling
energy for ventilation air and IT-servers. Alternatively, the CES can be used
to optimize cooling energy production by charging and discharging based on
varying operating conditions, such as current and predicted cooling load, and
current and predicted ambient air temperature.
Fig. 2. The case-study cooling system as visualized in the building management system,
courtesy of IWMAC. Freezing compressors and CO2 distribution to evaporators roughly
outlined in the red dashed box, ANN model input values marked in bold (CF in %).
(Color figure online)
For the IEMS to make the optimal choice of operating mode for the TES,
the performance of the CS must be evaluated both at current and future oper-
ating conditions. In this work, ANN models for theoretical calculation of cooling
load and compressor power consumption based on available compressor data is
explored. The models have been developed for the three identical semi-hermetic
reciprocating sub-critical compressors, denoted FM1, FM2 and FM3 within the
red dashed box in Fig. 2, but the whole CS will be modelled in future work.
Compressor performance calculation data for the 4CSL-12K compressors was
collected from the website of the manufacturer, Bitzer™. Theoretical values were
calculated using the following given equation (according to EN12900):
48 S. M. Opalic et al.
ez − e−z
T anh = (3)
ez + e−z
2
T ansig = (4)
(1 + e−2z ) − 1
ReLU = max(0, z) (5)
z
e
Sigmoid = (6)
1 + ez
Fig. 3. Fully connected ANN with four neurons in the input layer and two neurons in
the output layer.
The models were programmed in Python 3.6 and the Keras library. Several
network configurations were tested by changing the amount of HLs, the amount
of neurons and the corresponding activation functions in each HL. Input values
were normalized by subtracting the mean and normalizing the variance using
Eqs. ( 7)–(10). The calculated values of μ and σ 2 for the training data set {Xi }
were also applied to the validation data set using Eqs. ( 8) and (10).
m
1
μ= Xi (7)
m i=1
(μ)
Xi = Xi − μ (8)
1 (μ) 2
m
σ2 = Xi (9)
m i=1
(μ)
(σ 2 ) Xi
Xi = (10)
σ2
50 S. M. Opalic et al.
Table 5. Results - One seven neuron hidden layer ANN with different activation and
loss functions, and the best performing SVR model.
Fig. 4. Training loss comparison of the Fig. 5. Training loss comparison of the
training error between Tanh-MSE-7, training and validation error between
Tanh-MSE-45 and Tanh-MSE-45-2HL. Tanh-MSE-45 and Tanh-MSE-25-3HL.
52 S. M. Opalic et al.
and results in fluctuating, indicating that the models are too complex for the
underlying data.
Finally, to examine how the Tanh-MSE-45 model performs on completely new
data, some example calculations with Bitzer™ software were done using input
data that fall in between the values that were used to generate the training and
validation data sets. Specifically, instead of 5 °C steps for SGT ∈ −30, −25, .. − 5,
the values −12.5, −17.5 and −22.5 were used. Similarly, values for CF were set
to 67, 63, 47, 43, 37 and 33. Table 6 show these results as well as % Squared
Error (SE).
Table 6. Results - Tanh-MSE-45 model output compared with calculations done with
software from Bitzer™.
5 Conclusion
References
1. Arteconi, A., Hewitt, N., Polonara, F.: Domestic demand-side management (DSM):
role of heat pumps and thermal energy storage (TES) systems. Appl. Therm. Eng.
51(1), 155–165 (2013)
2. Chen, C., Duan, S., Cai, T., Liu, B., Hu, G.: Smart energy management system for
optimal microgrid economic operation. IET Renew. Power Gener. 5(3), 258–267
(2011)
3. Chua, K., Chou, S., Yang, W.: Advances in heat pump systems: a review. Appl.
Energy 87(12), 3611–3624 (2010)
4. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math.
Control, Signals Systems 2(4), 303–314 (1989)
5. Esena, H., Inallib, M., Sengurc, A., Esena, M.: Performance prediction of a ground-
coupled heat pump system using artificial neural networks. Expert Syst. Appl.
35(4), 1940–1948 (2008)
6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). CoRR
abs/1412.6980
7. Manic, M., Amarasinghe, K., Rodriguez-Andina, J.J., Rieger, C.: Intelligent build-
ings of the future: cyberaware, deep learning powered, and human interacting.
IEEE Ind. Electron. Mag. 10(4), 32–49 (2016)
8. Neksa, P.: Co2 heat pump systems. Int. J. Refrig. 25(4), 421–427 (2002)
9. Neksa, P., Rekstad, H., Zakeri, G., Schiefloe, P.A.: Co2-heat pump water heater:
characteristics, system design and experimental results. Int. J. Refrig. 21(3), 172–
179 (1998)
10. Pardo, N., Montero, A., Martos, J., Urchueguı́a, J.: Optimization of hybrid - ground
coupled and air source - heat pump systems in combination with thermal storage.
Appl. Therm. Eng. 30(8), 1073–1077 (2010)
11. Sarkar, J., Bhattacharyya, S., Gopal, M.: Optimization of a transcritical co2 heat
pump cycle for simultaneous cooling and heating applications. Int. J. Refrig. 27(8),
830–838 (2004)
12. Remani, T., Jasmin, E.A., Ahamed, T.P.I.: Residential load scheduling with renew-
able generation in the smart grid: a reinforcement learning approach. IEEE Syst.
J. 6(99), 1–12 (2018)
13. Venayagamoorthy, G.K., Sharma, R.K., Gautam, P.K., Ahmadi, A.: Dynamic
energy management system for a smart microgrid. IEEE Trans. Neural Netw.
Learn. Syst. 27(8), 1643–1656 (2016)
14. Wen, Z., O’Neill, D., Maei, H.: Optimal demand response using device-based rein-
forcement learning. IEEE Trans. Smart Grid 6(5), 2312–2324 (2015)
15. Zhao, Z., Lee, W.C., Shin, Y., Song, K.B.: An optimal power scheduling method
for demand response in home energy management system. IEEE Trans.Smart Grid
4(3), 1391–1400 (2013)
Outlier Detection in Temporal Spatial Log
Data Using Autoencoder for Industry 4.0
1 Introduction
Cyber-physical systems (CPSs) use in industry is on the rise [1]. CPSs consist of both
physical components, e.g. sensors, and cyber components, e.g. measurement software
[2]. Each component may produce a massive amounts of log data [3], while also using
different log schemas. Log data used in this paper was extracted from a quality glass
inspection machine in production within a smart factory environment. Error detection
and problem solving in such an environment is a time-consuming task [2–5]. Log
schema evolution may even increase diversity.
An automatic outlier detection approach that can mark an outlier in log data can
mitigate complexity and speed up problem solving. Supervised machine learning
approaches are inappropriate due the lack of labelled log data. We apply an unsu-
pervised autoencoder approach to find outliers in temporal spatial log data. Autoen-
coders are neural networks, that compress and decompress data using functions that are
learned automatically from examples [6], in our case log data. Unsupervised outlier
detection methods rely on the fact that outliers are only a small portion of the entire
data [7]. This fact can be exploited within the autoencoder neural network. The more
examples of the same type of log data, the better the compression and
decompression/reconstruction of the log data will be. As the result, the reconstruction
error between input and output of the neural network will be higher in case of an
outlier.
We have successfully used an autoencoder approach for automatically detecting
outliers in the quality inspection machine environment without any domain-specific
knowledge. For this, we employed the mean squared error (MSE) function between the
input and output of the autoencoder network to find outliers in the temporal spatial log
data. The log data with the greatest MSE has been considered as a potential outlier.
With our method, we were able to mark multiple outlier in the provided log data, which
have been verified as outliers by a domain expert ex post.
The remainder of this paper is structured as follows. Section 2 introduces a review
of the related work done in outlier detection with autoencoders and within CPSs. In
Sect. 3 we present our autoencoder approach to find outliers in temporal spatial log
data. In Sect. 4 we outline our setup and evaluate our empirical results. Finally, we
conclude our work and point to future work.
2 Related Work
Outlier detection has been extensively studied by the research community and pub-
lished in comparative books and surveys [7–10]. In this paper, we focus on unsuper-
vised outlier detection in CPSs, especially on outlier detection techniques based on
autoencoders that are feasible on log data. As Chen et al. [11] recently stated,
autoencoders are appropriate for finding outliers. Several challenges within a CPS
environment exist that make outlier detection difficult. Two major challenges are the
mixed log data [2] and the highly dimensional feature space [2, 12].
Mixed data consists of numerical, categorical, and temporal attributes [8]. Harada
et al. [2] leverage this difficulty by pre-processing the log data of the CPS to enable
their proposed Local Outlier Factor (LOF) to process the data. We are aware of mixed
data difficulty and pre-process the log data to train an autoencoder neural network. We
took the autoencoder approach to face the second challenge, because an autoencoder is
able to identify high level correlations in features [6, 13]. Our approach is able to detect
outliers using a threshold on reconstruction error [11]. One of the first who applied
autoencoder approaches on event log data was Nolle et al. [14]. Nolle et al. [14] used
an autoencoder on an artificially generated business process event dataset (*50.000
events) to separate normal traces of data from anomalous ones using a threshold. We
apply our autoencoder approach to a different domain, using a different autoencoder
configuration and a real-world dataset (*1.2 mio. log lines). However, we facilitate the
idea of a threshold on the reconstruction error to classify outliers, as shown in [11, 14,
15], and test this approach on a real-world setting.
Outlier Detection in Temporal Spatial Log Data Using Autoencoder for Industry 4.0 57
3 Background
b
f1 :X ! X ð1Þ
The autoencoder function (1) reconstructs the input from an unlabeled dataset
ðt Þ
X ¼ fVi 2 W DxT g, where Vi ¼ fvi 2 W d gTt¼1 represents in our case, a multidimen-
sional log line of length d 2 D at Time t and over a Set W ¼ fd1 ; d2 ; . . .; dn j
d 2 R _ d 2 K _ d 2 Lg. The log line consists of numerical values R, messages from
a language L, and categorical values K L. The output domain Xb is probabilistic, not
exact, reconstruction of the input domain X. The MSE function that will be minimized
in order to maximize reconstruction quality is:
1 Xn
f2 : ðVi f1 ðVi ; hÞÞ2 ð2Þ
n i¼1
4 Our Approach
We explain our approach in four steps. Starting with a definition of our problem
domain and the possible outliers that can appear in the environment. Next, we describe
our autoencoder configuration and data strategy for training. Finally, we explain our
promising outlier detection approach based upon the autoencoder.
As a result, the autoencoder has 183 input neurons and inverse output neurons (see
Fig. 2). We employ one hidden layer with 12 neurons between input and output layer.
All neurons get initialized by Xavier uniform initializer algorithm [20]. The uniform
distribution for weights being in range ½x; x, where x is defined as follows per
neuron:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
6:
x¼ ð3Þ
ðfan in þ fan outÞ
The initializer (3) was chosen after successful field trails. We choose Rectified
Linear Unit (ReLU) as the activation function for the encoding layer. ReLU is defined
as follows [21]:
f ð xÞ ¼ maxð0; xÞ ð4Þ
ex
rð xÞ ¼ ð5Þ
1 þ ex
Each log line gets extracted (vi ), transformed ðDxi Þ and used as input for the
autoencoder. Furthermore, the encoded log line ðDei Þ is used to reproduce the input
ðDdi Þ. In addition, the MSE function is applied to Dxi and Ddi . As a result, the Dmsei
Outlier Detection in Temporal Spatial Log Data Using Autoencoder for Industry 4.0 61
We present our experimental results in this section. The experiments are done without
any domain-specific knowledge. In order to verify our results on unlabeled log data, we
conducted an interview with a domain expert ex ante.
The presented log data was captured on a glass inspection machine in production.
A conveyor delivers a thin glass layer towards the glass inspection machine, which
detects glass defects and rates material quality. A glass inspection machine consists of
cameras, lights, multiple sensors, hardware and software modules, which produce log
data. Figures 4 and 5 show the log files, after the dataset preparation step is done (see
Sect. 4.2).
Outliers O1-2 and O4-5 represent outliers that were unknown upfront. O1 and O4
were identified as a slight change of thickness in the inspection layer ex post and are
valid outliers. O2 and O5 appear out of order and not directly connected physical visual
outliers. A possible explanation is, that O2 and O5 can be seen as indicators for the
marked outlier areas M1 and M2. This remains unverified for the moment, a study on
more log data is planned.
Through the automatic outlier detection process we are able to mark outliers in log
data. The marked log line can be mapped backwards to the original log line, which
includes the timestamp. Therefore, the expert is enabled to do a more in-depth analysis
of the log data around that timestamp. So, our method mitigate complexity and speed
up problem solving.
In this paper we presented an outlier detection method for temporal spatial log data that
can be used without domain-specific knowledge. We present conversion rules that can
be applied to log data independent of the application domain. Furthermore, we pre-
sented an autoencoder configuration. The configuration was found during field trails on
the log data. In the end, we were able to find outliers in temporal spatial log data with
our presented algorithm.
We found the outlier areas that were formerly known by the domain expert. In
addition, we were able to find four additional outliers. Two outliers were verified as
valid outliers ex post. The other outliers are in question; a study is planned to prove if
the outliers can be seen as indicators for a major outlier peak. Another study on the
autoencoder configuration will be conducted in order to prove if the configuration is
suitable in other domains and on other log data. Through our study we were able to
help a domain expert to identify more outliers in temporal spatial log data that were not
considered before in a pre-analysis of the log data.
We have shown that autoencoder can be successfully used as an outlier detection
method on log data. In further studies we plan to also incorporate contextual outlier
detection using semi-supervised methods. A limitation of this study is that our method
relies on the fact that outliers are only a portion of the whole log data, otherwise our
autoencoder can reconstruct outliers with a minor reconstruction error than normal log
data leading to a failure of our method. Another limitation is that we were only
provided with log data of one production machine and get feedback only from one
domain expert. We are aware of a potential bias and try to minimize and verify our
results with more log data and more domain experts in future. Nevertheless, our method
can be used as an additional tool for quick inspection in temporal spatial log data. In
future this can speed up problem solving in a complex highly integrated connected CPS
environment and reduce the time span of cost-intensive downtimes and non-functional
CPSs.
Acknowledgements. This study takes place within the project ProDok 4.0, funded by the
German Ministry of Education and Research (BMBF) within the framework of the Services 2010
action plan under funding no. 02K14A110. Executive steering committee is the Karlsruher
64 L. Kaupp et al.
Institut für Technologie - Karlsruhe Institute of Technology (KIT). Project partners are
KUKA AG, ISRA VISION AG, dictaJet Ingenieurgesellschaft mbH and Hochschule Darmstadt -
University of Applied Sciences. Glass inspection machine (Type FS5D) logs provided by
ISRA VISION AG. All rights on example log data remains solely to ISRA VISION AG. Usage
must be requested.
References
1. Fu, Y., Zhu, J., Gao, S.: CPS information security risk evaluation system based on Petri Net.
In: DSC 2017, 2017 IEEE Second International Conference on Data Science in Cyberspace,
Proceedings, 26–29 June 2017, Shenzhen, China, pp. 541–548. IEEE, Piscataway (2017)
2. Harada, Y., Yamagata, Y., Mizuno, O., Choi, E.-H.: Log-based anomaly detection of CPS
using a statistical method. In: 8th IEEE International Workshop on Empirical Software
Engineering in Practice, IWESEP 2017, Proceedings, 13 March 2017, Tokyo, Japan, pp. 1–
6. Conference Publishing Services, IEEE Computer Society, Los Alamitos, California,
Washington, Tokyo (2017)
3. Nguyen, H., Cai, C., Chen, F.: Automatic classification of traffic incident’s severity using
machine learning approaches. IET Intell. Transp. Syst. 11, 615–623 (2017)
4. Hasani, Z.: Robust anomaly detection algorithms for real-time big data. Comparison of
algorithms. In: Stojanović, R. (ed.) 2017 6th Mediterranean Conference on Embedded
Computing (MECO). Including ECYPS 2017, Proceedings: Research Monograph, Bar,
Montenegro, 11th–15th June 2017, pp. 1–6. IEEE, Piscataway (2017)
5. Lu, X., Nagelkerke, M., van de Wiel, D., Fahland, D.: Discovering interacting artifacts from
ERP systems. IEEE Trans. Serv. Comput. 8, 861–873 (2015)
6. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural
networks. Science 313, 504–507 (2006)
7. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection. ACM Comput. Surv. 41, 1–58
(2009)
8. Aggarwal, C.C.: Outlier Analysis. Springer, Cham (2017). https://doi.org/10.1007/978-3-
319-47578-3
9. Ibidunmoye, O., Hernández-Rodriguez, F., Elmroth, E.: Performance anomaly detection and
bottleneck identification. ACM Comput. Surv. 48, 1–35 (2015)
10. Mehrotra, K.G., Mohan, C.K., Huang, H.: Anomaly Detection Principles and Algorithms.
Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67526-8
11. Chen, J., Sathe, S., Aggarwal, C., Turaga, D.: Outlier detection with autoencoder ensembles.
In: Chawla, N., Wang, W. (eds.) Proceedings of the 2017 SIAM International Conference on
Data Mining, pp. 90–98. Society for Industrial and Applied Mathematics, Philadelphia
(2017)
12. Tang, L.-A., et al.: Trustworthiness analysis of sensor data in cyber-physical systems.
J. Comput. Syst. Sci. 79, 383–401 (2013)
13. Protopapadakis, E., Voulodimos, A., Doulamis, A., Doulamis, N., Dres, D., Bimpas, M.:
Stacked autoencoders for outlier detection in over-the-horizon radar signals. Comput. Intell.
Neurosci. 2017, 5891417 (2017)
14. Nolle, T., Seeliger, A., Mühlhäuser, M.: Unsupervised anomaly detection in noisy business
process event logs using denoising autoencoders. In: Calders, T., Ceci, M., Malerba, D.
(eds.) Discovery Science: 19th International Conference, DS 2016, Bari, Italy, October 19–
21, 2016, Proceedings, pp. 442–456. Springer, Cham (2016). https://doi.org/10.1007/978-3-
319-46307-0_28
Outlier Detection in Temporal Spatial Log Data Using Autoencoder for Industry 4.0 65
15. Lu, W., et al.: Unsupervised sequential outlier detection with deep architectures. IEEE Trans.
Image Process. 26, 4321–4330 (2017)
16. Xiong, Y., Zuo, R.: Recognition of geochemical anomalies using a deep autoencoder
network. Comput. Geosci. 86, 75–82 (2016)
17. Sebestyen, G., Hangan, A.: Anomaly detection techniques in cyber-physical systems. Acta
Univ. Sapientiae Inform. 9, 116 (2017)
18. Qu, Y., et al.: Product-based neural networks for user response prediction over multi-field
categorical data (2018)
19. Rodríguez, P., Bautista, M.A., Gonzàlez, J., Escalera, S.: Beyond one-hot encoding: lower
dimensional target embedding. Image Vis. Comput. 75, 21–31 (2018)
20. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural
networks. In: Proceedings of the International Conference on Artificial Intelligence and
Statistics (AISTATS 2010). Society for Artificial Intelligence and Statistics (2010)
21. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In:
Proceedings of the 27th International Conference on International Conference on Machine
Learning, pp. 807–814. Omnipress, USA (2010)
22. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014)
Reservoir Computing Approaches Applied
to Energy Management in Industry
1 Introduction
In the last decade, the environmental consciousness of the society increased and the
environmental regulation became more severe in most European countries by pres-
surizing industries and communities toward energy and resource efficiency. The con-
cept of “resource” includes primary raw materials and energy, but also by-products,
wastes, wastewater and off-gases.
In particular, the off-gases produced during the different industrial processes such
as the integrated steelmaking cycle are often very valuable, as they are a source of
energy, which can replace natural gas for the production of heat, electric energy or
steam. In integrated steelworks, these gases come from the main production steps, i.e.
coke production (coke ovens), pig iron production (Blast Furnace - BF) and pig iron
conversion to steel through oxygen insufflation (Basic Oxygen Furnaces - BOF).
Currently these gases are internally exploited but their management is often non-
optimal due to technical and process-related constraints as well as to discrepancies in
the plant production scheduling. The poor coordination in management of gas pro-
duction and consumption can lead to either over-production or under-production of off-
gases. In the first condition, the off-gases are stored in gasholders until they are full and
afterwards they are flared (or, in some facilities, the production is stopped), with
consequent emissions and economic losses. In the second case, the off-gases cannot
satisfy the internal demand and natural gas is exploited, with consequent consumption
of a primary source and associated costs. These issues are enhanced in intermittent
processes, such as the BOF.
The off-gases production is unavoidable, but their wastage can be reduced through
the development of solutions allowing their optimal management and exploitation, with
consequent reduction of economic and environmental impacts, of CO2 emissions and
of primary resources exploitation. In the past, some Decision Support Systems
(DSS) were developed to support plant technicians in the off-gas management by also
incorporating sophisticated models computing the amount and energy content of the off
gases produced in a the integrated steelmaking cycle (the steelmaking route which
produces steels from virgin primary raw material, i.e. basically iron ore and coke)
[1–3]. However, this approach provides off-line indications, considers only limited
number of constraints, neglect power generation systems as gas users and does not
consider dynamic prediction of process gas production and use. The ongoing EU-
funded GASNET project aims at filling this gap through the development of a library of
models to forecast the production of main off-gases in terms of volume flowrate and
intrinsic gas energy and the related consumptions by the main consumers processes in
the time horizon of two hours with a sampling rate of 1 min. Some of these models
exploit Echo State Neural Networks (ESN) either in their standard or in their “deep”
version (DESN).
Some solutions can be found in literature, which exploit back propagation Neural
Networks (NN) [4], improved least squares Support Vector Machine (SVM) [5] or
multiple linear regression model [6] to forecast the Blast Furnace Gas (BFG) genera-
tion. However, none of them provides information about the energetic value of the
gases, as the CO and H2 content are not provided and thus the Net Calorific Value
(NCV) of the gas cannot be computed. On the other hand, no literature works are
available concerning the prediction of BOF Gas (BOFG). This gas has a higher NCV
with respect to BFG and, thus, is more interesting from recovery point of view.
However, the converter process is more difficult to model, being highly dynamic. On
this subject, ESNs prove their capability to model efficiently complex dynamic pro-
cesses, being also characterized by a cost-efficient training procedure, which is a
fundamental element for the future deployment of the system in an industrial context.
In the proposed application, the design of the ESN-based models was supported by
a deep analysis and careful selection of the input data, which was fundamental in order
to select the most relevant variables and eliminate useless or redundant data.
The Paper is organized as follows: Sect. 2 provides some theoretical background on
ESN, Sect. 3 describes the necessary data pre-processing stages, Sect. 4 provide details
on the developed models, while Sect. 5 presents and discusses the models perfor-
mances. Finally Sect. 6 provides some concluding remarks and hints for future work.
68 V. Colla et al.
Moreover, the reservoir layers generate dynamics starting from an exciting input
vector uðkÞ composed of nI features. The state of the entire reservoir system is com-
posed of the state of each reservoir, as follows:
xð t Þ ¼ ½ x1 ð t Þ xN ð t Þ ð1Þ
Without losing the generality, we can think that each reservoir comprises of n
neurons. The state of the first reservoir is computed as:
x1 ðtÞ ¼ fx W in1 uðtÞ þ W r1 x1 ðt 1Þ þ W f1 yðt 1Þ þ m1 ðtÞ ð2Þ
Reservoir Computing Approaches Applied to Energy Management in Industry 69
Where fx is a nonlinear function (in general a tanh function), the input matrix W in1
is a n nI matrix, the reservoir matrix W r a n n matrix, the feedback matrix W f1 a
n ny matrix, y is the output of the ESN and m is a small amplitude white noise.
The state of i-th reservoir layer is computed starting from the state of the (i − 1)-th
layer as:
xi ðtÞ ¼ fx W ini xi1 ðtÞ þ W ri xi ðt 1Þ þ W fi yðt 1Þ þ mðtÞ ð3Þ
where qðWri Þ is the spectral radius of each reservoir matrix. It is important to note
that the condition is only sufficient and ESP must be verified empirically for the
designed network, but in general in the majority of the case studies, the necessary
condition is also empirically sufficient.
In order to verify ESP, once Wri is initialized has to be scaled to obtain the desired
spectral radius qi , as follow:
q
Wri ¼ Wrrand
i
i ð6Þ
q Wrrand
i
70 V. Colla et al.
Each reservoir input matrix W in1 W inN is initialized randomly with weights in the
range [−1, 1], and then scaled by a factor K i , known as input scaling.
To verify the second requirement, firstly, the reservoir system must have a suffi-
ciently large number of neurons, secondly both number of layers and synapse
connections within the reservoir system must be ad hoc designed. Several heuristics
and algorithms for the selection of these hyperparameters are based on cross-
validation techniques, or other less costly techniques from a computational point of
view. Some interesting examples are described in [16], where several hints are
suggested to set the hyperparameters in a shallow ESN, in [17], where a Bayesian
optimization is used to determine a good set of hyperparameters, and in [18]. Here
spectral analysis in exploited to determine the number of layers of the DESNs and
Intrinsic Plasticity is used to increase the richness of nonlinear dynamics within
each reservoir. Once each reservoir layer has been initialized and verifies ESP, the
weights are left untrained and remain stationary.
• Readout Training: the linear readout is trained with the objective of minimizing
the 2-norm of the training dataset error, as follows:
minW y X Y target 2
3 Data Pre-processing
The number of collected variables, which are somehow related to the off-gas generation
is huge (in the order of hundreds). In order to develop effective models, the careful
selection of the input variables for each model is fundamental: to this aim, plant staff
experience and process knowledge need to be exploited together with appropriate pre-
processing analysis including outliers’ removal and feature space reduction.
variables, which indicate a linear correlation coefficient greater than 0.95 with a p value
lower than 0.05. Afterwards, the algorithm selects the minimum dominating set of the
built graph, according to the following definition: “Given a graph G = (V, E), a
dominating set is defined as a subset V 0 V such that every vertex not in V 0 is
adjacent to at least one member of V 0 .” Redundant variables are also defined as those
variables whose corresponding vertexes of the graph are not included in the minimal
size dominating set.
After elimination of the redundant variables, a Variable Selection (VS) algorithm is
applied on the remaining ones in order to select only the variables, which actually affect
the considered process, as, if irrelevant input variables are included, the system per-
formances can decrease. VS approaches, which are commonly applied in Machine
Learning (ML), can be classified into 3 main categories:
• filter methods, which select variables considering the relation between input and
output of the system and independently from the learning algorithm;
• wrapper methods, which use the learning machine as a black box by selecting
variables on the basis on their predictive influence.
• embedded methods, which implement VS within the training process. Embedded
approaches show comparable performance to wrapper-based ones, but need ad-hoc
designed learning systems.
Generally, when dealing with large datasets filter methods are preferable, as they
are less computational cumbersome. However, if the dataset has a reasonable dimen-
sion, wrapper and embedded methods are preferable, as they reach a higher accuracy
with respect to filter approaches, as they take into account the ML model.
The ideal (in terms of search space) VS method is the analysis of all combinations
of potential input variables, the so-called exhaustive search or brute force method. This
approach belongs to the wrapper class but it is not viable when a significant number of
variables are involved, being its computational time complexity exponential. When the
number of potential input variables is significant, evolutionary approaches can be an
appropriate compromise allowing to cover a large combination of variables. In this
application an efficient approach based on Genetic Algorithms (GAs), is applied [25,
26]. Each variable subset is characterized by a binary chromosome of the GA (each
gene corresponds to an input variable and its value is 1, if the associating variable is
included in the considered subset, 0 otherwise). The GA creates a population of several
chromosomes and estimates their goodness by evaluating a fitness-function, which
corresponds to the model performance when trained by exploiting the selected input
variables subset. The best chromosomes are exploited, through GA crossover and
mutation procedures, to create a new population: crossover generates the son chro-
mosome by randomly selecting the genes values from the two parents; mutation creates
new individuals by randomly switching the binary value of a gene of the considered
chromosome. When a new population is generated, the evaluation and reproduction
process is repeated until a stop condition, which, in this application, consists of the
achievement of a fixed number of iterations or of a plateau for the fitness function.
Redundancy analysis and VS allowed creating a subset of significant variables.
Reservoir Computing Approaches Applied to Energy Management in Industry 73
4 Developed Models
Hereby we present some exemplar models of industrial processes, which show the
effectiveness of the proposed approach in industrial applications. In particular, the BOF
process has been studied in depth with several modelling objective:
• Model 1: prediction of the BOFG production;
• Model 2: prediction of the steam production by recovering the BOFG heat.
An ESN-based model is proposed forecasting the recoverable BOFG (i.e. volume
and heating power) and providing a qualitative idea of flareable and total BOFG
volume flows and heating powers with a frequency of 1 min. in a time horizon of 2 h.
The qualitative feature of the estimate of flareable and total BOFG volume flows
and heating powers is affected by the performance of data associated to the flared gas,
which is not comparable to the performance of the data associated to the recovered gas
and does not allow a good performance training of the model for these outputs.
Moreover, in order to ensure proper exploitation of BOFG gas, the knowledge of the
behavior of recoverable gas with respect to the flareable gas (part of gas part that need
to be flared due to process and security constraints) is more interesting.
The model includes five different ESNs that require to be separately tuned in order
to independently forecast recoverable, flareable and total BOFG, CO and H2 contents
of the gas. Each network is specialized and, thus, the computational burden in the
training phase is moderate. The prediction of the CO and H2 content in the BOFG is
exploited to estimate the NCV according to the following equation:
The achieved NCV value and the BOFG volume allow computing the heating
power.
The general scheme of the model is shown in Fig. 2.
The inputs of the model, which have been selected through VS and redundancy
analysis, are reported in Table 1 with the related cluster and other useful information.
They mostly refer to materials and Oxygen which are fed to the BOF. The position of
the movable skirt is also considered, which is an operating variable affecting the air
feed of the process.
Most of the inputs are common among the different ESNs, while some inputs
associated to the charged materials are fed only to some ESNs.
The variable named “BOF Plant Unit States” is a sort of Boolean representation of
the scheduling of the process: when it hold a unitary value, the converter is in operation
(start of the blowing of the oxygen), otherwise oxygen blowing is not occurring.
The second model forecasts the total steam production related to the scheduling of
the BOF process. The model has a forecasting period of 2 h ahead, starting from the
most correlated variables. The list of inputs is shown on Table 2. These variables affect
the transfer of the heat which is exploited for steam production. The dynamic of the
steam production is modeled through the use of the novel DESN-based architecture.
Table 1. Input variables of the ESNs for the BOFG production forecasting (freq. 1 min, t = 0
stands for current value, t = 0 …120 stands for all values in the next 2 h).
Variable Meas. unit Input cluster Note
Position of moveable skirt mm Input 1 t=0
Blowed Oxygen Volume Flow kNm3/h Input 1 t=0
Recovered BOFG Vol. Flow kNm3/h Input 1 t=0
Flared BOFG Volume Flow kNm3/h Input 1 t=0
BOF Plant Unit State Boolean Input 1 t = 0….120
Charged Liquid Pig iron t Input 1 t = 0….120
Charged Lime t Input 2 t = 0….120
CO content in BOFG vol.%, Input 3 t=0
H2 content in BOFG vol.%, Input 4 t=0
Recycled briquetted sinter t Input 4 t = 0….120
Table 2. Input variables for the prediction of steam production in BOG process.
Variable Measurement unit Note
Position of moveable skirt mm t=0
Blowed Oxygen Volume Flow kNm3/h t=0
Recovered BOFG Volume Flow kNm3/h t=0
Flared BOFG Volume Flow kNm3/h t=0
BOF Plant Unit State Boolean t = 0….120
Production of steam in the BOF boiler t/h t=0
Before the training and the simulation of the models, outliers are removed.
All the presented ESN and DESN-based architectures are designed according to a
common approach, which can be summarized as follows:
Reservoir Computing Approaches Applied to Energy Management in Industry 75
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P Ns ffi
1
Ns i¼1 ð e
y i y i Þ 2
where Ns is the number of the samples of the dataset, yi and ey i are respectively the i-
th sample of the target and the predicted output.
• The training algorithm is an offline ridge regression of the readout;
5 Experimental Results
In the case of BOFG, the prediction performance is very satisfactory in both cases
(i.e. shallow and DESNs): the BOFG model allows faithfully following the amount of
recoverable gas and associated heating power. This last quantity is evaluated by
exploiting the predicted CO and H2 contents in the gas that are estimated by the model
with a good performance. The forecast of flareable gas is less precise but the qualitative
behavior is followed in a satisfactory way. Finally, the model provides an intermediate
accuracy for the prediction of the total amount of BOFG volume flow and heating
power, which indicates the predicted sum of the recoverable and flareable BOFG.
Figure 3 depicts the prediction of the most important outputs (i.e. recoverable BOFG
and related heating power) by using shallow ESN and shows the model capability to
follow the intermittent behavior of the process under consideration without any delay.
Fig. 3. Comparison between real (black) and forecasted (red) values by the model based on
shallow ESN of recoverable BOFG gas: (a) volume flow; (b) heating power. (Color figure online)
The suitability of this model is reflected in the NRMSE that lies in the range
7 14% for the recoverable gas volume flow and 6 13% for the related heating
power; the lowest value of the error corresponds to values which are closer in time.
However, the error values are in part distorted, as in some applications the estimated
recoverable BOFG is dissimilar from the actually recovered portion. This fact could
occur when there is a specific situation, in which the gas was recoverable but for some
reasons (e.g. the gasholder was full or for other unusual conditions) the operators
decided to flare it anyway. Consequently, the error measures are not totally repre-
sentative of the accuracy of the models, which are considered appropriate for fore-
casting in a quantitative way the gas amount and its intrinsic energy for recovery
purposes.
Using DESNs in the BOFG model provides an almost identical value of the
NRMSE. In both cases, the training time is reasonable but the DESN required almost
75% of the training time with respect to the shallow ESNs. On the other hand, the
DESN computational burden in this case study is higher than the shallow ESN one.
Also the validation results for the second model are very encouraging. An example
of the prediction of future 120 min of BOF Boiler Steam Production is shown in Fig. 4:
the trained model allows to predict the variable of interest for 2 h ahead, with an
accuracy in the range 3.2 6% in the case of the DESN-based approach and
Reservoir Computing Approaches Applied to Energy Management in Industry 77
3.5 6% in the case of shallow ESN. The error is low in the first few minutes of
prediction and, as it is normally expected, it tends to increase, as we want to predict the
phenomenon in the distant future. Both shallow ESN and DESN show very good
performances also for the prediction of 2 h ahead, but in this case the DESN-based
model proves to be the best solution both in terms of accuracy and computation time.
Fig. 4. DESN-based prediction of BOF Steam production: real (blue) vs. forecasted values
(red). (Color figure online)
6 Conclusions
Acknowledgments. The work described in the present paper was developed within the project
entitled “Optimization of the management of the process gases network within the integrated
steelworks - GASNET” (Contract No. RFSR-CT-2015-00029) and received funding from the
Research Fund for Coal and Steel of the European Union, which is gratefully acknowledged. The
sole responsibility of the issues treated in the present paper lies with the authors; the Union is not
responsible for any use that may be made of the in-formation contained therein.
78 V. Colla et al.
References
1. Porzio, G.F., et al.: Reducing the energy consumption and CO2 emissions of energy
intensive industries through decision support systems–an example of application to the steel
industry. Appl. Energy 112, 818–833 (2013)
2. Porzio, G.F., et al.: Process integration in energy and carbon intensive industries: an example
of exploitation of optimization techniques and decision support. Appl. Therm. Eng. 70(2),
1148–1155 (2014)
3. Porzio, G.F., Nastasi, G., Colla, V., Vannucci, M., Branca, T.A.: Comparison of multi-
objective optimization techniques applied to off-gas management within an integrated
steelwork. Appl. Energy 136, 1085–1097 (2014)
4. Zhang, Q., Gu, Y.L., Ti, W., Cai, J.J.: Supply and demand forecasting of blast furnace gas
based on artificial neural network in iron and steel works. Adv. Mat. Res. 443, 183–188
(2012)
5. Yang, L., He, K., Zhao, X., Lv, Z.: The prediction for output of blast furnace gas based on
genetic algorithm and LSSVM. In: IEEE 9th Conference on Industrial Electronics and
Applications, pp. 1493–1498 (2015)
6. Zhao, J., Wang, W., Liu, Y., Pedrycz, W.: A two-stage online prediction method for a blast
furnace gas system and its application. IEEE Trans. Control Syst. Tech. 19(3), 507–520
(2011)
7. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks.
In: International Conference on Machine Learning, pp. 1310–1318, February 2013
8. Schäfer, A.M., Zimmermann, H.G.: Recurrent neural networks are universal approximators.
Int. J. Neural Syst. 17(04), 253–263 (2007)
9. Jaeger, H.: The, “echo state” approach to analysing and training recurrent neural networks-
with an erratum note. Bonn Ger.: Ger. Natl. Res. Cent. Inf. Technol. GMD Tech. Rep. 148
(34), 13 (2001)
10. Jaeger, H., Haas, H.: Harnessing nonlinearity: predicting chaotic systems and saving energy
in wireless communication. Science 304(5667), 78–80 (2004)
11. Grigoryeva, L., Ortega, J.P.: Echo state network are universal. Neural Netw. 108, 495–508
(2018)
12. Gallicchio, C., Micheli, A., Pedrelli, L.: Deep reservoir computing: a critical experimental
analysis. Neurocomputing 268, 87–99 (2017)
13. Gallicchio, C., Micheli, A.: Echo state property of deep reservoir computing networks.
Cogn. Comput. 9(3), 337–350 (2017)
14. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT Press,
Cambridge (2016)
15. Yildiz, I.B., Jaeger, H., Kiebel, S.J.: Re-visiting the echo state property. Neural Netw. 35, 1–
9 (2012)
16. Lukoševičius, M.: A practical guide to applying echo state networks. In: Montavon, G., Orr,
G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 659–
686. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_36
17. Maat, J.R., Gianniotis, N., Protopapas, P.: Efficient optimization of echo state networks for
time series datasets. In: International Joint Conference on Neural Networks (IJCNN), pp. 1–7
(2018)
18. Gallicchio, C., Micheli, A., Pedrelli, L.: Design of deep echo state networks. Neural Netw.
108, 33–47 (2018)
19. Grubbs, F.E.: Procedures for detecting outlying observation sin samples. Technometrics 11,
1–21 (1969)
Reservoir Computing Approaches Applied to Energy Management in Industry 79
20. Knorr, E.M., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In:
Proceedings of VLDB, pp. 392–403 (2003)
21. Cateni, S., Colla, V., Nastasi, G.: A multivariate fuzzy system applied for outliers detection.
J. Intell. Fuzzy Syst. 24(4), 889–903 (2013)
22. Cateni, S., Colla, V., Vannucci, M.: A fuzzy logic-based method for outliers detection. In:
Proceedings of the IASTED International Conference on Artificial Intelligence and
Applications, AIA 2007, pp. 561–566 (2007)
23. Mahalanobis, P.C.: On the generalized distance in statistics. Proc. Natl. Inst. Sci. India 4, 9–
55 (1936)
24. Grandoni, F.: A note on the complexity of minimum dominating set. J. Discrete Algorithms
4(2), 209–214 (2006)
25. Cateni, S., Colla, V., Vannucci, M.: General purpose input variables extraction: a genetic
algorithm based procedure GIVE a GAP. In: 9th International Conference on Intelligent
Systems Design and Applications, ISDA 2009, pp. 1278–1283 (2009)
26. Cateni, S., Colla, V., Vannucci, M.: A genetic algorithm-based approach for selecting input
variables and setting relevant network parameters of a SOM-based classifier. Int. J. Simul.
Syst. Sci. Technol. 12(2), 30–37 (2011)
Signal2Vec: Time Series Embedding
Representation
1 Introduction
Time series is a sequence of data in time order, with values in continuous space.
The order can be irrelevant to time, but it is still important. This type of data has
always attracted the interest of scientists in a vast range of areas such as speech
recognition, finance, physics, biology etc. Some common tasks involving time
series are: motif discovery, forecasting, source separation, subsequence matching,
anomaly detection and segmentation.
In time series problems, regardless the approach, the performance of the solu-
tion is heavily affected by the representation of the data. The categories of repre-
sentations can be classified into data adaptive, non-data adaptive, model-based
and data dictated. The first one includes techniques such as Adaptive Piecewise
Constant Approximation [13], Singular Value Decomposition [15], Symbolic Nat-
ural Language [23], Symbolic Aggregate ApproXimation [16]. Approaches, which
belong to the second representation, are: Discrete Wavelet Transform [5], spectral
DFT [9], Piecewise Aggregate Approximation [14] and Indexable Piecewise Lin-
ear Approximation [6]. Model based representations are based on statistics such
as Markov Models and Hidden Markov Model [19] and Auto-Regressive Moving
Average [7]. Finally, the most popular data dictated approach is Clipped [24].
In this paper a novel framework, named Signal2vec, is introduced. A similar
approach has been proposed by Nalmpantis et al. [20], where a model called
Energy2vec is used to create a hyperspace of energy embeddings. Energy2vec is
binded to the energy domain, is supervised and its applicability is limited. On
the other hand, Signal2vec is a general, unsupervised model and is applicable
in any time series. It is inspired by Word2vec [17] which builds a vector space,
maintaining semantic and syntactic relations of the original words. Word2vec
has been applied on numerous textual or discrete sequences such as recommen-
dation systems [2,22], ranking of sets of entities [4,10], biology [1] and others
[26]. Signal2vec is the first attempt, that extends Word2vec applicability on any
sequential data in continuous space.
The benefits and the drawbacks of the framework are discussed extensively
through a theoretical analysis. The framework is validated with experiments in
two different tasks: classification and single source separation. Both, the analysis
and the experiments are based on energy data.
2 Signal2Vec
Signal2vec consists of two main steps: tokenization and skip-gram model. The
former one is a discretization process, transforming a continuous time series
into tokens. The latter one transforms the sequence of tokens into embeddings.
Figure 1 illustrates the steps of the framework.
2.1 Tokenization
using complexity of power draws [8]. Next, the classifier k-nearest neighbors
is trained to map values of the signal to tokens. The classifier can be used to
tokenize time series from the same domain.
The algorithm is evaluated using household energy data. Tokens represent
the energy states of each appliance, because token extraction is applied on the
submetered data. Then, sequences of tokens are created for each appliance. The
final sequence is a concatenation of the appliance specific sequences and corre-
sponds to the aggregated signal. In problems like power disaggregation, both
token extraction and token assignment would be applied directly on the aggre-
gated signal, because the submetered data are supposed to be unknown. The
analysis that follows is based on submetered data, in order to present meaning-
ful tokens, that correspond to appliances states.
2.2 Skip-Gram
Signal2vec is based on word2vec, which uses either skip-gram model or continu-
ous bag-of-words (CBOW). Skip-gram predicts the words around the target word
and CBOW predicts the target word given its neighbours. Both methods can be
applied having minor differences on the results. For consistency, skip-gram is
selected for all the experiments.
Following tokenization, a time series is mapped to a sequence of tokens. This
sequence is now called a corpus. If the tokens are not abstract states and reflect
real-world conditions of the time series, then the corpus is a human description
of the total signal. The collection of the tokens consists a vocabulary. In order
Signal2Vec: Time Series Embedding Representation 83
to apply the skip-gram model, a context is also defined as the window to the
left and to the right of the target token. The objective of the algorithm is to
predict the context given a specific token. The architecture is a shallow neural
network with one hidden layer and is trained with pairs of tokens. One token
is the target and the other one belongs to the context. The network trains the
weights of the hidden layer from the frequencies each pairing shows up. A more
formal definition of the objective function is defined next.
Let tkn1 , tkn2 ,...tknT be a sequence of T training tokens. Then the objective
function tries to maximize the average log probability according to the formula:
T
1/T log p(tknt+i |tknt ), (1)
t=1 −c≤i ≤c ,i=0
3 Evaluation
3.1 Data and Tools
The source of the data is the UK-DALE dataset, which includes both the aggre-
gated and the individual power consumption of the appliances in a house. House
1 is selected, because it has the most devices of all the houses. The preferred pro-
gramming language is Python. The tool named NILMTK [3] is used for accessing
the database and preprocessing the energy data. In order to distribute compu-
tation to many CPU cores and a GPU, the skip-gram model is developed in
Tensorflow. Tensorflow comes along with a suite of visualization tools, called
Tensorboard. It is used to plot diagrams of the model, visualize the embeddings
and evaluate the model. Tensorboard’s visualization tool uses PCA and TSNE
in order to plot the embedding space in three or two dimensions. The evaluation
of the embedding space is mainly done with tensorboard’s similarity tool, which
supports both cosine and euclidean distance.
embeddings in 2D and the two groups are gathered in two different semicircles.
In 3D the clusters are more clear because the two semicircles are separated as
two distinct shapes.
The similarity tool is used to find vectors that are close to each other. Both
Euclidean and cosine distance metrics are used, although there isn’t any signif-
icant difference in the results. For each similarity search, the respective plots
of the input vector and its five closest ones are compared. The results show
that embeddings with small distance in the geometrical space, have similar fre-
quency diagrams. The results are very robust in terms of distinguishing high
and low frequency energy states. Almost all the cases of similarity searches give
neighbor vectors, which correspond to the same category of frequency. On the
other hand, vectors belonging to the same cluster cannot be distinguished and
no characteristics are found to justify the results of a similarity search.
Figure 3 depicts the frequency distribution of tokens, derived from unsuper-
vised tokenization. There are two central points, forming two normal distribu-
tions. Comparing the frequency distributions and the geometric properties of the
embedding spaces, there is a connection between the tokens and the embeddings.
Skip-gram, transfered the properties of the sequence of tokens to a multidimen-
sional space. Assuming that sequences of tokens can be translated to frequency
of appliance usage, which in turn implies human behaviour and habits, it can
be concluded that the constructed vector space encapsulates the energy profile
of the house.
produced from the aggregated energy signal and they are used to transform any
other energy time series.
used rarely. When calculating the average vector any temporal patterns are lost,
whereas in PAA they are maintained. For future experiments more sophisticated
methods can be evaluated e.g. weighted average vector. Finally, it is notable that
for larger time series the results for raw data are getting worse. This is explained
by the dimensionality explosion, which signal2vec faces by compressing all the
information into a single vector of constant dimensions.
time series during 2014 is segmented into fixed windows. Signal2vec is applied to
each window and finally the average vectors are calculated. Each representative
vector is the input to a multilayer percepton classifier with one hidden layer and
100 neurons. The fixed windows that have been tested correspond to 4, 8, 12
and 24 h. The labels are the appliances that were ON at least one time during
the fixed time period. The results are very encouraging, especially when compar-
ing the performance of the same model identifying 6 and 12 appliances. Table 2
presents the results in details. Additional experiments need to be implemented
for comparison with other models.
5 Conclusion
References
1. Asgari, E., Mofrad, M.R.: Continuous distributed representation of biological
sequences for deep proteomics and genomics. PloS one 10(11), e0141287 (2015)
2. Barkan, O., Koenigstein, N.: Item2Vec: neural item embedding for collaborative
filtering. In: 2016 IEEE 26th International Workshop on Machine Learning for
Signal Processing (MLSP), pp. 1–6. IEEE (2016)
3. Batra, N., et al.: NILMTK: an open source toolkit for non-intrusive load monitor-
ing. In: Proceedings of the 5th International Conference on Future Energy Systems,
pp. 265–276. ACM (2014)
4. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating
embeddings for modeling multi-relational data. In: Advances in Neural Information
Processing Systems, pp. 2787–2795 (2013)
5. Chan, K.P., Fu, A.W.C.: Efficient time series matching by wavelets. In: Proceedings
of the 15th International Conference on Data Engineering, 1999, pp. 126–133. IEEE
(1999)
6. Chen, Q., Chen, L., Lian, X., Liu, Y., Yu, J.X.: Indexable PLA for efficient simi-
larity search. In: Proceedings of the 33rd International Conference on Very Large
Data Bases, pp. 435–446. VLDB Endowment (2007)
7. Corduas, M., Piccolo, D.: Time series clustering and classification by the autore-
gressive metric. Comput. Stat. Data Anal. 52(4), 1860–1872 (2008)
8. Egarter, D., Pöchacker, M., Elmenreich, W.: Complexity of power draws for load
disaggregation (2015). arXiv preprint arXiv:1501.02954
9. Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in
time-series databases, vol. 23. ACM (1994)
10. Garcia-Duran, A., Bordes, A., Usunier, N.: Composing relationships with transla-
tions. Ph.D. thesis, CNRS, Heudiasyc (2015)
11. Gutmann, M.U., Hyvärinen, A.: Noise-contrastive estimation of unnormalized sta-
tistical models, with applications to natural image statistics. J. Mach. Learn. Res.
13(Feb), 307–361 (2012)
12. Kelly, J., Knottenbelt, W.: The UK-DALE dataset, domestic appliance-level elec-
tricity demand and whole-house demand from five UK homes. Sci. Data 2, 150007
(2015)
13. Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S.: Locally adaptive dimen-
sionality reduction for indexing large time series databases. ACM Sigmod Rec.
30(2), 151–162 (2001)
14. Keogh, E.J., Pazzani, M.J.: A simple dimensionality reduction technique for fast
similarity search in large time series databases. In: Terano, T., Liu, H., Chen, A.L.P.
(eds.) PAKDD 2000. LNCS (LNAI), vol. 1805, pp. 122–133. Springer, Heidelberg
(2000). https://doi.org/10.1007/3-540-45571-X 14
15. Korn, F., Jagadish, H.V., Faloutsos, C.: Efficiently supporting ad hoc queries in
large datasets of time sequences. In: ACM Sigmod Record, vol. 26, pp. 289–300.
ACM (1997)
16. Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic rep-
resentation of time series. Data Min. Knowl. Disc. 15(2), 107–144 (2007)
17. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word rep-
resentations in vector space (2013). CoRR abs/1301.3781. http://arxiv.org/abs/
1301.3781
18. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In: Advances in Neural
Information Processing Systems, pp. 3111–3119 (2013)
90 C. Nalmpantis and D. Vrakas
19. Minnen, D., Isbell, C.L., Essa, I., Starner, T.: Discovering multivariate motifs using
subsequence density estimation and greedy mixture learning. In: Proceedings of the
National Conference on Artificial Intelligence, vol. 22, p. 615. AAAI Press; MIT
Press, Menlo Park, Cambridge, London (1999, 2007)
20. Nalmpantis, C., Krystalakos, O., Vrakas, D.: Energy profile representation in vector
space. In: 10th Hellenic Conference on Artificial Intelligence SETN 2018. ACM
(2018)
21. Nalmpantis, C., Vrakas, D.: Machine learning approaches for non-intrusive load
monitoring: from qualitative to quantitative comparation. Artif. Intell. Rev. 1–27
(2018)
22. Ozsoy, M.G.: From word embeddings to item recommendation (2016). arXiv
preprint arXiv:1601.01356
23. Portet, F., et al.: Automatic generation of textual summaries from neonatal inten-
sive care data. Artif. Intell. 173(7–8), 789–816 (2009)
24. Ratanamahatana, C., Keogh, E., Bagnall, A.J., Lonardi, S.: A novel bit level time
series representation with implication of similarity search and clustering. In: Ho,
T.B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 771–
777. Springer, Heidelberg (2005). https://doi.org/10.1007/11430919 90
25. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation
of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
26. Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., Weston, J.: Starspace: Embed
all the things (2017)! arXiv preprint arXiv:1709.03856
Biomedical - Bioinformatics Modeling
Classification of Sounds Indicative
of Respiratory Diseases
1 Introduction
DDD3
DD2
ADD3
D1
DAD3
CB 1 AD2
For each
Input audio AAD3 wavelet
For each Segmentation and Area under Output
signal CB 2 Wavelet Packet
spectral packet autocorrelation envelope of feature
Gabor filtering Transform - coefficient
band calculation autocorrelation vector
Daubechies 1 (Haar) DDA3
CB 12 DA2
ADA3
A1
DAA3
AA2
AAA3
Fig. 1. The block diagram of proposed feature extraction module. Audio signals are
filtered and each spectral band is analyzed by a 3-level wavelet packet transform. After
segmenting and computing the area under the autocorrelation envelope, we obtain the
feature vector.
Table 1. The frequency limits used for perceptual wavelet packet integration analysis.
The idea behind the specific set is the production of a vector that provides
a complete analysis of the audio signal across different spectral areas while they
are approximated by WP. We should also take into account that respiratory
signals do not distribute their energy across the spectrum in a homogeneous
way. Thus, a fine partitioning of the spectrum could offer relevant distinctive
information. Based on this observation, we designed a band-based signal analy-
sis with the frequency ranges denoted in Table 1. Such a division is achieved by
Gabor bandpass filters. Subsequently, three-level wavelet packets are extracted
out of each spectral band. The specific level is able to provide detailed infor-
mation regarding the signal characteristics at a specific band. Downsampling is
applied on each coefficient at each stage in order not to end up having the double
96 S. Ntalampiras and I. Potamitis
amount of data, as Nyquist theorem requests. The wavelet coefficients are then
segmented and the autocorrelation envelope area is computed and normalized
by half the segment size. M normalized integration parameters are calculated
for each frame, where M is the total number of the frequency bands multiplied
by the number of the wavelet coefficients. This series of parameters comprises
the WP-Integration feature vector and the entire calculation process is depicted
in Fig. 1.
WP-Integration parameters reflect upon the degree of variability of a specific
wavelet coefficient within a frequency band. Since the audio signals we try to
classify exhibit great differences among these bands, we decided to utilize the
normalized autocorrelation envelope area.
The proposed framework relies on the Directed Acyclic Graph (DAG) logic [9],
i.e. the classification scheme is a graph denoted as G = {N, L}, where N =
{n1 , . . . , nm } represents the nodes and L = {l1 , . . . , lp } the links associating the
nodes. Each node in N is responsible for a binary classification task conducted
via a set of HMMs which fit well the specifications of audio pattern recognition
tasks, thus the DAG-HMM notation.
The motivation behind creating such a graph-based classification system is
that in this way, one is able to limit the problem space and design classifi-
cation algorithms for two mutually-exclusive classes than having to deal with
the entirety of the different classes at the same time. Essentially, the proposed
methodology breaks down any m-class classification problem into a series of
2-class classification problems.
DAGs can be seen as a generalization of the class of Decision Trees, while
the redundancies and repetitions that may occur in different branches of the tree
can be observed more efficiently since different decision paths might be merged.
In addition, DAGs are able to collect and conduct a series of tasks in an ordered
manner, subject to constraints that certain tasks must be performed earlier than
others. The sequential execution of tasks is particularly important and directly
related to the efficacy with which the overall task is addressed [22].
The DAG-HMM architecture used in this paper includes m(m − 1)/2 nodes
(m being the total number of classes) each one associated with a two-class clas-
sification problem. The connections between the different nodes in G have only
one orientation without any kind of loop(s). As a result, each node of a such a
so-called rooted DAG has either 0 or 2 leaving arcs.
The principal issue associated with the design of every DAG is the topological
ordering, i.e. ordering the nodes in a way that the starting endpoints of every
edge occur earlier than the corresponding ending endpoints. In the following, we
describe how such a topological ordering is discovered based on the Kullback-
Leibler divergence.
Classification of Sounds Indicative of Respiratory Diseases 97
Naturally, one would expect that the performance of the DAG-HMM depends on
the order in which the different classification tasks are conducted. This was also
evident from early experiments. This observation motivated the construction
of the DAG-HMM so that “simple” tasks are executed earlier in the graph. In
other words, these are placed in the top nodes of the DAG-HMM, in a way that
classes responsible for a high amount of misclassifications are discarded early in
the graph operation. In order to get an early indication of the degree of difficulty
of a classification task, we employed the metric representing the distance of the
involved classes in the probabilistic space, i.e. the Kullback-Leibler Divergence
(KLD) between per-class GMMs in the feature space. The basic motivation is
to place early in the DAG-HMM tasks concerning the classification of classes
with large KLD, as they could be completed with high accuracy. The scheme
determining the topological ordering is illustrated in Fig. 2.
The KLD between two J-dimensional probability distributions A and B is
defined as [20]:
p(X|A)
D(A||B) = p(X|A)log dx (1)
R J p(X|B)
KLD provides an indication of how distant two models are in the proba-
bilistic space. It is important to note that KLD as given in Eq. 1 comprises an
asymmetric quantity. The symmetrical form can be inferred by simply adding
the integrals in both directions, i.e.
Model testing
Remaining classes
1, 2, 3, 4 1 vs. 4
Log- Log-
not 1 not 4 binary likelihood B
likelihood A
decision
given that the number of Monte Carlo draws is sufficiently large. During our
experiments we set n = 2000.
It should be noted the KLD between HMMs was not used since computing
distances between HMMs of unequal lengths, which might be common in this
work as HMMs representing different classes might have different number of
states, can be significantly more computationally demanding without a corre-
sponding gain in modeling accuracy [5,23].
After computing the KLD for the different pairs of classes, i.e. reach the
second stage depicted in Fig. 2, the KLD distances are sorted in a decreasing
manner. This way the topological ordering of the DAG-HMM is revealed, placing
the classification tasks of low difficulty on its top. Each node removes a class
from the candidate list until there is only one class left, which comprises the
DAG-HMM prediction. The distance matrix elements could be seen as early
performance indicators of the task carried out by the corresponding node. The
proposed topological ordering places tasks likely to produce misclassifications
at the bottom of the graph. This process outputs a unique solution for the
topological sorting problem, as it is usually met in the graph theory literature [3].
The operation of the proposed DAG-HMM scheme is the following: after extract-
ing the features of the unknown audio signal, the first/root node is activated.
More precisely, the feature sequence is fed to the HMMs, which produce two
log-likelihoods showing the degree of resemblance between the training data of
Classification of Sounds Indicative of Respiratory Diseases 99
each HMM and the unknown one. These are compared and the graph flow con-
tinues on the larger log-likelihood path. It should be stressed out that the HMMs
are optimized (in terms of number of states and Gaussian components) so that
they address the task of each node optimally. That said, it is possible that a
specific class is represented by HMMs with different parameters when it comes
to different nodes of the DAG-HMM.
An example of a DAG-HMM addressing a problem with four classes is illus-
trated in Fig. 3. The remaining classes for testing are mentioned beside each
node. Digging inside each node, Fig. 3 also shows the HMM-based sound classi-
fier responsible for activating the path of the maximum log-likelihood.
The operation of the DAG-HMM may be parallelized with that of investi-
gating a list of classes, where each level eliminates one class from the list. More
in detail, in the beginning the list includes all the potential audio classes. At
each node the feature sequence is matched against the respective HMMs and
the model with the lowest log-likelihood is erased from the list, while the DAG-
HMM proceeds to the part of the topology without the discarded class. This
process terminates when only one class remains in the list, which comprises the
system’s prediction. Hence, in case the problem deals with m different classes,
the DAG’s decision will be made after the evaluation of m − 1 nodes.
4 Experiments
This section explains (a) the dataset, (b) the parameterization of the proposed
solution for classification of respiratory sounds, and (c) finally presents and anal-
yses the achieved results.
4.1 Dataset
The respiratory sound database comes from the challenge organized within the
International Conference on Biomedical Health Informatics in 2017 and it is
publicly available1 . The recordings span over several years. The database has a
total duration of 5.5 h and contains 6898 respiratory cycles, of which 1864 contain
crackles, 886 contain wheezes, and 506 contain both crackles and wheezes, in 920
annotated audio samples coming from 126 subjects.
The cycles were annotated by respiratory experts as including crackles,
wheezes, a combination of them, or no adventitious respiratory sounds. The
recordings were collected using heterogeneous equipment and their duration
ranges from 10 s to 90 s. In addition, noise levels in some respiration cycles is
high, representing very well, real life conditions. Finally, training and testing data
are already defined by the challenge organization committee. More information
regarding the dataset is available in [18].
1
https://bhichallenge.med.auth.gr/.
100 S. Ntalampiras and I. Potamitis
Table 2. The recognition rates for the proposed and contrasted methods.
4.3 Results
Table 2 depicts the rates achieved by two contrasted approaches as well as the
proposed one. We observe that the solution based on DAG-HMM fed with PWP-
Integration feature set achieved the highest recognition rate which is equal
to 50.1%. Interestingly the inferred topological order suggested the execution
of classification tasks with the following order: (a) crack+wheeze vs. normal,
(b) normal vs. wheeze, (c) crack vs. crack+wheeze, (d) crack vs. wheeze, (e)
crack+wheeze vs. wheeze, and (f ) crack vs. normal.
Towards a more detailed picture of its classification capabilities, Table 3 tab-
ulates the confusion matrix. As we can see, the class identified with the highest
accuracy is the wheeze one with 64.5% and second is the normal one with 63%.
On the contrary, crack sound events were the most misclassified ones with the
respective rate being 36.7%.
Even though the achieved rate is the highest one reported in the litera-
ture, it is still far from satisfactory. Interestingly, when samples from crack,
2
Freely available at http://torch.ch/.
Classification of Sounds Indicative of Respiratory Diseases 101
Table 3. The confusion matrix (in %) achieved by the proposed approach. The average
recognition rate is 50.1%.
Presented Responded
crack crack+wheeze normal wheeze
crack 36.7 3.1 58.5 1.6
crack+wheeze 3.1 38.4 57.9 0.6
normal 32.4 3.3 63 1.3
wheeze 0.5 3.2 31.8 64.5
5 Conclusions
Acknowledgment. This research was funded by the ELKE TEI Crete funds related
to the domestic project: Bioacoustic applications, number 80680.
References
1. Baluja, S., Covell, M.: Waveprint: efficient wavelet-based audio fingerprinting. Pat-
tern Recogn. 41(11), 3467–3480 (2008). https://doi.org/10.1016/j.patcog.2008.05.
006. http://www.sciencedirect.com/science/article/pii/S0031320308001702
2. Chen, S.H., Wu, H.T., Chen, C.H., Ruan, J.C., Truong, T.: Robust voice activ-
ity detection algorithm based on the perceptual wavelet packet transform. In: 2005
International Symposium on Intelligent Signal Processing and Communication Sys-
tems. IEEE (2005). https://doi.org/10.1109/ispacs.2005.1595342
3. Cook, S.A.: A taxonomy of problems with fast parallel algorithms. Inf. Control
64(1), 2–22 (1985). https://doi.org/10.1016/S0019-9958(85)80041-3. http://www.
sciencedirect.com/science/article/pii/S0019995885800413. International Confer-
ence on Foundations of Computation Theory
102 S. Ntalampiras and I. Potamitis
19. Serbes, G., Ulukaya, S., Kahya, Y.P.: An automated lung sound preprocessing
and classification system based onspectral analysis methods. In: Maglaveras, N.,
Chouvarda, I., de Carvalho, P. (eds.) Precision Medicine Powered by pHealth and
Connected Health. IP, vol. 66, pp. 45–49. Springer, Singapore (2018). https://doi.
org/10.1007/978-981-10-7419-6 8
20. Taylor, P.: The target cost formulation in unit selection speech synthesis. In:
Ninth International Conference on Spoken Language Processing, ICSLP, INTER-
SPEECH 2006, Pittsburgh, PA, USA, 17–21 September 2006 (2006). http://www.
isca-speech.org/archive/interspeech 2006/i06 1455.html
21. Torrence, C., Compo, G.P.: A practical guide to wavelet analysis. Bull. Am. Mete-
orol. Soc. 79, 61–78 (1998)
22. VanderWeele, T.J., Robins, J.M.: Signed directed acyclic graphs for causal infer-
ence. J. Roy. Stat. Soc.: Ser. B (Stat. Method.) 72(1), 111–127 (2010). https://
doi.org/10.1111/j.1467-9868.2009.00728.x
23. Zhao, Y., Zhang, C., Soong, F.K., Chu, M., Xiao, X.: Measuring attribute dissim-
ilarity with HMM KL-divergence for speech synthesis, 6 p. (2007)
Eye Disease Prediction from Optical Coherence
Tomography Images with Transfer Learning
Abstract. Optical Coherence Tomography (OCT) of the human eye are used
by optometrists to analyze and detect various age-related eye abnormalities like
Choroidal Neovascularization, Drusen (CNV), Diabetic Macular Odeama
(DME), Drusen. Detecting these diseases are quite challenging and requires
hours of analysis by experts, as their symptoms are somewhat similar. We have
used transfer learning with VGG16 and Inception V3 models which are state of
the art CNN models. Our solution enables us to predict the disease by analyzing
the image through a convolutional neural network (CNN) trained using transfer
learning. Proposed approach achieves a commendable accuracy of 94% on the
testing data and 99.94% on training dataset with just 4000 units of data, whereas
to the best of our knowledge other researchers have achieved similar accuracies
using a substantially larger (almost 10 times) dataset.
1 Introduction
OCT is used to obtain structural images of the human eye, which is otherwise not
visible to the naked eye. It has revolutionized the evaluation for treatment of many eye
diseases, some of which also may lead to blinding. OCT, performed by looking inside
the machine, keeping still, captures high resolution images of the back of the eye which
can be used for retina scanning (A retinal scan uses retina blood vessel patterns of a
person’s eye). Figures 1 and 2 shows parts of a normal eye and its retina respectively.
DRUSEN: Drusen [2] (Fig. 3) predominantly occurs due to aging, when yellow
extracellular particles accumulate between the Bruch’s membrane and retinal pigment
of the eye. Drusen can lead to clogging of the transport system, which may prevent
cone cells (responsible for color vision) from receiving enough oxygen and gradually
lead to their death. Drusen leads to less vivid colors and makes one’s central vision
blurry.
CNV: Choroidal Neovascularization [3] (Fig. 4) occurs due to the formation of new
blood vessels near the choroid. Defects in the Bruch’s membrane located in the
innermost part of choroid, extreme myopia, excessive vascular endothelial growths are
causes of CNV. CNV causes distortion in central vision and pressure can be felt at the
back of the eye.
DME: Diabetic Macular Edema [4] (Fig. 5) occurs mainly in diabetic patients,
resulting in blurred vision due to the macula beginning to fill with fluid. The cone cells
are impaired of being able to sense light and hence results in blurred vision. DME
occurs when the blood vessels at the back start widening.
Using a subset of 4000 images from a large dataset of 85000 OCT scans, we trained
2 deep CNN models, using Keras library in python, to get maximum average validation
accuracy of 94%.
1.1.1 VGG16
VGG16 is a state-of-the-art CNN model introduced by Karen Simonyan, Andrew
Zisserman [6], for the ImageNet 2014 competition. It comprises of 23 layers, and is
106 A. Bhowmik et al.
available for public use along with the pretrained weights (trained in ImageNet which is
a source of over a million images), to classify objects belonging to 1000 classes. This
model architecture thus has the outermost layer generating 1000 values for the 1000
classes of objects, for a given input image of dimension no less than (32,32,3). It has
138,357,544 trainable parameters and its size is approx. 528 MB.
1.1.2 Inception V3
Similar to VGG16, Inception V3 [7] is also a state of the art, very deep (159 layer)
CNN model for Image classification, and 23,851,784 trainable parameters. It supports
images of dimension no less than (75,75,3). Size of InceptionV3 is 92 MB.
2 Related Work
Medical image analysis needs a strong dataset of diverse images. The significance of
creating a dataset from patients of diverse age and races are to avoid bias in prediction
process. Also, the evaluation of learning procedure of a model being trained, gives us a
clear understanding of the volume and impact of medical data, which is being fed into
the training model. Many such key points of consideration have been pointed out in
Marleen’s article [9]. A recent work [10] studies the formation of Myocardial Infarction
by processing cardiac magnetic resonance images in a region based CNN, to locate left
ventricle and then uses a stacked auto encoder-decoder model to assess the local motion
information of the sequences. The model achieved 87.6% accuracy in prediction of
infarct locations. Shie et al. in [11] have used a transfer learning approach for Otitis
Media images. They have extracted encoded features from the Otitis Media images
using transfer learning on deep CNN, AlexNet and then with those extracted features,
they have trained an SVM classifier on top of AlexNet to get an average accuracy of
88.5%. Treader et al. in [12] describes utilizing a deep CNN in TensorFlow using an
SD-OCT dataset of 1112 exudative AMD eye and normal eye scans to achieve an
accuracy of 99.7%. In [13] Phillip Prahs describes the use of transfer learning on a
dataset of 1.8 lakh OCT images to detect Antivascular endothelial growth with a 95.5%
accuracy. Comaparing to [12] and [13], our model classifies over a larger variety of eye
diseases. Kermany et al. [14] have performed a research work similar to our paper by
using transfer learning on OCT images and also chest X-ray images, using VGG16
network, reaching high accuracies. They achieved maximum accuracy of 96.6%
training over 85000 OCT images which they had collected over a period of 4 years. We
have used the same dataset, but demonstrated that our model achieves similarly high
accuracies by using a significantly smaller part of the dataset, hence significantly
reducing training and increasing availability and performance.
The dataset consists of exactly 84495 grayscale images taken by the OCT device and
labelled in the 4 categories of NORMAL, CNV, DME and DRUSEN by experienced
optometrists. The dataset has been downloaded from Kaggle and it was created orig-
inally by [15] “Shiley Eye Institute of the University of California San Diego, the
California Retinal Research Foundation, Medical Center Ophthalmology Associates,
the Shanghai First People’s Hospital, and Beijing Tongren Eye Center” over a period of
4 years, starting from July 2013. Every sample has been assessed by professionals
through multiple levels of expertise. Following Fig. 6 shows the pixel intensities of a
random sample (here CNV) from the dataset. Figure 7, from [15] shows the compar-
ative study of the OCT scans for our classes.
108 A. Bhowmik et al.
We have taken a subset of the dataset (exactly 4000 samples, 1000 per class) so that
the training time is less. Following figure shows frequency of NORMAL, CNV, DME,
DRUSEN in a random sample batch of size 728, used for training. Its distribution is
shown in Fig. 8.
4 Implementation
For our scenario we have used the VGG16 and InceptionV3 model for transfer learning
using the pretrained ImageNet weights. The models are available through the Keras
library along with the weights. The outermost fully connected layer was removed
leaving a convolutional layer with an output dimension of [5,5,256]. We add to this a
pooling layer of size 2 2 to scale down the feature size, followed by a flattening layer
and a Dense (perceptron) layer. The Dense layer with a SoftMax activation function
takes in a 1024 size vector from the flattening module and outputs a vector of size 4.
Initially, in the pre-processing stage, the images have been resized to 224 224
resolution and reshaped into NumPy arrays (resulting dimension [224,224,3]). The data
labels have been categorized into 4 labels, 1, 2, 3, 4 and coded into their 1-hot vector
representations. We froze the layers of the original model so that they were not updated
in the training. Only our layers were trained. We used both Adam and RMSProp to
optimize the training process, with a learning rate of 0.00005. Adam [16], as expected
performed slightly better. For evaluation of the model’s loss, we have based it on
categorical cross entropy. A loss function is a measure of how bad the model predicted
for that epoch and usually is measured by mean squared error. We ran the training for
50 epochs on a dataset of size 4000 with equally balanced data from all classes and
train and test split of 0.2. The system was tested on both VGG16 and InceptionV3
models. In about 12 epochs, the Inception V3 model had a faster learning rate for small
dataset, than the VGG16, without the dropout layer, although VGG16 performed better
overall. Although our scenario was that of type 4 as mentioned in Sect. 1.2, fortunately,
we did not need to train the whole network from scratch to get good results. Training
the uppermost layer was sufficient. Figure 9 describes the whole training process.
Adding a dropout [17] layer with dropout rate of 0.5, before the final layer actually
yielded better results than the previous model.
Dropout layers are used for regularization by “dropping out” the contribution of a
few randomly selected neurons during training. This helps other active neurons to
contribute more to the features which were otherwise assigned to be determined by the
currently deactivated neurons, thus helping them generalize the handling of data better.
This helps in building multiple independent representations of the subject being learned
by the network. Using a dropout of 0.5, we achieved 1 to 2 percent improvements in
accuracies than our base model. Figure 10 describes the layers of the VGG16 CNN
110 A. Bhowmik et al.
model we have altered, for this research work. The CNN in the figure corresponds to a
normal VGG16 with the top layer removed.
5 Results
The proposed network was trained on a device with Nvidia k80 12 GB GPU and
32 GB RAM, running on Intel Xeon processor. Testing with our network with a
training set of size 3200 and validation of 800 samples produces 94% accuracy. It is
noteworthy that such high accuracy is achieved using a transfer learning model on a
dataset which is comparatively so small. Studying other approaches for the same
problem, it is seen that, increasing the data obviously enhances the accuracy only by a
small percentage. But it takes considerably larger amount of time to train in such cases.
Our model takes a running time of 41 s per epoch (one epoch running over all 4000
data records). Here is a comparative study of similar models with substantially more
data and our model with lesser data (Table 1).
Brief description of evaluation metrics used: Precision and Recall. [18] are mea-
sures of the amount of data used and the amount of which was beneficial for the
training. Precision is a ratio of the amount of useful results, to the total retrieved results.
Recall is the ratio of the fraction of relevant results we received, to the total number of
relevant instances.
Precision P ¼ TP=ðTP þ FPÞ ð1Þ
Where TP = True Positive (predicted values = actual values and both are positive),
FP = False Positive or Type 1 error (Predicted values to be positive when they were
actually false) FN = False Negative (Predicted values to be negative but they were
false) Finally, the F score is a measure of how close Precision and Recall are (it is their
Harmonic Mean).
VGG16 Results: The average accuracy was 91.6% without the dropout layer, over 12
epochs on a 1000 size dataset (Fig. 10) and with dropout layer, and 4000 size dataset,
accuracy boosted to 94%, in about 33 epochs, after which the process converged by
early stopping. The precision, recall and f-scores for each class are shown in the figure.
The support tag signifies the number of validation samples for each class. The
average recorded accuracy on the training dataset, however, was 99.4%.
Inception V3 Results: It recorded average prediction accuracy of 92.6% without the
dropout layer using a 1000 size dataset and no further improvement was noticed when
using a larger dataset of 3200 training samples (Fig. 11).
while columns denote the predicted labels. The dark tiles show the number of current
label predictions while the light tiles denote incorrect predictions. It gives us a clearer
overview of the model’s performance. From the results we notice that differentiating
between Normal eye and Drusen eye has been more challenging than other disease
predictions.
Figures 13 and 14 show the learning performance of our VGG16-based model
throughout the training process. Our model provides excellent results, from the 10th
epoch itself. After the 10th epoch, the model starts to converge at a slower rate,
ultimately reaching a training accuracy of 99.7 at the 40th epoch, and the model stops
automatically by early stopping [19] regularization.
6 Conclusion
References
1. Brezinski, M.E., Fujimoto, J.G.: Optical coherence tomography: high-resolution imaging in
nontransparent tissue. IEEE J. Sel. Top. Quantum Electron. 5(4), 1185–1192 (1999)
2. Hunter, A.A., Chin, E.K., Almeida, D.R., Telander, D.G.: Drusen imaging: a review. J. Clin.
Exp. Ophthalmol. 5(327), 2 (2014)
3. Amaro, M.H., Holler, A.B.: Age-related macular degeneration with choroidal neovascular-
ization in the setting of pre-existing geographic atrophy and ranibizumab treatment. Analysis
of a case series and revision paper. Revista Brasileira de Oftalmologia 71(6), 407–411
(2012)
4. Bressler, N., et al.: Optimizing management of diabetic macular edema in Hong Kong: a
collaborative position paper. Hong Kong J. Ophthalmol. 21(2), 59–64 (2017)
5. Kumar, S., Kumar, M.: A study on the image detection using convolution neural networks
and TensorFlow. In: 2018 International Conference on Inventive Research in Computing
Applications (ICIRCA), pp. 1080–1083. IEEE (2018)
6. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556 (2014)
7. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
8. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10),
1345–1359 (2010)
114 A. Bhowmik et al.
9. de Bruijne, M.: Machine learning approaches in medical image analysis: from detection to
diagnosis (2016)
10. Chen, M., Fang, L., Zhuang, Q., Liu, H.: Deep learning assessment of myocardial infarction
from MR image sequences. IEEE Access 7, 5438–5446 (2019)
11. Shie, C.-K., Chuang, C.-H., Chou, C.-N., Wu, M.-H., Chang, E.Y.: Transfer representation
learning for medical image analysis. In: IEEE Engineering in Medicine and Biology Society,
Conference, pp. 711–714 (2015)
12. Treder, M., Lauermann, J.L., Eter, N.: Automated detection of exudative age-related macular
degeneration in spectral domain optical coherence tomography using deep learning. Graefe’s
Arch. Clin. Exp. Ophthalmol. 256(2), 259–265 (2018)
13. Prahs, P., et al.: OCT-based deep learning algorithm for the evaluation of treatment
indication with anti-vascular endothelial growth factor medications. Graefe’s Arch. Clin.
Exp. Ophthalmol. 256(1), 91–98 (2018)
14. Kermany, D.S., et al.: Identifying medical diagnoses and treatable diseases by image-based
deep learning. Cell 172(5), 1122–1131 (2018)
15. Cell. http://www.cell.com/cell/fulltext/S0092-8674(18)30154-5
16. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:
1412.6980 (2014)
17. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a
simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–
1958 (2014)
18. Towards Datascience. https://towardsdatascience.com
19. Prechelt, L.: Early stopping - but when? In: Orr, Genevieve B., Müller, K.-R. (eds.) Neural
Networks: Tricks of the Trade. LNCS, vol. 1524, pp. 55–69. Springer, Heidelberg (1998).
https://doi.org/10.1007/3-540-49430-8_3
20. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmen-
tation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3431–3440 (2015)
21. Van Engelen, A., et al.: Multi-center MRI carotid plaque component segmentation using
feature normalization and transfer learning. IEEE Trans. Med. Imaging 34(6), 1294–1305
(2015)
Severe Asthma Exacerbations Prediction Using
Neural Networks
Asthma exacerbations and the general disease mishandling are one of the main sources
of hospitalizations, emergency treatment and patients’ life quality worsening, causing,
in some cases, irreversible pulmonary obstruction after a few years of its poor con-
trolling [2]. Emergency treatment and hospitalizations, after a primary care, can be
assessed under several perspectives from related databases, such as the type of treat-
ment performed, or the medication applied.
In order to evaluate these cases, databases are managed and analyzed, especially
under the perspective if the assistance in the hospital resulted in an emergency handling
or not.
Despite its importance, this information is missing in some cases, affecting the
quality of the assessments developed by specialists who usually have access to it. It’s
critical to them, for example, to understand which age or sex is typically involved in
emergency cases of asthma exacerbations or even what drugs are being managed.
Hence, this work aims to train a neural network in order to automate the emergency
cases prediction and to avoid a long manual analysis of it. Besides, we intend to
evaluate its accuracy and calculate other performance metrics – such as the model loss
function – to better understand how precise the neural network is and compare it with
other machine learning algorithms.
Fig. 1. Hospitalizations caused by asthma, between January 2000 and December 2011
Severe Asthma Exacerbations Prediction Using Neural Networks 117
Fig. 2. Asthma mortality and hospitalization landscape in Brazilian public healthcare system
Fig. 3. Total asthma hospitalizations and their costs in Brazilian public healthcare system
To be more precise regarding the asthma expenses, in 2006 [5], a study was
conducted with controlled asthma patients – i.e., people with daytime symptoms
manifestations or nighttime manifestations twice a week– and uncontrolled ones – i.e.,
people with daytime symptoms manifestations more than twice a week, or with
nighttime manifestations in two consecutive nights, or even using a relieve medication
more than twice a week. In this report, the direct costs of an uncontrolled patient were
US$ 39,15 for the use of emergency rooms, US$ 86,30 for the hospitalization proce-
dures and US$ 36,20 for the medicine handling, whereas the costs for a controlled
patient was US$ 2,70 for the emergency rooms, US$ 12,88 for the hospitalization
procedures and US$ 74,50 for the drug application (see Fig. 4). Thus, regarding the
indirect costs, the days lost at school or work were larger in the uncontrolled group (54
vs. 30 school days, 48 vs. 12 work days).
118 A. Silveira et al.
Fig. 4. Direct costs distribution between controlled and uncontrolled asthma patients
Hence, the study showed that patients with uncontrolled asthma have a bigger
financial impact when compared to the controlled ones. The bigger costs proportion are
related to the use of emergency rooms and hospitalizations, which imply that there are
opportunities to cost reduction, disease controlling (by using medicines) and hospi-
talizations decrease. Analysis which try to understand the exacerbations behavior that
induce emergency handle, from a series of characteristics analyzed during the patient
primary care, may improve the creation of better strategies to control the disease and
diminish its weight on the public or private healthcare budget in Brazil. By offering
information and structured treatment to the patients, a bigger control of the condition, a
cost reduction and a better well-being can be achieved.
3 Methodology
It’s important to remind that each activity listed above breaks down into other
essential ones, such as:
• Clean up and prepare the dataset to perform an EDA:
– Exclude unnecessary variables and not available (N/A) data;
• Perform the EDA, which will identify patterns, trends and behaviors that will
support the selection of information to the next step:
– Examine database balance according the input and output variables;
– Visualize database general behavior, capturing patterns from it;
– Select the variables that will be used to train the classifier algorithm.
• Train and test a neural network and other machine learning algorithms, using the
information provided by the database:
– Split a train and test database samples in order to perform these tasks;
– Train and test a neural network, along with other classifiers, to predict the type
of care, using relevant variables identified in the EDA.
1 Xn
L¼ ðyðiÞ logð^yðiÞ Þ þ ð1 yðiÞ Þ logð1 ^yðiÞ ÞÞ ð1Þ
n i¼1
4 Implementation
4.1 Exploratory Data Analysis (EDA), Pre-processing, Training and Test
An EDA is crucial to uncover patterns in the database and can be extremely helpful for
the neural network classification performance. For example, other variables can be
created from the assessment of the original ones – in order to make the analysis more
insightful. In this research, two variables were created and helped in the whole process.
After the EDA, the final variables were selected and passed through a preprocessing
and an one-hot encoding [11] processes in order to enable the data usage by the model.
The separation between a train (with 72.946 entries – 80% of the database)and a test
samples (with 18.237 entries – 20% of the database) occurred at the end.
5 Results
After 100 epochs, these are the results of our neural network model (Figs. 7, 8 and 9).
The neural network achieved a maximum accuracy of 89,5% (epoch 73) for the test
sample, ranking as the third best accuracy between all the classification methods,
behind only the specialists’ classification and the decision tree method (see Fig. 10.
Accuracy level comparison table).
Severe Asthma Exacerbations Prediction Using Neural Networks 123
Model Accuracy
Specialists’ 91,2%
Decision tree 89,8%
Neural network 89,5%
Random forest 89,2%
Gaussian SVM 83,0%
Linear SVM 76,2%
Naive Bayes 70,3%
Polynomial SVM 58,4%
6 Conclusion
Regarding the confusion matrix, a few points need to be noticed: how the algorithm
managed to classify better true emergency cases than non-emergency ones (91% vs.
87%). This is probably caused by a slightly higher number of emergency cases in the
dataset, which may have leaded to an algorithm specialization. Secondly, it’s
important to say how the low level of true negatives observed – compared to the
other methods assessed – is important to healthcare machine learning applications:
saying a patient isn’t ill, when she indeed needs treatment, can cost lives and limit
the application in real life.
It’s noticed, as well, that the specialists’ method produced a higher accuracy level.
Yet, given the time it took to be completed (roughly six weeks), a trained neural
network may yield to an automatized and less time-consuming result.
Finally, new databases, which don’t contain the emergency classification, may
have it, by applying our trained neural network. Automating this process, new
emergency asthma cases can be prevented, by studying its behavior and possible
severity diagnoses, leading to more assertive medicine campaigns for pharmaceutical
companies, better public healthcare policies and a decrease in hospitalization
spending in Brazil.
References
1. BMJ Best Practices. https://bestpractice.bmj.com/topics/pt-br/45. Accessed 02 Dec 2018
2. Jornal Brasileiro de Pneumologia. http://www.scielo.br/scielo.php?script=sci_arttext&pid=
S1806-37132006001100002. Accessed 02 Dec 2018
3. Global Initiative for Asthma. www.ginasthma.org. Accessed 02 Dec 2018
4. Jornal de Pediatria. http://www.scielo.br/pdf/jped/v82n5/v82n5a06. Accessed 02 Dec 2018
124 A. Silveira et al.
1 Introduction
DNA microarrays have been popularized in medicine field due to this technology
is capable of analyzing the expression level of millions of genes at the same time.
By means of the analysis, it is possible to perform several tasks such as diagnose
diseases, identify different tumors, select the best treatment for a specific patient
to resist illness, among others. However, the analysis of this expressions requires
of robust computational and statistical technique to obtain the most relevant
information from the DNA microarray. Particularly, the Artificial Intelligence
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 125–136, 2019.
https://doi.org/10.1007/978-3-030-20257-6_11
126 B. A. Garro and R. A. Vazquez
Community has applied several new technique for obtaining this valuable infor-
mation. A branch of AI, called pattern classification focuses on the identification
of different classes or groups analyzing the information contained in the samples;
this samples could be associated with a particular disease, making possible the
identification different types of cancer. The DNA microarray has an enormous
quantity of genes to be analyzed and the samples available are few, this implies
that the computational intelligent technique must be capable of learning with
few samples in order to be applied in a DNA classification task.
Artificial neural networks (ANN) are computational models that have been
applied in different tasks such as pattern recognition, particularly to the classi-
fication of DNA microarrays. For example, the authors in [1], describe how an
ANN can be used to identify recurrence of cancer after prostatectomy in terms
of predictive genes. Other authors, such as in [2], apply an ANN for cancer
classification using the singular value decomposition (SVD). In [3], the authors
diagnose disease categories of a kind of cancer and in [4], the authors focus their
research in mass spectrometry. ANN ensembles based on sample filtering algo-
rithm is designed for separating the wrongly labeled DNA microarrays from the
training set and used to construct one more ANN just for the wrong samples
classified [5].
The previous papers have in common that the authors first perform a dimen-
sionality reduction in order to select the most representative genes because many
of them are irrelevant. If there is not a carefully selection of genes, the conse-
quences can be reflected in the low performance obtained in the classification
or prediction a disease. For example, in [6], a selection of genes is performed in
terms of the gene ranking based on the significance level, after that, the infor-
mation obtained during the dimensionality reduction process is used to train
an ANN and then classify cancer tumors. Other examples are described in [7]
and [8], where the authors apply different swarm intelligence techniques for the
dimensionality reduction process and then training an ANN in order to classify
the DNA microarrays.
However in order to get acceptable results with an ANN it is necessary to
adjust several parameters during its design. The number of synaptic weights, the
kind of transfer functions and the number of inputs that will be transformed in
the system are important features that directly impact in accuracy during the
solution of the problem. For this reason, authors in [9] optimize the ANN design
for classification problems.
The generalized neural network (GNN) is other kind of ANN that was devel-
oped by [10] with the aim to reduce the design of an artificial neural network
without many connections, with a good performance compared with the clas-
sic ANN and easy to implement in hardware to solve real time problems. Some
works applied GNN to solve approximation functions [11], for computing density
estimations [12,13], prediction and recently DNA classification problems [14].
Basically, the generalized neuron is composed of three neurons, although the
results obtained with these model are highly acceptable, if the problem requires
to build a generalized network, this network will be compose of three times more
A New Generalized Neuron Model Applied to DNA Microarray Classification 127
neurons than a classical feed forward network. In that sense, it is still necessary
to propose a neuron model that preserves the advantages of the GNN and at the
same time reduce the number of neurons used in the model.
In this work, we propose a new generalized neuron model as a generalization
of the model proposed by [10], allowing to select automatically different transfer
functions, aggregation functions as well as the operators associated with the
integration of the input pattern and synaptic weight, all these features in only
one neuron. In order to apply the generalized neuron to a DNA microarray
classification problem, firstly, the gene dimensionality is reduced using artificial
bee colony (ABC) algorithm, then the data is used to train the GNN by means of
a differential evolution (DE) algorithm where the main parameters of the model
are evolved.
This work comprises five chapters: section one is a brief introduction. Then,
the concept of generalized neuron as well as the new generalized model is pre-
sented. In addition, the propose methodology is detailed in section three followed
by the experimental results in section four. Finally, the conclusions of this work
are presented in section five.
To compute the output of the CGN, the input patterns, x ∈ IRN are mapped
with the corresponding synaptic weights w ∈ IRN of each neuron (OΣ and OΠ )
using the corresponding aggregation functions (sum and product), then the result
are evaluated by the transfer function (sigmoidal for sum function and gaussian
for product function). Finally, the output neuron integrates the obtained results
of the first two neurons. In addition, another input called bias is considered.
128 B. A. Garro and R. A. Vazquez
OCGN = W · OΣ + (1 − W ) · OΠ (3)
In order to propose a more simple generalized neuron model, we analyze the
main parts of a perceptron (Eq. 4), a morphological perceptron and (Eq. 5) and
some terms of a polynomial neural network (Eq. 6).
n
ycp = f ([Σi=1 xi · wi ] + b) (4)
and the integration operator that allow the interaction between input pattern
(x) and synaptic weights (w) can be selected from a set of three main operators
Io = {⊕, , ⊗} (10)
defined as
A New Generalized Neuron Model Applied to DNA Microarray Classification 129
⊕ (x, w) = x + w (11)
(x, w) = x · w (12)
According to [8], the selection of the best set of genes could be defined in terms of
an optimization problem. Given a set of p DNA microarrays X = {x1 , . . . , xp },
xi ∈ IRn , i = 1 . . . , p and its corresponding disease d = {d1 , ..., dp } associated to
each DNA microarray. The number of disease is defined by di ∈ {1, . . . , K} and
n
K. The aim is to find a subset of genes of the DNA microarray data G ∈ {0, 1}
such that a fitness function defined by min (F (X|G , d)) is minimized.
130 B. A. Garro and R. A. Vazquez
0, x < th
Tth (x) = . (14)
1, x ≥ th
The artificial bee colony (ABC) algorithm is based on the metaphor of the
bees foraging behavior [16]. The population of N B bees xi ∈ IRn , i = 1, . . . , N B
represented by the position of the food sources (possible solutions) are dis-
tributed in a search space. Three classes of bees are used to achieve the con-
vergence near to the optimal solution that represents the most relevant genes
from the DNA microarray.
To evaluate the solutions found by the ABC algorithm and determine which
is the best solution, it is necessary to define a fitness function. The aptitude of an
individual is calculated with a fitness function that measures how many samples
have been wrongly predicted in terms of the classification error function (CER).
This fitness function is defined in Eq. 15.
p K
arg min D xi |G , ck |G − di
k=1
F (X |G , d ) = i=1 (15)
p
where p is the total number of gene expressions to be classified, D is a distance
measure, K is the number of classes, c is the center of each category, arg min
provides the class to which the input pattern belongs in terms of the distance
classifier and di is the expected class.
In [8], the authors comment that different distance measures could be applied
to classify the gene expression samples. In this research, we adopt the Euclidean
distance.
Once selected the set of features that best describes the disease, the next step
is to train the generalized neural network (GNN) composed of two generalized
neurons, one neuron for each class defined in the problem. For determining the
class to which the input pattern belongs, the generalized neurons enter into a
competition stage where the neuron with the highest output is set to 1 and the
remaining neurons are set to 0; the neuron set with 1 determines the class to
which the input pattern belongs.
The inputs of the GNN is feed with the genes previously selected using the
ABC algorithm. Before starting to train the GNN, the dataset with the best
genes was partitioned into two datasets: training and testing subsets. After that,
the GNN was trained with the differential evolution (DE) algorithm.
A New Generalized Neuron Model Applied to DNA Microarray Classification 131
4 Experimental Results
In this section, we analyze the experimental results obtained with the proposed
methodology to determine its accuracy. For that purpose, the methodology was
applied to classify two type of cancer: the acute lymphocytic leukemia and the
acute myeloid leukemia.
In [18], the authors demonstrate successful classification between ALL and
AML leukemia using DNA microarrays. The Leukemia benchmark ALL-AML
dataset contains measurements corresponding to ALL and AML samples from
Bone Marrow and Peripheral Blood. It is composed of samples for training (27
ALL and 11 AML) and 34 samples for testing (20 ALL and 14 AML) where each
sample contains information of 7129 gene expressions.
For selecting the best set of genes, we adopt the protocol described in [8],
performing 30 experiments for different values of th and using the next param-
eters for the ABC algorithm: population size (N B = 40), maximum number of
cycles M N C = 2000, limit l = 100 and food sources N B/2.
After concluding the experimental results, we observed that the best result
were obtained with a threshold set to th = 0.3. By using this parameters, the
ABC algorithm select three genes for performing the classification task, achieving
an average accuracy of 74.6% during the testing stage, see Table 1. According
to the experimental results obtained during the dimensionality reduction stage,
the best genes found by the methodology were SLC17A2 Solute carrier family
17 (L13258 at), MLC gene (M22919 rna2 at) and FBN2 Fibrillin 2 (U03272 at),
achieving a maximum accuracy of 88.2% with the samples for testing.
132 B. A. Garro and R. A. Vazquez
Table 1. Best behavior of the proposed methodology using a Euclidean distance clas-
sifier for ALL-AML dataset.
Once the best set of genes was found, the information was used to train
a generalized neural network (GNN) composed of two generalized neurons. By
using these three genes combined with the methodology for training the GNN,
we expect to improve the results achieved with the Euclidean distance classi-
fier. For training the GNN, we set the parameters of differential evolution (DE)
algorithm as follows: crossover CR = 0.9, mutation rate F = 0.8, number of
population members N P = 50 and 1000 generations. In order to statistically
validate and compare the results obtained with the designed ANN, 30 experi-
ments were performed. In each experiment, the dataset was partitioned following
two different criteria. Original partition (O) where data is partitioned as in the
original dataset. Random partition (R) where 80% of the samples were selected
randomly from the original dataset to construct the training dataset and the
remaining for the testing dataset.
In addition, the experimental results obtained with the GNN were compared
against the results obtained with a classical generalized neuron (CGN) following
the methodology described in [14] and a feed forward neural network (MLP)
with the next architecture: input layer with three linear neurons, hidden layer
with five sigmoidal neurons and output layer with two sigmoidal neurons. For
determining the class to which the input pattern belongs, the neurons of the
output layer enter into a competition stage where the neuron with the highest
output is set to 1 and the remaining neurons are set to 0; the neuron set with
1 determines the class to which the input pattern belongs. The proposed MLP
was trained according to the next parameters: the learning rate was set to 0.1.
The number of epoch for the training phase was set to 5000, and the goal error
was set to 0. The Levenberg-Marquardt algorithm was used during the training
phase of the MLP (Figs. 2 and 3).
Table 2 shows the results for Leukemia problem using the proposed MLP, the
CGN and the proposed GNN. From this results, we can observed that the MLP,
CGN as well as the GNN provides better results compared to those obtained with
the euclidean distance classifier. In average, the CGN and GNN provides better
results than the MLP, achieving an accuracy of 93.3% and 91.4%, respectively,
during the testing phase with the random partition. Although, the is slightly
decreasing in the accuracy using the GNN, for the case of the original partition
the GNN provides the best results (87.1%). The best average accuracy provided
by the proposed methodology was of 100% using the random partitions. On
A New Generalized Neuron Model Applied to DNA Microarray Classification 133
the other hand, the best accuracy obtained with the proposed methodology was
94.1%, using the original partition.
Finally, in Table 3, we compare the results against those obtained by other
authors using a support vector machine (SVM). In general, we observed that the
proposed methodology performs better than those revised from the literature.
These experimental results suggest that the GNN could be adopted as an
alternative technique for performing the classification of DNA microarrays. Com-
pared against the CGN and MLP, the GNN provide similar results but it is
Table 2. Best and worst accuracy obtained with the GN and MLP.
important to mention that less neurons are required compared against MLP.
On the other hand, the new generalized neuron model is less complex than the
classical generalized neuron.
5 Conclusions
During the first stage, a dimensionality reduction over the ALL-AML dataset
was successfully applied in order to select the set of genes that best describe the
leukemia cancer using the ABC algorithm. Once discover the best set of genes,
we evaluated the accuracy of a simple distance classifier, obtaining an accuracy
highly acceptable. Nonetheless, in the second stage, we tried to improve the
results using a generalized neural network (GNN).
In the second stage, we evaluated the performance of the GNN during the
detection of the two type of leukemia cancer. The GNN was trained using the
set of genes discovered by the proposed methodology during the first stage. The
obtained results, show that the differential evolution algorithm is an excellent
technique for training a GNN. The accuracy achieved with the GNN was com-
pared against the accuracy of a feedforward neural network (FNN) as well as the
classical generalized neuron (CGN). Through several experiments, we observed
that the GNN as well as the CGN and FNN obtained better results compared
to those reached with the distance classifier.
On the other hand, the experimental results showed that GNN achieved
a better performance than the results obtained with the FNN and similar to
those obtained with the CGN. Finally, we concluded that the GNN trained with
the proposed methodology is capable of detecting, predicting and classifying a
disease with an acceptable accuracy.
Acknowledgment. The authors would like to thank Universidad La Salle México for
the economic support under grant number NEC-10/18.
A New Generalized Neuron Model Applied to DNA Microarray Classification 135
References
1. Peterson, L., et al.: Artificial neural network analysis of DNA microarray-based
prostate cancer recurrence. In: 2005 Proceedings of the 2005 IEEE Symposium on
Computational Intelligence in Bioinformatics and Computational Biology, CIBCB
2005, pp. 1–8, November 2005
2. Huynh, H.T., Kim, J.J., Won, Y.: Classification study on DNA micro array with
feed forward neural network trained by singular value decomposition. Int. J. Bio-
Sci. Bio-Technol. 1, 17–24 (2009)
3. Khan, J., et al.: Classification and diagnostic prediction of cancers using gene
expression profiling and artificial neural networks. Nat. Med. 7(6), 673–679 (2001)
4. Lancashire, L.J., Lemetre, C., Ball, G.R.: An introduction to artificial neural net-
works in bioinformaticsapplication to complex microarray and mass spectrometry
datasets in cancer studies. Briefings Bioinform. 10(3), 315–329 (2009)
5. Chen, W., Lu, H., Wang, M., Fang, C.: Gene expression data classification using
artificial neural network ensembles based on samples filtering. In: 2009 Interna-
tional Conference on Artificial Intelligence and Computational Intelligence, AICI
2009, vol. 1, pp. 626–628, November 2009
6. Peterson, L.E., Coleman, M.A.: Comparison of gene identification based on artifi-
cial neural network pre-processing with k-means cluster and principal component
analysis. In: Bloch, I., Petrosino, A., Tettamanzi, A.G.B. (eds.) WILF 2005. LNCS
(LNAI), vol. 3849, pp. 267–276. Springer, Heidelberg (2006). https://doi.org/10.
1007/11676935 33
7. Garro, B.A., Vazquez, R.A., Rodrı́guez, K.: Classification of DNA microarrays
using artificial bee colony (ABC) algorithm. In: Tan, Y., Shi, Y., Coello, C.A.C.
(eds.) ICSI 2014. LNCS, vol. 8794, pp. 207–214. Springer, Cham (2014). https://
doi.org/10.1007/978-3-319-11857-4 24
8. Garro, B.A., Rodrı́guez, K., Vázquez, R.A.: Classification of DNA microarrays
using artificial neural networks and ABC algorithm. Appl. Soft Comput. 38, 548–
560 (2016)
9. Garro, B.A., Vázquez, R.A.: Designing artificial neural networks using parti-
cle swarm optimization algorithms. Comp. Int. and Neurosc. 2015, 369298:1–
369298:20 (2015)
10. Kulkarni, R.V., Venayagamoorthy, G.K.: Generalized neuron: feedforward and
recurrent architectures. Neural Netw. 22(7), 1011–1017 (2009)
11. Rizwan, M., Jamil, M., Kothari, D.: Generalized neural network approach for global
solar energy estimation in India. IEEE Trans. Sustain. Energy 3(3), 576–584 (2012)
12. Kiran, R., Venayagamoorthy, G.K., Palaniswami, M.: Density estimation using
a generalized neuron. In: 9th International Conference on Information Fusion,
FUSION 2006, Florence, Italy, 10–13 July 2006, pp. 1–7. IEEE (2006)
13. Kiran, R., Jetti, S.R., Venayagamoorthy, G.K.: Online training of a generalized
neuron with particle swarm optimization. In: Proceedings of the International Joint
Conference on Neural Networks, IJCNN 2006, Part of the IEEE World Congress
on Computational Intelligence, WCCI 2006, Vancouver, BC, Canada, 16–21 July
2006, pp. 5088–5095. IEEE (2006)
14. Garro, B.A., Rodrı́guez, K., Vázquez, R.A.: Generalized neurons and its application
in DNA microarray classification. In: IEEE Congress on Evolutionary Computa-
tion, CEC 2016, Vancouver, BC, Canada, 24–29 July 2016, pp. 3110–3115. IEEE
(2016)
136 B. A. Garro and R. A. Vazquez
15. George, G., Raimond, K.: Article: a survey on optimization algorithms for opti-
mizing the numerical functions. Int. J. Comput. Appl. 61(6), 41–46 (2013). Full
text available
16. Karaboga, D.: An idea based on honey bee swarm for numerical optimization.
Technical report, Computer Engineering Department, Engineering Faculty, Erciyes
University (2005)
17. Storn, R., Price, K.: Differential evolution - a simple and efficient adaptive scheme
for global optimization over continuous spaces. Technical report (1995)
18. Golub, T.R., et al.: Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)
19. Sahu, B., Mishra, D.: A novel feature selection algorithm using particle swarm opti-
mization for cancer microarray data. Proc. Eng. 38(0), 27–31 (2012). International
Conference on Modelling Optimization and Computing
20. Wang, A., An, N., Chen, G., Li, L., Alterovitz, G.: Improving PLSRFE based gene
selection for microarray data classification. Comput. Biol. Med. 62, 14–24 (2015)
21. Alshamlan, H.M., Badr, G.H., Alohali, Y.A.: Genetic bee colony (GBC) algorithm:
a new gene selection method for microarray cancer classification. Comput. Biol.
Chem. 56, 49–60 (2015)
Classification - Learning
A Hybrid Approach for the Fighting Game
AI Challenge: Balancing Case Analysis
and Monte Carlo Tree Search for the Ultimate
Performance in Unknown Environment
1 Introduction
Advances in computing have given rise to computer game complexity, to the extent
that human ability is no longer sufficient to handle the vast number of game states.
Human gamers will soon be dominated by computer programs due to the emergence of
powerful techniques such as Monte Carlo Tree Search [1] and Machine Learning [2] in
which the programs can learn the solution with little human intervention. Aided by
modern computer power, these algorithms have demonstrated a capacity beyond that of
© Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 139–150, 2019.
https://doi.org/10.1007/978-3-030-20257-6_12
140 L. G. Thuan et al.
human experts with numerous examples such as Deep Blue [3] or AlphaGo [4], but
does that imply that human guidance is completely irrelevant in this day and age?
The success of programs like AlphaGo was partially thanks to the nature of the
game, which was chess, a turn-based game, in which modern computers are given
sufficient time to process. However, in a real-time environment where programs must
response reasonably well within a short instance of time, current processing power
proves insufficient since modern game states can grow exponentially while processors
are increasingly closer to their physical limit [5].
This paper presents our submission to the Fighting Game AI challenge [6] 2018, a
challenging AI competition that aims at solving the general real-time fighting problem.
Our research focuses on the design of an algorithm that can handle the vast number of
game states in a reasonable way given our average computational power. Our algorithm
is a proper blend between a generic case analysis approach with the winning MCTS
algorithm. Our results show that by blending some human wisdom with MCTS, the
resulting performance is superior to all in existence.
2.1 Overview
The Fighting Game AI Challenge is initiated by the Intelligent Computer Entertainment
Lab of the Ritsumeikan University to promote AI research towards general fighting
games, a classical class of games, in which players compete against each other until
only one remains using techniques resembling those in martial arts. Starting since 2013,
the competition has been well-received by scholars around the world and achieved a
high reputation among the most prestigious academic conferences worldwide,
including the IEEE Conference on Computational Intelligence and Games (CIG). CIG
2018 marked another successful milestone of the challenge with the emergence of
numerous interesting solutions, among which was our submission, the first that could
perform well and stably in the most challenging LUD division in all competition
categories.
MCTS is, unfortunately, not without its limitations. No prior training implies the
need for self-learning during the playtime which is not favorable in a real-time envi-
ronment. Albeit the majority of the submissions were dominated by MCTS, only those
who could manage to reduce its time-consuming nature could become victors. The
most successful example being Eita Aoki [7]. By discovering a winning heuristic, he
reduces the use of MCTS to the minimum possible and became the 3-time consecutive
winner of this challenge. Nonetheless, even his solution is unable to conquer the LUD
division since his heuristic could not be formularized without prior information.
Our submission proposed the first stable winning solution to the challenging LUD
division. Albeit missing a detailed case analysis made us stay only in the second place
in the first 2 divisions, our contribution is still of paramount importance since LUD is,
among the three, the closest division to the general real-time fighting problem.
3 Previous Work
The first period between 2013 and 2015 could be regarded as the Dark Ages in the
history of the competition in that submissions were entirely populated with variants of
the if-else approach with little to no sign of new directions. Regardless, there were
numerous interesting heuristics, as shown in Table 2, which are still relevant until now.
The initial popularity of these if-else variants is understandable since advanced tech-
niques require a certain level of design and implementation skill and algorithm design
142 L. G. Thuan et al.
is too hard of a field even for experienced experts, making them unpopular among
participants, most of whom are students or young professionals. Case analysis with
heuristics is much easier to implement and hence was the only technique in use in the
first year of the competition. Subsequent years saw more advanced techniques such as
Reinforcement Learning or Fuzzy Logic, but their implementation was still too sim-
plistic to produce satisfiable results. Worse came to worst, they were even losing
against the naïve and much-simpler-to-implement if-else approaches.
Careful analysis of the previous solutions leads us the conclusion that a successful
solution is the one that is based on MCTS but does not abuse it, as illustrated in the
following pseudocode:
Action findBestAction (GameState state) {
Action action = intelligentSearch(state)
if (action != null)
action = MCTS(state)
return action
}
Among which, we believe that startup is the most imperative period. The rationale
behind this decision is due to the concurrent nature of the problem as illustrated in
Fig. 2.
As illustrated in Fig. 2, players perform their respective action at the same time,
implying the possibility that one player can be put under attack even before he or she is
ready to act. Thus, the faster the attack can start, the less likely our player will end up in
a disadvantageous situation, which was why this factor should be considered as a
priority to determine the potential of the action.
Resulting State of the Player Under Attack. Nonetheless, fast startup alone does not
guarantee an excellent action since the opponent still has a chance to revenge in the
next attack. A promising attack should be one that opens more opportunity for future
attacks. In further details, it means that a good attack should lead the opponent into a
vulnerable state in which it is difficult or even impossible for him or her to block the
next one, forming a series of consecutive moves, namely combo [9]. In this particular
challenge, the most known vulnerable state is the DOWN state – a state where the
opponent is completely knocked out and became unable to retaliate. Hence actions that
can result in the opponent’s DOWN state will be our first-class priority.
Preselection. A further improvement is the preselection of a minimum set of actions
based on their potential for speed improvement during the playtime. After that, an
additional reselection to further minimize the selected list will be conducted based on
random game simulation and decide which actions to choose according to its respective
total damage for the opponent during the random games. An illustration can be shown
as follows.
A Hybrid Approach for the Fighting Game AI Challenge 145
Map<Action,Int> damageByAction =
simulateRandomGamesAndRecord(firstRound)
return secondRound
}
The threshold is set to be an upper bound value, which can be any high enough
value that if it is met, we are certainly going to win regardless of what happens
afterward. Since such value is independent of the action data, it can be chosen
generically.
146 L. G. Thuan et al.
In addition to the above factors, each action will only be considered for selection if
the current amount of energy permits it. Actions that require too much energy will be
filtered beforehand. Last, but not least, a simple minimax algorithm will be used to
select those with the same priority and each action will only be selected if it can
produce a positive impact on the opponent. The priority to select an action will be first,
in those chosen in the preselection, followed by those with the largest potential.
in which W is the average number of wins, Np is the number of visits in the direct
p
parent node and Nc is the number of visits in the current node. c ¼ 2 is the explo-
ration parameter that is used to balances between exploitation (W) and exploration
qffiffiffiffiffiffiffiffiffiffi
log2 Np
( Nc ).
In the context of our solution, each node is a game state. The root node is the
current game state and all others are virtual nodes that will be created by game sim-
ulation. Each new node is created from the current node when we try to perform a
different action. The tree branch will go deep until we can decide if the current state is a
win or loss and then we can recursively recompute all information such as the number
of wins and the number of visits from the current state up to root. The entire process is
1
repeated until time runs out (1 frame = 60 s). The selected action is one that leads to the
most visited direct child from root. The complete procedure can be summarized as
follows.
A Hybrid Approach for the Fighting Game AI Challenge 147
return findMostVisistedChild(rootNode).getAction()
}
5 Evaluation
5.1 Performance
The first priority to evaluate a solution is naturally its performance. Figure 3 illustrates
the improvement of our solution before and after mixing our case analysis approach.
The table on the left of Fig. 3 is our mid-term result against 5 other competitors in
which we did not introduce the intelligent search. Albeit we did not have high
expectation, it was still shocking that we were at the bottom of ranking in the LUD
148 L. G. Thuan et al.
Fig. 3. Performance before and after applying our case analysis approach.
division at the beginning and even worse, we were knocked out by every other par-
ticipant in every single game. However painful it was, it served well for us as a crucial
step to highlight the brilliantness of our improvement as shown in the second table –
our final result against 8 other participants, in which we climb straight to the top from
the bottom. Being on the top of the ranking in both STANDARD and SPEED-
RUNNING leagues demonstrate our performance.
5.2 Stability
The next key factor to evaluate our solution is stability which can be illustrated in
Table 3, showing instability among other participants.
As can be seen from Table 3, most participants either perform too unstably or
badly. Since SampleMctsAi belongs to the organizers, it is counted, implying that ours
is the only stable solution that well-performed (first place for all).
other participants all use Java, a much more mature language. In our submission, we
chose Kotlin for its expressiveness and beauty, but it was unexpected that Kotlin
compiler is still too young to generate bytecode of the same quality as Java compiler. In
our experiment after the contest, we translated the Java programs into equivalent Kotlin
code and were surprised to see that the Kotlin version lost every single match despite
the algorithm in use is the same. Therefore, even though our MCTS implementation is
of higher quality compared to others, it is invisible from the competition results.
Does our optimization really work? In our experiment, we compared MCTS pro-
grams with and without the optimization. With our optimization, it was still on the
losing side but could handle approximately 1 out of every 3 matches instead of failing
every single time as the program without optimization. Regardless, this result
demonstrates that without our case analysis, we would have lost also in LUD division
and that our intelligent search is truly effective.
6 Conclusion
In this study, we have contributed a promising solution for the general fighting problem
via our submission to the Fighting Game AI Challenge 2018. Our solution is combi-
nation of case analysis and Monte Carlo Tree Search that produces stable and
remarkable results in the LUD division, the division closest to the general real-time
fighting game. Unfortunately, we were unable to make a significant optimization in the
MCTS and still rely on human wisdom to generate the heuristics for the intelligent
search which, at the same time, opens a new direction for our future research.
One possible direction for the future is to automate the generation of generic
heuristics and the second is to further optimize MCTS for solving this challenge. One
promising possibility for the former is the use of Deep Learning and Neural Networks
to learn and formularize the heuristics before the competition, but the challenge of such
approaches is to construct a model that can ensure the genericness of the resulting
heuristics, meaning that it must be independent of actual values. Another approach to
handle the latter is to introduce memorization into MCTS as illustrated in the AlphaGo
Zero paper [13], which may significantly improve both the quality and performance of
the search tree. The downside is that such an approach may be incompatible since the
performance of the Google machines running AlphaGo Zero is still largely outper-
forming that of normal computers. Regardless, these are all promising directions, in
which research towards real-time fighting games can evolve.
References
1. Chaslot, G., Bakkes, S., Szita, I., Spronck, P.: Monte-carlo tree search: a new framework for
game AI. In: AIIDE. The AAAI Press (2008)
2. Nork, B., Lengert, G.D., Litschel, R.U., Ahmad, N., Lam, G.T., Logofătu, D.: Machine
learning with the pong game: a case study. In: Pimenidis, E., Jayne, C. (eds.) EANN 2018.
CCIS, vol. 893, pp. 106–117. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-
98204-5_9
150 L. G. Thuan et al.
3. Campbell, M., Hoane, J., Hsu, F.: Deep blue. Artif. Intell. 134(1–2), 57–83 (2002)
4. Silver, D., et al.: Mastering the game of Go with deep neural networks and tree search.
Nature 529(7587), 484–489 (2016)
5. Are processors pushing up against the limits of physics? https://arstechnica.com/science/
2014/08/are-processors-pushing-up-against-the-limits-of-physics. Accessed 21 Feb 2019
6. Fighting Game AI Competition. http://www.ice.ci.ritsumei.ac.jp/*ftgaic/index-1.html.
Accessed 21 Feb 2019
7. Fighting Game AI Competition. https://www.slideshare.net/ftgaic/2018-fighting-game-ai-
competition?ref=http://www.ice.ci.ritsumei.ac.jp/*ftgaic/index-R.html. Accessed 21 Feb
2019
8. Kim, M.J., Ahn, C.W.: Hybrid fighting game AI using a genetic algorithm and Monte Carlo
tree search. In: Proceedings of the Genetic and Evolutionary Computation Conference
Companion, GECCO 2018, Kyoto, Japan (2018)
9. Zuin, G., Macedo, Y., Chaimowicz, L., Pappa, G.: Discovering combos in fighting games
with evolutionary algorithms. In: GECCO 2016, Denver, CO, USA, pp. 277–284 (2016)
10. James, S., Konidaris, G., Rosman, B.: An analysis of monte carlo tree search. In:
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI 2017)
(2017)
11. Heineman, G., Pollice, G., Selkow, S.: Algorithms in a Nutshell. O’Reilly Media, Sebastopol
(2016)
12. Jangid, M.: Kotlin – the unrivaled android programming language lineage. Imperial J. Inter-
disc. Res. 3, 256–259 (2017)
13. Silver, D., et al.: Mastering the game of Go without human knowledge. Nature 550, 354
(2017)
A Probabilistic Graph-Based Method
to Improve Recommender System Accuracy
Abstract. The last two decades have seen a surge of data on the Web which
causes overwhelming users with huge amount of information. Recommender
systems (RSs) help users to efficiently find desirable items among a pool of
items. RSs often rely on collaborating filtering (CF), where history of transac-
tions are analyzed in order to recommend items. High accuracy, and low time
and implementation complexity are most important factors for evaluating the
performance algorithms which current methods have the shortage of all or some
of them. In this paper, a probabilistic graph-based recommender system
(PGB) is proposed based on graph theory and Markov chain with improved
accuracy and low complexity. In the proposed method, selecting each item for
recommendation is conditioned by considering recommended items in the
previous steps. This approach uses a probabilistic model to consider the items
which are likely to be preferred by users in the future. Experimental results
performed on two real-world datasets including Movielens and Jester, demon-
strate that the proposed method significantly outperforms several traditional and
state-of-the-art recommender systems.
1 Introduction
Today, one of the major problems of the online shops is that users are often confused
when deciding what to choose among a huge number of items. RSs have been proposed
in order to help users to find the most suitable item according to their preferences [1].
Generally, RSs are classified into content-based methods (CB), collaborative filtering
(CF), and hybrid methods. In CB approaches, the system recommends items based on
available content information on the users and items [2]. CF is a widely used approach
in RSs which focuses on similarity values between the users (or items). CF uses an
information filtering technique based on the user’s previous rating/purchase history to
offer items which are aligned with the taste of the target user [3]. Hybrid RSs combine
CF and CB approach to obtain improved performance [4]. The existing research papers
in the field of RSs have mainly considered movie recommendation topic [5, 6]. There is
also a rich literature on other topics, such as e-commerce [7], books [8], documents [9],
music [10], television programs [11], applications in markets [12], e-learning [13], and
Web search [14]. There are various metrics in the literature for evaluating the per-
formance of RS algorithms [15].
Accuracy is one of the most important evaluation metrics, and most of studies in
RSs evaluation criteria have focused on the accuracy [16]. Evaluation of RSs can be
performed in an offline or online manner [17]. In an offline analysis, part of ratings in
the dataset are hidden from the recommender algorithm as a test set and the RS
algorithm uses the rest of data (training set) to predict new ratings or rank for unseen
items. Offline evaluation methods are fast, but we cannot elicit the real taste of users
regarding the recommended items. Likewise, online evaluations are conducted in a live
environment to observe users’ behavior and tracking their acts [18]. In addition, in
comparison with offline methods, conducting an online evaluation, if possible, is more
costly and time consuming, and this leads most of the research works to offline
evaluation. Over the last decade, many research studies have focused on proposing new
approaches to improve performance of recommenders. Although paying attention to
existing evaluation measures are important to have a good RS, to implement RS in real
world, we have to take into the account some considerations, such as simplicity of
implementation and reasonable run time.
In this paper we propose an accurate probabilistic recommendation system based on
graph theory to generate accurate recommendations with reasonable run-time in
comparison with some traditional and state-of-the-art algorithms. PGB algorithm
transforms ratings in the first step to like and dislike, and by so doing, it helps us to
apply ratings on traditional Markov model and makes calculations simpler. Also, PGB
lets us to model history of the ratings in a comprehensive and compact graph (system-
state graph) using Markov model idea that helps us to recommend items without
referring to raw ratings anymore; PGB generates system-state graph only one time at
the start of algorithm, and then use it to recommend items. In the next step, we make
decision based on system-state graph and travers through it for each user to find most
probable items which will be liked by him/her in the future. One of the most significant
advantages of PGB is its flexibility against updating dataset. In case of adding new
ratings, system-state graph can be modified only by updating respective weights and it
doesn’t need to make it from scratch; Thus, proposed method is suitable for systems
that have a lot of change in a short period of time.
2 Related Works
The last two decades witnessed much attention in network science, where graph theory
and data mining meet [19]. A number of applications have been proposed to model
behaviour of users with graph theory or related theories [20, 21]. Several works have
applied Markov models in the context of RSs, where a Markov chain model makes
recommendations based on previous actions. Rendle et al. proposed Factorized Per-
sonalized Markov Chains (FPMC) to combines Matrix Factorization and Markov
Chain to model personalized sequential behavior [22]. His method improved by Cheng
et al. by changing factorizing transition matrix into two latent and low-rank sub-
matrices [23]. Shani et al. modelled the RS using Markov decision processes
A Probabilistic Graph-Based Method to Improve RS Accuracy 153
In this section, we first explain Markov model as the base idea for the proposed
method, and then discuss details of the proposed method. The proposed method
includes two main steps: (i) transforming the ratings and creating system-state graph
using Markov chain to model users’ ratings history, and (ii) applying a probabilistic
model on the generated graph to recommend items.
N ð\i1 ; i2 ; . . .; im ; im þ 1 [ Þ
TF ð\i1 ; i2 ; . . .; im [ ; \i1 ; i2 ; . . .; im ; im þ 1 [ Þ ¼ ; ð1Þ
Nð\i1 ; i2 ; . . .; im [ Þ
In the Next step, we extract two subgraphs, Gl and Gd from G that only contain
relation between like and dislike nodes, respectively. Both Gl and Gd graphs show the
correlation of items from different aspects; GL shows the similarity of two items based
on the number of likes they received together while Gd show the similarity of items
based on dislike they received together. For example, in Fig. 1, user u1 dislikes i1 and
i2 , and user u2 likes i1 and i2 . It shows i1 and i2 have a similar behaviour and get like or
get dislike at same time in users’ ratings history.
Since Gl and Gd have similar concepts, we merge them and make a new graph that
shows the correlation between all items based on users’ rating history. users’ rating
history; this graph is denoted by Gld . To merge these graphs, we unify si;l and si;d as a
single node and aggregate their edges along with their weights. The resulted graph is
the system-state graph, which items in dataset and their correlations make its nodes and
weights, respectively. Ultimately, Gld is used for the recommendation purpose. In the
next step, we traverse in Gld to find items that are likely to be rated as like for the target
user in the future. Figure 2 shows the process of generating Gl , Gd and Gld according to
A Probabilistic Graph-Based Method to Improve RS Accuracy 155
dataset in Fig. 1. There are some differences between our proposed method and classic
Markov model in creating system-state graph: (i) we consider ratings in system-state
graph, (ii) we don’t consider the sequence of ratings (iii) we ignore part of unimportant
relations when extract Gl and Gd from G to make graph compact.
3.3 Recommendation
If Slu represents the set of items that are rated as like by u, the aim is to find k items with
strongest correlation with Slu in Gld , and recommend them to u. The main idea is that if
we find recommendations based on the users’ rating history, we can assume that users
are likely to be like this recommendation. Then, we add this recommendation to Slu to
consider it as rating history of the target user. In other words, each time we recommend
a new item, we update Slu for target user u with the items recommended until that
step. Suppose that Rec is a function that generate the recommendation list
Ru ¼ \r1 ; . . .rk [ where k is the number of recommendations; the mth recommen-
dation rm for target user u is obtained as follow:
rm ¼ RecðSlu Þ; ð2Þ
where Slu is updated Slu which is obtained from union of Slu and previous recommended
items as follow:
where Rm1
u is the situation of Ru after adding the ðm 1Þth recommendation. The
pseudo-code of the above function is given in Algorithm 1.
156 N. Joorabloo et al.
To recommend item r1 to target user u; our Rec function finds Slu items in Gld and
then find connected nodes with the highest weights to them; We denote these items and
their related weights by NIu and NWu , respectively. Since we may have repetitive nodes
in NIu , Rec removes the duplicates in NIu and aggregates the related weights in NWu .
The items present in NIu have the strongest correlation with Slu items. In other words,
weights in NWu show the probability of occurring NIu and Slu items together in the
ratings history. We select the item with the highest weight in NIu as the first item for the
recommendation. In fact, we assume all Slu items as a single node in graph and select
the node in Gld that has the highest correlation with Sul ; Fig. 3 shows how r1 and r2 are
selected. Figure 3(a) shows Gld where purple nodes and the red line around them
shows Slu . In Fig. 3(b), the algorithm finds connected nodes to Slu items with highest
weights which are depicted with red colour. In (c), i5 with the highest aggregated
weight is selected as a first recommendation and added to Slu to update it for next round
of recommendation. For recommending the next item, we assume that u like r1 , and
then find r2 based on this assumption. This means that we select r2 only if r1 is
preferred by the target user. To this end, we add r1 to Sul and repeat the same process
with updated Sul . Figure 3(c, d) shows the process of selecting r2 . The proposed
approach reveals hidden correlations between items in system-state graph and helps
users to find more neighbours when they have few numbers of items in Sul , which
ultimately leads to have better precision in spars datasets.
A Probabilistic Graph-Based Method to Improve RS Accuracy 157
Fig. 3. Steps to obtain recommendation from system-state graph. (a) initial state of the target
user in ratings dataset. (b) finding items with highest connected weight for each item in Sul .
(c) select i5 as first recommended item and add it to initial set. (d) repeat b and c process based on
updated Sul and selecting i7 as second recommendation. (Color figure online)
To make the approach clearer, imagine that Sul ¼ \i1 ; i2 ; i3 [ for target user u.
Gld and the process of selecting the first and second recommendation items are depicted
in Fig. 3. The strongest connection between i1 ; i2 and i3 with other nodes in Gld are
i5 ; i7 and i5 respectively. Figure 4 shows the process of creating NIu and NWu based on
Gld . After aggregating weights, the item with the highest weight is selected as the
recommendation, which is I5 in this example.
Fig. 4. Crating NIu and NWu and get the recommendation after aggregating weights
158 N. Joorabloo et al.
For top-k recommendation problem, we repeat the above introduced process for k
times, which is a simple process. Markov model often struggles with some problems,
such as determining the size of state and sparsity of the dataset, which is often the case
for many real datasets. Markov model only considers the previously rated items in the
recommendation process and ignores valuable users’ rating information. Small chain
number can make another problem for Markov model, as small chains cannot accu-
rately represent the taste of users. The proposed method aims at solving these problems,
and our experiments in the next section reveals its effectiveness to the state-of-the-art
recommendation methods.
4 Experimental Results
In this section we compare the proposed algorithm with a number of classical and state-
of-the-art algorithms. Since our algorithm is a ranking method, we use the evaluation
metrics, which are proper for this type of methods.
Precision
Pu ðNÞ is precision for a list of recommended items to user u and is defined as the
percentage of relevant items to user u in their list of recommendation. Relevant items to
target user are those rated as like by the user. Precision of a system with N users, PðNÞ,
is calculated as:
P
Pu ðNÞ
PðN Þ ¼ u2testSet
ð4Þ
N
Recall
Recall is among the most frequently used metrics of information retrieval field. Recall
for a target user u is denoted by Recallu ðN Þ which is the proportion of relevant items to
all items and Recall of a system with N users, RecallðN Þ, is calculated as:
P
Recallu ðNÞ
RecallðN Þ ¼ u2testSet ð5Þ
N
where R is the recommendation list and reli shows the relative item in position i, which
can be zero or one if a relevant item recommended in ith position in the recommended
list or an irrelevant item is placed there, respectively. Normalized DCG (NDCG) is
determined by calculating DCG and dividing it by the ideal DCG in which the rec-
ommended items are perfectly ranked:
XjSi j 1
IDCG ¼ 1 þ ð7Þ
i¼2 log2 ði þ 1Þ
F1 SCORE
Since precision and recall are inversely correlated, it is needed to consider both of them
when evaluating different algorithms. Since precision and recall are dependent on the
number of recommended items, researchers have often used F1 score, as a combination
of precision and recall. F1 is calculated as follow:
2 PðN Þ RecallðN Þ
F1 ¼ ð8Þ
PðN Þ þ RecallðN Þ
4.2 Datasets
In this paper, we employ two well-known datasets, including Movielens-100K and
Jester, to evaluate performance of our method. The density of Jester dataset is about 10
times more than Movielens dataset that helps us to compare the performance of the
proposed algorithm in spars and dense datasets. Movielens-100 K is a movie dataset
with 943 users, 1682 items and 100,000 ratings. Jester is ratings of users to set of jokes.
In this work we use a sample of the original dataset with 3000 users, 100 jokes and
165,536 ratings. The ratings in Movielens and Jester are on a scale of 1 to 5 and −10 to
+10, respectively. For all benchmarks we use the same train and test sets for recom-
mending 10 items. Threshold T to transform data to like and dislike is set to 2.5 for
Movielens dataset and 0 for Jester dataset.
4.3 Results
In order to generate the result for comparison, we have used the Librec library in Java.
The proposed method is developed in Matlab and compared with AspectModel [28],
BPoissMF [29], EALS [17], ListRankMF [30], RankSGD [31], RankALS [32], WBPR
[33], CLIMF [34], UserKNN, BUCM [35], ItemKNN, IMULT [36], and GRAD [37].
The result in Tables 1 and 2 report the performance of the algorithms in terms of
different evaluation metrics over Movielens and Jester datasets, respectively. The pro-
posed algorithm performs better than other algorithms in terms of precision, recall,
NDCG and F1 evaluation metrics in both datasets. While it is the fastest algorithm in
Jester, it has the fourth fastest runtime in Movielens, where AspectModel and Lis-
tRankMF are the fastest, and the second fastest algorithms, respectively. The perfor-
mance of other algorithms differs across the datasets. While UserKNN is the second top-
performer in Movielense (after the proposed algorithm), in the other dataset,
160 N. Joorabloo et al.
Table 1. Performance of algorithms on Movielens dataset. The best result for each metric is
shown in boldface, while the second best result is shown in underlined boldface.
Precision Recall NDCG F1 Time(ms)
GBP 0.3478 0.1475 0.3949 0.207149 9500
AspectModel 0.228862 0.091245 0.247767 0.130473 3148
BPoissMF 0.019919 0.006309 0.015464 0.009583 16172
EALS 0.174187 0.07951 0.187264 0.109182 64966
ListRankMF 0.10122 0.047671 0.108673 0.064816 4282
RankSGD 0.261179 0.11386 0.292407 0.158586 13282
RankALS 0.175203 0.070117 0.189135 0.100153 572904
WBPR 0.14939 0.06305 0.152679 0.088674 104072
CLIMF 0.004065 0.00849 0.003427 0.001465 3696650
UserKNN 0.305691 0.129918 0.33227 0.182341 21259
BUCM 0.057927 0.020219 0.05347 0.029976 6368
ItemKNN 0.030081 0.012945 0.029488 0.0181 22388
IMULT 0.1842 0.0657 0.2068 0.096854 3402569
GRAD 0.0774 0.0429 0.0914 0.055203 19302
Table 2. Performance of algorithms on Jester dataset. The best result for each metric is shown
in boldface, while the second best result is shown in underlined boldface.
Precision Recall NDCG F1 Time(ms)
GBP 0.7331 0.7443 0.823 0.738658 11880
AspectModel 0.553032 0.527585 0.628423 0.540009 24522
BPoissMF 0.186442 0.174734 0.214315 0.180398 32608
EALS 0.238501 0.217747 0.265346 0.227652 47041
ListRankMF 0.36413 0.341146 0.433668 0.352264 15783
RankSGD 0.381922 0.366335 0.445621 0.373967 13709
RankALS 0.323398 0.309093 0.384412 0.316084 103244
WBPR 0.406522 0.375251 0.43215 0.390261 71296
CLIMF 0.115103 0.103977 0.085998 0.109257 1602445
UserKNN 0.446568 0.41683 0.482625 0.431187 26096
BUCM 0.307723 0.288875 0.343151 0.298002 14850
ItemKNN 0.136041 0.123463 0.127869 0.129447 36769
IMULT 0.562 0.4438 0.6301 0.495955 985245
GRAD 0.5133 0.4081 0.6318 0.454694 24834
5 Conclusion
References
1. Quan, T.K., Fuyuki, I., Shinichi, H.: Improving accuracy of recommender system by
clustering items based on stability of user similarity. In: CIMCA 2006. IEEE (2006)
2. Pazzani, M.J., Billsus, D.: Content-based recommendation systems. In: Brusilovsky, P.,
Kobsa, A., Nejdl, W. (eds.) The Adaptive Web. LNCS, vol. 4321, pp. 325–341. Springer,
Heidelberg (2007). https://doi.org/10.1007/978-3-540-72079-9_10
3. Bobadilla, J., et al.: Recommender systems survey. Knowl.-Based Syst. 46, 109–132 (2013)
4. Burke, R.: Hybrid recommender systems: survey and experiments. User Model. User-Adap.
Inter. 12(4), 331–370 (2002)
5. Winoto, P., Tang, T.Y.: The role of user mood in movie recommendations. Expert Syst.
Appl. 37(8), 6086–6092 (2010)
6. Javari, A., Jalili, M.: A probabilistic model to resolve diversity–accuracy challenge of
recommendation systems. Knowl. Inf. Syst. 44(3), 609–627 (2015)
7. Castro-Schez, J.J., et al.: A highly adaptive recommender system based on fuzzy logic for
B2C e-commerce portals. Expert Syst. Appl. 38(3), 2441–2454 (2011)
8. Núñez-Valdéz, E.R., et al.: Implicit feedback techniques on recommender systems applied to
electronic books. Comput. Hum. Behav. 28(4), 1186–1193 (2012)
9. Porcel, C., et al.: A hybrid recommender system for the selective dissemination of research
resources in a technology transfer office. Inf. Sci. 184(1), 1–19 (2012)
10. Tan, S., et al.: Using rich social media information for music recommendation via
hypergraph model. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 7(1), 22
(2011)
11. Barragáns-Martínez, A.B., et al.: A hybrid content-based and item-based collaborative
filtering approach to recommend TV programs enhanced with singular value decomposition.
Inf. Sci. 180(22), 4290–4311 (2010)
12. Costa-Montenegro, E., Barragáns-Martínez, A.B., Rey-López, M.: Which App? A
recommender system of applications in markets: implementation of the service for
monitoring users’ interaction. Expert Syst. Appl. 39(10), 9367–9375 (2012)
13. Bobadilla, J., Serradilla, F., Hernando, A.: Collaborative filtering adapted to recommender
systems of e-learning. Knowl.-Based Syst. 22(4), 261–265 (2009)
14. McNally, K., et al.: A case study of collaboration and reputation in social web search. ACM
Trans. Intell. Syst. Technol. (TIST) 3(1), 4 (2011)
162 N. Joorabloo et al.
15. Jalili, M., et al.: Evaluating collaborative filtering recommender algorithms: a survey. IEEE
Access 6, 74003–74024 (2018)
16. Li, X., Wang, H., Yan, X.: Accurate recommendation based on opinion mining. In: Sun, H.,
Yang, C.-Y., Lin, C.-W., Pan, J.-S., Snasel, V., Abraham, A. (eds.) Genetic and
Evolutionary Computing. AISC, vol. 329, pp. 399–408. Springer, Cham (2015). https://
doi.org/10.1007/978-3-319-12286-1_41
17. He, X., et al.: Fast matrix factorization for online recommendation with implicit feedback.
In: Proceedings of the 39th International ACM SIGIR conference on Research and
Development in Information Retrieval. ACM (2016)
18. Santos, B.S., et al.: Integrating user studies into computer graphics-related courses. IEEE
Comput. Graph. Appl. 31(5), 14–17 (2011)
19. Deo, N.: Graph Theory with Applications to Engineering and Computer Science. Courier
Dover Publications, Mineola (2017)
20. Augustyniak, P., Ślusarczyk, G.: Graph-based representation of behavior in detection and
prediction of daily living activities. Comput. Biol. Med. 95, 261–270 (2018)
21. Mobasher, B.: Data mining for web personalization. In: Brusilovsky, P., Kobsa, A., Nejdl,
W. (eds.) The Adaptive Web. LNCS, vol. 4321, pp. 90–135. Springer, Heidelberg (2007).
https://doi.org/10.1007/978-3-540-72079-9_3
22. Rendle, S., Freudenthaler, C., Schmidt-Thieme, L.: Factorizing personalized Markov chains
for next-basket recommendation. In: Proceedings of the 19th International Conference on
World Wide Web. ACM (2010)
23. Cheng, C., et al.: Where you like to go next: successive point-of-interest recommendation.
In: Twenty-Third International Joint Conference on Artificial Intelligence (2013)
24. Shani, G., Heckerman, D., Brafman, R.I.: An MDP-based recommender system. J. Mach.
Learn. Res. 6(Sep), 1265–1295 (2005)
25. He, Q., et al.: Web query recommendation via sequential query prediction. In: 2009 IEEE
25th International Conference on Data Engineering. IEEE (2009)
26. Sahoo, N., Singh, P.V., Mukhopadhyay, T.: A hidden Markov model for collaborative
filtering. In: Management Information Systems Quarterly, Forthcoming (2010)
27. Yang, Q., et al.: Personalizing web page recommendation via collaborative filtering and
topic-aware markov model. In: 2010 IEEE International Conference on Data Mining. IEEE
(2010)
28. Hofmann, T., Puzicha, J.: Latent class models for collaborative filtering. In: IJCAI (1999)
29. Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using Markov chain
Monte Carlo. In: Proceedings of the 25th International Conference on Machine Learning.
ACM (2008)
30. Shi, Y., Larson, M., Hanjalic, A.: List-wise learning to rank with matrix factorization for
collaborative filtering. In: Proceedings of the Fourth ACM Conference on Recommender
Systems. ACM (2010)
31. Töscher, A., Jahrer, M.: Collaborative filtering ensemble for ranking. J. Mach. Learn. Res.
W&CP 18, 61–74 (2012)
32. Takács, G., Tikk, D.: Alternating least squares for personalized ranking. In: Proceedings of
the Sixth ACM Conference on Recommender Systems. ACM (2012)
33. Gantner, Z., et al.: Personalized ranking for non-uniformly sampled items. In: Proceedings of
KDD Cup 2011 (2012)
34. Shi, Y., et al.: CLiMF: learning to maximize reciprocal rank with collaborative less-is-more
filtering. In: Proceedings of the Sixth ACM Conference on Recommender Systems. ACM
(2012)
A Probabilistic Graph-Based Method to Improve RS Accuracy 163
35. Barbieri, N., et al.: Modeling item selection and relevance for accurate recommendations: a
Bayesian approach. In: Proceedings of the Fifth ACM Conference on Recommender
Systems. ACM (2011)
36. Ranjbar, M., et al.: An imputation-based matrix factorization method for improving accuracy
of collaborative filtering systems. Eng. Appl. Artif. Intell. 46, 58–66 (2015)
37. Lin, C.-J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput.
19(10), 2756–2779 (2007)
A Robust Deep Ensemble Classifier
for Figurative Language Detection
1 Introduction
2 Literature Review
Despite all forms of FL have been studied independently by the Machine Learn-
ing community, none of the proposed systems have been tested on more than one
problem. Related work on the impact of FL in sentence sentiment classification
problems are usually categorized with respect to their subject: irony, sarcasm
detection and sentiment analysis of figurative language. Many researchers tend
to treat sarcasm and irony as an identical phenomenon in their works but we
will investigate each subject separately.
3
http://www.anc.org/data/anc-second-release/frequency-data/.
A Robust Deep Ensemble Classifier for Figurative Language Detection 167
3.1 BiLSTM
LSTM cell architectures tend to perform significantly better than regular neu-
ral networks in exploiting sequential data. An ordered input vector x =
(x1 , x2 , .., xn ) is mapped to an output vector y = (y1 , y2 , ..., yn ) throughout
computing hidden state vector h = (h1 , h2 , ..., hn ) repetitively. Cell i uses infor-
mation from previous cells in time, creating a context feature, in the neural
network architecture. However, sequential data, as text, speech or frame series
often require knowledge of both past and future context. Bidirectional LSTMs
solve this problem by exploiting input data, summarizing both the past and
future context of each time sample. They assign two hidden states for the same
time sample calculated in both directions in order to feed the output layer. The
→
−
forward hidden sequence h is calculated from reading input x1 to xn while on
←−
the other hand the backward hidden sequence h reads input from xn to x1 . The
total hidden state determines the time sample i by concatenating both forward
→ ←
− −
and backward hidden states, that is hi = hi || hi . In our work, we implement
a deep two-layered bidirectional LSTM (BiLSTM) stacked with a dense layer
between them as shown in Fig. 1.
168 R.-A. Potamias et al.
3.2 AttentionLSTM
rt = tanh(Wh ht + bt ) (1)
n
ert
at = sof max(rt ) = T , at = 1 (2)
j=0 ert t=1
T
s= at ht (3)
t=0
where Wh and bt are the LSTM model weights, optimized during training. As
FL detection often demands focus on the sentiment contrast between words we
implement an architecture based on an attentive LSTM layer (AttentionLSTM)
as shown in Fig. 2. Finally, a dense softmax activation layer is applied to the s
feature vector representation of the tweet for the classification step.
where F1i stands for F1 score of classifier i. The final prediction is made by com-
bining the confidence scores of all three classifiers confidence scores multiplied
with a scaling factor wi .
3
oj = arg max wi ∗ −
→
pi j , −
→
pi j ∈ Rc (5)
c
i=1
4 Experimental Setup
4.1 Datasets
Detecting FL, in addition to its inherent difficulty as a problem, is a particularly
difficult task since there is not much benchmark data available. To examine
DESC robustness and reliability, we investigate three different Twitter-based
datasets. First, we detect ironic comments on Twitter using SemEval-2018’s
‘Detecting Irony in English Twitter’ balanced dataset [12], consisting of 3834
train and 784, gold standard, test data. In addition, we utilized an imbalanced
dataset, containing sarcastic tweets, compiled by Riloff et al. [21]. Riloff’s dataset
170 R.-A. Potamias et al.
is a high-quality dataset as indicated by its high Cohen’s kappa score and consists
of 2278 tweets, only 506 of them being sarcastic. Finally, to evaluate our model’s
performance on sentiment analysis we acquired a dataset containing all forms of
figurative language as well as their sentiment polarity, as proposed on SemEval’s
2015 Task-11 [8]. This dataset is also composed by figurative tweets and contains
overall 8000 training and 4000 test tweets. Each tweet sample is ranked on a
11-point scale according to their sentiment polarity, ranging from –5 (negative
sentiment polarity, for tweets with critical and ironic meanings) to +5 (positive
sentiment polarity, for tweets with very upbeat meanings).
– Readability Features (4): Finally, we claim that all FL forms tend to have
different readability scores than literally ones. Thus, we implement four met-
rics measuring text readability using well known readability scores. First, we
enumerate words not present on Dale Chall’s list and afterwards we calculate
three readability scores:
difficult words words
Dale Chall = 0.1579 × 100 + 0.0496 (7)
words sentences
total words total syllables
Flesch = 206.835 − 1.015 − 84.6 (8)
total sentences total words
words complex words
Gunning Fog = 0.4 + 100 (9)
sentences words
5 Experimental Results
We compared the proposed DESC method against a variety of classifiers and
different feature combinations. We experimented with different combinations of
Tf-Idf and the features presented in Sect. 4.2 using a deep neural network archi-
tecture and also an SVM, frequently used by the teams participating in SemEval.
As illustrated in Table 1, DESC outperforms all baseline classifiers with con-
sistent (over all metrics) performance figures, a fact that is indicative for the
robustness of the proposed approach. In addition, we can observe the stability
that DESC demonstrates on all datasets, including unbalanced and multiclass-
classification problems. The abbreviations used in the table for the features are
unigrams, bigrams along with feature Feature Vector (FeatVec) (2inp), unigrams
(uni) only, Tf-idf (Tfidf), the feature set of Sect. 4.2 (FeatVec) as well the con-
catenation of all the features (All).
172 R.-A. Potamias et al.
Table 1. Comparison with baseline classifiers (various set-ups of DNN, SVM, AttL-
STM, and BiLSTM) for the tasks of (a) irony and sarcasm detection (binary), and (b)
sentiment polarity detection (eleven ordered values, −5 for “very negative” to 5 for
“very positive”) - bold figures indicate superior performance.
(b) Sentiment
(a) Irony & Sarcasm detection polarity
detection
Sentiment/
Irony Sarcasm
Average SemVal2015
SemVal-2018-Task 3.A[12] Riloff [21]
Task 11[8]
System Acc Pre Rec F1 AUC Acc Pre Rec F1 AUC Acc Pre Rec F1 AUC COS MSE
DNN-2inp 0,63 0,64 0,62 0,63 0,65 0,82 0,78 0,81 0,79 0,84 0,73 0,71 0,72 0,71 0,75 0,602 4,230
DNN-Tfidf 0,65 0,65 0,63 0,64 0,70 0,83 0,81 0,83 0,80 0,72 0,74 0,73 0,73 0,72 0,71 0,710 3,170
DNN-uni 0,65 0,68 0,65 0,65 0,72 0,79 0,78 0,81 0,79 0,74 0,72 0,73 0,73 0,72 0,73 0,690 8,430
DNN-All 0,66 0,69 0,66 0,67 0,75 0,83 0,81 0,83 0,81 0,82 0,75 0,75 0,75 0,74 0,79 0,789 2,790
DNN-FeatVec 0.64 0.65 0.65 0.65 0.71 0,81 0,81 0,83 0,80 0,72 0,73 0,73 0,74 0,73 0,71 0,680 3,230
SVM-Tfidf 0,65 0,68 0,65 0,66 0,70 0,82 0,80 0,82 0,80 0,80 0,74 0,74 0,74 0,73 0,75 0,720 2,890
SVM-FeatVec 0,59 0,59 0,59 0,59 0,60 0,82 0,73 0,81 0,75 0,76 0,71 0,66 0,70 0,67 0,68 0,700 3,390
SVM-All 0,66 0,69 0,66 0,67 0,75 0,83 0,81 0,83 0,81 0,81 0,75 0,75 0,75 0,74 0,78 0,723 2,810
AttentionLSTM 0,71 0,70 0,71 0,70 0,75 0,85 0,83 0,85 0,83 0,84 0,78 0,77 0,78 0,77 0,80 0,749 2,860
BiLSTM 0,71 0,71 0,71 0,70 0,76 0,85 0,85 0,85 0,85 0,85 0,78 0,78 0,78 0,78 0,81 0,704 3,220
DESC 0,74 0,73 0,73 0,73 0,78 0,87 0,87 0,86 0,87 0,86 0,81 0,80 0,80 0,80 0,82 0,820 2,480
We tested DESC against the performance scores of all models submitted and
published in SemEval-2015 [8] Sentiment Analysis task. DESC achieves 0.82 in
cosine similarity whereas the winning team [26] obtained 0.758; ranked also on
4th position regarding the MSE measure. At the same time, ClaC and UPF
teams are the only one to obtain a better MSE value than DESC, as illustrated
in Table 4. Further evidence of the robustness of DESC across all metrics on both
Irony [12] and Sarcasm [21] detection is illustrated in Tables 2 and 3. Specifically,
DESC outperforms all submissions on [12] regarding F1 measure; a fact that
it is indicative for the balance achieved between precision and recall figures
which also satisfies the desired property for low false-positives regrading the task
of automated twitter classification. In addition, DESC’s performance is highly
increased compared to Riloff’s initial proposal [21] and is very close to Ghosh
and Veale [9].
Finally, combining all three AttentionLSTM, BiLSTM and DNN-all classi-
fiers in an Ensemble model we can detect Irony with increased confidence as
illustrated in Fig. 6.
Fig. 6. ROC-AUC curve for three ensemble classifiers on SemEval’s-2018 Irony detec-
tion task.
174 R.-A. Potamias et al.
References
1. Nielsen, F.A.: A new ANEW: evaluation of a word list for sentiment analysis in
microblogs. arXiv e-prints, March 2011
2. Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: An Enhanced Lexical
Resource for Sentiment Analysis and Opinion Mining, vol. 10, January 2010
3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning
to align and translate. CoRR abs/1409.0473 (2014)
4. Barbieri, F., Saggion, H.: Modelling Irony in Twitter. In: EACL (2014)
5. Buschmeier, K., Cimiano, P., Klinger, R.: An Impact Analysis of Features in a
Classification Approach to Irony Detection in Product Reviews, January 2014.
https://doi.org/10.3115/v1/W14-2608
6. Carvalho, P., Sarmento, L., Silva, M., Oliveira, E.: Clues for detecting irony in
user-generated contents: oh... It’s “so easy”; -). In: International Conference on
Information and Knowledge Management, Proceedings (2009)
7. Davidov, D., Tsur, O., Rappoport, A.: Semi-supervised recognition of sarcastic
sentences in Twitter and Amazon. In: Proceedings of the Fourteenth Conference
on Computational Natural Language Learning, CoNLL 2010, Stroudsburg, PA,
USA, pp. 107–116. Association for Computational Linguistics (2010)
8. Ghosh, A., et al.: SemEval-2015 Task 11: Sentiment Analysis of Figurative Lan-
guage in Twitter (2015)
9. Ghosh, A., Veale, T.: Fracking sarcasm using neural network. In: Proceedings of
the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and
Social Media Analysis, pp. 161–169 (2016)
A Robust Deep Ensemble Classifier for Figurative Language Detection 175
10. Ghosh, D., Guo, W., Muresan, S.: Sarcastic or not: word embeddings to predict
the literal or sarcastic meaning of words. In: EMNLP (2015)
11. González-Ibáñez, R.I., Muresan, S., Wacholder, N.: Identifying sarcasm in Twitter:
a closer look. In: ACL (2011)
12. Hee, C.V., Lefever, E., Hoste, V.: SemEval-2018 task 3: irony detection in English
Tweets. In: SemEval@NAACL-HLT (2018)
13. Huang, Y.-H., Huang, H.-H., Chen, H.-H.: Irony detection with attentive recurrent
neural networks. In: Jose, J.M., et al. (eds.) ECIR 2017. LNCS, vol. 10193, pp.
534–540. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56608-5 45
14. Hutto, C.J., Gilbert, E.: VADER: a parsimonious rule-based model for sentiment
analysis of social media text. In: ICWSM (2014)
15. Kumar, L., Somani, A., Bhattacharyya, P.: “Having 2 hours to write a paper is
fun!”: Detecting Sarcasm in Numerical Portions of Text. arXiv e-prints, September
2017
16. Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word
representations. In: Proceedings of the 2013 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Tech-
nologies, pp. 746–751 (2013)
17. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word repre-
sentation. In: EMNLP, vol. 14, pp. 1532–1543 (2014)
18. Rajadesingan, A., Zafarani, R., Liu, H.: Sarcasm detection on Twitter: a behavioral
modeling approach. In: WSDM (2015)
19. Reyes, A., Rosso, P., Buscaldi, D.: From humor recognition to irony detection: the
figurative language of social media. Data Knowl. Eng. 74, 1–12 (2012)
20. Reyes, A., Rosso, P., Veale, T.: A multidimensional approach for detecting irony
in Twitter. Lang. Resour. Eval. 47(1), 239–268 (2013)
21. Riloff, E., Qadir, A., Surve, P., De Silva, L., Gilbert, N., Huang, R.: Sarcasm as con-
trast between a positive sentiment and negative situation. In: EMNLP 2013–2013
Conference on Empirical Methods in Natural Language Processing, Proceedings
of the Conference, pp. 704–714. Association for Computational Linguistics (ACL)
(2013)
22. Rosenthal, S., Ritter, A., Nakov, P., Stoyanov, V.: SemEval-2014 Task 9: Sentiment
Analysis in Twitter, January 2014. https://doi.org/10.3115/v1/S14-2009
23. Staiano, J., Guerini, M.: DepecheMood: a Lexicon for Emotion Analysis from
Crowd-Annotated News. arXiv e-prints, May 2014
24. Gibbs, W.: R.: On the psycholinguistics of sarcasm, vol. 115, March 1986. https://
doi.org/10.1037/0096-3445.115.1.3
25. Wallace, B.C., Choe, D.K., Charniak, E.: Sparse, contextually informed models
for irony detection: exploiting user communities, entities and sentiment. In: ACL-
IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Lin-
guistics (ACL), Proceedings of the Conference, vol. 1 (2015)
26. Özdemir, C., Bergler, S.: CLaC-SentiPipe: SemEval2015 subtasks 10 B, E, and task
11. In: Proceedings of the 9th International Workshop on Semantic Evaluation
(SemEval 2015), Denver, Colorado, pp. 479–485. Association for Computational
Linguistics, June 2015
Enhanced Feature Selection for Facial
Expression Recognition Systems
with Genetic Algorithms
Kennedy Chengeta(B)
Abstract. Humans use voice and facial expressions to infer their state
of emotions. Key expressions include being happy, angry, sad and neu-
tral. The expressions accounts for a third of non-verbal communication.
The study presents an efficient method to identify facial expressions in
images based on artificial neural networks enhanced by genetic algo-
rithms. We use Viola Jones for facial detections and PCA, a statistical
method to reduce the dimensionality and extract the features with a local
variant of local binary patterns called CS-LBP for static images which
is a local algorithm that reduces feature set by comparing symmetrical
pixels halving the feature set. The features are then optimally selected
using genetic algorithm before classification using artificial neural net-
works. It is also crucial to note that with these emotions being natural
reactions, recognition of feature selection and edge detection from the
images can increase accuracy and reduce the error rate. This can be
achieved by removing unimportant information from the facial images.
The genetic algorithm (GA) chooses a subset of image features based on
a reduced-dimensional dataset. The study proposes local binary pattern
variant central symmetric local directional pattern (CS-LDP), central
symmetric LBP (CS-LBP) and artificial neural networks aided by genetic
algorithms for feature selection. The study used the Japanese Female
Facial Expression, JAFFE database. The approach outperformed other
traditional approaches and proved that with added feature selection and
optimization the processing time is reduced and accuracy improved.
1 Introduction
Emotion recognition with facial expressions is used in child therapy, patients with
autism, marketing product responsiveness, security, internet gaming, accounting
and educational activities [4,5,11,20,23]. The key expressions include, sadness,
joy, neutral, surprise, fear and anger [2,18]. FER is also widely used digital iden-
tification and surveillance and remote access control systems [5,6,14]. Measuring
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 176–187, 2019.
https://doi.org/10.1007/978-3-030-20257-6_15
Facial Expression Detection with Neural Networks and Genetic Algorithms 177
2 Literature Review
Facial expression recognition has been achieved using local feature extraction
algorithms and holistic algorithms like LBP, LDP and Gray Level Co-occurrence
Matrix (GLCM) algorithms [2,4,14,21,24]. Feature selection and reduction has
been applied using PCA as well as genetic algorithm though less studies have
been applied in FER on the latter [22]. Preprocessing uses gray level transfor-
mations, normalization and histogram equalization has been widely researched
[17,20,22]. The classification of facial expressions using neural networks and sup-
port vector machines has also been widely proven though a combination of the
two has been less applied in the field [14,23]. The section reviews the prepro-
cessing and feature extraction stages of the FER process.
The normalization reduces light in too bright areas of the facial image as shown
in Eq. 1 where f(x, y) is equal to the value of each pixel intensity. Histogram
equalization improves the image’s global contrast by adjusting the facial image
intensity by calculation of the cumulative distributive function [13,28]. Equal-
ization is done to map the image distribution from a source histogram to a new
histogram which has a much wider and uniformly distributed intensity values.
The gray level image pixels are transformed by intensity values into a uniformly
distributed histogram. Gray scale normalization will compensate the effects of
color and light [22,28]. The hue image is shown as
2
1
[(x − q) + (x − y)]
H = cos−1 2
(2)
(x − q)(x − q) + (x − y)(q − y)
This Gamma Intensity Correction (GIC) transformer is used to adjust the overall
image brightness. RGB normalization involves all pixels being scaled by a given
factor and subdividing with the total of 3 color components to elminate color
effects [22,28].
x q y
(xnorm , qnorm , ynorm ) = , , (3)
x+q+y x+q+y x+q+y
n−1
CS − LBPr,n = s(zi − zi+(n/2) )2i (5)
i=0
The fitness function optimizes the objective function. The population selected
once its satisfies the fitness, undergoes mutation or crossover for them to be
selected for next iteration [1,22,25]. This string is analogous to the chromosome.
The genetic algorithm ensures faster processing due to a reduced feature and
search space. The algorithm is based on the given flow below [26]:
180 K. Chengeta
Genetic Algorithm
1. Selection of the base population of facial expression images is done
2. The fitness of each image in the image set population is calculated
3. The process is repeated till complete based on time, fitness and other factors
4. The best facial expression images are chosen for reproduction
5. The new individuals through crossover and mutation then produce offspring
6. The fitness of the new features is determined
7. The least fit features are then replaced by new individuals in the population
Fig. 3. Genetic algorithm flow [25] Fig. 4. Mutation and Cross over [26]
The middle layers use a tanh/sigmoid activation function and the softmax
top layer is based on a ‘LogisticRegression’ class). Multilayer perceptrons are
based on adjusting weights and biases during training to reduce errors [1,6]. On
the other hand, backpropagation links the weights and bias to the error using y
root mean squared error (RMSE). Deep Belief Network (DBN) algorithms have
been used in facial expression recognition through a 2-layer DBN architecture
based on a stack of restricted Boltzmann machines (RBMs) [1,6].
3 Implementation
The facial expression recognition system includes detection of the facial images,
preprocessing, feature extraction, feature selection using genetic algorithm and
lastly classification of the given feature vector histograms to generate the emo-
tions happy, sad, angry, surprise as well as fear. The implementation is shown
in the next diagram from facial detection, preprocessing, feature extraction and
selection or reduction to classification.
Facial Expression Detection. For our training images, we use the Viola-Jones
Haar Cascade Method implemented in OpenCV.
f ace cascade = cv2.CascadeClassif ier( haarcascade f rontalf ace def ault.xml )
eye cascade = cv2.CascadeClassif ier( haarcascade eye.xml )
Facial Expression Detection with Neural Networks and Genetic Algorithms 183
The Viola-Jones mouth detector was used to detect the mouth features and
uses the mouth position against the nose and eyes. The single eye detection uses
2 triangles to detect both eyes as left and right. The nose detection tool is used
to detect the nose [7,13,17].
3.1 Pre-processing
Preprocessing involved the application of histogram equalization to enhance the
image quality. The images were also converted to gray scale. The effects of angle,
distance, lighting were also eliminated by using the Gamma intensity correction
and logarithmic transformation on the given images. The RGB to Gray function
is implemented using a python Matlab function. The color information is dis-
carded to improve speed and there is no loss of accuracy. The images are trans-
formed from the RGB to grey level space using the following RBG to GRAY
[7,13,17]:
Y = 0, 3R + 0, 59G + 0, 11B (11)
The dataset used is the JAFFE datasets.This dataset has 213 images with 7
facial expressions made up of six basic and the neutral expression taken from 10
models of Japanese descent and the emotions include the following
[ angry , f ear , disgust , surprise , neutral , joy , sadness ]
The feature extraction was done with LBP, Central Symmetric LBP [5]. These
local extractors were also compared with a holistic algorithm namely GLCM.
The feature selection involved using the genetic algorithm to select the healthier
feature vectors. The other variant considered was the local directional pattern
where the histogram was represented as below [9].
r=0
r=0
LDPh (σ) = f (LDPk (r, c), σ). (12)
M N
The central symmetric LBP is shown in the diagrammatic form below [16,17]
(Fig. 7).
Dimensional Reduction and Feature Selection are done using principal
component analysis (PCA) and genetic algorithms. For a PCA the optimal
hyperplane is shown as below [1,5,22] for mean m, the training sample is com-
puted as The covariance matrix X based on given training samples is
n
X= (xk − m) (xk − m)T (13)
k=1
184 K. Chengeta
The genetic algorithm selected feature vectors based on 15%, 18%, 25%, 28% and
30% and 35% of the total feature vectors selecting the most fittest feature vectors.
The best accuracy came when the ratio of 0.28 was used hence the classification
was done on the basis of 28% of the total feature vectors. for scenarios with PCA
alone, 66.66% of top-most eigenvectors are selected.
3.4 Classification
Various classification methods were used namely, support vector machines,
weighted voting classifier of support vector machines and neural networks. Cross
validation was also applied to ensure there is reduced overfitting [6,7,13,17]. The
classification was implemented using python scipy, jupyter and numpy packages
on a Mac OS operating system.
eclf = voting classif ier(clf s = [clfa , clfb ], weights = [0.6, 0.4]) (14)
v T mq + k = 0, v T , mq + k1 yq = +1, (15)
The SVM classifier uses the RBF kernel for classification and the Multilayer Per-
ceptron (MLP) was implemented with the following where soft voting classifier
was used
M LP Classif ier(activation = relu , alpha = 0.0001,
batch size = auto , beta 1 = 0.9,
beta 2 = 0.999, early stopping = F alse, epsilon = 1e − 08,
hidden layer sizes = (30, 30, 30), learning rate = constant ,
learning rate init = 0.001, max iter = 200, momentum = 0.9,
nesterovs momentum = T rue, power t = 0.5, random state = N one,
shuf f le = T rue, solver = adam , tol = 0.0001, validation f raction = 0.1,
eclf = V otingClassif ier(estimators
= [( mlp , clf 1), ( svm , clf 2)], voting = hard , weights = [40, 60])
Facial Expression Detection with Neural Networks and Genetic Algorithms 185
Table 1 shows the accuracy levels of the SVM, MLP, and weighted classifier
of 60:40 ratio of support vector machine and MLP neural network. The weighted
classifier showed better classification accuracy than the base algorithms alone.
The CS-LBP algorithm and CS-LDP algorithms showed greater accuracy than
the base LBP and LDP algorithms. When combined with the genetic algorithm
for feature optimization and selection the accuracy levels was higher that the CS-
LBP alone and GLCM a holistic algorithm. The results for the Genetic algorithm
with CS-LBP were based on a 28% reduced feature selection using the genetic
algorithm and recorded higher classification results than CS-LBP alone due to
the reduced set of features and removal of noisy features. The CS-LDP algorithm
with genetic algorithm recorded the highest accuracy due to advantages of having
an filtering edge detector known as kirsch filter as part of the algorithm with a
classifier of 0.989. The processing time on the scenarios where genetic algorithms
were used for CS-LBP and CS-LDP all recorded processing times between 15–
20% faster than the traditional LBP and CS-LBP alone.
5 Conclusion
The study proposed a hybrid approach of Central Symmetric Local Binary Pat-
terns, a variant of local binary patterns, genetic algorithm to enhance feature
selection and a weighted classifier of artificial neural networks and support vector
186 K. Chengeta
machines. The different emotions measured namely fear, disgust, anger, happi-
ness and sadness are recognized with better accuracy than basic local binary
patterns or using just support vector machines or neural networks as the clas-
sifiers. The databases trained included the Japanese Female Facial Expression
(JAFFE) database. The accuracy of almost 97% was achieved on small datasets
which is better than the traditional algorithms as shown in Table 1. The use
of genetic algorithm and central symmetric LBP which both reduce the fea-
ture set by selecting the healthy feature vectors and using symmetrical differ-
ences of pixels respectively improves accuracy of facial expression recognition
and reduces processing time. The Central Symmetric Local Directional Pattern
combined with genetic algorithm also adds the advatange of filtering unwanted
edges through its kirsch edge detector.
References
1. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann
Publishers, San Francisco (2001)
2. Aung, M.S., et al.: The automatic detection of chronic pain-related expression:
requirements, challenges and the multimodal EmoPain dataset (2015)
3. Pavithra, P., Ganesh, A.B.: Detection of human facial behavioral expression using
image processing. ICTACT J. Image Video Process. 1, 162–165 (2011)
4. Nurzynska, K., Smolka, B.: Smiling and neutral facial display recognition with the
local binary patterns operator. J. Med. Imaging Health Inf. 5(6), 1374–1382 (2015)
5. Calder, A.J., Burton, A.M., Miller, P., Young, A.W., Akamatsu, S.: A principal
component analysis of facial expressions. Vis. Res. 41(9), 1179–1208 (2001)
6. Padgett, C., Cottrell, G.W.: Representing face images for emotion classification.
In: Advances in Neural Information Processing Systems, pp. 894–900 (1997)
7. Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vision 57,
137–154 (2004)
8. Zhao, X., Zhang, S.: Facial expression recognition based on local binary patterns
and kernel discriminant isomap. Sensors 11(10), 9573–9588 (2011)
9. Rivera, A.R., Castillo, R., Chae, O.: Local directional number pattern for face
analysis: face and expression recognition. IEEE Trans. Image Process. 22, 1740–
1752 (2013)
10. Alizadeh, S., Fazel, A.: Convolutional Neural Networks for Facial Expression
Recognition arXiv:1704.06756, June 2017
11. Chen, L., Xi, M.: Local binary pattern network: a deep learning approach for face
recognition. In: 2016 IEEE International Conference on Image Processing (ICIP)
(2016)
12. Uddin, M.Z., Khaksar, W., Torresen, J.: Facial expression recognition using salient
features and convolutional neural network. IEEE Access 5, 26146–26161 (2017).
https://doi.org/10.1109/ACCESS.2017.2777003
13. Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57,
137–154 (2004)
14. Valenti, R., Sebe, N., Gevers, T.: Facial expression recognition: a fully integrated
approach. In: 14th International Conference of Image Analysis and Processing,
Modena, pp. 125–130 (2007). https://doi.org/10.1109/ICIAPW.2007.25
Facial Expression Detection with Neural Networks and Genetic Algorithms 187
15. Mattivi, R., Shao, L.: Human action recognition using LBP-TOP as sparse spatio-
temporal feature descriptor. In: Jiang, X., Petkov, N. (eds.) CAIP 2009. LNCS,
vol. 5702, pp. 740–747. Springer, Heidelberg (2009). https://doi.org/10.1007/978-
3-642-03767-2 90
16. Ravi Kumar, Y.B., Ravi Kumar, C.N.: Local binary pattern: an improved LBP
to extract nonuniform LBP patterns with Gabor filter to increase the rate of face
similarity. In: ICCCIP, Mysore, pp. 1–5 (2016)
17. Pietikinen, M., Hadid, A., Zhao, G., Ahonen, T.: Computer Vision Using Local
Binary Patterns. Springer, London (2011). https://doi.org/10.1007/978-0-85729-
748-8
18. Ekman, P., Friesen, W.V.: The repertoire of nonverbal behavior: categories, origins,
usage, and coding. Semiotica 1, 49–98 (1969)
19. Rami, H., Hamri, M., Masmoudi, L.: Objects tracking in images sequence using
center-symmetric local binary pattern (CS-LBP). Int. J. Comput. Appl. Technol.
Res. 2(5), 504–508 (2013)
20. Nakashima, Y., Kuroki, Y.: SIFT feature point selection by using image segmen-
tation. In: International Symposium on Intelligent Signal Processing and Commu-
nication Systems, Xiamen, pp. 275–280 (2017)
21. Thakare, V.S., Patil, N.N.: Classification of texture using gray level co-occurrence
matrix and self-organizing map. 2014 International Conference on Electronic Sys-
tems. Signal Processing and Computing Technologies, pp. 350–355. IEEE, Wash-
ington (2014)
22. Theodoridis, S., Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn
(2008). ISBN 9781597492720, 9780080949123
23. Dikkers, H., Spaans, M., Datcu, D., Novak, M., Rothkrantz, L.: Facial recognition
system for driver vigilance monitoring. In: 2004 IEEE International Conference on
Systems, Man and Cybernetics, vol. 4, pp. 3787–3792 (2004)
24. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, Washington, DC, USA (2005)
25. Boubenna, H., Lee, D.: Feature selection for facial emotion recognition based on
genetic algorithm. In: 2016 12th International Conference on Natural Computation,
Fuzzy Systems and Knowledge Discovery, Changsha, pp. 511–517 (2016)
26. Satone, M., Kharate, G.: Feature selection using genetic algorithm for face recog-
nition based on PCA, wavelet and SVM. Int. J. Electr. Eng. Inf. 6(1), 39 (2014)
27. Nafchi, H.Z., Ayatollahi, S.M.: A set of criteria for face detection preprocessing.
Procedia Comput. Sci. 13, 162–170 (2012)
28. Han, H., Shan, S., Qing, L., Chen, X., Gao, W.: Lighting aware preprocessing
for face recognition across varying illumination. In: Daniilidis, K., Maragos, P.,
Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 308–321. Springer, Heidelberg
(2010). https://doi.org/10.1007/978-3-642-15552-9 23
Imaging Time-Series for NILM
1 Introduction
Energy demands have risen greatly in the past 40 years. More and more electrical
devices are becoming essential in a household; computers, tablets, cell phones
etc. That, of course, means that more energy is being spent. Therefore a need
arises for efficient monitoring of the energy being consumed. With the progress
of technology, it has become possible to monitor the energy consumption of a
house with the use of smart meters. The idea is to apply a smart meter in each
appliance of the house and have real-time information about energy consump-
tion. The application of smart meters on the appliance level is still quite costly
and therefore other ways have to be explored.
A more cost-efficient way of monitoring power consumption of a house would
be to have to install only one meter per household, which will monitor the total
This work has been funded by the EΣΠA (2014–2020) Erevno-Dimiourgo-Kainotomo
2018/EPAnEK Program ‘Energy Controlling Voice Enabled Intelligent Smart Home
Ecosystem’, General Secretariat for Research and Technology, Ministry of Education,
Research and Religious Affairs.
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 188–196, 2019.
https://doi.org/10.1007/978-3-030-20257-6_16
Imaging Time-Series for NILM 189
energy being consumed. The real world application of this is possible with the
help of algorithms that are able to infer the total energy signal to the device level
sub-signals that compose it. This is also known as an energy disaggregation task
and the field that studies it is Non-Intrusive Load Monitoring (NILM).
The most popular NILM approaches in machine learning are producing mod-
els whose input for the training is high or low-frequency data of the mains of a
house and the labels are the true energy consumption values of a chosen appli-
ance. Usually, both the training data and labels are chronologically arranged and
one of the goals of the chosen model is to find correlations between the aggre-
gated signal and the appliance signal, given a certain time frame. In NILM,
appliances within a household often change and get replaced by others as time
passes. Therefore, there are many occasions where a model trained to classify
an appliance in a certain time frame of a house cannot predict it accurately on
a different time frame. That being said, it is worthwhile investigating whether
the application of a transfer learning technique can be used to classify electrical
appliances (ON/OFF state) given the aggregated power signal of a house.
Transfer learning [3,15] does not rely on training and test data being in
the same feature space or even having the same distribution, in contrast to the
more classical approaches of machine learning techniques. It can be beneficial
in problems where there is a shortage of data, such as an accurate classifier or
regressor is not able to be trained, or there is a need for a more general model; a
model that generalizes in different distributions. There are several ways transfer
learning can be achieved. In this work, we focus on feature representation. The
idea is to encode the existing knowledge to another feature space. E.g, encode a
time series to an image representation. This approach, to our knowledge, has not
been previously explored in the existing literature concerning NILM problems.
It has been however applied to other time series classification tasks and has been
proven to yield good results.
In this study, the idea involves transformation of low-frequency data (1 Hz
data), from popular datasets UK-Dale [9] and REDD [11] to images using Wang
and Oates’ [20] time series to image algorithm, which uses Gramian Angular
Field Matrices (GAF), and afterwards changing the feature space with the help
of the pretrained Convolutional Neural Network VGG16 image classification
algorithm [19] to vectors. The final step is, feeding those vectors, accompanied
with their respective device labels, to a classification algorithm, such as a decision
tree algorithm, and training it to be able to recognize whether the appliance is
ON or OFF.
In the following sections, the process of the feature transformation and
transference is analyzed, as well as a set of experiments for the appliance
“fridge” of the UK-Dale and REDD datasets is compared directly to previous
implementations.
190 L. Kyrkou et al.
2 Related Work
Hart [6] was the first to work on this problem and used combinatorial techniques
in order to monitor changes on the appliances states of a household given the
aggregated signal. Since then there have been many approaches to the problem.
Factorial Hidden Markov Models (FHMM) solutions [1,10,16,17] have been the
leading implementations the past two decades, while Deep learning Artificial
Neural Network (ANN) architectures have become popular in the last decade
[2,8,12,13,21]. Nalmpantis and Vrakas [14] in their review describe some of the
most important recent Machine Learning approaches for the problem of NILM.
The approaches are compared in detail, presenting a qualitative and quantitative
analysis.
De Baets et al. [4] at their work represent the Vi trajectory of appliances as
images and train a Siamese Neural network from which a new feature space is
derived. This new representation is the input of the DBSCAN algorithm that
ultimately is able to recognize appliances in a household that are left unlabeled.
They use the high-frequency data from the datasets PLAID [5] and WHITED
[7]. Their approach seems to be successful in recognizing unknown appliances in
a household.
Wang et al. [20] propose a novel way of transforming time series into images,
in order to test whether this new representation of the data improves classifica-
tion results. The images are constructed by transforming a time series to its polar
coordinate representation. For this the time series used are scaled either to the
interval [−1, 1] or [0, 1]. The rescaled time series X̃ can then be represented as
its polar coordinates. This can be achieved by using the value of the time series
and the time stamp to encode as the angular cosine and radius respectively. The
equation is defined as:
φ = arccos(x̃i , −1 ≤ x̃i ≤ 1, x̃i ∈ X̃)
ti (1)
r=N , ti ∈ N
GADF = [cos(φi − φj )] = I − X˜2 · X̃ − X̃ · I − X˜2 (3)
In their study they test the above representation on multiple datasets. They
use a Tiled Convolutional Neural Network for extracting features. For the clas-
sification task they use a Denoising Auto-Encoder. The results are promising as
the Mean Squared Error metric is reduced by a maximum of 48% compared to
approaches that use raw data.
Imaging Time-Series for NILM 191
3 Implementation
The process begins with the transformation of the mains aggregated energy
signal of a house to multiple vectors of length 64 that will, in turn, become the
input of GAF algorithm. In this study only the GADF form is used. The vector
size corresponds to 6.4 min of data and is chosen so that the time frame is not
considerably large, but at the same time is not too small which would cause
substantial delays in most machines when producing the image files. Moreover,
the time frame chosen cannot be bigger than 6.4 min because of computational
restrictions. The nilmtk framework for Python 2.7 is used in order to preprocess
the data and construct the input files for the GAF algorithm. The set sampling
rate for both the mains data and the appliance data is set to 6 s. Once the data
has been preprocessed the GAF algorithm can begin generating the images. The
output of this algorithm are multiple images in .png format and 100 × 100 pixel
size. PAA smoothing is applied on them. The images are also tied with same
length vectors of labels for the appliance “fridge”, for which the experiments are
conducted. The whole procedure can be seen in Fig. 1.
The next step concerns the transference to another input space by using the
VGG16 pre-trained model in Keras, Python to categorize the images produced
by the GAF algorithm in the previous step. VGG16 is an image classification
Convolutional Neural Network (CNN) which is trained for 1000 classes of the
ImageNet database [18]. It has been proven very successful on image classifica-
tion tasks. Its input consists of a 3D tensor that represents the image. Because
the model has already been trained there is no need to re-train it with our images.
Instead, each image passes through the network and its prediction/output is the
new data space we wanted. Therefore, each image is inferred to 512 categories
vectors. After each image for a selected house and time period has been pro-
cessed, an array of dimensions [number of images × 512] is constructed. This
will be the input for the training of the classification algorithms that are tested
in this study.
192 L. Kyrkou et al.
3.3 Classification
3.5 Specifications
Fig. 2. The bar chart on the left shows the results for all the models that are trained
for the data of house 1 of UK-Dale and tested on data from the same house. The bar
chart on the right shows the results for the same models as the left but the data tested
belong to the unseen houses 2, 4 and 5 of the UK-Dale dataset.
Fig. 3. The bar chart on the left shows the results for all the models that are trained for
the data of house 1, 2 and 4 of UK-Dale and tested on data from house 5 of UK-Dale.
The bar chart on the right shows the results for the models that are trained for the
data of house 1 of UK-Dale dataset and tested for the houses of the REDD dataset.
In this section, we discuss the experiments and results of our novel NILM solution
and compare it directly to previous work. The proposed method is compared with
some popular ANN architectures as they were implemented in a previous work
of Krystalakos et al. [12]. The following networks were used for this task: Gated
Recurrent Units Network (GRU) [12], Recurrent Neural Network (RNN) [8],
Windowed GRU (WGRU) [12], Short-Sequence2Point (SS2P) [21], and Denois-
ing Auto-Encoder (DAE) [8]. The comparison between the aforementioned ANN
architectures and our approach is direct, meaning the training and testing of the
models are done for the exact same time periods and houses.
The ANN architectures that were used as a base for the comparison follow.
GRU consists of two convolutional layers, followed by two bidirectional GRU and
two fully connected layers. RNN consists of one convolutional layer, followed by
194 L. Kyrkou et al.
two bidirectional LSTM layers and two fully connected layers. WGRU consists
of one convolutional layer and two bidirectional GRU layers who are followed by
two fully connected layers. The last four layers have dropout between them. SS2P
consists of a total of seven layers: five convolutional layers followed by two fully
connected layers. All layers of the SS2P architecture have dropout between them.
DAE consists of a convolutional layer, three fully connected layers followed by
one convolutional output layer. All layers of the DAE architecture have dropout
between them.
We train multiple models based on the new feature space on the task of pre-
dicting ON/OFF states of the appliance “fridge”. In the following sub-sections,
the results are presented, categorized by experiment type.
In the first experiments category, all models are being trained on house 1 of
the UK-Dale dataset and the dates 1/4/2013 to 1/4/2014. The test also occurs
for the data of the same house, though the time frame is 1/4/2016 to 1/4/2017.
As it can be seen in Fig. 2 the proposed approach gives comparable results with
those from the ANN models. This experiment uses the F1 metric and evaluates
how well the algorithm performs on data of the same house.
The second category of experiments uses the same trained models as above
but the test occurs on different houses of the same dataset. In these experiments,
we test how well the models generalize on unseen houses of the same dataset.
The metric we use for this evaluation is F1 score. Our approach seems to be on
par with the comparative models for the houses 2 and 5. Testing on house 5 the
AdaBoost classifier with low depth decision trees (ABDTC) seems to outdo all
of the ANN models. On house 4 our models are not as accurate as the ANN
models, which could be due to the fact that house 1 and house 4 differ greatly
based on the number of appliances each house has. House 1 has a total of 54
appliances, while house 4 has only 6.
In the third experiment category, the models are trained with data from
multiple houses of the UK-Dale dataset. The houses used for the training data
are 1, 2 and 4. From house 1 only the time frame from 1/4/2013 to 1/4/2014 is
used for the training, as it has the most data out of all the houses and therefore
if all where to be used a memory error could occur. This test evaluates if it
is possible for a model to learn from multiple houses and be able to predict
accurately on an unseen house. Some ANN models were not able to converge.
Our models have a significantly lower F1 score from the best ANN, which is the
Windowed GRU model.
The last experiment category tests how well a model trained in UK-Dale
generalizes for the data of the REDD dataset. That way, if a model performed
well on the unseen houses of the different dataset it is safe to say that it has
the potential to be accurate on data regardless of the country of origin. Note
that UK-Dale is a UK based dataset while REDD is a US-based dataset. As it
is shown in Fig. 3 most models, whether they’re ANN’s or the different input
space approaches, perform well on most houses. Our approach is on par with the
ANN’s on most houses and on house 3 it outperforms them.
Imaging Time-Series for NILM 195
References
1. Aiad, M., Lee, P.H.: Non-intrusive load disaggregation with adaptive esti-
mations of devices main power effects and two-way interactions. Energy
Build. 130, 131–139 (2016). https://doi.org/10.1016/j.enbuild.2016.08.050.
http://www.sciencedirect.com/science/article/pii/S0378778816307472
2. Chen, K., Wang, Q., He, Z., Chen, K., Hu, J., He, J.: Convolutional sequence
to sequence non-intrusive load monitoring. J. Eng. 2018(17), 1860–1864 (2018).
https://doi.org/10.1049/joe.2018.8352
3. Dai, W., Chen, Y., Xue, G.R., Yang, Q., Yu, Y.: Translated learning: transfer learn-
ing across different feature spaces. In: Advances in Neural Information Processing
Systems, pp. 353–360 (2009)
4. De Baets, L., Develder, C., Dhaene, T., Deschrijver, D.: Detection of unidentified
appliances in non-intrusive load monitoring using siamese neural networks. Int. J.
Electr. Power Energy Syst. 104, 645–653 (2019)
5. Gao, J., Giri, S., Kara, E.C., Bergés, M.: PLAID: a public dataset of high-resoultion
electrical appliance measurements for load identification research: demo abstract.
In: Proceedings of the 1st ACM Conference on Embedded Systems for Energy-
Efficient Buildings, pp. 198–199. ACM (2014)
6. Hart, G.W.: Nonintrusive appliance load monitoring. Proc. IEEE 80(12), 1870–
1891 (1992)
7. Kahl, M., Haq, A.U., Kriechbaumer, T., Jacobsen, H.A.: Whited-a worldwide
household and industry transient energy data set. In: 3rd International Workshop
on Non-Intrusive Load Monitoring (2016)
8. Kelly, J., Knottenbelt, W.: The UK-DALE dataset, domestic appliance-level elec-
tricity demand and whole-house demand from five UK homes. Sci. Data 2, 150007
(2015). https://doi.org/10.1038/sdata.2015.7
9. Kelly, J., Knottenbelt, W.: The UK-dale dataset, domestic appliance-level electric-
ity demand and whole-house demand from five UK homes. Sci. Data 2, 150007
(2015)
10. Kolter, J.Z., Jaakkola, T.: Approximate Inference in Additive Factorial HMMs
with Application to Energy Disaggregation, June 2018. https://doi.org/10.1184/
R1/6603563.v1
196 L. Kyrkou et al.
11. Kolter, J.Z., Johnson, M.J.: REDD: a public data set for energy disaggregation
research. In: Workshop on Data Mining Applications in Sustainability (SIGKDD),
San Diego, CA, vol. 25, pp. 59–62 (2011)
12. Krystalakos, O., Nalmpantis, C., Vrakas, D.: Sliding window approach for online
energy disaggregation using artificial neural networks. In: Proceedings of the 10th
Hellenic Conference on Artificial Intelligence, SETN 2018, pp. 7:1–7:6. ACM, New
York (2018). http://doi.acm.org/10.1145/3200947.3201011
13. Lange, H., Bergés, M.: The neural energy decoder: energy disaggregation by com-
bining binary subcomponents (2016)
14. Nalmpantis, C., Vrakas, D.: Machine learning approaches for non-intrusive load
monitoring: from qualitative to quantitative comparation. Artif. Intell. Rev. 1–27
(2018)
15. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng.
22(10), 1345–1359 (2010)
16. Paradiso, F., Paganelli, F., Giuli, D., Capobianco, S.: Context-based energy disag-
gregation in smart homes. Future Internet 8(1) (2016). https://doi.org/10.3390/
fi8010004. http://www.mdpi.com/1999-5903/8/1/4
17. Parson, O., Ghosh, S., Weal, M., Rogers, A.: Non-intrusive load monitoring using
prior models of general appliance types. In: Twenty-Sixth AAAI Conference on
Artificial Intelligence (2012)
18. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J.
Comput. Vis. 115(3), 211–252 (2015)
19. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
20. Wang, Z., Oates, T.: Imaging time-series to improve classification and imputation.
In: Twenty-Fourth International Joint Conference on Artificial Intelligence (2015)
21. Zhang, C., Zhong, M., Wang, Z., Goddard, N., Sutton, C.: Sequence-to-point learn-
ing with neural networks for non-intrusive load monitoring. In: Thirty-Second
AAAI Conference on Artificial Intelligence (2018)
Learning Meaningful Sentence Embedding
Based on Recursive Auto-encoders
1 Introduction
Embedding the meaning of a sentence into a vector space is a challenging and
on going area of research in natural language processing as expected to be very
useful for several natural language tasks. Neural word embedding technique has
recently received a lot of attention and has proved to be a powerful tool for mod-
eling semantic relations between individual words. Almost all word embedding
models are based on the distributional hypothesis [1] which states that words
that occur in the same contexts tend to have similar meaning. The context is
usually defined as the words which precede and follow the target word within
some fixed window in most word embedding models with various architectures.
Word2vec [2] is a well-known word embedding model that received substantial
success in mapping words with similar syntactic or semantic meaning to vectors
close to each other.
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 197–207, 2019.
https://doi.org/10.1007/978-3-030-20257-6_17
198 A. Bouraoui et al.
2 Related Work
phrases in syntactical tree of the sentence. To compare two sentences, they use
a similarity matrix which dimensions are proportional to the size of the two
sentences. Since the similarity matrix generated to compare two sentences has
varying dimension due to different sentence lengths, a dynamic pooling layer is
used to map it to a matrix of fixed dimension. The resulting matrix is used to
calculate the similarity scores between the two sentences.
3 Proposed Method
Most of previous methods focus on the syntactic aspects of a sentence by the
means of parse dependency tree or use immediately neighboring sentences to
learn sentence embedding. However, we are more interested in constructing rep-
resentations that capture semantics of variable-sized sentences based on word
meaning evolution and the meaning of the current sentence. We believe that not
only word embeddings contain and express the semantics of the sentences in the
vector spaces. But also the global meaning of sentences in which a word is used
can help determining the intended meaning of this word. In fact, words can have
more than one meanings. We believe that a change in the global meaning of a
sentence where a word appears is a good indicator of a change in the meaning of
a word. So, influenced the word meaning by the global meaning of the sentence
contains it might be relevant to model its new semantic meaning. On other hand,
initializing the word when it appears by its old meaning can help detecting new
meanings for same words. That can provides an understanding of what the word
means at that point.
Motivated by the aforementioned, we propose a new method to learn and
embed the sentence meaning while jointly learning evolved word representation
in unsupervised manner based on recursive auto-encoders. Traditional recursive
auto-encoders are based on using recursive structure of parse trees, starting
from the leaves and proceeding recursively in a bottom-up fashion until the
root of the parse tree is reached. Our model differs from traditional recursive
auto-encoders at three points. The first, we take into account words order and
combine neighbors words recursively and sequentially. Second, we don’t rely on
a parse tree structure. Third, we jointly learn sentences and words meaning
using their evolved representations. We construct the meaning of sentences and
words iteratively and recursively with update of word embedding at each new
input. We include the representation meaning of a sentence to construct the
representation meaning of their words, where the sentence meaning is represented
by an embedding computed based on a recursive auto-encoders model updated
t times. The meaning of words is represented by an embedding computed from
the decoding function. The evolved word embedding is fed as input to our model
to construct sentence meaning. In fact, the word meaning is influenced not only
by the meaning of the previous words but also by the meaning of the sentence
where it occurs. Our model is illustrated in Fig. 1.
Given a sentence s = [w1 , w2 , ..., wi , .., wj , ..., wn ] with length n, our model
regards it as a sequence of ordered words and we apply the auto-encoder recur-
sively (as shown in Fig. 2). Among the sequence, our model combines each pair
Learning Meaningful Sentence Embedding 201
of sibling nodes into a potential parent node. It takes the first pair of neighbor-
ing vectors, the sentence embedding vector and the first word vector within the
sentence, defines them as potential children of a sentence (c1 , c2 ) = (semb , w1 ),
concatenates and gives them as input to the auto-encoder. Then, we save the
potential parent node p1 , the right reconstructed children which represent the
reconstructed vector embedding of the current word of the given sentence.
Therefore, the network is shifted by one position and takes as input vectors
(c1 , c2 ) = (p1 , w2 ) and again computes a potential parent node and reconstructed
vectors for potential child. This process repeats until it hits the last pair of vec-
tors in the sentence: (c1 , c2 ) = (pn−1 , wn ). Each potential parent pi is calculated
using Eq. 1.
pi = f (Wφi [c1 , c2 ] + biφ ) (1)
where Wφi ∈ d×2d is an encoding weight matrix (2d is the number of input
units), c1 and c2 are the d-dimensional vector representations corresponding to
every child in the pair, biφ is the encoding bias vector and f is a non-linear
activation function. The learned parent vector pi is also a d-dimensional vector.
The resulting representation pi is then mapped back to a reconstructed vectors
of original children pair vectors in order to obtain more informative and abstract
representation of children meaning by performing the following evaluation:
c1 , c2 = f (Wϕi pi + biϕ ) (2)
representation will be used as input for a next training for our recursive auto-
encoders. The decoder part will be used as initialization for words when they
occurred within a sentence that will later be trained. This process was repeated
t times.
We explored two strategies to initialize the sentence vector included as a child
in our model input: random real-valued initialization and weighted average of
vectors of words constituent a sentence. We also suggest a new method to handle
out-of-vocabulary words. The main idea consists at determining for each out-
of-vocabulary word within a test sentence, all training sentences containing this
word. From the constructed set of sentences, we select one that has the maximum
number of common words. The out-of-vocabulary word will be initialized by the
average of these words. In fact, we design three model variants considering the
different methods of sentence embedding and out-of-vocabulary initialization.
The first variant (model-v1) consists at initializing the sentence represen-
tation and out-of-vocabulary words randomly with each component sampled
from uniform distribution. The second variant (model-v2) employs the sug-
gested method of handling out-of-vocabulary words and random initialization
for sentence representation input. We furthermore test if our model does better
when the initialized sentence embedding be a weighted average of word vectors
obtained by word2vec (model-v3). We constructed the initial sentence embed-
ding by weighting each word embedding within a sentence by its idf and therefore
averaging the resulted embeddings. We treat each sentence as a document and
calculate the idf weight for word w as follows:
N
idfw = log (3)
1 + nw
where N is the total number of sentences and nw is the number of sentences in
which the word w occurs.
4 Experiments
4.1 Setting
We build our model using the framework Tensorflow and keras for python. We
adopt the Adam optimizer [17] to tune the network parameters. As objective
Learning Meaningful Sentence Embedding 203
function, we apply mean square error (MSE). It is calculated between the original
input (pi−1 , wi ) and its reconstructed vector (pi−1 , wi ) of each auto-encoder bloc
of our recursive model using Eq. 4.
2
pi−1 pi−1
M SE(pi ) = − (4)
wi
wi
We initialize the word embeddings using word2vec tool. We use 200 as dimen-
sional size for embeddings. To measure the performance of our model, we utilize
two metrics the Pearson Correlation coefficient, denoted by r, and the Spear-
man’s Rank Correlation coefficient, denoted by p. We performed the semantic
similarity task in an unsupervised framework by calculating the cosine similarity
between two sentence vectors.
4.2 Results
In this section, we present the results for the experiments we conducted with the
setup explained in the previous section.
In Table 1, we show the results of our model when trained on SICK and
SMTeuropal datasets. We also include obtained results with two tested meth-
ods for embedding sentences. Idf -avg-w2v method explicated above and Avg-
w2v method where we construct representations of variable-length sentences by
averaging the embeddings of all words within a sentence obtained by word2vec
model.
We can see that overall, our model variants achieve a better performance
compared with our baselines simple averaging and idf -weighted averaging. The
simple averaging is the poorly performed method. That can be explicated by the
fact that taking the average of word embedding in a sentence tends to give to
much weight to words that are quite irrelevant semantically. We can find that our
embeddings yield more semantic information especially for our model-v3 which
proves to be the more promising variant. This indicates that our assumption of
considering evolving words meaning for modeling sentence meaning was benefi-
cial for learning sentence meaning as well as in learning better word meaning.
204 A. Bouraoui et al.
We can find from Fig. 3 that when two sentences have a lot of word overlap,
and have little differences in key semantic roles (A black dog is attacking a brown
dog on the sand and A black dog is playing with a brown dog on the sand), our
model tends to make smaller error and it seems to be good at capturing semantics
similarity. When sentence pairs differ in meaning and have few words in common
(A woman is cutting shirmps and A prawn is being cut by a woman), our model
can detect the semantic differences. Thus, we can say that our model is able to
give us satisfying predictive semantic similarity scores.
Table 2 displays the results of some state-of-art methods on SICK dataset
and our results.
The four first results represent high level of traditional methods that are
released by SemEval 2014. It can be seen that the result of our model outperforms
the aforementioned results, according to pearson and spearman correlations.
The four last models represent four well known state-of-art models; recursive
auto-encoder models proposed by [11] and recurrent models based on parse-tree
proposed by [12]. It can be seen that our model-v3 perform well against others
state-of-art sentence embedding. Thus we can say that our model is promising
and having competitive results comparing it with traditional methods and state-
of-art deep networks.
Table 3 shows the Pearson correlation of our model against some state-of-art
methods on SMTeuroparl dataset.
Comparing our result to three best results on Semeval task and some state-
of-art results on Smteuroparl dataset further shows that our model provide
Learning Meaningful Sentence Embedding 205
Method Pearson’s r
[22] 0.4203
[24] 0.3612
[23] 0.5280
RNN 0.409
LSTM 0.443
[25] 0.450
Our modelv3 0.4694
competitive results. For example, our model is better than the model of [24]
by 0.1082 and that of [25] by 0.0194.
In contrast to other methods which are trained using external knowledge
information such as wordnet or parse trees, our unsupervised method uses only
word embeddings and recursive auto-encoder without external information. For
example, the ECNU model [19] uses four learning methods, WordNet and addi-
tional corpus; The meaning factory model [18] also utilizes three different knowl-
edge base including WordNet. [23] also use external resources. These additional
corpus consist of rich semantic information which is quite helpful for representing
sentence meaning. Furthermore, previous works use pre-training on a much larger
corpus which could introduce prior knowledge of the dataset. We would note that
the output of our neural network is a vector that uses to compute the score of
sentence similarity. In other works, the score of sentence similarity is directly
computed by the neural network using linear regression. In addition, our model
yields not only meaningful sentence embedding but also evolving word mean-
ing embedding. So, obtained competitive results with our unsupervised model is
good and we can conclude that using evolved word meaning improve modeling
sentence meaning. Also, incorporating sentence representation as global meaning
improves the word semantic embedding.
206 A. Bouraoui et al.
5 Conclusion
In this work, we address the problem of learning sentence representation that
capture semantics for variable-length sentence in unsupervised manner. To do so,
we propose a novel method based on recursive auto-encoders. Our model incor-
porates sentence embedding with embeddings of words that contain without
using parse tree structure. We take word meaning evolution into consideration.
That’s every sentence will be embedded as a vector representing its seman-
tic meaning using the evolved embedding of words meaning within it. We also
propose a method addressing the problem of unknown words. This method ame-
liorates results comparing to randomly embedding unknown words. Conducted
experiments on semantic similarity task show that proposed method for learning
meaningful sentence embeddings gives competitive results. In the future, we plan
to introduce attention mechanism to enhance more semantic representations of
sentences.
References
1. Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
2. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. In: ICLR (2013)
3. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word represen-
tation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing, 25–29 October, Doha, Qatar, A meeting of SIGDAT, a Spe-
cial Interest Group of the ACL, pp. 1532–1543 (2014)
4. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with
subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
5. Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R.S., Torralba, A., Urtasun, R.,
Fidler, S.: Skip-thought vectors. In: NIPS (2015)
6. Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network
for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the Asso-
ciation for Computational Linguistics, pp. 655–665 (2014)
7. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings
of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pp. 1746–1751 (2014)
8. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning
of universal sentence representations from natural language inference data. In:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing (2017)
9. Subramanian, S., Trischler, A., Bengio, Y., Pal, C.J.: Learning general purpose
distributed sentence representations via large scale multi-task learning. In: Inter-
national Conference on Learning Representations (2018)
10. Socher, R., Huang, E.H., Pennin, J., Manning, C., Ng, A.Y.: Dynamic pooling and
unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural
Information Processing Systems, USA, pp. 801–809 (2011)
11. Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compo-
sitional semantics for finding and describing images with sentences. Trans. Assoc.
Comput. Linguist. 2, 207–218 (2014)
Learning Meaningful Sentence Embedding 207
12. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-
structured long short-term memory networks. In: Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing, pp. 1556–1566 (2015)
13. Bowman, S.R., Gauthier, J., Rastogi, A., Gupta, R., Manning, C.D., Potts, C.:
Fast unified model for parsing and sentence understanding. ACL (2016)
14. Dyer, C., Kuncoro, A., Ballesteros, M., Smith, N.A.: Recurrent neural network
grammars. In: NAACL, San Diego, California, 12–17 June 2016, pp. 199–209 (2016)
15. Hill, F., Cho, K., Korhonen, A.: Learning distributed representations of sentences
from unlabelled data. In: NAACL-HLT, San Diego, California, 12–17 June 2016,
pp. 1367–1377 (2016)
16. Grover, J., Mitra, P.: Sentence alignment using unfolding recursive autoencoders.
In: Proceedings of the 10th Workshop on Building and Using Comparable Corpora
ACL, Vancouver, Canada, 3 August 2017, pp. 16–20 (2017)
17. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: Interna-
tional Conference on Learning Representations, pp. 1–13 (2015)
18. Bjerva, J., Bos, J., Van der Goot, R., Nissim, M.: The meaning factory: formal
semantics for recognizing textual entailment and determining semantic similarity.
In: SemEval COLING, pp. 642–646 (2014)
19. Zhao, J., Zhu, T., Lan, M.: ECNU: one stone two birds: ensemble of heterogenous
measures for semantic relatedness and textual entailment. In: SemEval COLING,
pp. 271–277 (2014)
20. Lai, A., Hockenmaier, J.: Illinois-LH: a denotational and distributional approach
to semantics. In: International Workshop on Semantic Evaluation, Dublin, Ireland
(2014)
21. Jimenez, S., Dueñas, G., Baquero, J., Gelbukh, A.: UNAL-NLP: combining soft
cardinality features for semantic textual similarity, relatedness and entailment. In:
Proceedings of the 8th International Workshop on Semantic Evaluation (SemeVal),
Dublin, Ireland (2014)
22. Banea, C., Hassan, S., Mohler, M., Mihalcea, R.: UNT: a supervised synergistic
approach to semantic text similarity. In: Proceedings of SemEval 2012, pp. 635–642
(2012)
23. Bar, D., Biemann, C., Gurevych, I., Zesch, T.: UKP: computing semantic textual
similarity by combining multiple content similarity measures. In: Proceedings of
SemEval 2012, pp. 435–440 (2012)
24. Saric, R., Glavas, G., Karan, M., Snajder, J., Dalbelo, B.: TakeLab: systems for
measuring semantic text similarity. In: Proceedings of SemEval 2012, Montreal,
Canada, pp. 441–448 (2012)
25. Kenter, T., Borisov, A., Rijke, M.D.: Siamese CBOW: optimizing word embeddings
for sentence representations. CoRR abs/1606.04640 (2016)
Pruning Extreme Wavelets Learning
Machine by Automatic Relevance
Determination
1 Introduction
The Extreme Learning Machine [10] is a learning algorithm for hidden layer
single feedforward networks - SLFNs with low computational complexity and
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 208–220, 2019.
https://doi.org/10.1007/978-3-030-20257-6_18
Pruning ELM Wavelet by ARD 209
2 Literature Review
2.1 Extreme Learning Machine
The ELM (Extreme Learning Machine) is a learning method developed for hid-
den layer feedforward neural networks (SLNFs) in 2006, where its main contri-
210 P. V. de Campos Souza et al.
where wj is the weight vector of the connections between the m inputs and the
hidden j -th neuron, hj is the vector of weights of the connections between the
j -th hidden neuron and the neurons of the network output, and bj is the bias of
the j -hidden neuron. For the ELM, f(·) is the activation function applied to the
scalar product of the input vector with the hidden weights wk that are defined at
random. The function f (·) can be for more types (ex. sigmoidal) [10,17]. With
the model defined in (1), we can write y as H * β, where β is the vector of
weights of the output layer, y is the vector of outputs. H is determined to be
[3,10,17]: ⎡ ⎤
f (w1 , x1 + b1 ) ... f (wm , x1 + bl )
H = ⎣ f (w1 , x2 + b1 ) ... f (wm , x2 + bl ) ⎦ (2)
f (w1 , xn + b1 ) ... f (wm , xn + bl ) N ×l
The columns of the matrix G, defined in (2), correspond to the outputs of the
hidden neurons of the SLFN with respect to the input X = [x1 , x2 , ..., xN ]Tm×N .
The ELM implements a random initialization of the weights of the hidden layer
(based on a numerical range any), wk . Then, the weights of the output layer are
obtained through the pseudo inverse [9] according to the expression [10,17]:
β = H+ y (3)
2.2 Wavelets
Wavelet is a function capable of decomposing and representing another func-
tion described in the time domain so that we can analyze this other function
in different frequency and time scales. In Fourier analysis, we can only identify
information about the frequency domain, but we can not know “when” in time
these frequencies that we study happen; Meanwhile, in wavelet analysis, we can
also extract information from the function in the time domain. The detailing
of the frequency domain analysis decreases as time resolution increases, and it
is impossible to increase the detail in one domain without decreasing it in the
other. Using wavelet analysis, you can choose the best combination of details for
an established goal. Adapting this concept to the artificial neural networks, the
adjustment of the detail provided by the wavelets is an element capable of pro-
viding a generalization of network recognition. In this work, the discrete wavelet
transform will be adopted. This type of methodology is much used in data com-
pression. In order to calculate the discrete wavelet transform it is through the fil-
ter bank application where the filter determined by the coefficients h = {hn }n∈Z
corresponds to a high pass filter and the filter g = {gn }n∈Z to a low pass filter.
Each of these coefficients in the discrete wavelet transform is tabulated. Empha-
sis is given to the use of the operator (↓ 2) is the sub-sampling operator. This
operator applied to a discrete function reduces its number of elements in half,
allowing the procedure to be faster and more precise. The filters h and g are
linear operators, which can be applied to the input x as a convolution:
c(n) = g(k)x(n − k) = g ∗ x (4)
k
d(n) = h(k)x(n − k) = h ∗ x (5)
k
where the signal c(n) is known as approximation and the signal d(n) as detail [6].
The decomposition with the filter decomposes the signal into only two frequency
bands. We can chain a series of filter banks by using the sub-sampling operation
to provide the division of the sampling frequency by 2 to each new filter bank
threaded.
the ELM-based training method is used. The concepts of wavelets and ELM
are also used in composite functions with differential evolution and finally a
parameter initialization with dual wavelet-based activation functions, in addition
to a combination of Morlet wavelets function and inverse hyperbolic sine function
with initialization of weights and bias through a heuristic procedure [11]. Other
works that use the ELM to train models along with the concepts of wavelets
[1,5,7,14]. Several works use the ReLU function and ELM to solve problems,
mainly related to deep learning [13,18,24]. The main advantage of using the
ReLU function over other activation functions is that it does not activate all
neurons at the same time. This means that if for the ReLU function and the
input is negative, it will be converted to zero and the neuron will not be activated
allowing at the same time, only some neurons are activated, making the network
sparse, efficient and easy for the computational processing of answers.
The determination of the number of neurons has also been the target of
several academic works. Some approaches use incremental methodology (start
with a low number of neurons and gradually increase), and others use pruning
techniques that allow the network to start with a high number of neurons and
techniques pruning the neurons less relevant. One of the most prominent pruning
works in neural network architectures that use ELM was proposed by [19] which
uses statistical techniques to perform the pruning of neurons. Affinity matrix
techniques, data density, probabilistic and other statistical evaluations are used
in [3,15,22]. This work differs from other authors’ proposals due to the use of
wavelet functions to define weights and bias and the technique of pruning based
on Bayesian techniques.
SBL assumes a Gaussian likelihood function p(y|x) = N (y; Φx, λI), consistent
with the data fit term from (6). The basic ARD prior incorporated by SBL is
p( x; γ) = N ( x; 0, diag[γ]), where ∈ Rm
+ is a vector of m non-negative hyper-
parameters governing the prior variance of each unknown coefficient. These
hyperparameters are estimated from the data by first marginalizing over the
coefficients x and then performing what is commonly referred to as evidence
maximization or type-II maximum likelihood, this is equivalent to minimizing
[23]:
L(γ) − log p(y|x)p(x; γ)dx = − log p(y; γ) ≡ log |Σy | + y T Σy−1 y (8)
Note that if any γ∗,i = 0, as often occurs during the learning process, then
xSBL,i = 0 and the corresponding dictionary column is effectively pruned from
the model. The resulting xSBL is therefore sparse, with nonzero elements corre-
sponding with the “relevant” basis vectors [23].
The architecture of the proposed model follows the assumptions widely used
in the literature, where a Single Layer Feed Forward Network (SLFN) has a
certain amount of hidden neurons, and these neurons have weight and value
of bias calculated through wavelet functions. The number of neurons in the
hidden layer varies according to the values chosen by the network architecture,
and the training is performed through an extreme learning machine. A single
output neuron carries the binary responses of the network. All neurons involved
in the architecture of this network are of the ReLU type. The Fig. 1 shows the
architecture explained in this topic.
For the hidden layer of the SLFN, training will be performed with each output
of the filters of each level of the Wavelet transform, thus allowing to update the
214 P. V. de Campos Souza et al.
weights and bias, which by the original definition of the ELM should be randomly
assigned, assigning them the corresponding values of the output of the Wavelet
filters. Thus, the training of the neural network can happen differently from the
traditional ELM algorithm, the characteristic of the input data of the model will
allow in a single step the definition of values for weights and bias. Initially, the
wavelet transform is applied to the input X of the model resulting in a vector
ψ1 . This vector is then passed to a detail removal function that eliminates the
high frequencies (if any) of the vector ψ1 by adjusting the number of neurons
l in the hidden layer so that the training can be done resulting in a vector φ1 .
In the hidden layer of this architecture, the initial vector has l values. After the
application of the Wavelet transform, the resulting vector still with l elements
but part of this vector is responsible for the high frequencies (detail), and the
other part is responsible for the low frequencies (approximation).
Consider that, for the ELM example, the initial data vector has ten dimen-
sions. After applying the wavelet transform, the resulting vector continues with
ten elements, but part of that vector is responsible for the high frequencies
(detail), and the other part is responsible for the low frequencies (approxima-
tion). When operating RemoveDetails(ψ1 ) only the r vector values are used (for
example six). In this way, we have two vectors: a vector of ten items (first layer
input) and another vector of six characteristics (first layer output). From this
six-element vector, the values responsible for the approximation are assigned to
the bias and the detail value to the weights of the neurons of the first layer (choice
Pruning ELM Wavelet by ARD 215
made at random). In this paper, the high values of the filter will be assigned to
the weights of the neurons, and the low values of the filter will be allocated to
the bias. The option occurred without any apparent criteria because there are no
factors that prove that the inverse would not work either. This procedure ensures
that the same amount of weights and bias that would be randomly assigned are
provided based on the Wavelet transform, allowing these two parameters to be
based on the characteristics of the dataset submitted to the model.
Algorithm 1. ELM training with filter bank Wavelets for weight w and
bias b
ψ1 ← W avelet(input);
φ1 ← RemoveDetails(ψ1 );
T rain(in = input, out = φ1 );
The following algorithm presents the steps performed for the pruning of unnec-
essary neurons.
216 P. V. de Campos Souza et al.
In this section, we will discuss the classification tests for the model proposed in
this paper. In order to perform the tests, real and synthetic bases were chosen,
seeking to verify if the accuracy of the proposed model surpasses the traditional
techniques that work in ELM. The information tables present information about
the tests, presenting factors such as the percentage of samples destined for the
training and testing of the neural networks. All the tests with the algorithms
involved were done randomly so that tendencies that could interfere in the eval-
uations of the results are avoided. The proposed WR-ELM model was compared
with the with the state of the art of pruning in ELM (OP-ELM) [19] and a
recent model that uses the Matthew coefficient to prune less relevant neurons
(CM-ELM) [3]. In the models compared in the test were used as activation func-
tion in neurons is the sigmoid, weights were used in the randomly defined hidden
layer. A total of 30 experiments were done with the three models submitted to
all test bases. In all tests and all models, the number of primary neurons was
the same number of samples in the dataset. The samples were shuffled with
each test to demonstrate the real capacity of the models. In the results tables
are presented percentage values for the classification tests, accompanied by the
standard deviation found in the 30 repetitions. The expected pattern obtained
all responses with the response obtained. Finally, AUC is also highlighted for
classification tests. The outputs expected in the test were set to 0 and 1 to meet
ReLU responses. Therefore all bases used had their outputs converted to zero
and one. Accuracy is the primary test result. It is given as a percentage and com-
pares the response obtained by the model with the expected response. When the
two are equal, a hit unit is added. In the end, the total of hits obtained by the
model is taken and divided by the total number of samples destined for the test
in order to obtain the accuracy of the model.
Pruning ELM Wavelet by ARD 217
In the execution of real datasets, it was verified that the model proposed it
obtained superior results of accuracy in five of the datasets in the test. In the
other bases, the proposed model maintained a difference within the standard
deviation in two of the three bases, showing that the approach is statistically
equivalent to the original ELM training propositions, adding a more direct rela-
tionship with the database for the determination of the weights in the hidden
218 P. V. de Campos Souza et al.
layer. It is also noticed that the model proposed in the paper has the best answers
with the least average number of neurons. This demonstrates that the pruning
technique acted efficiently in choosing the most significant neurons.
5 Conclusion
We can verify that the definition of weights and bias in a random way is satis-
factory, but the results obtained with the determination of the wavelet functions
vary according to the input samples in conjunction with ReLU activation func-
tions the effects of making networks that use ELM as a basis for your most
efficient and accurate training. The obtained results demonstrate that the pro-
cessing capacity is better to neutralize the levels of neurons in its structure. This
method can be extended to solving complex problems with large databases and
having a large number of dimensions. Other work can be performed for linear
regression problems using other activation functions derived from ReLU and
other actual databases commonly shared by the machine learning community.
Future works may investigate whether the wavelet transform can work on mod-
els with more than one hidden layer, facilitating the assignment of weights in
the following layers based on the responses obtained in the hidden layer.
References
1. Avci, E., Coteli, R.: A new automatic target recognition system based on wavelet
extreme learning machine. Expert Syst. Appl. 39(16), 12340–12348 (2012)
2. Bache, K., Lichman, M.: UCI machine learning repository (2013)
3. de Campos Souza, P.V., Araujo, V.S., Guimaraes, A.J., Araujo, V.J.S., Rezende,
T.S.: Method of pruning the hidden layer of the extreme learning machine based
on correlation coefficient. In: 2018 IEEE Latin American Conference on Compu-
tational Intelligence (LA-CCI), pp. 1–6, November 2018. https://doi.org/10.1109/
LA-CCI.2018.8625247
4. Cao, J., Lin, Z., Huang, G.B.: Composite function wavelet neural networks with
extreme learning machine. Neurocomputing 73(7–9), 1405–1416 (2010)
Pruning ELM Wavelet by ARD 219
5. Chacko, B.P., Krishnan, V.V., Raju, G., Anto, P.B.: Handwritten character recog-
nition using wavelet energy and extreme learning machine. Int. J. Mach. Learn.
Cybern. 3(2), 149–161 (2012)
6. Daubechies, I.: The wavelet transform, time-frequency localization and signal anal-
ysis. IEEE Trans. Inf. Theory 36(5), 961–1005 (1990)
7. Deo, R.C., Tiwari, M.K., Adamowski, J.F., Quilty, J.M.: Forecasting effec-
tive drought index using a wavelet extreme learning machine (W-ELM) model.
Stochast. Environ. Res. Risk Assess. 31(5), 1211–1240 (2017)
8. Ding, S., Zhang, J., Xu, X., Zhang, Y.: A wavelet extreme learning machine. Neural
Comput. Appl. 27(4), 1033–1040 (2016)
9. Golub, G., Kahan, W.: Calculating the singular values and pseudo-inverse of a
matrix. J. Soc. Ind. Appl. Math. Ser. B: Numer. Anal. 2(2), 205–224 (1965)
10. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and appli-
cations. Neurocomputing 70(1–3), 489–501 (2006)
11. Javed, K., Gouriveau, R., Zerhouni, N.: SW-ELM: a summation wavelet extreme
learning machine algorithm with a priori parameter initialization. Neurocomputing
123, 299–307 (2014)
12. Karlik, B., Olgac, A.V.: Performance analysis of various activation functions in
generalized MLP architectures of neural networks. Int. J. Artif. Intell. Expert Syst.
1(4), 111–122 (2011)
13. Kuang, Y., Wu, Q., Shao, J., Wu, J., Wu, X.: Extreme learning machine clas-
sification method for lower limb movement recognition. Cluster Comput. 20(4),
3051–3059 (2017)
14. Li, B., Cheng, C.: Monthly discharge forecasting using wavelet neural networks
with extreme learning machine. Sci. China Technol. Sci. 57(12), 2441–2452 (2014)
15. Li, R., Wang, X., Lei, L., Song, Y.: l {21}-norm based loss function and regular-
ization extreme learning machine. IEEE Access 7, 6575–6586 (2019)
16. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural net-
work acoustic models. In: Proceedings of ICML, vol. 30, p. 3 (2013)
17. Martı́nez-Martı́nez, J.M., Escandell-Montero, P., Soria-Olivas, E., Martı́n-
Guerrero, J.D., Magdalena-Benedito, R., Gómez-Sanchis, J.: Regularized extreme
learning machine for regression problems. Neurocomputing 74(17), 3716–3721
(2011)
18. McDonnell, M.D., Tissera, M.D., Vladusich, T., Van Schaik, A., Tapson, J.: Fast,
simple and accurate handwritten digit classification by training shallow neural net-
work classifiers with the ‘extreme learning machine’ algorithm. PLoS ONE 10(8),
e0134254 (2015)
19. Miche, Y., Sorjamaa, A., Bas, P., Simula, O., Jutten, C., Lendasse, A.: OP-ELM:
optimally pruned extreme learning machine. IEEE Trans. Neural Netw. 21(1),
158–162 (2010)
20. Neal, R.M.: Bayesian Learning for Neural Networks, vol. 118. Springer, Heidelberg
(2012)
21. Peck, C.C., Sheiner, L.B., Nichols, A.I.: The problem of choosing weights in non-
linear regression analysis of pharmacokinetic data. Drug Metab. Rev. 15(1–2),
133–148 (1984)
22. Pinto, D., Lemos, A.P., Braga, A.P., Horizonte, B., Gerais-Brazil, M.: An affinity
matrix approach for structure selection of extreme learning machines. In: Proceed-
ings, p. 343. Presses universitaires de Louvain (2015)
220 P. V. de Campos Souza et al.
23. Wipf, D.P., Nagarajan, S.S.: A new view of automatic relevance determination. In:
Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) Advances in Neural Informa-
tion Processing Systems 20, pp. 1625–1632. Curran Associates, Inc. (2008). http://
papers.nips.cc/paper/3372-a-new-view-of-automatic-relevance-determination.pdf
24. Zeng, Y., Xu, X., Fang, Y., Zhao, K.: Traffic sign recognition using deep convo-
lutional networks and extreme learning machine. In: He, X., et al. (eds.) IScIDE
2015. LNCS, vol. 9242, pp. 272–280. Springer, Cham (2015). https://doi.org/10.
1007/978-3-319-23989-7 28
Students’ Performance Prediction Model Using
Meta-classifier Approach
1 Introduction
Educational data mining is employed to discover unique kind of data from educational
databases, which is then used to understand the students.
In achieving its goal, educational data mining exploits multiple level of hierarchy in
educational data. This is done with integration of methods from machine learning and
data mining literature (Baker 2010). A statistical method, machine learning and data
mining are 3 components that have been used widely in educational data mining to
mine varieties of educational data. As such, the use of educational data mining method
has been employed to develop and discover a unique type of data in educational
settings and gain better understanding on how students learn (Romero and Ventura
2010).
There are many techniques being proposed to do prediction and analysis of students
in previous studies. The most frequently used technique is data mining (Shahiri et al.
2015) which is a process of discovering hidden knowledge from databases or data
warehouse. It is also the process of extracting useful information from a massive
number of databases. Nowadays, techniques of data mining are widely used in many
fields, including education to predict future situations.
However, the learning of analytical and educational data mining is new field of
study in education. Learning analytic is a process of analyzing massive number of
educational data. This involves the prediction of forecasting future performance and
recognizing the risk. Educational data mining is a process that entails analyzing and
understanding students’ data using student performance prediction (Kavitha and Raj
2017). Despite many studies about learning analytic in higher education institutions
within the last several years, it still remains an emerging field of education (Nunn et al.
2016).
This is because new exploration and finding about students’ learning behaviour and
factor contributing to students’ performance need to be done for public benefit.
However, determining hidden knowledge from students’ data may require identifying
some important elements, such as parameters, data mining techniques and tools to
develop accurate models (Ahmad et al. 2015).
As such, there are big gaps regarding the size and amount of data required in
educational data mining research. According to Márquez-Vera et al. (2016), only few
studies on education dropout rates had been conducted, and most of them used sta-
tistical methods, rather than data mining techniques. Xu et al. (2017) said that there are
still lack of studies about predicting student’s performance in completing undergrad-
uate programs, particularly about students’ background and courses. In addition,
courses are not informative to make accurate predictions and evolving progress.
Therefore, this study combines features of students’ academic, demographical,
economical and e-learning behaviour factors that contribute to higher accuracy to form
a unique Students’ Performance Prediction, using the Meta-Classifier Model. This
study is guided by the following research questions, which need to be answered:
1. What are the most important features that influence students’ performance?
2. Is data processing able to manage, and improve the classifiers used in building
prediction model?
3. What is the most significant combination of classifiers to build students’ perfor-
mance prediction model?
Students’ Performance Prediction Model Using Meta-classifier Approach 223
2 Methodology
mining, this step is called pre-processing data and is executed after the data have been
integrated. In pre-processing using Python, 5 steps are taken as follow:
1. Data cleaning, normalize and transform
2. Data reduction
3. Modify categorical data to numerical
4. Data scaling
5. Feature selection
Artificial Neural Network and Decision Tree are two data mining methods that are
highly used by researchers for predicting students’ performance (Shahiri et al. 2015).
Therefore, these two base classifiers, together with Support Vector Machine are
employed in this study. Conversely, 5 ensemble classifiers; RF, Bagging, AdaBoost,
Stacking and Ensemble Vote are employed and compared.
Ensemble meta-classifier is a combination of two or more base-classifiers technique
used in developing better machine learning models and solve classification problems
(Polikar et al. 2008). It is a type of supervised learning technique that combines
multiple weak learners to produce a strong vital learner. Ensemble learner is also
known as multiple classifier system, that is based on multiple learning models which
has been used to solve problem like the classification problem (Gudivada et al. 2016).
TP þ TN
Accuracy ¼ ð1Þ
TP þ TN þ FP þ FN
TP
Precision ¼ ð2Þ
FP þ TP
TP
Recall ¼ ð3Þ
FN þ TP
Precision Recall
Fmeasure ¼ 2 ð4Þ
Precision þ Recall
3.1 Pre-processing
The first step of preprocessing is to remove data that are obviously not useful for data
mining. By using Python pre-processing, some features have been dropped from the
datasets, such as an ID table for features, codes, field of studies, session and year. From
226 H. Hassan et al.
43 features of student information system data, 17 are IDs features and 13 features are
not relevant. Therefore, only remaining features were considered for the next step.
After irrelevant data were removed, the next step was cleaning noises and outlier.
This study uses statistical methods to replace noise data to mean value. Characteristics
of noisy or incomplete data such as missing value, empty column and outlier were used
to replace and improve the quality of those data sets.
From all selected features, the data were categorized into 3 groups (demographical
and economical, academic background features as well as behaviour features respec-
tively). All these features were employed in the next step to rank the features using an
algorithm. The features description as in Table 1:
problem, where most of the metric developed are not for multiclass problem. As such,
it is suggested that future development of new metric or choosing a suitable metric
should take this problem into consideration (Hossin and Sulaiman 2015).
3.3 Modeling
After pre-processing, the data were divided into 2 categories (train and test) with ratio
70:30. Thus, to model the data, cross-validation was used with k-fold = 10 for each
model. The best way to evaluate the model is using future data. However, in the
preliminary stage, cross-validation was most used to evaluate the model.
Table 2 shows the experimental result of 8 different bases and meta-classifiers models.
The ac1curacy and precision for all models are lower than 80%, but the recall for each
model are mostly greater than 90%. The rule of thumb for precision and the recall is
that the higher the precision and recall, the better is the classifier, while the opposite is
applicable to RMSE and classification error (Adejo et al. 2018).
In Table 2, model 1, 2 and 3 are trained base-classifiers models with 3 categories of
data (student information system, e-learning and combination of student information
system and e-learning). Model 4, 5 and 6 use meta-classifiers with 3 categories of data.
The meta-classifier in these 3 models then employed the use of the same combination
of base-classifier which is Decision Tree. Model 7 is Stacked Classification and it uses
the combination of Random Forest (RF), Artificial Neural Network (ANN) and Support
Vector Machine (SVM). Finally, model 8 is the ensemble vote meta-classifier model
which is similar to that of model 7 but uses the combination rule majority vote with
optimization hyperparameter.
Among the classifier used, the Decision Tree model gives the lowest accuracy when
trained with 3 categories of data, while a combination of voting meta-classifier gives
the highest accuracy when trained using combination of system and e-learning data. It
seems that, voting meta-classifier has bright expectation that it is able to achieve better
accuracy by tuning the hyperparameter in future work.
However, there are things that are required to be taken into consideration, like pre-
processing of noise data as well as multi-balance problem that needs improvement.
From the 3 categories data, combination of student information system (SIS) and
behaviour e-learning (EL) data produced the highest accuracy result among all. 6 most
common evaluation metrics used in this study are; Accuracy, Confusion Metric, Pre-
cision, Recall, F-Score and Classification Error.
Studies done by Amrieh et al. (2016), Marbouti et al. (2016), Iam-On and Boon-
goen (2017), Adejo et al. (2018), Beemer et al (2018), Kostopoulos et al. (2018), Salini
et al. (2018), as well as Wanjau and Muketha (2018) developed students’ performance
prediction model using ensemble technique and compared the performance with variety
of single learning model. The results proved that ensemble techniques produce
tremendously better performance prediction of accuracy.
228 H. Hassan et al.
Even though many studies discovered pedagogical factors and students’ perfor-
mance issues, only a few research concentrated on specific areas (Anoopkumar and
Rahman 2018). On the other hand, Fernandes et al. (2018) study showed that grades
and absence are two features that are mostly relevant to predict the end of year aca-
demic outcome of student performance. They also stated that for demographic features,
neighbour, schools and age are 3 potential indicators influencing the academic success
or failure.
The influencing factor of students’ dropout rate has become a good study subject in
the past many years. Hence, student’s behaviour during the enrolment of a course, and
institutional characteristics are two factors area that have been discovered to increase
the chances of completing many researches (Lopez Guarin et al. 2015).
In the education field, academic is the core component where students’ academic
achievement needs to be improved by all higher educational institutions. Students’
academic information, profile and LMS are 2 types of students’ data that contain
beneficial information and tremendously useful information that could be interpreted as
knowledge. However, there is still lack of study to discover patterns and build students’
performance prediction model using ensemble techniques even though the technique is
proven to give higher accuracy.
In this study, students’ performance prediction model using ensemble meta-
classifiers to predict students’ performance proved to produce high accuracy. Thus, 3
base-classifiers and 5 meta-classifiers were developed and compared. It is also proven
that the combination of data of student information system and students’ behaviour
from e-learning produce the highest result when trained using Majority Vote Ensemble
Classifier.
Future step of this study would be to investigate best feature selection methods to
handle the massive number of data and optimize model by fine tuning hyperparameters
to get optimum result.
References
Adejo, O.W., Connolly, T., Adejo, O.W., Connolly, T.: ensemble approach Predicting student
academic performance using multi-model heterogeneous ensemble approach. J. Appl. Res.
High. Educ. 10(1), 61–75 (2018)
Ahmad, F., Ismail, N.H., Aziz, A.A.: The prediction of students’ academic performance using
classification data mining techniques. Appl. Math. Sci. 9(129), 6415–6426 (2015)
AL-Malaise, A., Malibari, A., Alkhozae, M.: Students performance prediction system using multi
agent data mining technique. Int. J. Data Min. Knowl. Manag. Process (2014)
230 H. Hassan et al.
Amrieh, E.A., Hamtini, T., Aljarah, I.: Mining educational data to predict student’s academic
performance using ensemble methods. Int. J. Database Theor. Appl. 9(8), 119–136 (2016)
Anoopkumar, M., Zubair Rahman, A.M.J.Md.: Model of tuned J48 classification and analysis of
performance prediction in educational data mining. Int. J. Appl. Eng. Res. 13(20), 14717–
14727 (2018). ISSN 0973-4562
Baker, S.J.R.: Data mining for education. Int. Encycl. Educ. (2010)
Barhamzaid, Z.A.A., Alleyne, A.: Factors affecting student performance in the first accounting
course in diploma program under political conflic. J. Educ. Prac. 9(24), 144–154 (2018)
Beemer, J., Spoon, K., He, L., Fan, J., Levine, R.A.: Ensemble learning for estimating
individualized treatment effects in student success studies. Int. J. Artif. Intell. Educ. 28(3),
315–335 (2018)
Fernandes, E., Holanda, M., Victorino, M., Borges, V., Carvalho, R., Van Erven, G.: Educational
data mining: predictive analysis of academic performance of public school students in the
capital of Brazil. J. Bus. Res. 94, 335–343 (2018)
Gudivada, V.N., Irfan, M.T., Fathi, E., Rao, D.L.: Cognitive Analytics: Going Beyond Big Data
Analytics and Machine Learning. Handbook of Statistics, 1st edn. Elsevier B.V (2016)
Hossin, M., Sulaiman, M.N.: A review on evaluation metrics for data classification evaluations.
Int. J. Data Min. Knowl. Manag. Process (IJDKP) 5(2), 1–11 (2015)
Iam-On, N., Boongoen, T.: Improved student dropout prediction in Thai University using
ensemble of mixed-type data clusterings. Int. J. Mach. Learn. Cyber. 8(2), 497–510 (2017)
Kavitha, G., Raj, L.: Educational data mining and learning analytics educational assistance for
teaching and learning. Int. J. Comput. Organ. Trends 41(1), 21–25 (2017)
Kondo, N., Okubo, M., Hatanaka, T.: Early detection of at-risk students using machine learning
based on LMS log data. In: 2017 6th IIAI International Congress on Advanced Applied
Informatics (IIAI-AAI), pp. 198–201 (2017)
Kostopoulos, G., Livieris, I.E., Kotsiantis, S., Tampakas, V.: CST-voting: a semi-supervised
ensemble method for classification problems. J. Intell. Fuzzy Syst. 35(1), 99–109 (2018)
Lopez Guarin, C.E., Guzman, E.L., Gonzalez, F.A.: A model to predict low academic
performance at a specific enrollment using data mining. Revista Iberoamericana de
Tecnologias del Aprendizaje 10(3), 119–125 (2015)
Marbouti, F., Diefes-Dux, H.A., Madhavan, K.: Models for early prediction of at-risk students in
a course using standards-based grading. Comput. Educ. 103, 1–15 (2016)
Márquez-Vera, C., Cano, A., Romero, C., Noaman, A.Y.M., Mousa Fardoun, H., Ventura, S.:
Early dropout prediction using data mining: a case study with high school students. Expert
Syst. 33(1), 107–124 (2016)
Nam, S.J., Frishkoff, G., Collins-Thompson, K.: predicting students’ disengaged behaviors in an
online meaning-generation task. IEEE Trans. Learn. Technol. 1382, 1–14 (2017)
Natek, S., Zwilling, M.: Student data mining solution-knowledge management system related to
higher education institutions. Expert Syst. Appl. 41(14), 6400–6407 (2014)
Nunn, S., Avella, J.T., Kanai, T., Kebritchi, M.: Learning analytics methods, benefits, and
challenges in higher education: a systematic literature review. Online Learn. 20(2), 13–29
(2016)
Polikar, R., et al.: An ensemble based data fusion approach for early diagnosis of Alzheimer’s
disease. Inf. Fusion 9(1), 83–95 (2008)
Romero, C., Ventura, S.: Educational data mining: a review of the state of the art. IEEE Trans.
Syst. Man Cybern. Part C (Appl. Rev.) 40(6), 601–618 (2010)
Salini, A., Jeyapriya, U., College, S.M., College, S.M.: A majority vote based ensemble classifier
for predicting students academic performance. Int. J. Pure Appl. Math. 118(24), 1–11 (2018)
Students’ Performance Prediction Model Using Meta-classifier Approach 231
Shahiri, A.M., Husain, W., Rashid, N.A.: A review on predicting student’s performance using
data mining techniques. In: 2015 3rd Information Systems International Conference, pp. 414–
422 (2015)
Tamhane, A., Appleton, J.: Predicting student risks through longitudinal analysis. In: KDD,
pp. 1544–1552 (2014)
Wanjau, S.K., Muketha, G.M.: Improving student enrollment prediction using ensemble
classifiers. Int. J. Comput. Appl. Technol. Res. 7(03), 122–128 (2018)
Xu, J., Moon, K.H., Van Der Schaar, M.: A machine learning approach for tracking and
predicting student performance in degree programs. IEEE J. Sel. Top. Signal Process. 11(5),
742–753 (2017)
Zollanvari, A., Kizilirmak, R.C., Kho, Y.H., Hernandez-Torrano, D.: Predicting students’ GPA
and developing intervention strategies based on self-regulatory learning behaviors. IEEE
Access 5, 23792–23802 (2017)
Deep Learning
A Deep Network System for Simulated
Autonomous Driving Using Behavioral Cloning
1 Introduction
2 Related Work
Convolutional neural networks (CNNs) [1] have revolutionized pattern recognition [2];
prior to large-scale adoption of CNNs, most pattern recognition projects were
accomplished using an initial stage of hand-crafted component extraction followed by a
classifier. The progress of CNNs is that features are learned automatically from training
examples. A CNN is particularly effective in image recognition tasks because the
convolution operation captures the 2D nature of images. Also, by applying the con-
volution kernels to examine an entire image, fewer parameters need to be learned in
comparison with the total number of operations [3]. While CNNs with learned features
have been in commercial use for over twenty years [4], their adoption has exploded in
the last few years because of two recent developments. First, large, labeled data sets
such as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [5] have
become accessible for training and validation [3]. Second, CNN learning algorithms
have been implemented on the parallel graphics processing units (GPUs) which
accelerate learning [3].
DARPA Autonomous Vehicle (DAVE) [6] demonstrated the potential of end-to-
end learning and was used to justify starting the DARPA Learning Applied to Ground
Robots (LAGR) program [7]. However, DAVE’s achievement was not reliable enough
to support a full alternative to more modular approaches to off-road driving: its average
space between crashes was about 20 m in complicated environments. Recently, a new
application has been started at NVIDIA, which aims to build on DAVE and create a
strong system for driving on public roads. The basic motivation for this project is to
avoid the need to identify specific human-designated features, such as lane markings,
guardrails, or other cars, and to prevent having to create a collection of “if-then-else”
rules, based on the perception of these features [3].
DAVE-2 [6] was inspired by the pioneering work of Pomerleau [8] who built the
Autonomous Land Vehicle in a Neural Network (ALVINN) system in 1989. It proves
that an end-to-end trained neural network can steer a vehicle on public roads [3].
Many major companies are involved in developing self-driving cars. Among those
who are currently testing such vehicles on public roads, one can mention, among
others, Google, Tesla, Toyota, BMW, Nissan, Ford [9].
A Deep Network System for Simulated Autonomous Driving 237
In this paper, we build a CNN that goes beyond pattern recognition and learns the
behavior of a vehicle.
Apart from producing sparse codes, the main advantage of ReLU is that it does not
have the vanishing gradient problem [4, 10] since the derivative for positive values is
not contractive [2]. On the other hand, ReLU is non-negative and for that reason, have a
mean activation greater than zero [11].
The hyperbolic tangent activation function (tanh) non-linearity compresses the
input in the range (−1, 1):
ftanh ðxÞ ¼ 1 e2x = 1 þ e2x : ð2Þ
where k = 1.0507 and a = 1.67326 [12]. It has self-normalizing properties because the
activations that are close to zero mean and unit variance, propagated through many
network layers, will also converge towards zero mean and unit variance. This makes
the learning highly robust and allows to train networks with many layers [13].
Because the learning can be made faster by centering the activations at zero, the
exponential linear unit (ELU) also uses the activation function to achieve mean zero,
which accelerates learning in deep neural networks and contributes to better learning
performance:
a ðex 1Þ; x0
fELU ðxÞ ¼ : ð4Þ
x; x[0
The hyperparameter a controls the value to which an ELU saturates for negative net
inputs. In the simplest case, a = 1 [11]. Like ReLU (with its variants such as Leaky
ReLU or Parametrized ReLU), ELU relieves the vanishing gradient problem by using
the identity for positive values.
These activation functions are graphically displayed in Fig. 5.
242 A.-I. Patachi et al.
4 Experimental Results
The results of behavioral cloning for the simulated driving scenario, implemented with a
CNN using different activation functions are presented in Figs. 6 and 7. They are gen-
erally similar, but on closer inspection, one can notice that the ELU activation function
leads to a smaller mean squared error loss both for the training and for the validation data.
Fig. 6. Training mean squared error loss and data validation mean squared error loss
A Deep Network System for Simulated Autonomous Driving 243
Fig. 8. The mean squared error loss obtained for the ELU activation function
244 A.-I. Patachi et al.
Fig. 9. The evolution of the mean squared error loss for a greater number of training epochs
When considering only the behavior of the ELU activation function, Fig. 8 shows a
clearer comparison between the results on the training and validation data. As expected,
the performance for the training data is better, but overall the two values are quite close,
and this signifies that the model has good generalization capabilities.
Figure 9 shows the evolution of the mean squared error loss for a greater number of
training epochs, i.e. 100 instead of 10. One can see that there are variations in the
training and validation losses, but the values remain generally stable and the perfor-
mance improvement is not great compared to the scenario with only 10 epochs.
Therefore, we can state that the behavioral cloning model can achieve good perfor-
mance in a very small number of training epochs.
5 Conclusions
In this paper, we have experimentally shown that CNNs are capable to learn the entire
task of lane and road following without manual decomposition into road or lane
marking detection, path planning and control. A small amount of driving training data
was sufficient to train the vehicle to operate on a road. The CNN can learn meaningful
road features from a very sparse training signals, such as steering and speed. The
system discovers, for example, to detect the outline of a road without the need for
specific labels during training. The best performance is obtained by using the expo-
nential linear unit (ELU) activation function.
As a future direction of research, the system can be improved such that it could
learn to drive faster on difficult roads. To this end, further effort needs to be dedicated
to the refinement of the network architecture and the proper selection of the training
data.
In the current scenario, the car is a single agent in its environment and the main goal
is to drive as close to the middle of the road as possible. The next endeavors should also
A Deep Network System for Simulated Autonomous Driving 245
address the situations with more traffic participants and the introduction of corrective
actions before a possible accident.
Acknowledgements. This work was funded in part by Continental Automotive Romania SRL
Iași.
References
1. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In:
Furnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on
Machine Learning (ICML10), pp. 807–814 (2010)
2. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Gordon, G.,
Dunson, D., Dudk, M. (eds.) JMLR W&CP: Proceedings of the Fourteenth International
Conference on Artificial Intelligence and Statistics (AISTATS 2011), vol. 15, pp. 315–323
(2011)
3. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B.: End to end learning for self-
driving cars (2016). https://arxiv.org/abs/1604.07316. Accessed 24 Mar 2019
4. Hochreiter, S.: The vanishing gradient problem during learning recurrent neural nets and
problem solutions. Int. J. Uncertainty Fuzziness Knowl.-Based Syst. 6(2), 107–116 (1998)
5. Russakovsky, O., Deng, J., Su, H., Krause, J.: ImageNet large scale visual recognition
challenge (ILSVRC). Int. J. Comput. Vis. 115(3), 211–252 (2014)
6. Net-Scale Technologies, Inc.: Autonomous off-road vehicle control using end-to-end
learning. Final technical report (2004). http://net-scale.com/doc/net-scale-dave-report.pdf.
Accessed 24 Mar 2019
7. LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient backprop. In: Montavon, G.,
Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700,
pp. 9–48. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_3
8. Pomerleau, D.A.: ALVINN: an autonomous land vehicle in a neural network (1989). http://
papers.nips.cc/paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.pdf.
Accessed 24 Mar 2019
9. Matthews, K.: Top Article for 2018 - Here Are All the Companies Testing Autonomous Cars
In 2018 (2018). https://www.roboticstomorrow.com/article/2018/03/top-article-for-2018-
here-are-all-the-companies-testing-autonomous-cars-in-2018/11592. Accessed 24 Mar 2019
10. Hochreiter, S., Schmidhuber, J.: Feature extraction through LOCOCODE. Neural Comput.
11(3), 679–714 (1999)
11. Clevert, D.-A., Unterthiner, T., Mayr, A., Hochreiter, S.: Fast and accurate deep network
learning by exponential linear units (ELUs). In: ICLR2016 (2016)
12. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks.
In: Advances in Neural Information Processing Systems (2017). https://arxiv.org/abs/1706.
02515. Accessed 24 Mar 2019
13. Pedamonti, D.: Comparison of non-linear activation functions for deep neural networks on
MNIST classification task (2018). https://arxiv.org/abs/1804.02763. Accessed 24 Mar 2019
A Machine Hearing Framework for Real-Time
Streaming Analytics Using Lambda
Architecture
1 Introduction
(http://www.cabi.org/isc/) [9], the most valid and comprehensive database on the issue,
world-wide. The Geolocation process is presented below:
Algorithm 1. Geolocation Process
Input:
Recognized_Species;
Country;
Country_Native_Species;
1: Read Recognized_Species, Country, Country_Native_Species;
2: for i=1 to Country_Native_Species [max] do
3: if Country_Native_Species [i]= Recognized_Species then
4: Recognized Species=Native_Species
5: else
9: Recognized Species=Invasive_Species
10: end if
11: end
Output:
Species Identity;
250 K. Demertzis et al.
3 Literature Review
Invasive alien species are a result of generalized climate change and they constitute a
serious and rapidly worsening threat to natural biodiversity in Europe. European Union
spends at least 12 billion Euros per year on control of IAS and disasters they cause.
Also, the risk to public health should not be overlooked as these species may be toxic,
such as “Lagocefalus” fish, which contains “Tetrodotoxin” a very dangerous substance,
capable of causing serious health problems, even death in the consumer [11, 12]. The
significance of the hybrid innovative intelligent approaches (Machine Learning
Algorithms) for identifying IAS and their separation from indigenous ones has been
developed by recent researches [13, 14]. Soft computing techniques are capable to
model and detect cyber security threats [15, 16] and they also offer optimization
mechanisms in order to produce reliable results.
Hinton et al. [17] had proposed methods and applications of DL. Through a series
of new learning architectures and algorithms, domains such as object recognition [18]
and machine translation [19, 20] have been transformed; deep learning methods are
now the state-of-the-art in object, speech and audio recognition. In particular, deep
learning has been the driving force behind large leaps in accuracy and model robust-
ness in audio related domains like audio sensing [21]. Alom et al. [22] applied the
Cellular Simultaneous Recurrent Networks (CSRNs) to generate initial filters of CNNs
for features extraction and Regularized Extreme Learning Machines (RELM) for
classification. Experiments were conducted on three popular datasets for object
recognition (such as face, pedestrian, and car) to evaluate the performance of the
proposed system. Zhang, et al. [23], proposed an object recognition algorithm which
did not depend on human experts to design features for fish species classification, but
constructed efficient features automatically. Results from experiments showed that the
proposed method obtained an average of 98.9% classification accuracy with a standard
deviation of 0.96% with a dataset composed of 8 fish species and a total of 1049
images. Also, DL has been the driving force behind large leaps in accuracy and model
robustness in audio related domains like speech recognition. Moreover Han et al. [24]
proposed to utilize DNNs to extract high level features from raw data and show that
they are effective for speech emotion recognition. Finally, Zhao et al. [25] proposed a
new method for automated field recording analysis with improved automated seg-
mentation and robust bird species classification by a Gaussian Mixture Model.
A Machine Hearing Framework for Real-Time Streaming Analytics 251
4 Algorithms
4.1 Extreme Learning Machines for Batch Data Algorithms
An ELM is a Single-Hidden Layer Feed Forward Neural Network (SLFFNN) [26] with
N hidden neurons, randomly selected input weights and random values of bias in the
hidden layer, while the weights at its output are calculated with a single multiplication
of vector matrix [27]. For an ELM using SLFFNN and random representation of hidden
neurons, input data is mapped to a random L-dimensional space with a discrete training
set N, where ðxi ; ti Þ; i 2 ½½1; N with xi 2 Rd and ti 2 Rc . The specification output of the
network is the following:
XL
f L ð xÞ ¼ i¼1
bi hi ð xÞ ¼ hð xÞb i 2 ½½1; N ð1Þ
Vector matrix b ¼ ½b1 ; . . .; bL T is the output of the weight vector matrix connecting
hidden and output nodes. On the other hand, hð xÞ ¼ ½g1 ð xÞ; . . .; gL ð xÞ is the output of
the hidden nodes for input x, and g1 ð xÞ is the output of the ith neuron. Based on a
training set fðxi ; ti ÞgNi¼1 , an ELM can solve the Learning Problem Hb ¼ T, where
T ¼ ½t1 ; . . .; tN T are the target labels and the output vector matrix of the Hidden Layer
H is the following:
2 3
gðx1 x1 þ b1 Þ gðxl x1 þ bl Þ
6 .. .. .. 7
H x j ; bj ; x i ¼ 4 . . . 5 ð2Þ
gðx1 xN þ b1 Þ gðxl xN þ bl Þ Nl
The input weight vector matrix of the hidden layer x (before training) and the bias
vectors b are created randomly in the interval [−1, 1], with
T T
xj ¼ xj1 ; xj2 ; . . .; xjm and bj ¼ bj1 ; bj2 ; . . .; bjm ð3Þ
The output weight vector matrix of the hidden layer H is calculated by the use of the
Activation function in the training dataset, based on the following function:
H ¼ gðxx þ bÞ ð4Þ
where H ¼ ½h1 ; . . .; hN is the output vector matrix of the hidden layer and X ¼
½x1 ; . . .; xN the input vector matrix of the hidden layer. Indeed, b can be calculated by
the following general relation:
252 K. Demertzis et al.
b ¼ HþT ð6Þ
where H þ is the generalized inverse vector matrix Moore-Penrose for matrix H. This
approach is employing ELM with Gaussian Radial Basis Function kernel K(u, v) =
exp(−c||u − v||2). The size k of the hidden layer are 20 neurons. Subsequently assigned
random input weights wi and biases bi, i = 1, …, N. To calculate the hidden layer
output matrix H we have used the function (7):
2 3 2 3
hðx1 Þ h1 ðx1 Þ hL ðx1 Þ
6 .. 7 6 .. .. 7
H¼4 . 5¼4 . . 5 ð7Þ
hðxN Þ h1 ðxN Þ hL ðxN Þ
Where h(x) = [h1(x), . . ., hL(x)] is the output (row) vector of the hidden layer with
respect to the input x. h(x) actually maps the data from the D-dimensional input space
to the L-dimensional hidden-layer feature space (ELM feature space) H. Thus, h(x) is
indeed a feature mapping. ELM is to minimize the training error as well as the norm of
the output weights:
Considering and combining the features of ELM presented above, we introduce and
propose a new Deep architecture by creating an Online Learning Multilayer Graph
Regularized Extreme Learning Machine Auto-Encoder (OSML-GRELMA). This is a
multi-layered neural network model that receives successive OL data streams and uses
the unsupervised GRELMA algorithm as a basic building block in which the output of
each level is used as inputs to the next one [28].
An autoencoder is an artificial neural network used for unsupervised learning of
efficient coding. The aim of an autoencoder is to learn a representation (encoding) for a
set of data, but with the output layer having the same number of nodes as the input
layer, and with the purpose of reconstructing its own inputs (instead of predicting target
value Y given inputs X). The Algorithm 2 is described below [28]:
A Machine Hearing Framework for Real-Time Streaming Analytics 253
fficient
of attributes. This is achieved with the modification of the base tree algorithm, by
effectively reducing the set of features examined for further separation into random
subsets of magnitude m, where m < M (M corresponds to the total number of attributes
that are examined in each case. In the non-streaming bagging, each of the n-base
models is trained in a Z-sized bootstrap sample created by random samples with
replacement from the original training kit. Every bootstrapped sample contains an
original training snapshot K, where P (K = k) follows a binomial distribution. For large
values of Z this binomial distribution is attached to a Poisson one, with k = 1. On the
other hand, according to the ARF approach for streaming data, a Poisson distribution is
used with k = 6. This “feedback” has the practical effect of increasing the probability of
assigning higher weights to instances during the training of the basic models [29].
ARF is an adaptation of the original Random Forest algorithm, which has been
successfully applied to a multitude of machine learning tasks. In layman’s terms the
original Random Forest algorithm is an ensemble of decision trees, which are trained
using bagging and where the node splits are limited to a random subset of the original
set of features. The “Adaptive” part of ARF comes from its mechanisms to adapt to
different kinds of concept drifts, given the same hyper-parameters. Specifically, the 3
most important aspects of the ARF algorithm are it adds diversity through resampling
(“bagging”); it adds diversity through randomly selecting subsets of features for node
splits and it has one drift and warning detector per base tree, which cause selective
resets in response to drifts. It also allows training background trees, which start training
if a warning is detected and replace the active tree if the warning escalates to a drift.
ARF was designed to be “embarrassingly” parallel, in other words, there are no
dependencies between trees. The overall ARF pseudo-code is presented below in
Algorithm 4 [29].
Algorithm 4.
Where m: maximum features evaluated per split; n: total number of trees (n = |T|);
dw: warning threshold; dd: drift threshold; c(): change detection method; S: Data
stream; B: Set of background trees; W(t): Tree t weight; P(): Learning performance
estimation function [29].
A Machine Hearing Framework for Real-Time Streaming Analytics 255
5 Datasets
The following four main categories of sounds have been determined in order to create
highly complex scenarios that can potentially include the most likely cases that can be
detected in an underwater space:
• Fishes: Several species of fish produce sounds with various mechanisms such as
teeth, pharynx, fins, and shuttle bladder. 1076 sounds belonging to 10 fish species
have been included in this category (e.g. Bidyanus Bidyanus, Epinephelus
Adscensionis, Cynoscion Regalis, Carassius Auratus, Cyprinus Carpio, Rutilus
Rutilus, Salmo Trutta, Oreochromis Mossambicus, Micropterus Salmoides,
Oncorhynchus Mykiss).
• Mammals: Marine mammals produce and use sounds to orientate and communicate
with each other. Totally 836 sounds belonging to 8 species of mammals are
included in this category. (e.g. Delphinus Delphis, Erignathus Barbatus, Balaena
Mysticetus, Phocoena Phocoena, Neophocaena Phocaenoides, Trichechus, Tursiops
Truncates, Phoca Hispida).
• Anthropogenic Sounds: It comprises of 684 sounds belonging to 9 classes (Ship,
Sonar, Zodiac, Torpedo, Wind Turbine, Scuba Noise, Bubble Curtain, Personal
Water Craft, Airgun).
• Natural Sounds: Totally 477 sounds belong here classified in six clusters (Earth-
quake, Hydrothermal Vents, Ice Cracking, Rainfall, Lightning, Waves).
The Feature Extraction process [30] enables capturing of characteristics that pre-
cisely determine the uniqueness of each sound and helps distinguish between acoustic
categories. The categories distinction is based on 34 characteristics related to statistical
measurements obtained from the signal frequency information. In this research effort
we have extracted the short-term feature sequences for an audio signal, using a frame
size of 50 ms and a frame step of 25 ms (50% overlap). All sounds had a sampling rate
of 44.1 kHz, 16-bit stereo resolution while their average duration was 10.3 s.
6 Results
In data batch cases using multiple classifiers, for estimating the real error during
training, the full probability density of both categories should be known [31, 32]. The
classification performance is estimated by the Total Accuracy (TA), Root Mean
Squared Error (RMSE), Precision (PRE), Recall (REC), F-Score and ROC Area
indices [33, 34]. The 10-fold cross validation is employed in this stage in order to
obtain performance indices. Analytical values of the predictive capacity of the algo-
rithm are presented in the following Tables 1, 2, 3, 4, 5 and 6.
In the case of stream data classification, we need to compare classification per-
formance in terms of Accuracy Kappa statistic and Kappa-Temporal statistic. This is
done by using the traditional immediate setting. The true label is presented right after
the instance used for testing or the delayed setting (where there is a real delay between
the moment an instance is presented and the moment its true label becomes available)
[33].
256 K. Demertzis et al.
Table 7 below, presents the results of the scenarios applied on streaming data in
this research. Validation of the results was done by employing the Prequential Eval-
uation method [34]. The training window used 1000 instances. It should be clarified
that the following Table 7 uses average values for every evaluation measure.
This paper presents an innovative, reliable, low-demand and highly effective system of
MAHE and sound analysis, based on sophisticated computational intelligence. The
development of FLAME_H is based on the optimal combination of two highly efficient
and fast learning algorithms that create a comprehensive intelligent system of active
environmental security using a Lambda Architecture approach. The sophisticated
258 K. Demertzis et al.
application described herein, combined with the promising results that have emerged,
constitutes a credible innovative proposal for the standardization and design of
biosecurity and biodiversity protection. This implementation follows a Reactive
strategy for dealing with invasive species as it combines training of two counter dia-
metrically opposite classifiers to detect incoming contrasts and to discard them.
Training is done by using datasets that respond to specialized, realistic scenarios. In
addition, this framework implements a Big data analysis approach that attempts to
balance latency, throughput, and fault tolerance using integrated and accurate views of
historical data, while at the same time it is making optimum use of new entrant data
flows. The operating scenarios proposed with the combination of batch and streaming
data, create capabilities for a fully-defined configuration of model’s parameters and for
high-precision classification or correlation.
The basic innovation of the proposed FLAME_H is the implementation of an
intelligent ML system, based solely on fully automated methods of detecting sound
events using COIN. This innovation provides important solutions and improves the
way environmental problems and, in particular, biodiversity and bio-security mecha-
nisms work and deal. Also, a significant innovation is the architecture of the proposed
computational intelligence system, which combines and exploits Lambda architecture,
that is, the combination of both batch and streaming data analysis, using fast and
extremely accurate ML algorithms to solve a multidimensional and complex real-life
problem. ML delivers intelligence and significantly boosts the environmental protec-
tion mechanisms as it is an important defense tool against asymmetric environmental
threats. The FLAME_H simplifies and automates the sound recognition and the
invasive species detection procedures, while minimizing human intervention by
combining the EL_GROSEMMARI and ARF algorithms for the first time in the lit-
erature. Finally, one more innovation is found in the way of collecting and selecting the
data, (which emerged after extensive research) as well as the development of the final
data set used, which is complex and has a high dimension, but it can be used effectively
in training.
Future extensions-improvements should focus on further optimizing the parameters
of the algorithm used by the Lambda architecture, so that an even more efficient,
accurate, and faster classification process is achieved. Also, it would be important to
study the expansion of this particular system by implementing Lambda architecture in a
parallel and distributed data analysis system (Hadoop). Finally, an additional element
that could be studied in the direction of future expansion concerns the operation of
FLAME_H with methods of self-improvement and meta-learning in order to fully
automate the process of locating the species.
References
1. Rahel, F., Olden, J.D.: Assessing the effects of climate change on aquatic invasive species.
Conserv. Biol. 22(3), 521–533 (2008). https://doi.org/10.1111/j.1523-1739.2008.00950.x
2. Abdulla, A., Linden, O.: Maritime Traffic Effects on Biodiversity in the Mediterranean Sea:
Review of Impacts, Priority Areas and Mitigation Measures. IUCN, Centre for Mediter-
ranean Cooperation, vol. 184, pp. 08 (2008)
A Machine Hearing Framework for Real-Time Streaming Analytics 259
3. Miller, W.: The structure of species, outcomes of speciation and the species problem: ideas
for paleobiology. Paleogeograohy Palaeoclimatol. Palaeoecol. 176(1–4), 1–10 (2001).
https://doi.org/10.1016/s0031-0182(01)00346-7
4. Lyon, R.: Human and Machine Hearing: Extracting Meaning from Sound. Cambridge
University Press, Cambridge (2017). https://doi.org/10.1017/9781139051699
5. Deng, L., Yu, D.: Deep learning: methods and applications. Found. Trends Signal Process. 7
(3–4), 197–387 (2014). https://doi.org/10.1561/2000000039
6. Zhang, J., Yin, J., Zhang, Q., Shi, J., Li, Y.: Robust sound event classification with bilinear
multi-column ELM-AE and two-stage ensemble learning. Eurasip J. Audio Speech Music
Process. 2017(1), 11 (2017). https://doi.org/10.1186/s13636-017-0109-1
7. Dedić, N., Stanier, C.: Towards differentiating business intelligence, big data, data analytics
and knowledge discovery. In: Piazolo, F., Geist, V., Brehm, L., Schmidt, R. (eds.) ERP
Future 2016. LNBIP, vol. 285, pp. 114–122. Springer, Cham (2017). https://doi.org/10.
1007/978-3-319-58801-8_10
8. Kiran, M., Murphy, P., Monga, I., Dugan, J., Baveja, S.S.: Lambda architecture for cost-
effective batch and speed big data processing, pp. 2785–2792 (2015). https://doi.org/10.
1109/bigdata.2015.7364082
9. http://www.cabi.org/isc
10. Yamato, Y., Kumazaki, H., Fukumoto, Y.: Proposal of lambda architecture adoption for real
time predictive maintenance. In: 2016 CANDAR, pp. 713–715. IEEE (2016). https://doi.org/
10.1109/candar.2016.0130
11. Demertzis, K., Iliadis, L.: Detecting invasive species with a bio-inspired semisupervised
neurocomputing approach: the case of Lagocephalus sceleratus. Neural Comput. Appl. 28
(6), 1225–1234 (2017). https://doi.org/10.1007/s00521-016-2591-2
12. Demertzis, K., Iliadis, L.: Intelligent bio-inspired detection of food borne pathogen by DNA
barcodes: the case of invasive fish species lagocephalus sceleratus. In: Iliadis, L., Jayne, C.
(eds.) Engineering Applications of Neural Networks EANN 2015 Communications in
Computer and Information Science, vol. 517, pp. 89–99. Springer, Cham (2015). https://doi.
org/10.1007/978-3-319-23983-5_9
13. Demertzis, K., Iliadis, L., Anezakis, V.D.: A deep spiking machine-hearing system for the
case of invasive fish species. In: Proceedings of 2017 IEEE International Conference on
Innovations in Intelligent Systems and Applications, Gdynia, Poland, pp. 23–28 (2017).
https://doi.org/10.1109/inista.2017.8001126
14. Demertzis, K., Iliadis, L.: Adaptive elitist differential evolution extreme learning machines
on big data: intelligent recognition of invasive species. In: Angelov, P., Manolopoulos, Y.,
Iliadis, L., Roy, A., Vellasco, M. (eds.) INNS 2016. AISC, vol. 529, pp. 333–345. Springer,
Cham (2017). https://doi.org/10.1007/978-3-319-47898-2_34
15. Demertzis, K., Iliadis, L.: Evolving smart URL filter in a zone-based policy firewall for
detecting algorithmically generated malicious domains. In: Gammerman, A., Vovk, V.,
Papadopoulos, H. (eds.) SLDS 2015. LNCS (LNAI), vol. 9047, pp. 223–233. Springer,
Cham (2015). https://doi.org/10.1007/978-3-319-17091-6_17
16. Demertzis, K., Iliadis, L.: SAME: an intelligent anti-malware extension for android art
virtual machine. In: Núñez, M., Nguyen, N.T., Camacho, D., Trawiński, B. (eds.) ICCCI
2015. LNCS (LNAI), vol. 9330, pp. 235–245. Springer, Cham (2015). https://doi.org/10.
1007/978-3-319-24306-1_23
17. Hinton, G., Deng, L., Yu, D., et al.: Deep neural networks for acoustic modeling in speech
recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2014). https://doi.org/10.1109/MSP.
2012.2205597
260 K. Demertzis et al.
18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional
neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.)
Proceedings of the 25th International Conference on Neural Information Processing
Systems, NIPS 2012, USA, pp. 1097–1105 (2012)
19. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural
networks with multitask learning. In: William, W., Mccallum, C., Mccallum, A., Sam, T.R.
(eds.) Proceedings of the 25th International Conference on Machine Learning, ICML 2008,
pp. 160–167. ACM, New York (2008). https://doi.org/10.1145/1390156.1390177, ISBN
978-1-60558-205-4
20. Deselaers, T., Hasan, S., Bender, O., Ney, H.: A deep learning approach to machine
transliteration. In: Proceedings of the Fourth Workshop on Statistical Machine Translation
StatMT 2009, pp. 233–241. Association for Computational Linguistics, Stroudsburg (2009)
21. Lanez, N.D., Georgievy, P., Qendro, L.: DeepEar: robust smartphone audio sensing in
unconstrained acoustic environments using deep learning. In: Proceedings of the 2015 ACM
International Joint Conference on Pervasive and Ubiquitous Computing, pp. 283–294
(2015). http://dx.doi.org/10.1145/2750858.2804262
22. Alom, M.Z., Alam, M., Taha, T.M., Iftekharuddin, K.M.: Object recognition using cellular
simultaneous recurrent networks and convolutional neural network. In: Proceedings of the
International Joint Conference on Neural Networks (IJCNN 2017), pp. 2873–2880,
Anchorage (2017). https://doi.org/10.1109/ijcnn.2017.7966211
23. Zhang, D., Lee, D.J., Zhang, M., Tippetts, B.J., Lillywhite, K.D.: Object recognition
algorithm for the automatic identification and removal of invasive fish. Biosyst. Eng. 145,
65–75 (2016). https://doi.org/10.1016/j.biosystemseng
24. Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and
extreme learning machine. In: Chng, E.S., Li, H., Meng, H., Ma, B., Xie, L. (eds.)
INTERSPEECH 2014, Proceedings of the Annual Conference of the International Speech
Communication Association, pp. 223–227 (2014)
25. Zhao, Z., Zhang, S.H., Xu, Z.Y., et al.: Automated bird acoustic event detection and robust
species classification. Ecol. Inf. 39, 99–108 (2017). https://doi.org/10.1016/j.ecoinf.2017.04.
003
26. Cambria, E., Guang-Bin, H.: Extreme learning machines. In: IEEE InTeLLIGenT
SYSTemS, 541-1672/13 (2013)
27. Huang, G.B.: An insight into extreme learning machines: random neurons, random features
and kernels. Cogn. Comput. 6(3), 376–390 (2014). https://doi.org/10.1007/s12559-014-
9255-2
28. Sun, K., Zhang, J., Zhang, C., Hu, J.: Generalized extreme learning machine autoencoder
and a new deep neural network. Neurocomputing 230, 374–381 (2017)
29. Gomes, H.M., Bifet, A., Read, J., et al.: Adaptive random forests for evolving data stream
classification. Mach. Learn. 106(9–10), 1469–1495 (2017). https://doi.org/10.1007/s10994-
017-5642-8
30. Giannakopoulos, T.: pyAudioAnalysis: an open-source Python library for audio signal
analysis. PLoS ONE 10(12) (2015). https://doi.org/10.1371/journal.pone.0144610
31. Žliobaitė, I., Bifet, A., Read, J., Pfahringer, B., Holmes, G.: Evaluation methods and
decision theory for classification of streaming data with temporal dependence. Mach. Learn.
98(3), 455–482 (2015). https://doi.org/10.1007/s10994-014-5441-4
A Machine Hearing Framework for Real-Time Streaming Analytics 261
32. Vinagre, J., Jorge, A.M., Gama, J.: Evaluation of recommender systems in streaming
environments. In: Workshop on Recommender Systems Evaluation: Dimensions and Design
(REDD 2014), Held in Conjunction with RecSys (2014). https://doi.org/10.13140/2.1.4381.
5367
33. Mao, J., Jain, A.K., Duin, P.W.: Statistical pattern recognition: a review. IEEE Trans. Pattern
Anal. Mach. Intell. 22(1), 4–37 (2000). https://doi.org/10.1109/34.824819
34. Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. Sci. 27(8), 861–874
(2006). https://doi.org/10.1016/j.patrec.2005.10.010
Deep Learning and Change Detection
for Fall Recognition
1 Introduction
It is widely known that falls in older people are among the main causes of
fatal injury while nonfatal injuries usually requires hospitalization. An indica-
tive example is the falls results according to the U.S. Centers for Disease Control
and Prevention (falls and fall related), where around 2.8 million injuries treated
in emergency departments recorded annually of which 800,000 needed hospital-
ization and more than 27,000 patients died.1 Although the falls seem to hurt only
the third age, there are also other cases of fall-related injuries such as among
1
https://www.cdc.gov/injury/wisqars/.
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 262–273, 2019.
https://doi.org/10.1007/978-3-030-20257-6_22
Deep Learning and Change Detection for Fall Recognition 263
2 Related Work
Fall detection methods can be categorized to methods based on (i) vision, (ii)
ambient sensors in the environment and (iii) wearable devices. The methods
based on wearables often rely on smart sensors with embedded processing capa-
bilities which can be attached to the human body. Because of this fact they seem
more attractive to elderly, as they are practical, of low cost, can be used easily
all day and can detect falls which may take place in random locations.
Nowadays, accelerometer-based approaches have gained ground in fall detec-
tion since its technological evolution has improved both its usability as the data
parsing process [35]. Accelerometers can built in smartphone now allowing for
implementation and design of an accurate way to detect timely fall accidents
in one device. Towards this direction, several studies have been proposed with
promising results. Kau et al. [17] proposed a fall accident detection model using a
smart phone and the third generation (3G) networks. Similarly, Aguiar et al. [2]
proposed a fall detection system based on smart phone by applying decision tree
classification algorithm for data analysis. Maglogiannis et al. paired a Pebble
Smart Watch together with an Android device in an attempt to recognize activ-
ity type, calculate the energy consumption and detect falls [22]. Shen et al. [29]
264 S. K. Tasoulis et al.
where a2x , a2y and a2z each corresponding x, y and z axis accelerations. Then we
Deep Learning and Change Detection for Fall Recognition 267
10.0
Total acceleration
7.5 Activity
normal
5.0 fall
2.5
0.0
0 500 1000 1500 2000
Time instant
Fig. 2. An example of the total acceleration atotal annotated according to the raw
sample class. (Color figure online)
maintaining the model efficiency as much as possible. The feature maps of the last
convolutional layer flatten, feeding a fully-connected layers of neurons, producing
feature maps with a dimension of 1 × 1 elements. These types of layers belongs
to the class of trainable layers (training is performed by finding suitable values
for the connection weights).
5 Experimental Results
In the first part of our experimental analysis we investigate the performance of
the CUSUM methodology onto the aforementioned dataset. The 3-dimensional
times series is transformed to univariate using the methodology described in
Sect. 4. We may visually investigate a sample of the resulting time series in Fig. 2
where the samples belonging to different classes are represented with different
colors. As previously discussed we observe the time series nature of the dataset
where subsequent falls occur after a number of normal activities. Finally, we also
observe situations where although a fall is taking place, the total acceleration is
not significantly high while on the other hand, there are normal activities with
high values of total acceleration.
In what follows, we also employ the predefined separation into train and
test sets and use the train set for parameter estimation so that θ0 = μ0 and
θ1 = μ1 , where μ0 and μ1 are the mean values of the samples belonging to the
normal activity and fall activity classes respectively. Then we investigate the h
parameter according to the number of false alarms and missed falls. We consider
a detected change as false alarm if the label of the current sample at the time
instant of reported change belong to the normal activity class. In addition, if
the algorithm do not detect a change during the time period of an actual fall
this is considered as a missed fall. The results are reported in Table 2 where,
considering that we need to minimize lost falls while also reducing computations
as much as possible, we intuitively choose the value 0.5 as more appropriate for
the h parameter since lower values imply significantly higher number of false
270 S. K. Tasoulis et al.
CUSUM function
3
0
0 500 1000 1500 2000
Time instant
Fig. 3. An example of the cusum function along with the reported changes in vertical
lines. Blue lines correspond to false alarms while red to correct reports. (Color figure
online)
Table 2. The h parameter analysis with respect to reported changes, false alarms and
lost falls.
h value 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9
Changes 884 668 551 476 423 382 352 332 318 295
False alarms 685 482 368 297 247 207 179 161 149 132
Lost falls 0 2 2 6 7 8 10 11 15 20
alarms. An example of the CUSUM function along with the reported true and
false alarms in different colors is illustrated in Fig. 3 for the same time series
example used in Fig. 2.
At this point we can estimate computational savings of the proposed app-
roach by reporting the ratio between the number of total reported changes for
h = 0.5 (551) and the total number of time windows constituting the train set.
For the non overlapping windows the computational savings are at least 61%
and up to 87% while for the overlapping windows case is up to 94%, which actu-
ally means that there are 94% fewer transmissions of sample batches between
the devices and similarly fewer classifications by the CNN on the smartwatch
device. Finally, we may examine the performance of the method for h = 0.5 on
the test set according to the false alarms and missed falls. The algorithm reports
334 detected changes for which 232 are false alarm, while there are no missed
falls, suggesting the coherent applicability of the method.
Having accomplished the task of minimizing computations with minimal cost
we proceed to the next step of our analysis. To this end, we employ a series of
classification methodologies to compare against the described CNN architecture
in an attempt to justify the use of deep learning for the classification task at
hand. These are Linear Discriminant Analysis (LDA), Support Vector Machines
(SVM), k-Nearest Neighbor (KNN) and Artificial Neural Networks (ANN). A
basic linear kernel have been chosen for SVM, the k parameter was set to 10
for KNN while 50 epochs and batch size equal to the number of samples have
been used to train the ANN. All methods are applied to the datasets described
Deep Learning and Change Detection for Fall Recognition 271
6 Concluding Remarks
Most of the classical approaches used for activity recognition and fall detection
rely on heuristics or hand-crafted feature extraction methods, which limit gen-
eralization. In addition, existing solutions on device applications are rare and
require extensive resource optimization In this work, we make an attempt to
tackle both of these challenges by proposing a method with wide generalization
by operation upon raw sensor accelerometer signal while preserving relatively
low resource demands. To achieve this we combine an efficient change detec-
tion algorithm along with Deep Learning that as shown in our experimental
analysis significantly enhance classification accuracy. The propose scheme pro-
vide details about the on-line implementation of this methodology employing a
smartwatch/smartband device and a smartphone. In our future work, we intend
to construct and examine real world datasets generated specifically for this con-
cept while providing further algorithmic developments.
Acknowledgments. This project has received funding from the Hellenic Foundation
for Research and Innovation (HFRI) and the General Secretariat for Research and
Technology (GSRT), under grant agreement No. 1901.
272 S. K. Tasoulis et al.
References
1. Abujiya, M., Riaz, M., Lee, M.H.: Enhanced cumulative sum charts for monitoring
process dispersion. PloS One 10, e0124520 (2015)
2. Aguiar, B., Rocha, T., Silva, J., Sousa, I.: Accelerometer-based fall detection for
smartphones. In: 2014 IEEE International Symposium on Medical Measurements
and Applications (MeMeA), pp. 1–6. IEEE (2014)
3. Bagalà, F., et al.: Evaluation of accelerometer-based fall detection algorithms on
real-world falls. PLOS ONE 7(5), 1–9 (2012)
4. Bourke, A., ÓLaighin, G.: A threshold-based fall-detection algorithm using a bi-
axial gyroscope sensor. Med. Eng. Phys. 30, 84–90 (2008)
5. Brodersen, K.H., Ong, C.S., Stephan, K.E., Buhmann, J.M.: The balanced accu-
racy and its posterior distribution. In: 2010 20th International Conference on Pat-
tern Recognition, pp. 3121–3124, August 2010
6. Brynolfsson, J., Sandsten, M.: Classification of one-dimensional non-stationary sig-
nals using the Wigner-Ville distribution in convolutional neural networks. In: 2017
25th European Signal Processing Conference (EUSIPCO), pp. 326–330, August
2017
7. Castillo, J.C., Carneiro, D., Serrano-Cuerda, J., Novais, P., Fernández-Caballero,
A., Neves, J.: A multi-modal approach for activity classification and fall detection.
Int. J. Syst. Sci. 45(4), 810–824 (2014)
8. Chen, D., Feng, W., Zhang, Y., Li, X., Wang, T.: A wearable wireless fall detection
system with accelerators. In: 2011 IEEE International Conference on Robotics and
Biomimetics, pp. 2259–2263, December 2011
9. Chen, L., Hoey, J., Nugent, C.D., Cook, D.J., Yu, Z.: Sensor-based activity recog-
nition. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(6), 790–808 (2012)
10. Georgakopoulos, S.V., Tasoulis, S.K., Maglogiannis, I., Plagianakos, V.P.: On-line
fall detection via mobile accelerometer data. In: Chbeir, Richard, Manolopoulos,
Yannis, Maglogiannis, Ilias, Alhajj, Reda (eds.) AIAI 2015. IAICT, vol. 458, pp.
103–112. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23868-5 8
11. Georgakopoulos, S.V., Tasoulis, S.K., Plagianakos, V.P.: Efficient change detection
for high dimensional data streams. In: 2015 IEEE International Conference on Big
Data (Big Data), pp. 2219–2222, October 2015
12. Granjon, P.: The CUSUM algorithm a small review (2014)
13. Greene, S., Thapliyal, H., Carpenter, D.: IoT-based fall detection for smart home
environments. In: 2016 IEEE International Symposium on Nanoelectronic and
Information Systems (iNIS), pp. 23–28, December 2016
14. Hsieh, K., Heller, T., Miller, A.B.: Risk factors for injuries and falls among adults
with developmental disabilities. J. Intellect. Disabil. Res. 45(1), 76–82 (2001)
15. Huang, C.L., Chung, C.Y.: A real-time model-based human motion tracking and
analysis for human-computer interface systems. EURASIP J. Adv. Signal Process.
2004(11), 616891 (2004)
16. Igual, R., Medrano, C.T., Plaza, I.: Challenges, issues and trends in fall detection
systems. Biomed. Eng. Online 12, 66 (2013)
17. Kau, L.J., Chen, C.S.: A smart phone-based pocket fall accident detection, posi-
tioning, and rescue system. IEEE J. Biomed. Health Inform. 19(1), 44–56 (2015)
18. Kepski, M., Kwolek, B.: Fall detection using ceiling-mounted 3D depth camera.
In: 2014 International Conference on Computer Vision Theory and Applications
(VISAPP), vol. 2, pp. 640–647. IEEE (2014)
Deep Learning and Change Detection for Fall Recognition 273
19. Kwolek, B., Kepski, M.: Human fall detection on embedded platform using depth
maps and wireless accelerometer. Comput. Methods Programs Biomed. 117(3),
489–501 (2014)
20. Ma, X., Wang, H., Xue, B., Zhou, M., Ji, B., Li, Y.: Depth-based human fall
detection via shape features and improved extreme learning machine. IEEE J.
Biomed. Health Inform. 18(6), 1915–1922 (2014)
21. Maglogiannis, I., Doukas, C.: Intelligent health monitoring based on pervasive tech-
nologies and cloud computing. Int. J. Artif. Intell. Tools 23(03), 1460001 (2014)
22. Maglogiannis, I., Ioannou, C., Tsanakas, P.: Fall detection and activity identi-
fication using wearable and hand-held devices. Integr. Comput.-Aided Eng. 23,
161–172 (2016)
23. Manganaro, G., de Gyvez, J.P.: One-dimensional discrete-time CNN with multi-
plexed template-hardware. IEEE Trans. Circuits Syst. I: Fundam. Theory Appl.
47(5), 764–769 (2000)
24. Mauldin, T.R., Canby, M.E., Metsis, V., Ngu, A.H.H., Rivera, C.C.: SmartFall: A
smartwatch-based fall detection system using deep learning. Sensors 18(10), 3363
(2018)
25. Gia, T.N., et al.: IoT-based fall detection system with energy efficient sensor nodes,
November 2016
26. Page, E.S.: Continuous inspection schemes. Biometrika 41(1/2), 100–115 (1954)
27. Perry, M., Pignatiello, J.J.: Estimating the time of step change with Poisson
CUSUM and EWMA control charts. Int. J. Prod. Res. 49, 2857–2871 (2011)
28. Pierleoni, P., Belli, A., Palma, L., Pellegrini, M., Pernini, L., Valenti, S.: A high
reliability wearable device for elderly fall detection. IEEE Sens. J. 15(8), 4544–4553
(2015)
29. Shen, R.K., Yang, C.Y., Shen, V.R., Chen, W.C.: A novel fall prediction system
on smartphones. IEEE Sens. J. 17(6), 1865–1871 (2017)
30. Tasoulis, S., Doukas, C., Plagianakos, V., Maglogiannis, I.: Statistical data mining
of streaming motion data for activity and fall recognition in assistive environments.
Neurocomputing 107, 87–96 (2013)
31. Tong, L., Song, Q., Ge, Y., Liu, M.: Hmm-based human fall detection and predic-
tion method using tri-axial accelerometer. IEEE Sens. J. 13(5), 1849–1856 (2013)
32. Tran, P.H., Tran, K.P.: The efficiency of CUSUM schemes for monitoring the coef-
ficient of variation. Appl. Stoch. Model. Bus. Ind. 32(6), 870–881 (2016)
33. Wang, D., Zhang, L., Xiong, Q.: A nonparametric CUSUM control chart based on
the Mann-Whitney statistic. Commun. Stat.-Theory Methods 46, 2017 (2017)
34. Wang, J., Zhang, Z., Bin, L., Lee, S., Sherratt, R.: An enhanced fall detection
system for elderly person monitoring using consumer home networks. IEEE Trans.
Consum. Electron. 60, 23–29 (2014)
35. Xu, T., Zhou, Y., Zhu, J.: New advances and challenges of fall detection systems:
a survey. Appl. Sci. 8(3), 418 (2018)
Image Classification Using Deep Neural
Networks: Transfer Learning
and the Handling of Unknown Images
Pattern Recognition (PR) can be considered as a form of machine learning that deals
with the recognition of patterns and regularities in data [1]. In supervised pattern
recognition, a pattern is recognized to fit one of a number of pre-defined classes. This is
known as classification. In general, however, it is difficult to have a manageable
number of classes that can deal with every possible known image. This problem is
made even more difficult with the introduction of unknown images (whose class is not
known) to the system. A simple supervised system will classify the unknown images
into one of the known classes depending on the nearest similar class. There is no
mechanism to reject the unknown images by the system. In this work, this problem is
addressed. Three distinct applications (gears, connectors and coins) are examined for
image classifications for test purposes.
ANN is most commonly applied as a supervised PR algorithm and is unable to deal
with unknown classes. Reference [2] achieved 92% accuracy in extracting numerical
information from partially degraded and aged Indian coins. They used a rotation
invariant character recognition process which employed a multichannel Gabor filter
together with a back-propagation ANN. Reference [3] studied the problem of coin
recognition with ANNs and concluded that accuracy of the system depended on
number of factors including: number of images per class, number of classes and
training-testing strategies.
Study of Deep Neural Networks (DNN) is a very active area of research. Several
research papers on deep learning are available for image classification [4–7]. DNN is a
multilayer neural network that uses a parallel computing approach to reduce processing
time. It is claimed that DNNs can approach the performance of the human brain [8].
Reference [9] demonstrated that discriminative DNN models can be easily fooled, that
is, they classify many unrecognizable images with near certainty as a member of a
recognizable class. To counter this criticism, Ref. [10] suggested that the likelihood of
negative images (or fooling images) appearing in a training dataset was of low prob-
ability, and thus was rarely or never observed in the testing dataset.
Transfer learning is a concept in machine learning where the model developed
previously for another application would be used for a new but similar application with
some modifications [11]. This is useful in DNN as the database tends to be very large.
More information on transfer learning can be found from [12]. An interesting study of
feature transferability was conducted by [13] concluding that initialization with
transferred features can improve generalization performance even after fine tuning on a
new problem. Our objective was to evaluate transfer learning as applied to AlexNet [4].
AlexNet works with millions of images and has over 1000 classes. It is considered to
be better than LeNet [14] in terms of accuracy and less complex than GoogleNet [15].
The images in AlexNet’s database were also considered to be similar to the images in
our own database.
2 Experimental Setup
The experimental setup was developed for the three applications: (1) small plastic
gears, (2) clear wire connectors and (3) metallic Indian coins. A total 4 classes were
prepared for training and a 5th class was added for testing (assuming the system had the
ability to deal with an unknown class).
Figure 1 shows the experimental setup and Fig. 2 gives sample images of the 5
classes for the 3 applications. The parts were fed to the conveyor using a plastic chute.
The conveyor in the system was from Dorner, Model 2200 with a flush mounting
package. The conveyor did not have sidewalls, a feature that provided for flexible
lighting and camera arrangements. The speed of the conveyor was adjustable from 0.5
to 50 m/min. This provided a way to test the system at different speeds and to check to
see whether system performance was sensitive to speed. A stretchable black fabric was
used to provide a black background for the image. To detect the presence of a part, a
276 V. Chauhan et al.
proximity sensor with a range of 4 mm was used. Once the part has been detected by
the sensor, it signaled the camera through an Arduino Uno R3 microcontroller to
acquire an image.
Fig. 1. Experimental setup for part recognition task (Color figure online)
Fig. 2. Sample images of five prepared classes for the three applications
A smart camera from National Instruments, NI 1732 was used for image acquisi-
tion. The aim of using the camera was to use the system in real time. It had an inbuilt
processor which could run a program developed in Vision Builder software. For the
selected applications, the largest part was 44 mm square with smallest feature around
1 mm. Therefore, a resolution of 640 480 was found to be enough. A lens from
Kowa, LM6JC, was used with the camera with 6 mm of focal length.
Previous work suggested to use dark-field lighting [16] as it can minimize the effect
of shadows on the visible pattern of the parts in the image. Generally, white light
provides more color detail. However, the red light was selected as the color was not a
significant differentiator for the selected applications. Moreover, red light can be used
for most industrial applications [17] involving a digital camera. A diffused industrial
grade dark-field light from Advanced Illumination, AI 1660, was selected to illuminate
the parts. The diffuser was prepared from the wax paper to smooth the effect of the sharp
Image Classification Using Deep Neural Networks 277
lighting. The light intensity depended on the voltage supplied with the range of 16–
24 V. The intensity with 19 V supply was found to be enough based on experiments.
A typical inspection speed is on the order of 100 to 300 parts/min [18]. In this
work, a speed of 100 parts/min (equivalent to 3 m/min as the conveyor speed) was
selected. In the setup, the light was mounted 35 mm above the surface of the part and
the working distance between the part and the camera was set to 100 mm. A camera
frame rate of 60 fps provided images without any visible blur. Rotational invariance
was one of the requirements for our system. For each image acquired from a distinct
part, twenty images were generated by digital rotation and by centering the part in the
region of interest. These generated images were referred as conditioned images. A total
of 2500 images per application were generated by conditioning. From the database,
2000 images were considered for the training of the 4 classes (500 images each) and
500 images were reserved for testing (100 images per class for 4 classes plus 100
images of unknown category). The images were organized in two folders: training and
testing. The database has been made available online, can be downloaded from the link
http://my.me.queensu.ca/People/Surgenor/Laboratory/Database.html. The database has
two folders (1) unconditioned (w/o rotations) and (2) conditioned (w/ rotations).
3 Methodology
required number of layers and the exact order of layers, various combinations are
needed to be checked. Each designed network needs to be trained using the available
data. Moreover, to learn the mapping between input and output data, many images are
required if the network is being trained from scratch. To overcome these limitations, for
this research work, a concept of transfer learning was adopted.
Transfer learning is the process of taking a pre-trained DNN and fine-tuning it to
learn a new task. Using transfer learning is usually much faster and easier than training
a network from scratch because it can quickly transfer learned features to a new task
using a smaller number of training images. There are numerous ways to implement
transfer learning. For this research, a pre-trained model approach was selected. The
training images used for pre-trained networks were a subset of the ImageNet database
used in ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). The popular
networks for transfer learning are AlexNet, LeNet, GoogleNet, VGG19, VGG16,
Squeezenet, resnet18, resnet50, resnet101 and inceptionresnetv2. These networks vary
in terms of depth, size, operations per prediction, parameters and input image size. The
networks with higher depth and more parameters require larger database and are
computationally expensive.
Considering the above factors, AlexNet was selected as a pre-trained network.
AlexNet is a convolutional neural network that is trained on more than a million images
from the ImageNet database. The network is 8 layers deep and can classify images into
1000 object categories, such as keyboard, mouse, pencil, and many animals. As a
result, the network has learned rich feature representations for a wide range of images.
The network has an image input size of 227 227. The images of the objects for the
selected 3 applications have similarity with the images used for AlexNet.
245 MB size, 0.72 Billion operations per prediction, 61 Million parameters and input
image size of 227 227 pixels. These numbers are small compared other computa-
tionally expensive network. AlexNet was subsequently downloaded and imported to
the tool.
Replace the Final Layers with New Layers Adapted to the New Data Set
AlexNet comprises of 25 layers. There are 8 layers with learnable weights: 5 convo-
lutional layers, and 3 fully connected layers. Selecting a layer in the tool, its properties
can be seen in the property pane. The network classifies input images using the last
learnable layer and the final classification layer. To retrain a pre-trained network to
classify new images, these final layers were replaced with the new layers adapted to the
new data set. To use a pre-trained network for transfer learning, the number of classes
updated to match the new data set. The default AlexNet has 1000 output classes. The
number of output classes for the applications of this research work were 4. For
AlexNet, the last learnable layer is a fully connected layer. This layer was deleted and a
new fully connected layer with the output size of 4 was inserted to replace it. Similarly,
the final output classification layer also needed to be replaced to reflect the number of
output classes in the given database. The number of outputs adjusted to the correct size
during training. Both these layers are highlighted in Fig. 3. The network parameters of
the fully connected layer were adjusted to improve the learning procedure. Weight
learn rate factor and Bias learn rate factors were set to 10. Once modified, the network
was analyzed to assure that no errors and warnings were present.
Export the Network and Developing the Training and Testing Algorithm
The modified network was renamed and exported to the workspace. An algorithm was
written to feed training and test images to the network. Images in the database were
340 340 pixels. The images were resized to 227 227 for input to the network. The
training database for each application contained 2000 images in 4 classes. The images
were divided 90% for training (1800 images) and 10% for validation (200 images).
A separate database of 500 images was used for testing the prediction accuracy of the
network.
Before training, the network parameters were adjusted to improve the training
process. For transfer learning, Initial Learn Rate was set to a small value to slow down
learning in the transferred layers. Also, the learning rate factors for the fully connected
layer were increased to speed up learning in the new final layers. This combination of
learning rate settings resulted in fast learning only in the new layers and slower learning
in the other layers. The number of epochs was set to 6. An epoch is a full training cycle
on the entire training data set. For transfer learning, training for many epochs is not
required. The data were shuffled every epoch. The mini-batch size, that is the number
of images to use in each iteration was set to 10. The validation frequency was set to 3
and the training plot to monitor progress was turned on. The network was trained for
each application separately. The output for the training plot is shown in the Fig. 4. The
plot shows how the accuracy improved over the epochs and the loss reduced. It took
50 min to train the network with a single CPU with spec i7, 1.8 GHz, 16 GB RAM.
The network outputs final class with a probability associated with it. The developed
algorithm calculated the classification accuracy of the network with and without
considering the probability. For accuracy with the probability, a probability threshold
of 99% was used to filter the results. Having a high probability threshold resulted in a
few false negative results (i.e. rejected some of the correct classifications). However, it
resulted in correct output with a high confidence. To test the reliability of the network,
it was also tested with unknown class images. These were images that did not fit to any
of the trained classes and were not the part of training. The confusion matrix and
accuracy were determined for testing with unknown classes with and without con-
sidering the probabilities. The analysis of the results is given in the next section.
Pre-trained AleNex was modified and trained using the transfer learning for all 3
applications (gears, connectors and coins). Due to a strong similarity between images
of the pre-trained AlexNet and the new database, the transfer learning resulted in a
faster learning (average training time 50 min) with a very high accuracy achieved with
the minimum iterations of 1080. As a result, the training and validation accuracy was
100% for the new database. The separate database of test images was used for addi-
tional testing. The test images contained 100 images of each class for 4 classes (i.e.
400 images) and additional 100 images of unknown objects. Hence, a total 500 images
were used for testing. The unknown object images were never seen during training and
didn’t fit any of the 4 training classes. The network was trained and tested for each
application separately. For simplicity, the intermediate results are presented only for the
connector application and the overall results are presented for all 3 applications.
For each test image, the network outputs a class and a probability associated with
the classification. The classification accuracy was calculated based on the number of
correct classifications by the network for the test database. The accuracy was deter-
mined under 4 distinct test conditions: (1) testing with known classes without con-
sidering the probabilities, (2) testing with added unknown class without considering the
probabilities; (3) testing with known classes and considering the probabilities and
(4) testing with added unknown class and considering the probabilities. The results
were presented as confusion matrices and the accuracies were reported. Since, the
network was originally trained with only 4 classes, for batch testing it could be tested
with only 4 classes. Thus, unknown class images were added to Class 4 for the test
conditions (2) and (4).
The confusion matrix for the classification accuracy for test condition (1) is given in
Table 1. Since, only 4 known classes were used for testing, the network resulted in the
perfect classification of 100%. For each test class, all 100 images were predicted as the
correct class using the network. To prove the robustness of the network, 100 images of
unknown objects were added to Class 4 for the test condition (2). The confusion matrix
is presented in Table 2. For this test condition, the output of the network was con-
sidered without its probability. It can be seen that out of 200 images of Class 4, all
images were predicted as Class 4 by the network. i.e. 100 images of the unknown class
were also predicted as Class 4 images. The resulting classification accuracy was 80%.
Table 1. Connector test results with known classes and without probabilities
Connectors w/o prob Known classes only Predicted class
Class 1 Class 2 Class 3 Class 4 Unknown
Actual class Class 1 100 0 0 0 0
Class 2 0 100 0 0 0
Class 3 0 0 100 0 0
Class 4 0 0 0 100 0
Unknown 0 0 0 0 0
Classification accuracy: 100%
282 V. Chauhan et al.
Table 2. Connector test results with an added unknown class and without probabilities
Connectors Unknown Predicted class
w/o prob added to class 4 Class 1 Class 2 Class 3 Class 4 Unknown
Actual class Class 1 100 0 0 0 0
Class 2 0 100 0 0 0
Class 3 0 0 100 0 0
Class 4 0 0 0 100 0
Unknown 0 0 0 100 0
Classification accuracy: 80.00%
The confusion matrix for the classification accuracy for test condition (3) is given in
Table 3. For this condition, the classification results were reinforced by considering the
probability of each classification. For each test image, first the output class was pre-
dicted using the network. In the second stage, the probability associated with that
classification was determined. If the probability score was higher than 99%, then only
the output class was retained, otherwise the output was referred as an unknown class.
The high probability threshold might result in some false negatives. i.e. the correct
predicted class treated as the unknown class. For example, 5 images of Class 2; 2
images of Class 3; and 4 images of Class 4 were referred as unknown class due to their
low probability scores. The classification accuracy was 97.60% for this test condition.
For test condition (4), the unknown class images were added to Class 4 and the
results were strengthened by considering the probabilities. The confusion matrix is
provided as the Table 4. Using the probability, out of 100 images of unknown class, 27
were predicted as Class 4 and 73 were predicted as the unknown class. Comparing the
results with the condition (3) were all 100 unknown class images were misclassified as
Class 4, the test condition (4) resulted in only 27 misclassifications. As a result, the
accuracy improved from 80% to 93.20%.
For the overall classification accuracy, the results of all three applications are
presented in Fig. 5 (left). The accuracy results are presented without considering the
probabilities. The large drop in the accuracy can be seen between the results of testing
with known classes only and the added unknown class.
Table 3. Connector test results with known classes and with probabilities
Connectors w/ prob Known classes only Predicted class
Class 1 Class 2 Class 3 Class 4 Unknown
Actual class Class 1 100 0 0 0 0
Class 2 0 95 0 0 5
Class 3 0 0 98 0 2
Class 4 0 0 0 95 5
Unknown 0 0 0 0 0
Classification accuracy: 97.60%
Image Classification Using Deep Neural Networks 283
Table 4. Connector test results with an added unknown class and with probabilities
Connectors Unknown added Predicted class
w/ prob to class 4 Class 1 Class 2 Class 3 Class 4 Unknown
Actual class Class 1 100 0 0 0 0
Class 2 0 95 0 0 5
Class 3 0 0 98 0 2
Class 4 0 0 0 100 0
Unknown 0 0 0 27 73
Classification accuracy: 93.20%
Fig. 5. Classification accuracy without probability (left) and with probability (right)
The results of the classification accuracy for all 3 applications with considering the
probabilities are given in Fig. 5 (right). For these results, the output of AlexNet was
corrected based on the probability of the classification and the probability threshold. As
a result, some false negatives were reported for testing with only known classes. On the
upside, the classification accuracy improved significantly for testing with the added
unknown class. The accuracy difference between testing with known class and added
unknown class reduced greatly compared to the results without considering the
probabilities.
Deep neural networks (DNNs) are excellent tool for image classification. Building and
training a DNN from scratch requires time, copious amounts of data and is computa-
tionally expensive. Moreover, a DNN consists of many layers. The performance of the
network varies based on the number and order of these layers. Moreover, during
training, millions of internal parameters are required to be tuned. Transfer learning is an
alternative solution. Researchers in the field of machine learning have contributed to
the community by sharing their pre-trained networks. For a given application, a similar
pre-trained network can be accessed, modified and retrained with a new database.
Using transfer learning is usually much faster and easier than training a network from
scratch as learned features can be quickly transferred to a new task using a smaller
number of training images. In this research, AlexNet trained on ImageNet database was
284 V. Chauhan et al.
selected, modified and retrained. The training with 4 classes took only 50 min and
resulted in 100% validation accuracy. During testing, the network was tested with both
known classes only and an added unknown class. The accuracy of the network dropped
from 100% to 80% with the addition of an unknown class. However, reinforcing the
classification results using the probabilities and setting the higher probability threshold,
significantly improved the accuracy. The accuracy increased from 80% to 97%. The
cause of errors were some false negative results where the output class was considered
as unknown due to a low probability score. For the coin and gear applications, the
accuracy improved by 20% for testing with added unknown images. For the connector
application, the accuracy was improved by 17% for testing with unknown images.
In order to test the reliability and robustness of transfer learning, further tests should
be conducted with different types of parts. Future work includes testing with new
databases and more images in each database. There are more than a dozen pre-trained
networks available for transfer learning. Examining the effect of transfer learning
applied to these newest networks is the goal for future work. Each pre-trained network
has many tuning options. Guidelines need to be developed regarding the modification
and tuning of pre-trained networks using transfer learning. The effect of batch size,
learning rate and other parameters can be the subject of future work. To improve the
reliability and robustness of the network in handling unknown class images, the fusion
of ANN or fuzzy based network with DNN should be studied. Also, the effect of image
quality on the performance of the network will be examined. An image quality index
will be defined like the one provided in the literature, for example [25].
References
1. Dougherty, G.: Introduction. In: Dougherty, G. (ed.) Pattern Recognition and Classification,
pp. 1–7. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-5323-9_1
2. Bremananth, R., Balaji, B., Sarkari, M., Chitra, A.: A new approach to coin recognition
using neural pattern analysis. In: IEEE Indicon Conference, Chennai, India, pp. 366–370
(2005)
3. Chauhan, V., Joshi, K.D., Surgenor, B.: Machine vision for coin recognition with ANNs:
effect of training and testing parameters. In: Boracchi, G., Iliadis, L., Jayne, C., Likas, A.
(eds.) EANN 2017. CCIS, vol. 744, pp. 523–534. Springer, Cham (2017). https://doi.org/10.
1007/978-3-319-65172-9_44
4. Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105
(2012)
5. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
6. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-Net: ImageNet classification
using binary convolutional neural networks, pp. 1–17. arXiv:1603.05279v4 (2016)
7. Fadaeddini, A., Eshghi, M., Majidi, B.: A deep residual neural network for low altitude
remote sensing image classification. In: 6th Iranian Joint Congress on Fuzzy and Intelligent
Systems (CFIS), Kerman, Iran, pp. 43–46 (2018)
8. Ciresan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image
classification. In: Computer Vision and Pattern Recognition, Providence, USA, pp. 3642–
3649 (2012)
Image Classification Using Deep Neural Networks 285
9. Nguyen, A., Yosinski, J., Clune, J.: Deep neural networks are easily fooled: high confidence
predictions for unrecognizable images. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Boston, USA, pp. 427–436 (2015)
10. Szegedy, C., et al.: Intriguing properties of neural networks. arxiv preprint arXiv:1312.6199
(2014)
11. Torrey, L., Shavlik, J.: Transfer learning. In: Handbook of Research on Machine Learning
Applications (2009)
12. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10),
1345–1359 (2010)
13. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural
networks? In: Advances in Neural Information Processing Systems 27, NIPS 2014, pp. 1–14
(2014)
14. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document
recognition. Proc. IEEE 86(11), 2278–2324 (1998)
15. Szegedy, C., et al.: Going deeper with convolutions. In: Conference on Computer Vision and
Pattern Recognition (CVPR), Boston, USA, pp. 1–9 (2015)
16. Joshi, K., Chauhan, V., Surgenor, B.: Real-time recognition and counting of Indian currency
coins using machine vision: a preliminary analysis. In: Proceedings of the Canadian Society
for Mechanical Engineering (CSME) International Congress, Kelowna, Canada (2016)
17. Cognex: Color Light Selection Guide. https://www.cognex.com/products/machine-vision/
2d-machine-vision-systems/in-sight-7000-series/colored-light-selection-guide. Accessed 31
Mar 2018
18. Chauhan, V.: Fault detection and classification in automated assembly machines using
machine vision. Doctoral thesis, Department of Mechanical and Materials Engineering,
Queen’s University, Canada, (2016)
19. Shah, S., Bennamoun, M., Boussaid, F.: Iterative deep learning for image set based face and
object recognition. Neurocomputing 174, 866–874 (2015)
20. Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H., Ogata, T.: Audio-visual speech
recognition using deep learning. Appl. Intell. 42, 722–737 (2015)
21. Zhou, S., Chen, Q., Wang, X.: Active deep learning method for semi-supervised sentiment
classification. Neurocomputing 120, 536–546 (2013)
22. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput.
Vis. 115(3), 211–252 (2015)
23. Joshi, K.D.: A flexible machine vision system for small part inspection based on a hybrid
SVM/ANN approach. Doctoral thesis, Department of Mechanical and Materials Engineer-
ing, Queen’s University, Canada (2018)
24. Machine Learning and Deep Learning Toolbox, MATLAB R2018b (2018)
25. Mittal, A., Soundararajan, R., Bovik, A.: Making a completely blind image quality analyzer.
IEEE Signal Process. Lett. 20(3), 209–212 (2013)
LEARNAE: Distributed and Resilient
Deep Neural Network Training
for Heterogeneous Peer to Peer Topologies
1 Introduction
During the past years, scientific community has made significant advances regarding
parallel DNN training. There are many aspects that can be dealt with in a decentralized
way. Each implementation may also have a different level of decentralization, based on
the intended use case. In most cases the proposed works rely on a centralized approach
like a parameter server [1, 2]. These approaches require high performance infrastruc-
ture, since all nodes have to communicate with the server and exchange models after
each optimization step. Decentralized attempts, such as [3], use local optimization and
asynchronous model merging reducing the communication demand, but the parameter
server is still a bandwidth bottleneck limiting scalability. Other distributed deep
learning systems [4, 5] are able to overcome this bottleneck, but for doing so they need
low-latency networking like InfiniBand, which results to very high setup cost and
narrows down use cases. Proposals like [6] focus on medium performance hardware
but they still utilize synchronous frameworks like [7], which are not the optimal choice
for loosely connected peers.
In this paper we propose a framework that stretches decentralization and tolerance
to maximum. Based on purely distributed peer-to-peer technology, it has no need for
servers or any kind of strict synchronization. Its indented use case are environments
with commodity-hardware nodes and networking infrastructure with moderate latency
and connectivity. Our approach supports versatile data acquiring from different types of
sources, including lightweight IoT devices, and uses novel Distributed Ledger Tech-
nologies as the data diffusion mechanism.
The rest of the paper is structured as follows: Sect. 2 presents technological
background knowledge, whereas Sect. 3 presents the LEARNAE architecture. Section 4
presents experimental results to evaluate the whole approach and, finally, Sect. 5
concludes the paper and identifies future research challenges.
2 Technological Background
This section presents the related technological background that forms the basis for the
LEARNAE framework.
2.1 IPFS
IPFS is a distributed filesystem for peer-to-peer networks, where no node has special
privileges compared to others. It is comprised of previously established and successful
techs, plus a novel block exchange protocol. Its main goal is to achieve efficient storage
and availability of big data, with no need of a centralized authority, supporting features
like immutability, deduplication, versioning and content-based addressing [8].
IPFS may be seen as an intuitive combination of Git [9], the distributed source code
version control system, and BitSwap, a novel data replication incentivizing algorithm
inspired by BitTorrent [10], the decentralized block exchange protocol.
In IPFS every node is identified by a unique ID which is the cryptographic hash of a
public key. Although a node owner has the freedom to change this ID, it is not advised,
since by doing so they will lose all the benefits the node has gained by its participation
to the network. Unlike most filesystems, IPFS is not using an addressing method based
on the location the data are stored in, instead any file can be acquired by just using the
cryptographic hash of its contents. Files larger than a specific size are sliced to blocks
which are assigned with their corresponding hash. In order to route a request, each
IPFS node utilizes a Distributed Hash Table (DHT), a ledger (based on [11–13]) that
contains pairs of block/peer information. For a small block, DHT contains the actual
block data, while for a larger one it contains the IDs of peers that can serve the specific
block.
Regarding block exchange strategy, every IPFS node maintains a list of blocks it
needs and one of blocks it already has. A major difference compared to BitTorrent’s
method [14] is that those lists are not limited to a specific torrent or a group of such, but
288 S. Nikolaidis and I. Refanidis
may include any block that has appeared to the network, no matter what file they are
part of. Since there will be cases where two peers will not need a block stored in each
other, or cases where a node needs nothing, some kind of credit has to be applied. Each
node is incentivized to increase its credit with its peers, because by doing so it also
increases the possibility others will help it acquire the blocks it will need in the future.
For a pair or peers this credit is in fact the balance of verified bytes exchanged between
the two.
Since the main concept of IPFS is decentralization, no data servers could be used to
store the files. All nodes participating to the network must provide some local storage
(similar to [15]). Every requested block is fetched from another peer and then saved to
the local storage. If a peer owner wishes to permanently retain a file locally, they can
“pin” it to avoid garbage collection. Considering all the above, IPFS should be able to
be used as resilient big dataset distribution method, ensuring load balancing between
the nodes and a configurable data replication, in order to anticipate node downtime on
loosely connected commodity hardware.
PubSub. IPFS supports a Publish-Subscribe scheme which allows instant messaging.
If a peer needs access to a specific “topic”, it subscribes to it and by doing so can listen
and broadcast data to that channel. As expected, PubSub has no need for centralized
authorities and all its messages propagate using gossip protocol. LEARNAE utilizes IPFS
PubSub feature to exchange training metadata among the peers.
2.2 IOTA
The IOTA network [16] aims to create a decentralized infrastructure to support all
forms of data exchange in the upcoming Internet of Things (IoT) era. Although it
supports a cryptocurrency token, automated micropayments between devices is just a
portion of its use cases. It is based on a structure known as Directed Acyclic Graph
(DAG), which in the IOTA community is referred to as the “Tangle” [17]. In general
IOTA attempts to achieve a consensus via permissionless procedures, using the Tangle
as a Distributed Ledger Technology (DLT) and a gossip protocol to propagate trans-
actions throughout the network. In order to attach a new transaction, a node has to
confirm two previous transactions, so each addition contributes to the overall stability
and performance of the network [18].
Masked Authenticated Messages. The IOTA feature-of-interest for this paper is
known as “Masked Authenticated Messages” (MAM). This feature allows lightweight
IoT devices to attach zero-valued transactions to the Tangle by just performing a small
amount of “Proof of Work” (PoW), which is solving a cryptographic puzzle. PoW is
necessary in order to avoid spamming from bad actors. Each of these transactions
includes a data portion, making MAM an IoT-oriented way to broadcast data streams,
which can support both encryption and authentication.
MAM messaging is based on a publish-subscribe logic. A data stream is con-
structed as a singly linked list, where each transaction points to the next one. If
someone knows the address (called “root”) of a specific transaction, they can read all of
LEARNAE: Distributed and Resilient Deep Neural Network Training 289
the following stream, but will have no access to previous data. Knowing the first root of
a MAM chain grants access to all the messages it contains.
A MAM stream may have one of three different accessibility modes: Public,
Restricted and Private. In Public mode, everyone can view the messages. In Restricted
mode, only those who know a “sidekey” can read the data. In private mode, only the
stream owner can access the data, since it requires knowing the seed that created the
MAM root sequence.
3 Implementation
Parallelism Type. The first decision that has to be made is between model [19] and
data [20, 21] parallelism, although there are works that propose hybrid systems.
LEARNAE adopts data parallelism. According to this approach, each worker keeps the
entire model locally and processes it using a subset of the training data.
Propagation. After processing on workers, the produced models have to be combined.
The two major methods for doing so are weight and update averaging, each of them
having its own advantages and drawbacks. This proposal uses weight averaging, thus
after training the actual value (not just the update) of each model parameter is - under
specific conditions - averaged with the actual value of the correspondent parameter of a
selected worker’s model [22].
Coordination. The method for merging the above models can also have different
levels of decentralization, as seen in the following Table 1:
A parameter server has the task of receiving, combining and redistributing the
averaged data. The use of a server in most cases results in a training speed increase, but
it also creates a single point of failure and - in large scale networks - a bandwidth
bottleneck. This drawback can be reduced using more servers that cooperate with each
other. For cases where the presence of a server is not feasible or desired, some of the
participating peers are assigned special coordinating tasks, while also functioning as all
other training workers. At the edge of this spectrum are the implementations where no
290 S. Nikolaidis and I. Refanidis
node has additional duties regarding coordination, creating a fully distributed envi-
ronment, which is the design followed by the current proposal.
Synchronicity. The training procedure may be synchronous or asynchronous. In
synchronous designs the coordinating entity ensures that all results are only combined
with others produced at the same training phase. In asynchronous designs there is no
such need and the results of a worker can be embedded into the global model under
more loose time-based rules. Each approach has pros and cons. Synchronous training
may converge faster, since it prevents merging of irrelevant data, but it may have locks
from slow peers that can delay the whole process. Asynchronous training achieves high
worker utilization, but suffers from high gradient staleness, meaning that by the time a
worker submits its results they are already out of date compared to the global model.
Many strategies have been proposed [23] to mitigate those downsides, resulting to a
large number of variants, especially for Asynchronous Stochastic Gradient Decent.
LEARNAE adopts an asynchronous design, although it contains features which, if used in
future implementations, may inject a configurable amount of synchronicity.
3.2 Architecture
LEARNAE is based on a versatile working scheme and can adapt to different environ-
ments. Regarding data, it supports use cases where all training data are poured into the
network during the initial phase, before any training. It also supports use cases where
data feeding is a continuous task and the training of the DNN model is an always-
progressing procedure. Thus, there is no limitation about the time training data may be
injected into the network, which is a critical feature when it comes to censor data
streaming from IoT devices.
For enhanced versatility there are 4 different types of node roles (Table 2). What
role a node has depends on whether it has access to training data and on its compu-
tational capability.
The first three roles in Table 2 require enough computational power to support
participation to the IPFS swarm. The fourth role is indented for IoT domain, since data
streaming through MAM messages may be broadcasted even by lightweight sensor
devices. Figure 1 demonstrates data flow between nodes of different roles.
LEARNAE: Distributed and Resilient Deep Neural Network Training 291
Training Data
Full Node Trainer Node
IPFS
IPFS Peer IPFS Peer
[ Input / Output ] [ Input / Output ]
IOTA
IPFS
IPFS Peer
[ Output ]
Management Algorithm
IOTA Client
Training Data Training Data [ Output ]
The ID of the used publish-subscribe channel (IPFS and/or IOTA) is the connecting
link between peers. Knowing this ID allows a peer to participate to the network by
listening and broadcasting messages. Figure 2 shows the workflow of a node’s “Lis-
tening thread”. There are 3 different types of messages:
Slice Hashlist. This message contains the hash of a file that a node made available on
IPFS. This file contains a list of hashes of training data slices that were also made
available on IPFS by the same peer.
Remote Model. This message contains the hash of a file that a node made available on
IPFS. This file contains the model of that peer in HD5 format. Other metadata of the
model are also included, like the achieved accuracy and the maturity of the model (the
number of the training cycles elapsed up to its creation).
Slice Use Stats. This message informs all participating peers that a specific data slice
has been used for training by a peer. This info allows the implementation of “overuse
threshold” feature, that is optionally setting a limit on how many times a data slice may
be used for training on different nodes.
A node may perform up to four main tasks, as described in Table 3:
292 S. Nikolaidis and I. Refanidis
In order to maximize node utilization, Learnae uses a different execution thread for
each task. All of them can work simultaneously, with the exception of the pair
Training/Averaging, since those both need read-write access to local model. Figures 3
and 4 demonstrates the work cycle of a full node.
4 Evaluation
4.1 Current Scope
This first paper is a preliminary study on the viability of using modern DLTs as data
diffusion mechanism of an asynchronous Stochastic Gradient Decent algorithm. Since
this is the initial approach, the following scope limits are applied:
• The performance of IPFS under the described use case will be reviewed. All tests
will be run on a single machine, simulating different peers using virtualization. This
LEARNAE: Distributed and Resilient Deep Neural Network Training 293
paper focuses on proof-of-concept metrics and all tests will be short runs of one-
hour timespan. Although the training dataset contains approximately 10 million
instances, only a limited fraction of those were used.
• Full deployment on actual network of commodity hardware, comparison to tradi-
tional methods, extended time runs, quantity-based metrics and full training data
usage will be studied in a subsequent work.
4.2 Simulation
The simulation runs of this first approach were executed on a virtual network of 10
workstations. The network was implemented on a single commodity computer with
294 S. Nikolaidis and I. Refanidis
Docker containers running the IPFS daemons. The coordinating application was
developed in C# and it contains both the distributed training management algorithm
and the logging mechanism. The dataset used was HEPMASS [24] (exotic particle
detection).
As seen in the following figures, simulation aims to study the effect of some key
characteristics like data slice size and overuse threshold, among other resilience-
oriented features like duplication level and overhead.
Simulation shows that increase in slice size has a positive impact on average
accuracy of the produced models (Fig. 5). Lower overuse threshold results to slightly
better accuracy, since it reduces repetitive training with same data (Fig. 6).
LEARNAE: Distributed and Resilient Deep Neural Network Training 295
As expected, total bytes sent are proportional to selected slice size (Fig. 7). The
same applies to average resilience (Fig. 8), which is defined as the number of nodes
owning a requested slice/model. Figure 9 demonstrates the percent of duplicate data
sent, that is an unavoidable overhead due to gossip-based protocol. This percentage,
although enormously high at the beginning, is quickly declining by time and minimized
for larger slice sizes.
81.5 3125
6250
81
12500
%
80.5 25000
80 50000
- 1 hour session - 100000
81.4
None
8
80.9
%
6
4
80.4
- 1 hour session - 2
3125
10,00,00,00,000 6250
Bytes
12500
5,00,00,00,000
25000
0 50000
- 1 hour session - 100000
Fig. 7. Total bytes sent per slice size (overuse threshold: none)
296 S. Nikolaidis and I. Refanidis
3125
6250
Nodes
6 12500
25000
1 50000
- 1 hour session - 100000
30 3125
6250
20
12500
%
10 25000
0 50000
- 1 hour session - 100000
Fig. 9. Duplicate data sent per slice size (overuse threshold: none)
This paper is a first approach on using Distributed Ledger Technology as the data
diffusion mechanism for decentralized DNN training, in order to achieve a fully dis-
tributed and peer-to-peer architecture. Thus, this study should be seen as a proof-of-
concept for the specific recipe. Although the used dataset contains approximately 10
million instances, only a small fraction of those (100,000) were used in this phase.
Many interesting questions remain to be addressed, especially those concerning
quantity-based metrics: What is the performance on real-life commodity hardware with
realistic network latencies and downtimes? How well can such an ecosystem scale?
What is the achieved accuracy when using more data? What is the maturity level of
new concepts like IOTA Tangle? What are the specifics of the performance/resilience
tradeoff? All these questions will be the subject of future work.
References
1. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of
the 12th USENIX Symposium on Operating Systems Design and Implementation, pp. 265–
283 (2016)
2. Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information
Processing Systems, pp. 1223–1231 (2012)
3. Zhang, S., Choromanska, A., LeCun, Y.: Deep learning with elastic averaging SGD. In:
Advances in Neural Information Processing Systems, pp. 685–693 (2015)
4. Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous
distributed systems. In: Proceedings of LearningSys (2015)
5. Iandola, F.N., Ashraf, K., Moskewicz, M.W., Keutzer, K.: FireCaffe: near-linear acceleration
of deep neural network training on compute clusters (2015)
6. Langer, M., Hall, A., He, Z., Rahayu, W.: MPCA SGD—a method for distributed training of
deep learning models on spark. IEEE Trans. Parallel Distrib. Syst. (2018)
7. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory
cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems
Design and Implementation, pp. 15–28 (2012)
8. Benet, J.: IPFS - Content Addressed, Versioned, P2P File System
9. Mashtizadeh, A.J., Bittau, A., Huang, Y.F., Mazieres, D.: Replication, history, and grafting
in the Ori file system. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating
Systems Principles, pp. 151–166. ACM (2013)
10. Cohen, B.: Incentives build robustness in BitTorrent. In: Workshop on Economics of Peer-
to-Peer Systems, vol. 6, pp. 68–72 (2003)
11. Baumgart, I., Mies, S.: S/Kademlia: a practicable approach towards secure key based
routing. In: Parallel and Distributed Systems International Conference (2007)
12. Freedman, M.J., Freudenthal, E., Mazieres, D.: Democratizing content publication with
coral. In: NSDI, vol. 4, p. 18 (2004)
13. Wang, L., Kangasharju, J.: Measuring large-scale distributed systems: case of BitTorrent
mainline DHT. In: 2013 IEEE Thirteenth International Conference on Peer-to-Peer
Computing (P2P), pp. 1–10. IEEE (2013)
14. Levin, D., LaCurts, K., Spring, N., Bhattacharjee, B.: BitTorrent is an auction: analyzing and
improving BitTorrent’s incentives. In: ACM SIGCOMM Computer Communication
Review, vol. 38, pp. 243–254. ACM (2008)
15. Dean, J., Ghemawat, S.: LevelDB–a fast and lightweight key/value database library by
Google (2011)
16. IOTA Foundation. https://www.iota.org. Accessed 14 Feb 2019
17. Popov, S.: The Tangle, 30 April 2018
18. Popov, S., Saa, O., Finardi, P.: Equilibria in the Tangle, 3 March 2018
19. Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., Andrew, N.: Deep learning with
COTS HPC systems. In: Proceedings of the 30th International Conference on Machine
Learning, pp. 1337–1345 (2013)
20. Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic
models using generalized maxout networks. In: 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE (2014)
21. Miao, Y., Zhang, H., Metze, F.: Distributed learning of multilingual DNN feature extractors
using GPUs (2014)
298 S. Nikolaidis and I. Refanidis
22. Povey, D., Zhang, X., Khudanpur, S.: Parallel training of deep neural networks with natural
gradient and parameter averaging (2014)
23. Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its
application to data-parallel distributed training of speech DNNs. In: Fifteenth Annual
Conference of the International Speech Communication Association (2014)
24. HEPMASS Dataset. http://archive.ics.uci.edu/ml/datasets/hepmass. Accessed 14 Feb 2019
Predicting Customer Churn Using Artificial
Neural Network
1 Introduction
Customer switching from one service provider to other service provider particularly in
the subscription based services is called customer churn [5]. It is one of the most
influential factors for different service-based industries mainly in telecommunication
industries and several other companies which provides subscription-based services
such as online entertainment like Netflix, Amazon Prime etc. Customer churn analysis
is one of the major factors in the overall growth of a company. Leaving customer churn
unchecked can heavily degrade a company’s business as customers might start pre-
ferring services of other companies and leave the current one.
Recent studies state that expense of making new customer is more than retaining
the existing customer as retaining. Jahromi et al. [12] highlighted that the existing
customer often leads to major increase in sales and dropped marketing cost [5, 6]. Over
the years, several factors have been proposed to be the crucial factors in determining
the reason behind the customer churn and also to find out the probability of a customer
churn [5, 6]. Having seen the importance of this alarming issue for the service providers
for their existence in the competitive marketplace has led to the intense deep dive in
development of the predictive tools and methods supporting vital task in modelling and
classification process [8, 9, 13].
In recent years there has been huge increase in the amount of data that is collected
and processed to extract meaningful and valuable information across wide areas of
business. The meaningful information that is obtained is then used by the companies
for customer relationship management (CRM) [1, 14]. Although there are many
techniques that have been successfully applied in predicting customer churn like using
SVM [1, 13, 15], induction [2], logistic regression, decision Trees [7], Naive Bayes and
neural networks [4] in the domain of airlines, banking and many more sectors, deep
learning approaches for the scenario still have lots to be explored. Older techniques [2,
10, 11] like logistic regression are less accurate as illustrated by Vafeiadis et al. [3] than
the newer evolved techniques like deep learning. Therefore, this study puts light on
findings and results of data accumulated from telecommunication [16] by using deep
learning techniques with Keras library.
In this paper a Multi-Layered Perceptron is modelled using Keras package in R
language to predict customer churn and the factors that dominates such situations. The
overall process is described in Fig. 1.
We provide a brief overview of the data set used in this paper for analysis and findings.
The data set used is from telecommunication industry, consisting of a set of 21 vari-
ables and 7043 observations. Data contents are service based information (Internet
services, Multiple lines, Phone services, Online backup, Online security, Device pro-
tection) customer account information (Customer ID, Paperless bill, Payment method,
Monthly charge, Total charge) and demographic information variables (Gender, part-
ners, dependents, senior citizen). For each record we also have its corresponding churn
as true or false.
Firstly, we split the dataset in the ratio of 4:1 where the former is used for training
and latter is for testing. Irrelevant and incorrect records are removed from the training
set by dropping the undesired variables (candidate ID is dropped in the following
Predicting Customer Churn Using Artificial Neural Network 301
experiment of this paper) and we are finally left with 20 variables and 7034 obser-
vations. Then the data is pruned and transformed by one hot encoding.
In Fig. 2 we illustrate how the accuracy and loss is changing over different. Final
accuracy obtained on training data and validation is 85.53% and 76.5% respectively.
Changing the units of neurons in first hidden layer to 75 s hidden layer neurons to
35, setting kernel initializer to ‘uniform’ and activation function to sigmoid, dropout
rate to 0.1 and the optimizer as Adam optimizer we get 93.14% accuracy on training
data but validation data scores 75% accuracy.
4 Evaluation Measures
To evaluate the classifier performance for different schemes with appropriate parameter
the measures used are precision, recall, accuracy and F-measure and these are calcu-
lated from content of confusion matrix. True positive and false negative cases are
denoted as TP and FN and false positive and true negative cases are denoted by FP and
TN respectively.
• Precision = TP/TP + FP
• Recall = TP/TP + FN
• Accuracy = TP + TN/TP + FP + TN + FN
Since we are dealing with binary classification output is converted to vector as the
prediction of class generates class value that are usually in matrix of 0’s and 1’s.
Estimate is made of keras Table 2.
Creating the confusion matrix Table 3 from model results that indicates actual
versus predicted classes and the confusion table obtained is depicted in below table.
Predicting Customer Churn Using Artificial Neural Network 303
From the confusion matrix precision and recall value is calculated to achieve 0.869
and 0.826 respectively. Higher values indicate better performance by the model while
lower values indicate inaccuracy. We can conclude that our model performs well on par
with industry standards and requirements.
To highlight which features are important from the set of features we use a heat map
where more important features are highlighted with the darker shade.
The Heat Map in Fig. 3, demonstrates that the features such as Tech support and
Streaming movie has more weighted factors that are responsible for the customer
churn. The last step in churn analysis is to study churn correlation analysis result of all
the features.
In the above shown Fig. 4, it is clearly illustrated that positive contribution con-
tributes to churn and negative contribution prevents customer churn. In the above figure
the features in red increase the likelihood of churn in which the feature “tenure”
dominates the churn the most then it is affected by features like “internet service fibre
optics” and so on while the features such as “Contract”, “Total charge” that are in blue
decreases the likelihood of customer churn.
6 Conclusion
We have designed a multi-layer perceptron which can predict customer churn with a
high accuracy rate of over 80%. Simulations were performed using the state-of-the-art
method of classification for customer churn prediction on the publicly available dataset.
Initially evaluation was carried out using different batch size, epoch, number of neurons
in the particular layers, initializer, activation function and optimizers and at the end the
best possible result was found with good accuracy of prediction and the factor that were
responsible for churn are highlighted in the analysis clearly. The work presented in this
paper has put some light on the accuracy and performance using different techniques
Predicting Customer Churn Using Artificial Neural Network 305
ranging from changing epoch, batch size, number of neurons in the layers, different
activation function and optimizer. Our model can be used in business organisations to
decide an optimal balance between cost of targets and customer retention. In the future
work some additional techniques and parameters will be explored and also use more
detailed and larger data set from industry so that the statistical importance of the result
can be maximized and can serve useful to industries for significant growth in the
competitive market.
References
1. Zhao, J., Dang, X.H.: Bank customer churn prediction based on support vector machine:
taking a commercial bank’s VIP customer churn as the example. In: 2008 4th International
Conference on Wireless Communications, Networking and Mobile Computing, pp. 1–4.
IEEE (2008)
2. Verbeke, W., Martens, D., Mues, C., Baesens, B.: Building comprehensible customer churn
prediction models with advanced rule induction techniques. Expert Syst. Appl. 38(3), 2354–
2364 (2011)
3. Vafeiadis, T., Diamantaras, K.I., Sarigiannidis, G., Chatzisavvas, K.C.: A comparison of
machine learning techniques for customer churn prediction. Simul. Model. Pract. Theory 55,
1–9 (2015)
4. Sharma, A., Panigrahi, D., Kumar, P.: A neural network based approach for predicting
customer churn in cellular network services. arXiv preprint arXiv:1309.3945 (2013)
5. Mozer, M.C., Wolniewicz, R.H., Grimes, D.B., Johnson, E., Kaushansky, H.: Churn
reduction in the wireless industry. In: Advances in Neural Information Processing Systems,
pp. 935–941 (2000)
6. Ngai, E.W., Xiu, L., Chau, D.C.: Application of data mining techniques in customer
relationship management: a literature review and classification. Expert Syst. Appl. 36(2),
2592–2602 (2009)
7. Lemmens, A., Croux, C.: Bagging and boosting classification trees to predict churn. J. Mark.
Res. 43(2), 276–286 (2006)
8. Farquad, M.A.H., Ravi, V., Raju, S.B.: Churn prediction using comprehensible support
vector machine: an analytical CRM application. Appl. Soft Comput. 19, 31–40 (2014)
9. Kianmehr, K., Alhajj, R.: Calling communities analysis and identification using machine
learning techniques. Expert Syst. Appl. 36(3), 6218–6226 (2009)
10. Neslin, S.A., Gupta, S., Kamakura, W., Lu, J., Mason, C.H.: Defection detection: measuring
and understanding the predictive accuracy of customer churn models. J. Mark. Res. 43(2),
204–211 (2006)
11. De Bock, K.W., Van den Poel, D.: Reconciling performance and interpretability in customer
churn prediction using ensemble learning based on generalized additive models. Expert Syst.
Appl. 39(8), 6816–6826 (2012)
12. Jahromi, A.T., Stakhovych, S., Ewing, M.: Managing B2B customer churn, retention and
profitability. Ind. Mark. Manag. 43(7), 1258–1268 (2014)
13. Xia, G.E., Jin, W.D.: Model of customer churn prediction on support vector machine. Syst.
Eng. Theory Pract. 28(1), 71–77 (2008)
14. Zhao, Yu., Li, B., Li, X., Liu, W., Ren, S.: Customer churn prediction using improved one-
class support vector machine. In: Li, X., Wang, S., Dong, Z.Y. (eds.) ADMA 2005. LNCS
(LNAI), vol. 3584, pp. 300–306. Springer, Heidelberg (2005). https://doi.org/10.1007/
11527503_36
306 S. Kumar and M. Kumar
15. Chen, Z.Y., Fan, Z.P., Sun, M.: A hierarchical multiple kernel support vector machine for
customer churn prediction using longitudinal behavioral data. Eur. J. Oper. Res. 223(2),
461–472 (2012)
16. Chouiekh, A.: Machine learning techniques applied to prepaid subscribers: case study on the
telecom industry of Morocco. In: 2017 Intelligent Systems and Computer Vision (ISCV),
pp. 1–8. IEEE (2017)
Virtual Sensor Based on a Deep Learning
Approach for Estimating Efficiency
in Chillers
This work was supported in part by the Spanish Ministerio de Ciencia e Innovacion
(MICINN) and the European FEDER funds under project CICYT DPI2015-69891-C2-
1-R/2-R.
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 307–319, 2019.
https://doi.org/10.1007/978-3-030-20257-6_26
308 S. Alonso et al.
1 Introduction
Energy consumption in large buildings, represents more than 20% of the global
energy consumption in developed countries. The proliferation of heating, venti-
lation and air conditioning (HVAC) systems is one of the main reasons behind
such a high consumption [16]. In central air conditioning systems, chillers are the
main energy consumers, consuming more than 40% of the total energy in com-
mercial and industrial buildings [17]. Thus, their efficiencies have a significant
effect on the overall energy performance of these buildings.
Several Energy Efficiency Indicators (EEI) can be used to determine the
chiller efficiency [1]. Most of the EEI require measuring the cooling power deliv-
ered to chilled water (the chiller output). However, manufacturers do not usually
include energy meters in their chiller designs due to high installation, mainte-
nance and recalibration costs [14]. A physical cooling meter can be replaced by
a virtual sensor [13].
Virtual sensors refer to software, usually including mathematical models that
allow measuring process variables or quantities using indirect measurements of
related variables, useful in cases where physical sensors are not available, expen-
sive, slow or imprecise. They are a low-cost and non-invasive choice to obtain
observations from a real system [15] and have been widely applied for flow
and efficiency metering in cooling plants [18,19]. Data based models for virtual
sensors using radial basis functions, multilayer perceptrons and other machine
learning methods, can provide accurate estimations for cooling power [3,11] pro-
vided input-output training data are available, which in cooling systems can be
acquired with portable energy meters.
Recent deep learning methods have been used for time ahead cooling pre-
diction [4,6]. However, those methods have been barely applied for estimating
the current output of a virtual sensor. Therefore, we suggest that deep learning
methods can achieve more accurate models for virtual sensors of cooling power.
We propose here a virtual sensor based on a deep learning approach for esti-
mating the cooling production and efficiency in chillers. The proposed model,
based on 2D Convolutional Neural Network (2D CNN) [12], is compared to other
state-of-the-art methods and tested on a real air-cooled chiller of the plant at
the Hospital of León achieving better results.
This paper is organized as follows: Sect. 2 states the problem. In Sect. 3, the
adopted methodology is exposed. Here, the proposed deep approach is explained
in detail. In Sect. 4, the experiment is presented and results are discussed. Finally,
conclusions and future work are drawn in Sect. 5.
2 Problem Statement
2.1 Chiller Efficiency
A chiller is an HVAC system in charge of providing cooling energy to building
facilities. Normally, a chiller is formed by a set of refrigeration circuits with
similar or even different capacity in order to achieve a better adaptation to
Virtual Sensor Based on a Deep Learning Approach 309
variable cooling loads (see Fig. 1). Each individual circuit provides cooling power
according to central chiller control. The total cooling production of the chiller is
the sum of the cooling power from each circuit.
Chiller Unit
Refrigeration Circuit 1 Refrigeration Circuit n
C B C B
Condensing Condensing
Pressure [Pc] Pressure [Pc]
Condenser Condenser
Condensing Condensing
Temperature [Tc] Temperature [Tc]
Electric Cooling
Energy Expansion
Compressor
Electric
Power
Expansion
Compressor
Electric
Power Energy
Valve [KWelec] Valve [KWelec]
Evaporating Evaporating
Pressure [Pe] Pressure [Pe]
D A D A
Cooling Cooling
Power Power
[KWcooling] [KWcooling]
Manufacturers incorporate electric power meters in the chillers since they use
the compressor current for controlling cooling capacity. However, they seldom
include cooling power meters, because it is not essential for capacity control and
increases the cost of the chiller. Moreover, cooling power meters are based on a
flow meter, whose measuring principle is usually invasive, thereby becoming a
new drawback for their installation. Furthermore, chiller units used to be placed
outside so, low temperatures could damage the flow meter. Therefore, virtual
sensor implementation becomes crucial in order to measure the cooling power
and to compute COP value in a chiller. The virtual sensor should be able to
estimate the cooling production and the COP value based on internal variables
of the refrigeration circuits.
Applying the energy conservation equation Q − W = ΔH to the theoretical
refrigeration cycle [10], we have Qevaporator = ΔH = HD − HA . So, cooling
production can be obtained using Eq. 2:
Qevaporator HD − H A
COP = = (3)
Wcompressor HA − H B
The chiller plant at the Hospital of León has been used in the experimental
setup. Basically, that plant consists of 5 air-cooled and 2 water-cooled chillers.
An air-cooled chiller is used for the experiments.
Table 1. Main internal variables for each refrigeration circuit of the chiller.
The air-cooled chiller (model Petra APSa 400-3) has a maximum cooling
capacity of 400 tons (approximately 1407 kW) and includes 3 refrigeration cir-
cuits (see Fig. 2). Each one is composed of a screw compressor, an electronic
expansion valve (EEV) and a condenser in V form. A common evaporator is
used for the 3 circuits. The compressor, driven by a three-phase induction motor
(400 V; 109 kW), has a maximum displacement of 791 m3 /h of R134a refrigera-
tion gas. Its capacity can be regulated between 50–100% of maximum value by
means of two auxiliary load and unload valves. The condensers have 16 fans of
1.5 kW, driven by variable speed drives. Note that the condensing control signal
is common to the 3 circuits.
The control board regulates the operation of 3 refrigeration circuits. It com-
municates with a BMS (Building Management System) which collects and stores
data from main internal variables (listed in Table 1) using Modbus protocol.
It should be remarked that it is impossible to measure the individual cooling
production of each refrigeration circuit, since the evaporator is common to all the
circuits and only the total cooling production is accessible for measuring. More-
over, the condensing control signal is common for the 3 refrigeration circuits,
so overpressures in one circuit can affect to the other two circuits (assuming all
of them are running). Thus, some interactions among the refrigeration circuits
are expected in this chiller. On the other hand, dependencies among variables
are expected since the refrigeration cycle is closed, i.e, suction and compressor
variables will determine the evolution of discharge variables.
3 Methodology
Based on the considerations exposed in Sect. 2, the proposed approach should
take into account the following aspects in order to address the regression
problem:
– A chiller unit can comprise several refrigeration circuits, which provide cooling
energy.
– Cooling energy and efficiency depend on the state of the refrigeration circuits,
which is defined by internal variables (pressures, temperatures and compressor
work).
312 S. Alonso et al.
According to this, a virtual sensor for the overall chiller production (KW cooling)
can be designed using the next regressor
T e1 , P e1 , T c1 , P c1 , KW elec1 , T e2 , P e2 , T c2 , P c2 , KW elec2 , . . .
. . . , . . . , . . . , . . . , . . . , . . . , . . . , . . . , T en , P en , T cn , P cn , KW elecn (5)
Data Preparation
BMS logs [Regressor | Estimation]
Merge
Data [Tei, Pei, Tci, Pci, KWeleci | KWcooling]
i-refrigeration circuit
Ultrasonic
portable meter
Filter kernel:
(2, 2)
Te1 Pe1 KW1
Virtual sensor
Cooling power estimation COP computation
[KWcooling]
[KWcooling]
[ KWelec1+ KWelec2 + KWelec3 ]
A downsampling layer is not required since the size of the images is expected
to be small. A nonlinear activation function is used for all units. Then, a fully
connected layer is applied with resulting units. Finally, an output layer with
dimension 1 provides cooling estimation as a virtual sensor measurement.
The proposed deep approach is compared with other linear and non-linear
methods used widely in the literature: (a) two linear methods, including Mul-
tiple Linear Regression (MLR) [8] and Random Sample Consensus (RANSAC),
based on selecting uniformly at random a subset of data samples to estimate
model parameters [5]; and (b) several nonlinear methods, including a kernel
method Support Vector Regression (SVR) [2], a shallow Multilayer Perceptron
(Shallow MLP), with just one hidden layer [9] and trained using backpropagation
algorithm, a deep Multilayer Perceptron (Deep MLP) with many hidden layers
using a special initialization strategy to avoid the vanishing/exploding gradient
problem [7].
314 S. Alonso et al.
4 Results
4.1 Collecting Data
Data have been collected from two sources. First, we have gathered data from
BMS (Building Management System) logs (plant manager subsystem). BMS
stores these data when changing in order to optimize storage capacity. The
second data source is an ultrasonic portable meter (Fluxus F601 by Flexim).
It supplies the lack of cooling power meter in the chiller. Cooling power is
acquired and stored each minute from flow and leaving and return chilled water
temperatures.
Both data sources (CSV format) were preprocessed. For that, data from BMS
logs were resampled with 1 min and then synchronized and merged with data
from ultrasonic portable meter.
4.2 Experiments
An experiment has been performed to test our approach. Data from air-cooled
chiller no. 5 were selected from 2 months (2018 December and 2019 January),
with a sampling time of 1 min, so the number of samples was 38169. Note
that, that chiller was not running consecutive time since its operation was
alternated with other chillers in the plant. Five regressor variables are used,
Te , Pe , Tc , Pc , KW elec (see Table 1) for each of the 3 refrigeration circuits, result-
ing in 15 regressor variables. The estimated variable was the cooling power
KW cooling. Total data were split into 2 datasets: The training and test
model dataset (70% of total data) is used to train and test all models. The
proposed approach, 2D CNN, and the remaining linear and non-linear models
are trained using this dataset. A 10-fold cross validation has been applied to
test models and select the best one. The virtual sensing test dataset (30% of
total data) is used to test virtual sensor estimation of cooling power and COP.
In this case, we suppose that the F601 portable meter is disconnected and the
virtual sensor based on the proposed 2D CNN model is used to measure cooling
production and efficiency.
The hyperparameters for each model were tuned after several preliminary
experiments, choosing the best ones in each scenario.
The proposed 2D CNN model consists of a 4 layers, an input layer (3, 5,
1), a 2D CNN layer (2, 4, 32), a flatten layer (256) and a dense output layer
(1). A dropout regularization (0.001) was applied to avoid overfitting. Activation
relu function was selected. The number of filters was 32 with a filter kernel of
(2, 2) in order to detect pair-wise patterns among circuits and variables. The
padding was defined as valid, achieving feature maps with a lower dimension.
No downsampling is required due to small size of input images.
The SVR model uses a radial basics function as kernel with 0.01 gamma
coefficient. The penalty parameter C of the error was established in 0.001 and the
epsilon-tube within which no penalty is associated in the training loss function
was 0.01.
Virtual Sensor Based on a Deep Learning Approach 315
The Shallow MLP model consists of 3 layers, a input layer with a dimension
of 15, a hidden layer (64 units) and an output layer (1 unit). It is trained with
backpropagation algorithm.
The Deep MLP model consists of 12 layers, a input layer with a dimension
of 15, 10 hidden identical layers (64 units) and an output layer (1 unit). The
dropout regularization was 0.001 to avoid overfitting and relu was selected as
activation function (also for Shallow MLP). For all methods, the training epochs
were 1000.
M AE(methodm ) M AP E(methodm )
M AER = 1 − ; M AP ER = 1 −
M AE(M LR) M AP E(M LR)
They allow us to check how much the scores of a certain method improve
(+ errors) or worsen (− errors) with respect to MLR scores (reference method).
According to train errors (see Table 2), Deep MLP is the best method,
improving around 67% a MLR and nearly 13% the second one (Shallow MLP).
Our approach is the third method providing also very low errors (MAPE is 1.85%
and MAE is 8.96 KW).
According to test errors (see Table 2) and focusing on MAPE errors, our app-
roach is the best method, improving around 29% a MLR and nearly 13% the
second one (SVR). The difference with the following methods (Deep MLP and
Shallow MLP) is 14% and 18%, respectively. Checking MAE errors, our app-
roach is also the winner. It can be stated we could estimate cooling production
and efficiency either with a relative error of 3.04% or with an absolute error
of 13.24 KW (see Fig. 4). Considering the standard deviation (±1.23), relative
errors can range between 1.81% and 4.27% (in the best and the worst scenario).
Therefore, we can conclude that 2D CNN is the best method to build the virtual
sensor.
Fig. 4. Cooling production estimation using the proposed deep approach (2D CNN)
for training, test and virtual sensing datasets.
Once the 2D CNN model was validated, we performed a new test using virtual
sensing dataset (30% of the total data). Now, we suppose the portable F601
meter is disconnected from the chiller and we try to verify the virtual sensor
measurement, estimating the cooling production and efficiency. In this case, we
have chosen MAPE error to test estimations.
The results can be observed in Fig. 5. The virtual sensor provides measure-
ments of cooling power with an error of 3.41% and with an absolute error of
16.43 KW. It is very accurate, except when starting or stopping a compressor.
The final aim of virtual sensor is to monitor chiller efficiency. For that, COP
value is computed using estimated cooling power and measured electric power
(sum of 3 electric compressor powers) for all virtual sensing dataset. The result
can be seen in Fig. 6. The chiller efficiency can be estimated with a relative error
of 3.41% and with an absolute error of 0.16. Note that the maximum errors occur
when a compressor starts or stops.
Virtual Sensor Based on a Deep Learning Approach 317
Fig. 5. Cooling production estimation using the proposed deep approach (2D CNN)
for virtual sensing dataset.
5 Conclusions
In this paper, we have proposed a virtual sensor for cooling power estimation
based on available internal chiller variables (temperatures, pressures and com-
pressor power) and using a deep convolutional neural network (2D CNN). The
proposed architecture uses a convolutional layer that takes advantage of coher-
ence between the three refrigeration circuits of the chiller, and was systematically
compared to a set of state-of-the-art methods, using several performance metrics
with a 10-fold validation methodology, where our proposed method achieved the
best results.
On the application side, the developed virtual sensor is very valuable for
several reasons. First, the estimations of the virtual sensor can replace the current
318 S. Alonso et al.
measurements from the expensive portable measuring system used for training,
that can be only available provisionally. This methodology can be extensible
(“copy-pasted”) to any chiller, specially to the 5 identical chillers in this plant,
resulting in a highly cost-effective way to track and monitor the overall cooling
power of the plant. Second, the availability of electric power consumption and an
accurate enough estimation of cooling power allows to have also an estimation of
the chiller efficiency, highly valuable for energy optimization of the overall plant.
References
1. Alves, O., Monteiro, E., Brito, P., Romano, P.: Measurement and classification of
energy efficiency in HVAC systems. Energy Build. 130, 408–419 (2016). https://
doi.org/10.1016/j.enbuild.2016.08.070
2. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297
(1995). www.scopus.com, cited By (since 1996): 4606
3. Escobedo-Trujillo, B., Colorado, D., Rivera, W., Alaffita-Hernández, F.: Neural
network and polynomial model to improve the coefficient of performance prediction
for solar intermittent refrigeration system. Sol. Energy 129, 28–37 (2016). https://
doi.org/10.1016/j.solener.2016.01.041
4. Fan, C., Xiao, F., Zhao, Y.: A short-term building cooling load prediction method
using deep learning algorithms. Appl. Energy 195, 222–233 (2017). https://doi.
org/10.1016/j.apenergy.2017.03.064
5. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model
fitting with applications to image analysis and automated cartography. Commun.
ACM 24(6), 381–395 (1981)
6. Fu, G.: Deep belief network based ensemble approach for cooling load forecasting
of air-conditioning system. Energy 148, 269–282 (2018). https://doi.org/10.1016/
j.energy.2018.01.180
7. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge
(2016). http://www.deeplearningbook.org
8. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning, 2nd
edn. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7
9. Haykin, S.S.: Neural Networks and Learning Machines, 3rd edn. Pearson Educa-
tion, London (2009)
10. Klein, S.A.: Design considerations for refrigeration cycles. In: International Refrig-
eration and Air Conditioning Conference, vol. 190, pp. 511–519. Purdue e-Pubs
(1992). http://docs.lib.purdue.edu/iracc/190
11. Kusiak, A., Li, M., Zheng, H.: Virtual models of indoor-air-quality sensors. Appl.
Energy 87(6), 2087–2094 (2010). https://doi.org/10.1016/j.apenergy.2009.12.008
12. LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series.
In: The Handbook of Brain Theory and Neural Networks, pp. 255–258. MIT Press,
Cambridge, MA, USA (1998)
13. Li, H., Yu, D., Braun, J.E.: A review of virtual sensing technology and application
in building systems. HVAC&R Res. 17(5), 619–645 (2011). https://doi.org/10.
1080/10789669.2011.573051
14. Mcdonald, E., Zmeureanu, R.: Virtual flow meter to estimate the water flow rates
in chillers. ASHRAE Trans. 120, 200–208 (2014)
Virtual Sensor Based on a Deep Learning Approach 319
15. McDonald, E., Zmeureanu, R.: Development and testing of a virtual flow meter
tool to monitor the performance of cooling plants. Energy Procedia 78, 1129–1134
(2015). https://doi.org/10.1016/j.egypro.2015.11.071
16. Pérez-Lombard, L., Ortiz, J., Pout, C.: A review on buildings energy consump-
tion information. Energy Build. 40(3), 394–398 (2008). https://doi.org/10.1016/j.
enbuild.2007.03.007
17. Saidur, R., Hasanuzzaman, M., Mahlia, T., Rahim, N., Mohammed, H.: Chillers
energy consumption, energy savings and emission analysis in an institutional build-
ings. Energy 36(8), 5233–5238 (2011). https://doi.org/10.1016/j.energy.2011.06.
027. PRES 2010
18. Wang, H.: Water flow rate models based on the pipe resistance and pressure differ-
ence in multiple parallel chiller systems. Energy Build. 75, 181–188 (2014). https://
doi.org/10.1016/j.enbuild.2014.02.017
19. Zhao, X., Yang, M., Li, H.: Development, evaluation and validation of a robust
virtual sensing method for determining water flow rate in chillers. HVAC&R Res.
18(5), 874–889 (2012). https://doi.org/10.1080/10789669.2012.667036
Deep Learning - Convolutional ANN
Canonical Correlation Analysis
Framework for the Reduction of Test
Time in Industrial Manufacturing
Quality Tests
1 Introduction
The industrial manufacturing paradigm shift known as Industry 4.0 and the asso-
ciated modernization in terms of intelligent asset monitoring and data and sensor
fusion are significantly increasing quality control capabilities [1]. As a result, the
prediction of abnormal behaviour in products or manufacturing machines, a cen-
tral pillar of quality control, is profiting from a previously unknown amount of
available data. Combined with vast computational resources, this data enables
the usage of intelligent, data-driven methods such as machine learning for diverse
descriptive, predictive or prescriptive tasks [2,3]. The modernization induced by
the implementation of Industry 4.0 standards has also affected product perfor-
mance tests, which are conducted in order to measure product characteristics
that facilitate quality control.
In a contribution closely related to our work, Fuzzy-Bayesian networks were
used to predict the performance of refrigeration compressors [4]. Due to their
superior predictive power, the information needed for the performance predic-
tion could be inferred from a short excerpt of the original measurement, thus
reducing the necessary test time. Although the used approach is applicable in
a broader context than the one resulting from the experimental dataset con-
sidered by the authors, Bayesian networks feature a series of disadvantages: on
the one hand, learning their structure from data is known to be NP-hard under
specific conditions [5], which for large datasets may be computationally pro-
hibitive. On the other hand, the specification of a prior can be challenging in
domains which are still heavily relying on expert knowledge, such as engineer-
ing. Sequential probability ratio tests [6,7] represent another technique which
can be employed for time reduction in quality tests. Starting with a null and an
alternative hypothesis, after each additional observation the decision must be
made whether to accept one of them and thus stop the test, or to proceed with
further monitoring. In its original form, this technique only offers a stopping
rule from which a new necessary, potentially shorter, test time can be derived,
without further investigating the behaviour of the information and noise parts
in the features extracted and selected from the quality test signals.
In this paper, we propose the combination of Convolutional Neural Networks
(CNNs) and the Canonical Correlation Analysis (CCA) technique [8] as an alter-
native for the potential reduction of the time required for quality tests. The CCA
framework has been used in a wide spectrum of scenarios, such as unsupervised
learning with multiple available views [9], feature learning from a view when a
second view is available during training, but not at test time [10,11] or multi-
view regression [12]. Other recent contributions [13,14] have also applied CCA
in order to increase the interpretability of the representations learned by the
hidden layers of neural networks. In [14], the authors use the CCA framework to
compute new, maximally correlated representations between the encondings of a
hidden layer at a specific time during the training process and the encodings of
the same layer at the final training timestep. Comparing the sorted CCA coeffi-
cients for different timesteps of the training, they observe that by the timestep
the neural network has already reached the final performance, a significant num-
Canonical Correlation Analysis Framework for Quality Test Time Reduction 325
ber of CCA coefficients have not yet converged, concluding that they must belong
to noise. In our work, after using a CNN to identify the earliest possible stopping
time for the quality test, we apply the same approach as the authors in [14] to
investigate the presence of noise in the features extracted from excerpts of the
quality test recording.
The main contributions of this paper help tackle multiple engineering chal-
lenges and can be summarized as follows:
(a) Combination of a CNN and the CCA framework for a potential reduction
of the time needed for industrial manufacturing quality tests.
(b) Application of the technique proposed in [14] to investigate the relationship
between classification-relevant features learned via a CNN from an excerpt of
the quality test recording and the features resulting from the full recording.
(c) Practical contribution by means of a test time reduction of 77.78% for the
vibroacoustical quality test represented by a real-world engineering dataset.
2 Problem Description
Let D denote a dataset consisting of a total of l sensor recordings of common
integer length T and corresponding binary labels y representing the quality of
the tested products. The goal is twofold: first, we analyze whether an excerpt
of the full recording exists which, starting at time 0, contains already enough
information in order to reach the same classification performance as with the
usage of the full recording. Second, we relate the classification-relevant features
from the increasing excerpts and the features from the full recording, which
the test was originally designed to measure. At the same time, we make two
important assumptions: on the one hand, we assume that the dataset D contains
a sufficient number of samples from all known quality issues. On the other hand,
we require the suitability of the hand crafted features defined by domain experts
for the full recordings. For cases where this assumption can not be guaranteed, we
propose the usage of a CNN for the feature extraction and selection in Sect. 3.2.
3 Methodology
The CCA framework can be employed for the identification of mutual informa-
tion in two observation matrices originating from an underlying process by infer-
ring information from the cross-covariance matrices. In the scenario described
326 P. A. Bucur and P. Hungerländer
One of the multiple representations of the solution was offered in [16], who by
defining:
−1 −1
T := ΣP12,P1 ΣP1 ,P2 ΣP22,P2 , (1)
identify the optimal objective value as the sum of the top k singular values of T ,
−1 −1
encountered at (Z1∗ , Z2∗ ) = (ΣP12,P1 Uk , ΣP22,P2 Vk ), where Uk , Vk are the matrices
of the first k left-, respectively right- singular vectors of T .
Canonical Correlation Analysis Framework for Quality Test Time Reduction 327
where ρi denotes the correlation coefficient corresponding to the i-th CCA com-
ponent between φq and φg . The development of the distance values between
328 P. A. Bucur and P. Hungerländer
the distinct φi ∈ Φ encodings and the vectors C and Q offers a very intuitive
visualization of the behaviour of the stable and unstable parts of the feature
encodings. From an engineering viewpoint, it is important to know that the fea-
tures selected for classification from each excerpt contain a common part, which
can be expected to generalize well to previously unseen data samples.
4 Computational Experiments
The experimental dataset employed in this work originates from the daily opera-
tions of ThyssenKrupp Presta AG, a supplier of steering gears. Undesired vibra-
tions in a steering gear and the resulting noise can often be traced back to one
of its subcomponents, the ball nut assembly (BNA). Due to this reason, the pro-
duced BNA are subject to a vibroacoustical test which consists of steering the
product first left, then right, at two different mean rotational speeds: 300 ◦ /s and
500 ◦ /s. During each test, the rotational velocity is linearly increased from an
initial 97.5% of the mean speed to a final value of 102.5%. Two accelerometers,
positioned perpendicularly on the product, record the vibrations emanating from
the steering movement with a sampling rate of 25.6 kHz. For the quality assess-
ment, the BNA vibrational signals are transformed to obtain their encoding as
order spectra, a frequency domain representation.
Acoustic domain experts then define a set of non-overlapping zones for the
spectra, each consisting of a left and right border and an upper threshold.
Between the zone borders, the order spectra may not violate the upper thresh-
olds; if the threshold of any zone is violated, the corresponding BNA fails the
quality test. The BNA which receive a positive quality assessment are assigned a
label of 0, while the faulty ones are labeled as 1. The dataset consists of a total of
8424 BNA and is split into a training set consisting of 5616 BNA (172 of which
failed the quality test) and a test set featuring 2808 BNA (86 faulty BNA). In
our setting, the significance of the production order of the BNA hinders a cross-
validation approach: the feature extraction and selection mechanisms shall also
account for possible trends or changes over time in the BNA production. We
thus construct the validation dataset with a stratified selection of the last 20%
of the samples belonging to the training set.
we separately average over the spectrograms with label 1 and those with label
0. Figure 1 depicts the result of the substraction of the mean spectrogram with
label 0 from the mean spectrogram with label 1.
Fig. 1. Overview of the averaged spectrograms with label 0 (resulting from the full
quality test recording ψn ) substracted from the averaged spectrograms with label 1.
Interestingly, frequencies can be identified for which the qualitatively superior BNA
feature higher amplitudes than their faulty counterparts. These frequencies are however
either irrelevant for the steering gear vibroacoustic behaviour or diverse mechanisms
ensure their damping.
Furthermore, due to the high data imbalance in which Class 0 has a preva-
lence of 96.94%, we additionally use the Synthetic Minority Oversampling Tech-
nique (SMOTE) [20] to create new, synthetic samples with label 1 for the training
data. We avoid rebalancing the validation dataset, making sure that the classi-
fier learn and select features belonging to the true data distribution. In a last
preprocessing step, we standardize the spectrograms to feature zero mean and
unit variance.
The encoding of the vibrational signals as spectrograms paves the way for the
usage of CNNs as a feature extraction, selection and classification mechanism
[21–23], a technique which we also employ. In a first step, the network adds
Gaussian noise with a standard deviation of the noise distribution of 0.2 to
the inputs in each batch, aiming to ensure regularization and avoid overfitting.
After a batch normalization step, two identical convolutional blocks follow. Each
block begins with a convolutional layer featuring 50 neurons, a filter spanning
5 units on the frequency- and 3 units on the time axis and ReLU activation.
The convolution is followed by a max-pooling operation, where the max pooling
window spans 2 units on the frequency- and 2 units on the time axis, and by
330 P. A. Bucur and P. Hungerländer
batch normalization. After a dropout layer which sets 50% of the input units to
0 at each training update, the data is flattened and run through 3 identical fully
connected blocks. Each such block consists of a fully connected layer with 500
neurons and ReLU activation, followed by batch normalization and dropout with
a fraction of 70% of the units set to 0 at each training iteration. In a penultimate
step another fully connected layer is applied, featuring 200 neurons and ReLU
activation. The encoding of this layer is the result of the feature extraction and
selection step and thus represents the features φi ∈ Φ, the random vector used
as input to the linear CCA as described in Sect. 3.2. In this work, we chose not
to consider the features extracted and selected by domain experts as φn , but use
the same approach as for the previous feature sets φi ∈ Φ. Choosing n = 18, we
compute a total of 150 CCA components between each φi , where i ∈ [1, 17], and
the final φ18 . A final fully connected layer featuring softmax activation maps the
encoding of the previous layer to 2 units which represent the two quality classes.
The training process uses the Adam optimizer with a learning rate of 0.0001
and the binary crossentropy as loss, a batch size of 50 and 10 training epochs.
Apart from direct architectural choices such as the number of convolutional and
fully connected blocks and their layer structure or the choice of the optimizer and
the loss, the rest of the described quantitities were treated as hyperparameter
and optimized using random search. A schematic overview of the employed CNN
is offered in Fig. 2.
Due to the high data imbalance and the resulting no-information rate, we chose
together with acoustic domain experts Cohen’s Kappa [24] as a single value
metric to assess the classification performance. The classification results of the
φi ∈ Φ features corresponding to the different ψi ∈ Ψ excerpts are depicted for
the training and test data in Fig. 3. The number of true positives, false positives,
Canonical Correlation Analysis Framework for Quality Test Time Reduction 331
true negatives and false negatives associated with the Cohen Kappa value of
0.44 obtained for φ18 on the test data is 2722, 61, 25 and 0, respectively. Despite
multiple regularization techniques such as the usage of dropout or early-stopping
the training process as soon as the loss on the validation data begins to increase
again, we still observe a significant gap between the performance on the training
data, on the one hand, and validation and test data, on the other hand. Since
the performance on the validation data is almost identical to the test data, we
trace the performance gap back to three origins. First, the noisiness of the quality
labels is a known problem, since the binary classification is a simplification of the
reality, where different quality nuances exist. Second, the industrial production
processes change with time due to a plethora of reasons, leading to shifts in
the data distribution. Third, it is also possible that the created synthetic data
samples do not bear enough resemblance to the validation and test data.
Fig. 3. The classification performance of the CNN with respect to the Cohen Kappa
metric for the training and test data. By ψ4 , the CNN achieves the same classification
performance as with ψ18 , the full recording.
Despite the performance gap, we observe that for ψ4 , the classification per-
formance of the features φ4 extracted and selected by the CNN already matches
the final performance obtained with the usage of φ18 . We thus identify an early
stopping time of the test of ι = 4, which implies that the test duration could
be reduced by 77.78%. In Fig. 4, we observe that by this time, many of the
CCA coefficients have not yet converged to their final values, indicating that the
corresponding components potentially represent noise in the encodings φi ∈ Φ.
The development of the distance measure introduced in Eq. 2 for the increas-
ing excerpts ψi ∈ Ψ is depicted in Fig. 5 for ξ = 150 CCA components computed
between φ2 and φ10 and a length of η = 10 of the C and Q vectors.
We observe that, as expected, the CCA components which stabilize early
remain stable, with the unstable ones rapidly losing correlation. This implies
that the features φi ∈ Φ contain a common part which is fully learned from each
excerpt: most probably, this part of the encoding will also generalize well. Since
332 P. A. Bucur and P. Hungerländer
Fig. 4. Development of each CCA component’s correlation coefficient between the fea-
tures extracted and selected from the increasing excerpts ψi ∈ Ψ and from the full
recording, ψn . By ψ4 , the classification performance of the CNN has already converged;
yet, many CCA coefficients have not converged and still continue to change.
Fig. 5. Development of the CCA distance between the stable (respectively unstable)
parts of the φi ∈ Φ features and the complete representation φi ∈ Φ, computed using
the mean CCA coefficient as described in Eq. 2. While the stable part maintains a high
correlation to the encoding φi ∈ Φ in larger excerpts, the unstable part rapidly loses
correlation and remains that way.
the rest of the encoding is not stable across the increasing excerpts, it likely does
not generalize to unseen data samples.
6 Conclusion
References
1. Dalenogare, L.S., Benitez, G.B., Ayala, N.F., Frank, A.G.: The expected contri-
bution of Industry 4.0 technologies for industrial performance. Int. J. Prod. Econ.
204, 383–394 (2018)
2. Diez-Olivan, A., Del Ser, J., Galar, D., Sierra, B.: Data fusion and machine learning
for industrial prognosis: trends and perspectives towards Industry 4.0. Inf. Fusion
50, 92–111 (2019)
3. Tao, F., Qi, Q., Liu, A., Kusiak, A.: Data-driven smart manufacturing. J. Manuf.
Syst. 48, 157–169 (2018)
4. Penz, C.A., Flesch, C.A., Nassar, S.M., Flesch, R.C., De Oliveira, M.A.: Fuzzy
Bayesian network for refrigeration compressor performance prediction and test
time reduction. Expert. Syst. Appl. 39(4), 4268–4273 (2012)
5. Chickering, D.M., Heckerman, D., Meek, C.: Large-sample learning of Bayesian
networks is NP-hard. J. Mach. Learn. Res. 5, 1287–1330 (2004)
6. Wald, A.: Sequential tests of statistical hypotheses. Ann. Math. Stat. 16(2), 117–
186 (1945)
7. Wald, A., Wolfowitz, J.: Optimum character of the sequential probability ratio
test. Ann. Math. Stat. 19(3), 326–339 (1948)
8. Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377
(1936)
9. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an
overview with application to learning methods. Neural Comput. 16(12), 2639–2664
(2004)
10. Arora, R., Livescu, K.: Kernel CCA for multi-view learning of acoustic features
using articulatory measurements. In: Symposium on Machine Learning in Speech
and Language Processing (2012)
334 P. A. Bucur and P. Hungerländer
11. Bucur, P.A., Frick, K., Hungerländer, P.: Quality classification methods for ball
nut assemblies in a multi-view setting. Optimization online e-prints, eprint 2018-
09-6796 (2018)
12. Kakade, S.M., Foster, D.P.: Multi-view regression via canonical correlation analy-
sis. In: Bshouty, N.H., Gentile, C. (eds.) COLT 2007. LNCS (LNAI), vol. 4539, pp.
82–96. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-72927-3 8
13. Raghu, M., Gilmer, J., Yosinski, J., Sohl-Dickstein, J.: SVCCA: singular vector
canonical correlation analysis for deep learning dynamics and interpretability. In:
Advances in Neural Information Processing Systems, pp. 6076–6085 (2017)
14. Morcos, A.S., Raghu, M., Bengio, S.: Insights on representational similarity in
neural networks with canonical correlation. arXiv e-prints, arXiv:1806.05759 (2018)
15. Galen, A., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis.
In: International Conference on Machine Learning, pp. 1247–1255 (2013)
16. Bibby, J.M., Kent, J.T., Mardia, K.V.: Multivariate Analysis. Academic Press,
London (1979)
17. Janssens, O., et al.: Convolutional neural network based fault detection for rotating
machinery. J. Sound Vib. 377(Suppl. C), 331–345 (2016)
18. Yildirim, Ö., Plawiak, P., Tan, R.-S., Acharya, U.R.: Arrhythmia detection using
deep convolutional neural network with long duration ECG signals. Comput. Biol.
Med. 102, 411–420 (2018)
19. Jing, L., Zhao, M., Li, P., Xu, X.: A convolutional neural network based feature
learning and fault diagnosis method for the condition monitoring of gearbox. Mea-
surement 111, 1–10 (2017)
20. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic
minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
21. Costa, Y.M., Oliveira, L.S., Silla Jr., C.N.: An evaluation of convolutional neural
networks for music classification using spectrograms. Appl. Soft Comput. 52, 28–38
(2017)
22. Choi, K., Fazekas, G., Sandler, M.: Explaining deep convolutional neural networks
on music classification. arXiv preprint arXiv:1607.02444 (2016)
23. Wu, Y., Mao, H., Yi, Z.: Audio classification using attention-augmented convolu-
tional neural network. Knowl.-Based Syst. 161, 90–100 (2018)
24. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas.
20(1), 37–46 (1960)
Convolutional Neural Network for Detection
of Building Contours Using Multisource
Spatial Data
1 Introduction
progress has been gradually achieved over the last decades. Early methods were
severely constrained due to their reliance on a generic model which assumed that
buildings follow a certain pattern and thus failed to provide reliable results when
applied to varied urban environments [1]. Other early attempts used shadow data
combined with 2D building blobs derived from digital elevation data [2]. Unfortu-
nately, such models were hampered due to low-resolution ground sampling data,
occlusions and shadows. Some researchers have used photogrammetric techniques
which availed of stereoscopic images with several of these methods using optical
images while others elevation data. An example of the former category is Lang and
Forstner [3] and Fraser et al. [4], who reconstructed 3D buildings from high-resolution
IKONOS stereo imagery. Airborne laser scanning equipment became more reliable and
refined during the late 1990s and early 2000s, thus becoming an important source of
obtaining digital surface maps (DSM). Mass and Moleman developed various
approaches to detecting building contours using DSMs [5]. Fusing optical and eleva-
tion data was the next logical step thus Haala [6] combined DSMs with optical images
in order to extract buildings and trees in an urban environment. There still exist
probabilistic methods that accept DEMs and utilize an object approach. Lafarge et al.
[7] used marked point processes to roughly approximate building contours via rect-
angular structures. These rectangular footprints were then regularized by taking into
account the local context of each rectangle and detecting roof height discontinuities.
Descombes and Zerubia [8] altered the previous method by introducing an energy
function which takes into account the height of the building as well as prior knowledge
about the general layout of buildings in urban settings. Simulated annealing was then
employed in order to minimize the energy function. Still other researchers used
observed point clouds from LIDAR data. Rottensteiner et al. [9] separated points being
on the ground from those belonging to buildings and other objects. This was accom-
plished by an analysis of the height differences of a digital surface model passing
through the original LIDAR points and a digital terrain model. During the last years,
there has been an increasing number of contributions that apply deep convolutional
neural networks to applications that return a whole image instead of the category of the
presented data. For instance, Dong et al. [10] used a three-layer convolutional network,
named SRCNN, to learn a direct mapping between low and high resolution images.
This mapping was represented by a deep convolutional neural network that took the
low-resolution image as input and returned a high-resolution version of the image. The
results were comparable or even better than well-established sparse coding dictionary
methods. In our application we propose a deep convolutional neural network that can
directly detect building contours. Due to the nature of the work in [10] which exhibits
several features that were considered akin to our application, we applied a modified
version of this network as the basis of our own network. In our case the modified
SRCNN, which we have named BCDCNN (Building Contour Detector Convolutional
Neural Network) accepts a tuple of available data in the form <[optical, DEM],
GT> which is comprised of an optical and a DEM input pair along with the
CNN for Detection of Building Contours Using Multisource Spatial Data 337
Fig. 1. (a) Optical (grayscale) channel input, (b) DEM channel input, and (c) Ground truth data.
The rest of the paper is organized as follows. Section 2 briefly discusses the
architecture of BCDCNN. Section 3 describes the experiments that were conducted by
tweaking the various parameters of the proposed network architecture, in order to
investigate the system’s performance under various setups. Section 4 presents the
results of our experiments and discusses the efficiency of the proposed method while
Sect. 5 draws the conclusions.
2 Methodology
1
These data constitute the first variation of the training set, as explained in the Methodology section.
338 G. Papadopoulos et al.
F1 ð X Þ ¼ maxð0; W1 X þ B1 Þ ð1Þ
Figure 2 illustrates the proposed network architecture in which the input to the
network, the optimal output and the size of the convolution kernels applied at each
layer are shown.
The goal is to get a building contour map FðYÞ which is as close as possible to the
ground truth. However, unlike classification type of applications in which the training
procedure associates input images to, usually, a few class labels, the proposed system is
presented with a far more difficult and challenging problem, that is, learning a
heteroassociative mapping from a quite limited training set of <input, output> pairs
and then expecting to generalize on new pairs of building top-view images. Further
elaborating on the complexity of our data sources there are four different types of edges
that the network must learn to differentiate.
• Elevation edges that are simultaneously optical edges, which is mostly the case.
• Optical edges that are not elevation edges: For instance, rooftops of neighboring
buildings of different colors but same heights.
• Elevation edges that are not optical: For example, a rooftop of the same color as an
adjacent street and at different heights.
• Implied edges: For instance, rooftops with the same color and same height. This is
the most difficult case.
Just to make the problem even more difficult, the available elevation data – carrying
most of the building contours information – are at a five times lower spatial resolution
than the optical images and the associated building contours. Hence, the proposed CNN
architecture is actually performing a combination of elevation data super-resolution
assisted by available high-resolution optical images and a heteroassociative mapping to
building contours.
Fig. 3. (a) Optical, (b) DEM, and (c) Ground truth training data from BLOCK2.
340 G. Papadopoulos et al.
Fig. 4. (a) Optical, (b) DEM, and (c) Ground truth test data.
Furthermore, we used three variations of the training data. More specifically, the
first variation is comprised of the original optical data and the mean shift upsampled
DEM data (Figs. 1, 3 and 4). Moving on to the second variation, the optical channel
has also been filtered with the mean shift edge preserving smoothing algorithm [12]
with the so filtered BLOCK1 been shown in Fig. 5a. Finally, in the third variation the
mean shift optical & DEM data have been filtered by a Laplacian of Gaussian
(LoG) operator (see Figs. 5b and c). The last two variations have been considered as an
attempt to reduce the effective dimensionality of the input data and improve the gen-
eralization ability of the proposed system.
Fig. 5. Filtered BLOCK1 data: (a) Mean shift filtered optical channel, (b) LoG-filtered optical
image, and (c) LoG-filtered DEM.
Although test data for the last two variations are not shown, it will be implied that
the same mean shift and LoG processing has also been applied in the test dataset in all
experiments performed on variations 2 or 3.
reconstructed image that correspond to background and have close to zero or negative
values, could be set aside from the derivative computations of the back-propagation
phase, e.g. by setting them to zero. On the other hand, neuron outputs wrongly close to
1should play a role in the back-propagation phase in order to be pushed down to lower
values. A second point we can make regarding weight adaptation in this application is
that all output neurons share the same weights and that these weights should be given a
chance to adapt in such a way as to satisfy confronting demands: to push some output
neurons to 1 and other neurons to 0. Since the proportion of 1-pixels is much smaller
than that of 0-pixels, it is expected that the shared weights will prioritize minimizing
the error of the “many” background pixels instead of the “few” contour pixels. This
comment highlights network training difficulties in heteroassociative mappings that
arise due to unequal pixel-class probabilities and resembles the necessity for class-
balanced datasets in classification problems. In order to balance the weight adaptation
process to serve equally well the contour and non-contour pixels, we propose to
substitute the typical RMSE cost criterion that involves all neuron outputs of the
reconstruction layer by a novel custom cost layer which we have named Top-N. Under
this scheme the RMSE between the reconstructed image and the corresponding GT is
calculated only for those pixels that belong to the 2N pixels with highest values, where
N is the number of contour pixels in GT. Assuming that most of the N contour pixels of
the ground truth image are also in the top 2N pixels of the reconstruction, this scheme
satisfies the imposed balancing criterion.
Fig. 6. (a) A low-level reconstruction of the test image, (b) the corresponding GT, (c) pdf and
cdf of intensity levels and Top-N threshold, and (d) Top-N version of the reconstruction.
In practical terms, the threshold used to specify the top 2N pixel values is calculated
as follows: We calculate the probability distribution function and cumulative distri-
bution function of the intensity levels for each image used during training and retain
only the pixels that have an intensity above that threshold2. This is depicted in Figs. 6
(a) – (d), which show a low quality reconstruction of the test data, the corresponding
ground truth, the Top-N threshold calculated as the percentage of pixels above the Top-
N intensity and the Top-N version of the reconstruction, respectively.
2
Actually, we use the average value of the intensity level for a whole batch in order to accelerate the
computation procedure.
342 G. Papadopoulos et al.
3 Experiments
3.1 Training Set Preparation
To satisfy the requirement of large numbers of training data to properly train deep
neural networks we performed data augmentation [13]. Firstly, we extracted 33 33
patches of the input data (optical + DEM) along with the corresponding 21 21
patches of the GT data (GT patches are smaller due to “valid” convolutions with 9 9,
1 1 and 5 5 kernels). The data were then augmented with rotations at multiples of
90° and with their vertical flips. In this manner we constructed tuples of input data and
GT in the form <[optical_section, DEM_section], GT_section>. The procedure
described in the following sections was followed for each of the three variations of our
training data set (original, Mean-Shift processed, LoG processed).
4 Experimental Results
All presented results pertain to the 9-1-5 or 9-3-5 convolution kernel choices and to the
64-32-1 feature maps configuration, i.e. number of feature maps at the output of each
convolutional layer. Actually, we conducted several tests to assess how the number of
feature maps affect the performance of the network. Specifically, we experimented with
networks of 128-64-1 and 256-128-1 feature map configurations. However, even
though performance is increased (the RMSE for the Original data sets at epoch 60
decreases from 3,4048 to 3,3384 and then to 3,2077 for the larger configurations), the
heavy computational costs prohibited their use in the sequel.
3
9 9, 1 1, 5 5 for the first, second and third layer respectively.
4
9 9, 3 3, 5 5 for the first, second and third layer respectively.
CNN for Detection of Building Contours Using Multisource Spatial Data 343
Table 2. RMSE and PSNR for test data trained on the Mean Shift processed dataset.
Loss layer Dropout 50% Dropout 50%–50% NoDropout
RMSE PSNR RMSE PSNR RMSE PSNR
Min Max Min Max Min Max
Top-N (9-1-5) 0,10423 15,257 0,10549 15,079 0,10486 15,067
Top-N (9-3-5) 0,10263 15,263 0,10343 15,263 0,10865 15,283
MSE (9-1-5) 0,10833 14,903 0,10977 14,807 0,10926 14,832
MSE (9-3-5) 0,10817 14,912 0,10912 14,841 0,10923 14,842
Table 3. RMSE and PSNR for test data trained on the LoG processed dataset.
Loss layer Dropout 50% Dropout 50%–50% NoDropout
RMSE PSNR RMSE PSNR RMSE PSNR
Min Max Min Max Min Max
Top-N (9-1-5) 0,10948 14,831 0,10811 14,929 0,10779 15,127
Top-N (9-3-5) 0,10608 15,031 0,10637 15,106 0,10763 15,148
MSE (9-1-5) 0,11005 14,647 0,11314 14,489 0,11093 14,803
MSE (9-3-5) 0,10955 14,005 0,10988 14,746 0,10839 14,824
344 G. Papadopoulos et al.
Fig. 7. (a) Reconstruction of train data for Original data set and Top-N Cost Layer,
(b) Reconstruction of train data for Original data set and MSE Cost Layer
Our network can also generalize as the reconstructions of the test data for networks
trained on the three variations for the proposed Top-N and MSE loss layer show.
Fig. 8. Top row: PSNR for test data on network trained for (a) Original data and Top-N cost
layer, (b) Mean Shift data and Top-N cost layer, (c) LoG data and Top-N cost layer. Bottom row:
Reconstruction of test data at PSNR peak for (d) Original data and Top-N cost layer, (e) Mean
Shift data and Top-N cost layer, (f) LoG data and Top-N cost layer.
According to Fig. 8a the highest PSNR for the test data set was at epoch 55 for the
dropout 50% case. We thus found the peak of the PSNR curves and then reconstructed
at that specific instant. This process was repeated for all our training data variations and
the resulting reconstructions are shown in Figs. 8d through f. The corresponding
experiments for the MSE cost layer are shown in Fig. 9. It has to be noted that we did
not employ any post-processing stage to improve the obtained building contours as this
CNN for Detection of Building Contours Using Multisource Spatial Data 345
will be the case of a relaxation system currently under development. From Figs. 8 and
9 we readily observe that deciding about how to improve the generalization ability of
the network is not straightforward. Perhaps, one can say that when the effective input
dimensionality is high (i.e. when the variance of the input pixel values is large) as is the
case for the Original data sets, the network exhibits poor generalization behaviour (see
the blue curves of Figs. 8a and 9a). As the effective dimensionality is progressively
reduced through the imposed smoothing from the Mean Shift and LoG data prepro-
cessing the generalization ability of the network is improved and, in the case of LoG,
even surpasses the cases that use dropout in one or two layers.
Fig. 9. Top row: PSNR for test data on network trained for (a) Original data and MSE cost
layer, (b) Mean Shift data and MSE cost layer, (c) LoG data and MSE cost layer. Bottom row:
Reconstruction of test data at PSNR peak for (d) Original data and MSE cost layer, (e) Mean
Shift data and MSE cost layer, (f) LoG data and MSE cost layer.
A second remark that we can make is that by using 50% dropout on one or two layers
the network resists better to overfitting. Specifically, in the case of training data sets with
relatively high effective dimensionality as is the case with the Original and Mean Shift
processed data sets, dropout (either in one or in two layers) proves to be the necessary
choice for network generalization. Finally, in accordance to the comparative results of
Tables 1, 2 and 3, a comparison of Figs. 8 and 9 shows that there is a slight improvement
in PSNR under any training data set variation when using the Top-N cost layer.
We have demonstrated a deep neural network configuration that given low resolution
elevation data of an urban area and corresponding high resolution optical data of the
same area can perform a hetero associative mapping to a new image, which contains the
building contours. Our proposed Top-N custom layer seems to offer performance
benefits, wherein the RMSE and PSNR exhibit better performance for the Top-N layer
as opposed to the MSE cost layer. We also examined the effect of adding more feature
346 G. Papadopoulos et al.
maps and we have shown that dropout is mostly necessary in order for the model to
generalize. It is very interesting to notice that training with the LoG data set was the only
case in which the network managed to generalize without using dropout (Figs. 8c and
9c), presumably a result of the reduced dimensionality. The problem we tried to solve
using deep neural networks is extremely complex due to the varying context around true
building contours in an urban environment. We conjecture that given more training data
the performance of the network will increase but hand-crafting such ground truth data is
a very tedious and time costly procedure. In the near future we intend to build a pixel-
based contour detector and a super-resolution system for digital elevation data
(DEM) capable of increasing the resolution of such data by at least 5 times.
References
1. Mason, S., Baltsavias, E.: Image-based reconstruction of informal settlements. In: Gruen, A.,
Baltsavias, E.P., Henricsson, O. (eds) Automatic Extraction of Man-Made Objects from
Aerial and Space Images (II). Monte Verità (Proceedings of the Centro Stefano Franscini
Ascona), pp. 97–108. Birkhäuser, Basel (1997). https://doi.org/10.1007/978-3-0348-8906-
3_10
2. Li, J., Ruther, S.H.: IS-Modeller: a low-cost image-based tool for informal settlements
planning. In: Geoinfomatics 1999 Conference, Ann Arbor (1999)
3. Lang, F., Forstner, W.: 3D-city modeling with a digital one-eye stereo system. In:
Proceedings of the XVIII ISPRS-Congress (1996)
4. Fraser, C.S., Baltsavias, E., Gruen, A.: Processing of Ikonos imagery for submetre 3D
positioning and building extraction. ISPRS J. Photogramm. Remote Sens. 56(3), 177–194
(2002)
5. Maas, H.-G., Vosselman, G.: Two algorithms for extracting building models from raw laser
altimetry data. ISPRS J. Photogramm. Remote Sens. 54(2-3), 153–163 (1999)
6. Haala, N., Brenner, C.: Extraction of buildings and trees in urban environments.
ISPRS J. Photogramm. Remote Sens. 54(2-3), 130–137 (1999)
7. Lafarge, F., et al.: Automatic building extraction from DEMs using an object approach and
application to the 3D-city modeling. ISPRS J. Photogramm. Remote Sens. 63(3), 365–381
(2008)
8. Ortner, M., Descombes, X., Zerubia, J.: Building outline extraction from digital elevation
models using marked point processes. Int. J. Comput. Vis. 72(2), 107–132 (2007)
9. Rottensteiner, F., Briese, C.: A new method for building extraction in urban areas from high-
resolution LIDAR data. In: International Archives of Photogrammetry Remote Sensing and
Spatial Information Sciences, vol. 34, No. 3/A, pp. 295–301 (2002)
10. Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional
networks. IEEE Trans. Pattern Anal. Mach. Intell. 38, 295–307 (2016)
11. Vassilas, N., Tsenoglou, T., Ghazanfarpour, D.: Mean shift-based preprocessing method-
ology for improved 3D buildings reconstruction. WASET Int. J. Civ. Environ. Struct.
Constr. Architectural Eng. 9(5), 575–580 (2015). (also in Proc. ICCVISP 2015, Berlin,
Germany)
12. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE
Trans. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002)
13. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556 (2014)
Fuzzy - Vulnerability - Navigation
Modeling
A Meta-multicriteria Approach to Estimate
Drought Vulnerability Based on Fuzzy
Pattern Recognition
1 Introduction
In which J is the number of adaptive capacity criteria and lj and wj are the mem-
bership function and the weight of each adaptive capacity criterion j correspondingly.
sensitivity and low adaptive capacity); and (4) high sensitivity and high adaptive
capacity simultaneously (problem and action).
By applying the fuzzy pattern recognition based method, the evaluation of each
alternative (country) works together for all the four categories. By following the fuzzy
pattern recognition process, for each country the sum of the four evaluations (one for
each category) is equal to one. Another interesting point of view is that the categories
are non-ordered since the combination between a positive (adaptive capacity) and a
negative (sensitivity) criterion is used.
In general let a classification problem with N alternatives (here, the number of the
examined countries) and M criteria (in the examined application, only two criteria are
used, M = 2). The classification problem may be concisely expressed as follows:
354 M. Spiliotis et al.
2 3
x11 ... x1M
D ¼ 4 ... xim ... 5 ð3Þ
xN1 ... xNM
where D is the matrix which contains the score of the criteria with respect to each
alternative (here the countries). Hence, where xim is the score of alternative i for
criterion m.
adaptive
capacity
(a)
no problem
no action anti-ideal point
sensitivity
(b)
ordered
categories
adaptive
capacity
sensitivity
Fig. 1. Synthesis (a) Synthesis between the sensitivity and the adaptive capacity and definition
of the corresponding four (non ordered) categories (b) The usual approximation of ordered
categories.
A Meta-multicriteria Approach to Estimate Drought Vulnerability 355
As the elements of the matrix of Eq. 3 normally contain entries of different orders,
scales and importance, it is common to convert them to a common base and express in
the form of a relative membership degree matrix R by adopting either of the following
standardisation procedures depending on the benefit objectives (Xuesen et al. 2009):
Very often the ideal and anti-ideal patterns (here, fictitious countries with ideal and
anti-ideal scores) are selected to be (e.g. Zhou et al. 1999): G ¼ ð1; . . .; 1Þ; B ¼
ð0; . . .; 0Þ respectively. In our case, instead of only two points (ideal and anti-ideal
points) four categories, and hence four points, are used.
The proposed method has a significant difference compared with the other applied
fuzzy multicriteria methods based on pattern recognition (e.g. Zhou et al. 1999; Xuesen
et al. 2009), since in the proposed method the synthesis of criteria with different
monotony takes place and hence, the evaluation lead to non-ordered categories.
Let us consider the kth pattern. The evaluation of the criterion m for the pattern k is
written as vk;m . The most widely used measure of distance is the following:
" #1=2
X
M 2
k
dk;i P ; ri ¼ wm ri;m vk;m ð6Þ
m¼1
The next critical concept is the membership degree lk,i -which indicates the relative
membership degree of alternative i belonging to pattern k. The membership degree lk,I
takes into account the score of each country (alternative i) compared with all categories
and not only with the examined category. Assuming the membership
degree matrix of
each country belonging to each category is as follows: U ¼ lk;i KN , where lk;i is the
membership degree of country i belonging to pattern k. A basic property that must
satisfy the matrix U is that the sum of the membership values for each alternative under
all patterns (here k = 1,…, 4) is equal to one (e.g. Shouyu and Guangtao 2003):
X
K
lk;i ¼ 1 ð7Þ
k¼1
In general, the methodology of fuzzy sets comprises a mapping from a general set X
to the closed interval [0,1] which is described by its membership function (Chrysafis
and Papadopoulos 2009; Bardossy et al. 1990). Therefore the above constraint can be
easily achieved since the membership function takes values between zero and one.
356 M. Spiliotis et al.
1
lk;i ¼ K 2 ð9Þ
P dk;i ðPk ;ri Þ
dl;i ðPl ;r iÞ
l¼1
Only two criteria are initially established in this article, the adaptive capacity and
the sensitivity criterion. Furthermore apart from the ideal and anti-ideal patterns, as the
TOPSIS method, the use of another two additional categories to describe better the
vulnerability to water drought is proposed. From this point, the moderate states are not
essential in multicriteria evaluation since they cannot lead to a roadmap. Each category
is described by the corresponding pattern as follows (Fig. 1a):
(a) Pattern of first category: ideal point (0, 1)
(b) Pattern of second category: anti-ideal point (1, 0)
(c) Pattern of third category: no problem-no action (0,0)
(d) Pattern of fourth category: high sensitivity and high adaptive capacity simulta-
neously (1,1).
In which the first score expresses the degree of sensitivity and the second score the
degree of adaptive capacity. Therefore, according to the proposed methodology M = 2
(number of criteria) and K = 4 (number of categories).
An important property of the proposed method is that the sum of the membership
values for each country under all the four patterns of the corresponding categories is
equal to one.
In this point it should be justified that the characterization of the proposed method
as meta-multicriteria method even if weights are used to evaluate the distance between
the patterns and the alternatives (countries).
As the other multicriteria methods do, the proposed method, deals with the extreme
points but apart from the TOPSIS method (Fig. 1b), the proposed method covers a
larger space of the decision space. As the multicriteria methods do, the reference points
are given a priori, whist the fuzzy pattern recognition is used only to classify the
countries into the pre-defined categories.
Next, instead of four categories, based on the same key idea, a larger amount of
categories can be established. This point is discussed in the case-study section.
A Meta-multicriteria Approach to Estimate Drought Vulnerability 357
The analysis focuses on the Mediterranean region. The countries examined are:
Albania, Algeria, Cyprus, Egypt, Greece, France, Israel, Italy, Morocco, Portugal,
Spain, Tunisia, Turkey and Malta. The data of each variable considered is normalised
based on the minimum and the maximum evaluation score of each index.
The methodology of fuzzy pattern recognition is implemented to distinguish to
which degree each country belongs to the four selected categories. As mentioned
before, each category is described by a multicriteria score of the corresponding pattern
(it can be seen as a fictitious country). Initially, four selected patterns describe all the
extreme combinations between the evaluation of the adaptive capacity and the sensi-
tivity criteria. A basic property that must be verified is that the sum of the membership
values for each country under all patterns is equal to one (Eq. 7). The results are shown
in Fig. 2.
Malta
Turkey
Tunisia
Spain
Portugal
Morocco
Greece
Egypt
Cyprus
Algeria
Albania
In this real case study, all Mediterranean countries have a certain degree of vul-
nerability in each category.
For this reason, the methodology of fuzzy pattern recognition is expanded by
adopting a high partitioning of the categories and intermediate categories (Fig. 3).
Hence, even the methodology insists on extreme points it recognizes some more
absolute states. Consequently, by following a higher partitioning, now in most cases
the model produced membership functions with values significantly greater than 0.5 for
at least one category for each country and hence the categorization of each country
becomes clearer.
This new categorization seems more compatible with the reality. For instance,
indeed there are counties as Israel and Cyprus where their vulnerability can be
358 M. Spiliotis et al.
characterized within the category of “almost problem and (adaptive) actions (0.75,
0.75)” whilst there are countries as in the south of the Mediterranean Sea where their
vulnerability to drought belongs to the category of “almost worst (almost problem and
no (adaptive) actions) (0.75, 0.25)”.
A disadvantage of the analysis is that this type of analysis cannot describe the
differences between the regions inside each country. For instance, Greece is a typical
example that presents both significant differences of the hydrological regime and the
concentration of the population. In fact, apart from others, Greece has plenty of surface
water potential in most cases far from the water demand centers. These inequalities
cannot be described from indicators which cover all the country.
An object for further investigation would be the normalization of the indices, since
usually, the greatest value of the samples is used. The influence of this practice can be
Malta
Turkey
Tunisia
Spain
Cyprus
Algeria
Albania
0.00000 0.10000 0.20000 0.30000 0.40000 0.50000 0.60000 0.70000 0.80000 0.90000 1.00000
(b)
Almost problem and no
actions
Fig. 3. (a) Categorization based on the fuzzy pattern recognition approach with high partitioning
(b) Interpretation of the fuzzy pattern recognition classes with high partitioning.
A Meta-multicriteria Approach to Estimate Drought Vulnerability 359
reduced by using outranking methods, which are based on binary comparisons (e.g.
Spiliotis et al. 2015), to evaluate the adaptive capacity and the sensitivity criteria under
many indices.
4 Conclusions
The main idea of this work is the assessment of the vulnerability to drought based on
the synthesis of the sensitivity and the adaptive capacity criteria. Taking into account
the extreme values of the above two criteria, initially four non - ordered categories are
established to characterize the vulnerability to drought. However a high partitioning
leads to a more distinguished classification. Finally the countries are classified based on
the synthesis between a positive (adaptive capacity) and a negative (sensitivity) cri-
terion. Since both negative and positive criterion are used, non-ordered categories are
produced.
The categorization into the pre selected non-ordered categories is based on fuzzy
pattern recognition with multiple patterns (not only the ideal and the anti-ideal pattern
as the usual multicriteria methods). The patterns are selected based on the physical
problem itself. The method of fuzzy pattern recognition enables us, without significant
complexities, to establish more categories to make the picture clearer.
The proposed method can be characterized as meta-multicriteria method since they
use not only one multicriteria method but a sophisticated schedule whilst the final
evaluation is between non-ordered categories based on the nature of the problem itself.
Further work could include means to modulate individually the adaptive capacity and
the sensitivity criteria into the multicriteria approach.
References
Iglesias, A., Garrote, L., Cancelliere, A., Cubillo, F., Wilhite, D.: Coping with Drought Risk in
Agriculture and Water Supply Systems. Drought Management and Policy Development in the
Mediterranean, vol. 26, p. 320. Springer, Netherlands (2009). https://doi.org/10.1007/978-1-
4020-9045-5
Iglesias, A., Garrote, L., Martín-Carrasco, F.: Drought risk management in Mediterranean river
basins. Integr. Environ. Assess. Manag. 5(1), 11–16 (2015)
Brack, W., Posthuma, L., Hein, M., von der Ohe, P.: European river basins at risk. Integr.
Environ. Assess. Manag. 5(1), 2–4 (2009)
Iglesias, A., Garrote, L., Flores, F., Moneo, M.: Challenges to manage the risk of water scarcity
and climate change in the mediterranean. Water Resour. Manage 21(5), 775–788 (2007)
Naumann, G., Barbosa, P., Garrote, L., Iglesias, A., Vogt, J.: Exploring drought vulnerability in
Africa: an indicator based analysis to inform early warning systems. Hydrol. Earth Syst. Sci.
Discuss. 10(10), 12217–12254 (2014)
Nardo, M., Saisana, M., Saltelli, A., Tarantola, S., Hoffman, A., Giovannini, E.: Handbook on
Constructing Composite Indicators: Methodology and User Guide, No. 2005/3. OECD
publishing (2005)
360 M. Spiliotis et al.
Tsakiris, G., Spiliotis, M., Vangelis, H., Tsakiris, P.: Evaluation of measures for combating water
shortage based on beneficial and constraining criteria. Water Resour. Manage 29(2), 505–520
(2015)
Kazakis, N., Spiliotis, M., Voudouris, K., Pliakas, F.K., Papadopoulos, B.: A fuzzy multicriteria
categorization of the GALDIT method to assess seawater intrusion vulnerability of coastal
aquifers. Sci. Total Environ. 593–594, 552–566 (2018)
Spiliotis, M., Martín-Carrasco, F., Garrote, L.: A fuzzy multicriteria categorization of water
scarcity in complex water resources systems. Water Resour. Manage 29(2), 521–539 (2015)
Belacel, N., Vincke, P., Scheiff, J.M., Boulassel, M.R.: Acute leukemia diagnosis aid using
multicriteria fuzzy assignment methodology. Comput. Methods Prog. Biomed. 64, 145–151
(2001)
Wu, D., Yan, D.-H., Yang, G.-Y., Wang, X.-G., Xiao, W.-H., Zhang, H.-T.: Assessment on
agricultural drought vulnerability in the Yellow River basin based on a fuzzy clustering
iterative model. Nat. Hazards 67(2), 919–936 (2013)
Xuesen, L., Bende, W., Mehrotra, R., Sharma, A., Guoli, W.: Consideration of trends in
evaluating inter-basin water transfer alternatives within a fuzzy decision making framework.
Water Resour. Manage 23(15), 3207–3220 (2009)
Zhou, H.C., Wang, G.L., Yang, Q.: A multi-objective fuzzy recognition model for assessing
groundwater vulnerability based on the DRASTIC system. Hydrol. Sci. J. 44, 611–618 (1999)
Shouyu, C., Guangtao, F.: A DRASTIC-based fuzzy pattern recognition methodology for
groundwater vulnerability evaluation. Hydrol. Sci. J. 48(2), 211–220 (2003)
Chrysafis, K.A., Papadopoulos, B.K.: Cost-volume-profit analysis under uncertainty: a model
with fuzzy estimators based on confidence intervals. Int. J. Prod. Res. 47(21), 5977–5999
(2009)
Bardossy, A., Bogardi, I., Duckstein, L.: Fuzzy regression in hydrology. Water Resour. Res. 26
(7), 1497–1508 (1990)
Bioinspired Early Prediction of Earthquakes
Inferred by an Evolving Fuzzy Neural
Network Paradigm
1 Introduction
The belief that animals can predict earthquakes has been around for centuries [1]. The
most emblematic historical case was recorded on 373 B.C. It was reported by the
historians that animals (rats, snakes and weasels). abandoned the Greek town of Helice
just days before an earthquake devastated the town. The knowledge of this capability of
the animals is well known and related to their fine senses that are able to detect and
match very small vibrations, infrasounds and ultrasounds, gases, and other physical
signs that anticipate an earthquake event.
Geologists [2] disagree that correlation between earthquakes and animals behavior
exists, mainly because physical signs that anticipate the earthquake are not known.
Anyway dogs demonstrated to be highly sensible to some earthquake events antici-
pating signs, maybe vibrations or electrical. An effective application of the use of dogs
as early earthquake detector (like illegal drugs sniffer dogs & explosives sniffer dogs)
was successfully tested in China. On 1975 Chinese authorities days before a 7.3
© Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 361–367, 2019.
https://doi.org/10.1007/978-3-030-20257-6_30
362 M. Malcangi and M. Malcangi
magnitude earthquake, evacuated the people from Haicheng town (saving at least
150,000 people (estimated) from injuries and fatalities). The evacuation was success-
fully based on widespread accounts of unusual animal (dogs) behavior.
The investigation on the animals behavior related to the earthquake events can lead
to identify the physical signs that can be detected by an electronic sensor and matched
by an Artificial Neural Network (ANN). After motion triggered cameras were been
located in the Yanachaga National Park (Perù), researchers observed significant
behavior in animals weeks before (three weeks) an earthquake struck the whole region.
Animals shown what is known as “serotonin syndrome” due to unbalanced positive
ions in the environment. Serotonin regulates the mood in the animals and the humans.
Positive ions concentration increases in the environment when rocks in the ground
crust are stressed during the build-up of an earthquake. So, more positive ions con-
centration produces more serotonin in the animals and more excitation that induce them
to move to positive ions low concentration area (down to the valleys from the hills). If
ionization sensors could be available as a commodity, then the increasing of envi-
ronment’s ionization could be monitored and used to trigger an earthquake early
warning system.
An environment sign that confirms the ionization-theory is the observation of
lighting preceding the earthquake event. Such lighting were been observed since
ancient when strong earthquakes happened. The most recent was been reported on
2009 when a strong earthquake devasted L’Aquila (a town at the center of Italy). Just
few seconds before the earthquake’s hit, people saw ten-centimeter flames of light
flickering above a stone street.
On November 12th, 1988, was been reported a bright purple-pink light along the St.
Lawrence River in Quebec, 11 days before the powerful quake happened.
People reported similar observations of lights before the great 1906 quake in San
Francisco.
Only 0.5% of earthquakes create the conditions for lightning in a limited geo-
graphical areas, but Freund [3] and other scientists are working on an earthquake
forecasting system that includes earthquake lights among the indicators.
Lightning can be observed and ionization can be sensed, but unfortunately such
sensors don’t exist as electronic commodities, so no application of the ionization theory
exists, apart the use of the animals like a sensor, looking to their mood. No lightning
prediction was also been deployed.
An alternative hypothesis could be that most of animals are sensible mainly to small
vibrations and that they are able to detect and to interpret such signs as an incoming
danger.
There are two types of seismic waves: body waves (Fig. 1) and surface waves.
Body waves travel deep in the ground and surface waves travel at the surface level of
the ground. Body waves travel faster than surface waves and reach us early. Body
waves are composed of two different waves the p-wave and the s-wave. The p-wave
(primary wave) is compressional because it causes vibrations parallel to its direction.
The s-wave is shacking because causes vibrations orthogonal to its direction. P-wave is
faster than other weaves (2–5 km/s). The P-wave do not includes the hit (maximum
intensity) vibration and it arrives seconds to minutes in advance. Animals sense small
Bioinspired Early Prediction of Earthquakes 363
vibration by their ears, feet and whiskers, then match and learn about the associated risk
evolving their learning along the time.
A reasonable bio-inspired solution to early prediction of earthquakes could be
based on electronic vibration sensors as sensing and detecting devices, and on applying
the Evolving Fuzzy Neural Network (EFuNN) paradigm [4–6, 12] to learn on line and
to match the small vibrations that lead the disruptive wave.
There are several technological motivations that validate the bio-inspired approach
to the development of a system for early detection of earthquake. One is that vibrational
sensors are devices Commercially Of-The-Shelf (COTS) available in volumes (MEMS
accelerometers) and are smart (embeds a MicroController Unit (MCU), capable to run
advanced pattern matching paradigms). Because Internet is evolving toward the IoE
(Internet of Everything) networking paradigm, massive displacement and networking
of smart sensors will be available to deploy an effective early warning service.
Fig. 1. Body waves are seismic waves composed of a P-wave and S-wave: P-wave is
compressional, S-wave is shaking and includes the hit.
2 What Is EFuNN?
The peculiarity of the EFuNN is that the five layers fuzzy architecture corresponds
to a five layers artificial neural network (ANN) architecture. The ANN’s learning
capabilities can be applied to set up the fuzzy logic engine’s knowledge as nodes of the
ANN. Nodes evolve by learning. Rules are nodes of the ANN.
Two important capabilities are embedded in EFuNN paradigm: the fuzzy logic’s
capability to infer by rules and the ANN’s capability to learn by data. This paradigm is
the best strategy to challenge the key task of the fuzzy logic (the knowledge set up)
according to a bio-inspired approach.
Fig. 2. EfuNN is a five layers artificial neural network where each layer corresponds to a layer
of a fuzyy logic engine.
A set of p-waves patterns were been extracted from earthquake recordings done with a
seismographer lacated at the Centro Geofisico Prealpino – Varese – Italy [7].
To build up the dataset to train and test the EFuNN, sampled data of the p-waves
patterns were been extracted and labeled using the SeisGram2 k visualization and
analysis software [8] and a Matlab script to assemble the data according to the formats
requested by the Neucom simulation and modeling environment [9] running the
EFuNN paradigm.
The Matlab script receive as input the full seismogram sampled data stream and
window three types patterns:
• Background noise
• P-wave
• Hit-wave
After patterns windowing, the script executes the labeling and formatting as it
follow:
Bioinspired Early Prediction of Earthquakes 365
To train the EFuNN the p-waves-based dataset has been applied to the EFuNN. After
training, the test was executed to validate the EFuNN capability to match the p-wave
pattern preceding an earthquake event.
The data set was split randomly in two parts (80% and 20%). 80% of the data set
was been applied to train the EFuNN, 20% of the data set has been applied to test the
EFuNN.
Train and test setup was as follows:
• Sensitivity threshold: 0.9
• Error Threshold: 0.1
• Number of membership functions: 3
• Learning rate for W1: 0.1
• Learning rate for W2: 0.1
• Node age: 60
• Maximum field: 0.5
Then a new test, a sequence data stream (background noise - p-wave, hit-wave) was
been built so the EFuNN was been tested on a full earthquake wave sequence. This test
(Fig. 3) demonstrated that the EFuNN can infer run-time on an incoming earthquake
event triggering the alert system early before of the hit-wave (the disruptive one)
arrival.
Fig. 3. After the training, the EFuNN at test-time demonstrated to be able to recognize the
incoming P-wave before the Hit-wave arrival.
366 M. Malcangi and M. Malcangi
According to the test results, after a background noise sequence (BN), the P-wave
pattern (PW) was detected and matched before the Hit-Wave (HitW) arrived (also the
Hit-Wave is detected and matched).
Acknowledgements. Thanks are due to Dr. Paolo Valisa (Centro Geofisico Prealpino – Varese
–Italy) that enabled us to collect seismographic data at the Centro Geofisico Prealpino – Varese -
Italy.
A special acknowledgment is due to Prof. Nikola Kasabov, Auckland University of Tech-
nology, Director KEDRI – Knowledge Engineering and Discovery Research Institute, for his
invaluable suggestions on how to get the most from the EFuNN’s evolving capabilities.
References
1. Mott, M.: Can Animals Sense Earthquakes. National Geographics, 11 November 2003
2. Gupta, R.P.: Remote Sensing Geology. Springer, Heidelberg (2018). https://doi.org/10.1007/
978-3-662-55876-8
3. Freund, F.T., Derr, J.S.: Prevalence of earthquake lights associated with rift environments.
Seismol. Res. Lett. 85(1), 159–178 (2014)
4. Alves, E.I.: Earthquake forecasting using neural networks: results and future work.
Nonlinear Dyn. 44, 341–349 (2006)
5. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer
Academic Publishers, New York (1981)
6. Kasabov, N., Kim, J.S., Watts, M., Gray, A.: FuNN/2-a fuzzy neural network architecture
for adaptive learning and knowledge acquisition. Inf. Sci. Appl. 101(3–4), 155–175 (1997)
Bioinspired Early Prediction of Earthquakes 367
7. https://www.astrogeo.va.it/sismologia/sismi.php
8. http://alomax.free.fr/seisgram/SeisGram2K.html
9. http://www.kedri.aut.ac.nz/areas-of-expertise/data-mining-and-decision-support-systems/
neucom
10. Kasabov, N.: Evolving Connectionist Systems: The Knowledge Engineering Approach.
Springer, Heidelberg (2007). https://doi.org/10.1007/978-1-84628-347-5
11. Kasabov, N.: EFuNN. IEEE Tr SMC (2001)
12. Kasabov, N.: Evolving fuzzy neural networks – algorithms, applications and biological
motivation. In: Yamakawa, T., Matsumoto, G. (eds.) Methodologies for the Conception,
Design and Application of the Soft Computing, World Computing, pp. 271–274 (1998)
13. Kasabov, N.K.: Time-Space, Spiking Neural Networks and Brain-Inspired Artificial
Intelligence. SSBN, vol. 7. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-
662-57715-8
Enhancing Disaster Response for Hazardous
Materials Using Emerging Technologies:
The Role of AI and a Research Agenda
Abstract. Despite all efforts like the introduction of new training methods and
personal protective equipment, the need to reduce the number of First
Responders (FRs) fatalities and injuries remains. Reports show that advances in
technology have not yet resulted in protecting FRs from injuries, health impacts,
and odorless toxic gases effectively. Currently, there are emerging technologies
that can be exploited and applied in emergency management settings to improve
FRs protection. The aim of this paper is threefold: First, to conduct scenario
analysis and situations that currently threat the first responders. Second, to
conduct gap analysis concerning the new technology needs in relations to the
proposed scenarios. Third, to propose a research agenda and to discuss the role
of Artificial Intelligence within it.
1 Introduction
First Responders (FRs) have one of the deadliest jobs in the world since they operate
very close to unpredicted dangers, such as a sudden gas explosion, release of deadly
chemicals, building collapse and heart attacks. As a result, several FRs are injured or
lose their lives in action due to the lack of awareness of unpredicted risks, of hazardous
materials and material exposure. FR fatalities compose more than half of the total
fatalities of the incident in some cases. For example, the 2007 forest fires in Artemida,
Greece resulted in 26 fatalities 3 of which were firefighters (11%), the 2013 West
Fertilizer Company incident in Texas, USA resulted in 15 victims where the FR victims
accounted for 53% fatalities. In the 2015 chemical blast in Tianjin-China, there were
173 fatalities, where 104 of them or 60% were FRs [1], Likewise, in the event of a
high-rise building fire in Tehran 2017, 16 out of 22 fatalities or 72% were FRs. This
reveals how far FRs are exposed to deadly risks in response operations.
Situational Awareness (SA) also has a significant impact on the incident man-
agement and coordination, where it is critical for “all knowledge that is accessible and
can be integrated into a coherent picture, when required, to assess and cope with a
situation” [2]. Today, the integration of Information and Communication Technology
(ICT) and mobile technologies in emergency management is reshaping communica-
tions and information exchange between the command and control centres (C2C) and
FRs on the incident site. Also, ICT has provided several opportunities for advanced-
sensing, computing and communicating through smartphones, wearable-portable
devices, robotics and unmanned aerial vehicles (UAVs).
However, reports show that advances in technology have not yet resulted in pro-
tecting FRs from injuries, health impacts, and odorless toxic gases effectively [3]. For
example, a recent study reveals that the mean number of firefighters’ fatalities in
Sweden has increased from 2000 to 2016 [4]. Indeed, the International Forum to
Advance First Responder Innovation’s [5] has published a list of four FR capability
gaps, namely the ability to: (a) Know the location of responders and their proximity to
risks and hazards in real time. (b) Detect, monitor and analyze passive and active
threats and hazards at incident scenes in real time. (c) Rapidly identify hazardous
agents and contaminants. (d) Incorporate information from multiple and nontraditional
sources (e.g., crowdsourcing and social media) into incident command.
Briefly, we see some capability gaps concerning the identification of hazardous
agents and detecting, monitoring and analyzing passive and active hazard. For exam-
ple, FR personnel often need to be close enough to the sources before realizing the
presence of a passive and active hazard. In short, there is a strong need to innovate with
new technologies for first responders to address their capability gaps. Sometimes, the
issues not only about emerging technologies but also methods how to use existing
technologies efficiently so that FRs are protected. Here, often a common operational
picture and situational awareness play a role, and technologies can enhance these
situations.
The aim of this paper is threefold: First, to conduct scenario analysis and situations
that may put FRs in different types of threats. Second, to conduct gap analysis con-
cerning the new technology needs and specifications in relation to the proposed sce-
narios. Third, to propose a research agenda and to discuss the role of Artificial
Intelligence (AI).
This paper is organized into five sections. Section 2 describes the theoretical
background on the importance of common operational picture and situational aware-
ness. Section 3 describes three scenarios where the hazardous materials can come into
the picture and be unexpected extra disasters. Section 4 elaborate examples of poten-
tially relevant, emerging technologies that can be scrutinized further for protecting the
FRs. Section 5 is a proposed Research Agenda. Section 6 is the concluding remark.
370 J. Radianti et al.
2 Theoretical Background
To understand the risks and threats exposed to the first responders, literature in
emergency management has emphasized the importance of the three following per-
spectives. First, Common Operational Picture (COP). Second, Shared Situation
Awareness (SA), and third, Collective Sensemaking and Advanced Decision Making.
The COP is a way to ‘achieve a sufficient level of shared information among the
different organizations and jurisdictions participating in disaster operations at different
locations, so all actors readily understand the constraints on each and the possible
combinations of collaboration and support among them under a given set of conditions’
[6]. SA is a precondition of any COP. It is the perception of environmental elements
and events concerning time or space, the comprehension of their meaning, and the
projection of their status after some variable has changed, such as time, or some other
variable, such as a predetermined event. SA involves being aware of what is happening
at times of uncertainty to understand how information, events, and one’s actions will
impact goals and objectives, both immediately and in the near future [7]. One with an
adept sense of situation awareness generally has a high degree of knowledge with
respect to inputs and outputs of a system, an innate “feel” for situations, people, and
events that play out because of variables the subject can control. Lacking or inadequate
situation awareness among responders with different backgrounds (i.e., shared situation
awareness) has been identified as one of the primary factors in accidents attributed to
human error. SA is related to Advanced Decision Making, which is the process fol-
lowed in processing information to assess situations for making collective decisions
and taking collaborative actions. Decision making is part of all management tasks and
that it is particularly important for emergency managers as they often need to take
decisions quickly on very inadequate information [8, 9]. Collaborative group decision
making plays an important part in first response operations. We see collaborative
interactions through the construct of heedful interrelating, which refers to interacting
with sensitivity to the task at hand while at the same time paying attention to how one’s
actions affect overall group functioning [10–12]. Advanced decision making is related
to the process of collaborative sensemaking.
Collective sensemaking is about building a sufficient level of shared understanding
in a sensemaking process in which members of different professional organizations
(de/re)construct information influenced by their institutional background to find out
what is going on in times of uncertainty [13]. This sensemaking process is based upon
the knowledge responders have gained through (1) education, including training/
exercises; (2) storytelling; and (3) past experiences [14, 15]. FRs have to constantly
make sense of the situation and of the actions of other FRs (collectively) because of the
rapidly changing environment [16]. Sensemaking can be understood as a steady pro-
cess [17] of gaining knowledge through the transformation and integration of new
information into cognitive schemata [18], which is particularly essential in crises to
understand such unstable situations and to make adequate decisions. Therefore, it aims
to reduce and bridge knowledge gaps [19–21] and can be impacted by several social
factors like opinions [19, 20] as well as interactions, discussions or information from
other individuals [13] to find common ground for decision making.
Enhancing Disaster Response for Hazardous Materials Using Emerging Technologies 371
For the first responder to have/get an adequate overview of the crisis situation and
the actions of (other) responding organizations, they need to create a common opera-
tional picture (COP). Information/data is needed to come to a coherent but dynamic
COP. In practice (in line with the literature) this happens only if they have sufficient
situational awareness (SA). SA is based on the collecting and sharing of information.
SA is not static (as the crisis situation continuously evolves and more responding
organizations become active), nor univocal or unambiguous. On the contrary, SA is
ambiguous and multivocal, because responding organizations will give (different)
meanings to the crisis situation, to the information and to what is needed to respond to
the crisis (i.e., decision making). A concept of collective sensemaking approach to
understand and unravel the multivocality and ambiguity, to understand how various
responders with their specific professional norms and routines interpret the situation,
and finally how they enact the responding practices in such a way (that is, collectively)
that the joint effort and operations become more efficient.
In the next section, we provide examples of the most frequent man-made and
natural disasters involving hazardous substance. The aim is to illustrate that people can
easily lose their situational awareness in such events and different understanding of the
situations. Moreover, people often don’t know that hazardous material leaked into, e.g.,
burning building, or even into water systems, where the danger is not only involving
the first responders alone but also society in general.
3 Scenario Analysis
Disastrous events concerning hazardous materials can occur in different settings. Three
scenarios presented below can serve as an illustration, why new technologies are
required. We select three examples of scenarios, i.e., industrial accidents, natural
hazard, and terrorist attack.
of buildings and environment [22]. It can cause fatalities and injuries of workers,
damage to property and infrastructure on site and in the surrounding area, critical
service disruptions, contamination.
In these three examples of scenarios, the first responders should react quickly,
minimizing the loss of lives of the affected people and themselves. If hazardous
materials are involved in these examples of disasters, they may be severely exposed to
e.g., poisonous gasses, explosive and other dangerous substance. We have high hope
that existing and emerging technologies can help the first responders detecting early all
potential hazards and have better preparedness in a disaster. The next section describes
the example of emerging technologies that potentially are useful for emergency
management.
4 Emerging Technologies
Currently, there are emerging technologies that can be exploited and applied to
improve the way how FRs can be protected, such as:
• Unmanned Aerial Vehicles (UAV) with various mounted sensors and cameras:
currently researchers examined different levels of the autonomy UAV from fully
controlled by operators, computer action alternatives, computer narrow down the
choice, the computer executes an action upon the operator’s approval to fully
controlled by computer and ignoring the operators [25].
• Wearable devices and wearable sensors (Wrist-worn, head-mounted and others) that
come as existing products and research prototype. For example:
– Wrist-worn: smartwatches and wrist bands with or without touchscreen display
such as Apple iWatch, Samsung Gear, Pebble Time, Fitbit flex existing products
and Smart-watch Life-Saver (prototype).
– Head-mounted devices: smart glasses such as Funiki Ambient glasses, Recon
Jet, Microsoft HoloLens (existing products) and Google glass (prototype),
– Smart jewelry designed a for health-monitoring such as a smart ring (existing
product) or other jewelry such as typing ring and gesture detection ring
(prototype)
– Electronic garments, i.e., clothing items that also serve as wearables such as
Athos, Spinovo (existing products) or Dooplesleep (prototype)
– E-Patch, i.e., sensor patches that can adhere to the skin for fitness tracking or
haptic applications, sensing and data transmission. For example, health patch,
Motorolla e-tattoo or stamp platform (existing products) or duoskin, smart tooth
patch (prototype) [26].
• AR/VR technologies: Currently, different technologies have been available to sup-
port the immersive experience with virtual reality and augmented reality such as
head-mounted display Oculus rift and HTC Vive. Also, some AR/VR experience
can be obtained by using smartphones such as Google Cardboard and the Galaxy
Gear VR headset. In addition, immersive video and 360-degree video add additional
VR/AR experience possibilities [27].
• Robotics: Nowadays, the technology and application areas have been developed
rapidly for industrial purposes, rehabilitation and surgery, search and rescue, self-
driving vehicles, assistive technology, home care, manufacturing and so on. Exam-
ples of care robots: Lifting, exoskeletons, assistive, companion, talking, emotional,
374 J. Radianti et al.
service [28] Research on swarm robotics have focused on several topics such as
Aerial manipulation, counter-swarm pursuit, target search, and tracking, surveillance
monitoring and mapping [29]. There are more examples indicating that the robotics is
increasingly becoming an attractive research area with usefull new applications.
• Real-time systems: Today’s robust networks allowed the researcher to put “real-
time” as a feature or a selling point of any newly created systems in any area.
Countless technologies are offering real-time systems as a part of the delivery.
• Smartphone with advanced computing power, connectivity, battery, and storage
have changed this device into a handy multi-purpose device that can be used for
collecting images, videos, audio, location and other sensor data [30, 31].
• Surveillance camera with automatic detections (behavioral and facial recognition,
object detections, object tracking and so on). Surveillance camera itself is not new,
but how people use and process the videos and images have improved significantly,
especially after the advancement of image processing techniques developed in
artificial intelligence domain.
• Super-computers i.e., high level of performance computers that allow processing
very large databases and conduct a big amount of computations, such as iDataPlex,
Shaheen II, Hazel Han and Trinity. They have 301,056 cores, 185,088 cores,
196,608 cores and 65,320 cores respectively [32]. Advances such as multi-core
processors and GPGPUs (General Purpose Graphics Processing Units) have
enabled a powerful machine for personal use. Supercomputers will continuously
support the advancements of activities dealing with UAV, AR/VR, robotics and so
on as more data have been collected and need huge computing power to deliver
results in nearly real time.
These technologies generate new types of massive data. Also, some technologies
have been exploited by the public to generate citizen information, especially social
media, and the use of various mobile and web app that allows researchers to collect
data through crow-sourcing technique. Likewise, various research benefited from
current fast development of data analysis techniques, especially artificial intelligence
especially machine learning, deep learning, and computer vision.
Despite all efforts like the introduction of new training methods and personal protective
equipment, the need to reduce the number of FRs fatalities and injuries, to as much as
reasonably possible level, remains. New emerging technologies can be used to mini-
mize this problem. However, these technologies must be analyzed (re)designed,
studied, tested in emergency management settings and evaluated, considering FRs
protection against multiple and unexpected dangers and to what extent they can truly
enhance SA and COP.
Therefore, we plan to conduct research in the following areas:
(1) On-Site Threat Detection (real-time portable platform for hazmat detection and
monitoring)
(2) Risk Monitoring and Safe Management of Threats
Enhancing Disaster Response for Hazardous Materials Using Emerging Technologies 375
There are some gaps when it comes to the technologies that can protect the FRs from
hazardous materials, especially the identification of hazardous agents and detecting,
monitoring and analyzing passive and active hazard. We found that there are various
new, promising research directions, exploiting new technologies to improve the safety
of the FRs. In addition AI technologies can be used in conjunction with other emerging
technologies to provide enhanced COP, SA and enhanced threat detection and protect
FRs by minimizing their casualties. Our future directions are to pursue and opera-
tionalize our research agenda into a set of concrete studies that can contribute to the
area of emergency management, safety and risk management.
References
1. McGarry, S.L., et al.: Preventing the preventable: the 2015 Tianjin explosions, in Harvard T.
H Chan School of Public Helath, Hongkong jockey Club Disaster Preparedness and
Response Institute, Hongkong (2017)
2. Sarter, N.B., Woods, D.D.: Situation awareness: a critical but ill-defined phenomenon. Int.
J. Aviat. Psychol. 1(1), 45–57 (1991)
3. Hall, A.H.: A Survey of Firefighter Cancers and Other Chronic Diseases: Preliminary
Results (2016)
4. Svensson, S.: Firefighter fatalities in Sweden, 1937–2016. Brandteknik Lunds tekniska
högskola, Lund (2017)
5. IFAFRI: Capability Gap 2 “Deep Dive” Analysis Synopsis (2017)
6. Comfort, L.K.: Crisis management in hindsight: cognition, communication, coordination,
and control. Public Adm. Rev. 67, 189–197 (2007)
7. Endsley, M.R.: Designing for Situation Awareness: An Approach to User-Centered Design.
CRC Press (2016)
8. Klein, G.: The recognition-primed decision (RPD) model: looking back, looking forward. In:
Naturalistic Decision Making, pp. 285–292 (1997)
9. Zsambok, C.E., Klein, G.: Naturalistic Decision Making. Psychology Press (2014)
376 J. Radianti et al.
10. Weick, K.E., Roberts, K.H.: Collective mind in organizations: heedful interrelating on flight
decks. In: Administrative Science Quarterly, pp. 357–381 (1993)
11. Druskat, V.U., Pescosolido, A.T.: The content of effective teamwork mental models in self-
managing teams: ownership, learning and heedful interrelating. Hum. Relat. 55(3), 283–314
(2002)
12. Weber, K., Glynn, M.A.: Making sense with institutions: context, thought and action in Karl
Weick’s theory. Organ. Stud. 27(11), 1639–1660 (2006)
13. Weick, K.E., Sutcliffe, K.M., Obstfeld, D.: Organizing and the process of sensemaking.
Organ. Sci. 16(4), 409–421 (2005)
14. Endsley, M.R.: Toward a theory of situation awareness in dynamic systems. Hum. Factors
37(1), 32–64 (1995)
15. Taber, N., Plumb, D., Jolemore, S.: “Grey” areas and “organized chaos” in emergency
response. J. Workplace Learn. 20(4), 272–285 (2008)
16. Wolbers, J., Boersma, K.: The common operational picture as collective sensemaking.
J. Conting. Crisis Manag. 21(4), 186–199 (2013)
17. Mirbabaie, M., Zapatka, E.: Sensemaking in Social Media Crisis Communication–A Case
Study on the Brussels Bombings in 2016 (2017)
18. Pentina, I., Tarafdar, M.: From “information” to “knowing”: exploring the role of social
media in contemporary news consumption. Comput. Hum. Behav. 35, 211–223 (2014)
19. Dervin, B.: Sense-making theory and practice: an overview of user interests in knowledge
seeking and use. J. Knowl. Manag. 2(2), 36–46 (1998)
20. Savolainen, R.: The sense-making theory: reviewing the interests of a user-centered
approach to information seeking and use. Inf. Process. Manag. 29(1), 13–28 (1993)
21. Stieglitz, S., et al.: Sensemaking and communication roles in social media crisis
communication (2017)
22. Rabjohn, A.: The human cost of being a ‘first responder’. J. Bus. Cont. Emerg. Plann. 6(3),
268–271 (2013)
23. SWD: Overview of Natural and Man-made Disaster Risks the European Union may face, in
176 Commission Staff Working Document, Brussels (2017)
24. Europol: EU Terrorism Situation and Trend Report (TE-SAT) (2017)
25. Atyabi, A., MahmoudZadeh, S., Nefti-Meziani, S.: Current advancements on autonomous
mission planning and management systems: an AUV and UAV perspective. Ann. Rev.
Control 46, 196–215 (2018)
26. Seneviratne, S., et al.: A survey of wearable devices and challenges. IEEE Commun. Surv.
Tutor. 19(4), 2573–2620 (2017)
27. Garcia-Luna-Aceves, J.J.H., Westphal, C.D.: Network support for AR/VR and immersive
video application: a survey. In: Proceedings of the 15th International Joint Conference on e-
Business and Telecommunications, ICETE 2018, vol. 1 (2018)
28. Rantanen, T., et al.: The adoption of care robots in home care—a survey on the attitudes of
Finnish home care personnel. J. Clin. Nurs. 27(9–10), 1846–1859 (2018)
29. Chung, S., et al.: A survey on aerial swarm robotics. IEEE Trans. Rob. 34(4), 837–855 (2018)
30. Radianti, J., Gonzalez, J.J., Granmo, O.-C.: Publish-subscribe smartphone sensing platform
for the acute phase of a disaster: a framework for emergency management support. In: 2014
IEEE International Conference on Pervasive Computing and Communication Workshops
(PERCOM WORKSHOPS). IEEE (2014)
31. Radianti, J., Lazreg, M.B., Granmo, O.-C.: Fire simulation-based adaptation of SmartRescue
App for serious game: design, setup and user experience. Eng. Appl. Artif. Intell. 46, 312–
325 (2015)
32. Ryabko, B., Rakitskiy, A.: Theoretical approach to performance evaluation of supercom-
puters. J. Circ. Syst. Comput. 27(04), 1850062 (2018)
Machine Learning Modeling -
Optimization
Evolutionary Optimization on Artificial
Neural Networks for Predicting the
User’s Future Semantic Location
Antonios Karatzoglou(B)
1 Introduction
The growing use of GPS technology in mainstream devices, such as mobile
phones, smartwatches and cars, to name but a few, lay the basis of a num-
ber of services that utilize the location information to provide their users with
more accurate and timely solutions. These services are referred to as Location-
based Services (LBS). Recently, LBS providers invest increasingly on location
prediction techniques to further improve the user experience of their services.
So far, a big variety of different models has been investigated in both the aca-
demic and the private sector. These include probabilistic approaches like Hidden
Markov Models (HMM) and Bayes Networks, as well as other methods like Sup-
port Vector Machines (SVM) and Artificial Neural Networks (ANN). In general,
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 379–390, 2019.
https://doi.org/10.1007/978-3-030-20257-6_32
380 A. Karatzoglou
2 Related Work
Modelling human movement patterns and predicting upon them represents a
well researched and evaluated topic in the scientific community. As already men-
tioned in the introductory section, most work relies on numerical data, like GPS
or cell tower data. In the last decade though, a new group of researchers utilize
semantic information for leveraging their models and the respective predictive
performance. The first part of this section discusses the most relevant research
in the field of semantic-enhanced location prediction. The second part describes
a group of works, in which genetic algorithms are used in a location predic-
tion scenario. In contrast to our work, none of them concerns semantic outdoor
trajectories.
Ying et al. deployed in [32] their own Geographic Semantic Information
Database (GSID) to semantically enrich GPS and cell tower ID trajectories. The
GSID refers to a POI1 database containing semantic (geo-)information of land-
marks and associated location types. Ying et al. use the returned semantic tra-
jectories to propagate a prefix tree model that serves as basis for their improved
next location prediction algorithm. In [31], they extend their model by adding
temporal information explicitly to its input patterns. Taking these temporal pat-
terns additionally into account helps improve their former model. Karatzoglou
et al. explore in their research a large variety of semantic trajectory modelling
1
Point of Interest.
Evolutionary Optimization on Artificial Neural Networks 381
and prediction methods. In [7–9], they explore the use of context-specific multi-
dimensional Markov Chains as well as the degree of semantic enrichment, that
is the type and the amount of the additional semantic information that can be
fed into the model, to achieve better accuracy scores. Furthermore, they pro-
pose a context-driven semantic similarity based approach for dynamic clustering
of locations, which provides their model with the necessary flexibility to over-
come, among others and to a certain degree, sparse and inconsistent training
datasets. In [10,11], they evaluate the performance of various neural network
architecture types including the Feed-Forward (FFNN), the Recurrent (RNN),
the Long Short-term Memory (LSTM) and the Convolutional Neural Network
(CNN) with the LSTM and the CNN performing overall best. Finally, their
work in [6] extends the LSTM-based model and investigates the application of
Sequence to Sequence Learning (Seq2Seq) and its impact on both the short- and
the long-term prediction. Samaan et al. model human trajectories using concep-
tual maps, which are very close to the notion of semantic trajectories [19–21].
In addition, a set of user profile information encoded in XML, such as the user’s
preferences and her schedule, are used to support their probabilistic inference
process. Ridhawi et al. present in their work a similar approach for indoor track-
ing of users [17,18]. In contrast to Samaan et al., Rhidawi et al. use ontologies
for modelling and storing the users’ information. In [13], Long et al. parse the
(textual) input of Location-Based Social Network (LBSN) users to analyze their
spatial and temporal movement patterns at a higher semantic level. Krishna-
murthy et al. exploit the same type of data to predict the next semantic location
of mobile LBSN users [12]. For this purpose, they analyze the semantic similarity
between locations and the users’ input (checkins) based on several semantic sim-
ilarity measures, like the Jaccard [5] and the Tversky [24] Index. Ye et al. in [30]
use LBSN data as well and use the within included semantic labels of locations as
input for their Mixed Hidden Markov Model (MHMM). Finally, Wannous et al.
and Malki et al. propose a multi-ontology based approach in combination with a
set of rules created by a group of experts to model and reason about semantically
annotated movement and activity patterns of marine mammals [15,26–29].
There exists a rather limited number of papers that use evolutionary algo-
rithms, either in a direct or in an indirect manner, in a location prediction
scenario. Mantoro et al. propose a direct approach, in which a genetic algorithm
is used to represent the visits of a certain user at a certain location at a cer-
tain time [16]. The evolution process returns simply the most likely places to
be occupied by each user. Mala et al. apply a genetic algorithm for capturing
and evolving the moving speed and the current traffic situation over time [14].
This information is then used to support the location transition probability func-
tion and consequently the overall prediction performance. In [23], Tiwari et al.
make use of the genetic algorithm to encode and evolve the sparse User-Location
matrix that contains the preference scores between each location and user. Their
model is able to outperform the classic Matrix Factorization approach. Finally,
Vlahogianni et al. present in [25] a genetic optimization approach for optimizing
neural networks with respect to a short-term traffic flow prediction use case. In
382 A. Karatzoglou
3 Semantic Trajectories
A trajectory is a spatio-temporal sequence that describes the movement of
objects in space within a certain temporal interval. The most common trajec-
tories nowadays are GPS trajectories. These are usually defined as a sequence
of GPS points, that is, triples containing the objects’ latitude (lati ), longitude
(longi ) and a corresponding point of time (ti ) as displayed in the following
equation:
T rajGP S = (lat1 , long1 , t1 ), (lat2 , long2 , t2 ), ... (1)
Figure 1 shows a daily GPS trajectory as a sequence of GPS points.
In 2007, Alvares et al. and Spaccapietra et al. highlighted in their work the
benefits of adding a conceptual, semantic layer upon the numeric trajectories
when analyzing moving objects [1,22]. Taking semantic information addition-
ally into account provides the analysts with a deeper insight into their move-
ment patterns and thus with a greater understanding of their moving behaviour.
A semantic trajectory refers to such a semantically enriched trajectory and
with respect to human trajectories, it consists of a sequence of a few signifi-
cant locations. Significant locations represent locations at which users stay long
enough to perform a certain activity (e.g., mall, restaurant, office or gym) [2].
Semantic trajectories conform in a certain way with Hägerstrand’s notion of a
space-time prism that describes the trade-off between time and space depending
on the respective activity [4]. Figure 2 illustrates the reduction of thousands of
thousands of recorded GPS points to a sequence of a few significant semantic
locations.
This work concentrates on modeling and predicting upon exact this type of
semantic trajectories.
As already mentioned before in this work, the above process has been suc-
cessfully used for optimizing the hyperparameters of neural networks, like its
architecture (number of hidden layers, neurons and their interconnections), the
learning rate and the training batch size, to name but a few. In the matter of
fact, they represent one of the most promising approaches when it comes to a
derivative-free, non-backpropagation optimization. In this work, we evaluate the
use of a genetic algorithm for optimizing the structure of the neural network with
respect to the optimization process’ temporal expenditure during the training
and the overall test accuracy in the semantic location prediction scenario. In
particular, we concentrate on finding the best possible values for the following
hyperparameters:
5 Evaluation
Semantic trajectories can be represented at various levels, depending on the
applied semantic granularity, e.g., food location vs. restaurant vs. fast food restau-
rant vs. burger joint. We tested our approach on modeling human trajectories
Evolutionary Optimization on Artificial Neural Networks 385
at two different semantic representation levels, e.g. restaurant vs. burger joint,
Chinese restaurant, pizzeria,... or night life location vs . bar, night club, disco,
cinema,... . Our semantic location taxonomy is based on the Foursquare venue
categorization2 . Figure 4 shows a small part of our Reality Mining semantic loca-
tion taxonomy.
We evaluated our approach using the MIT Reality Mining dataset [3]. The
Reality Mining dataset contains semantically annotated trajectory data from
100 mobile users over a period of 9 months. Before using it for evaluation, we
cleansed and preprocessed the data, among others, by removing nonsensical data
(e.g., huge location jumps in extremely short time intervals) and by filtering
out extremely sparse annotators. Figure 5 shows the distribution of the resulted
locations for the high level case. It is apparent that the Reality Mining dataset is
extremely unbalanced. In order to compensate this unbalance in our evaluation
outcome, we used a weighted macro accuracy metric that takes this outbalance
explicitly into account. A Markov Chain model, as well as the Convolutional
(CNN) of [10] and a LSTM-based Neural Network optimized through a typical
grid search served as our baseline. The parameters can be found in Table 1.
To optimize our LSTM-based model, we applied a simple genetic algorithm.
Table 2 contains the respective hyperparameter values. These were obtained
through a short testing process. We ran the evolution process twice. Once for
the low and once for the high representation level trajectories. Thus, at the end,
we received two optimal hyperparameter configuration sets for our predictive
model (Table 3). It can be seen that the values for both the low and the high
level case are pretty similar. The number of past locations to be considered (his-
tory window) as well as the number of hidden layers are identical. Interestingly,
the largest difference occurs with respect to the number of LSTM units per hid-
den layer. It seems that our LSTM model needs a wider topology when it comes
2
https://developer.foursquare.com/docs/resources/categories.
386 A. Karatzoglou
Fig. 5. Distribution of the semantic locations in the (filtered) MIT dataset at the higher
semantic representation level.
Hyperparameter Value
Population size 4
Number of generations 4
Gene distribution Bernoulli distribution
Crossover probability 0.4
Mutation probability 0.1
to modeling high level human movement patterns. And this, despite the fact that
the overall number of high level location types and thus the corresponding input
vector is smaller compared to the number of unique locations at the lower level.
Figure 6 shows the results for the low representation level case. We can see that
the GA-optimized network is able to outperform both the LSTM model that was
optimized through an exhaustive grid search as well as the Markov Chain model
of 3. Order. Solely, and as expected based on former investigations in [10], the
CNN approach is better. Thus, the genetic algorithm based optimization tech-
niques seems to enhance the standard LSTM, however, it couldn’t lead to giant
leaps.
At the higher level, all models perform generally better. The GA-based LSTM
is again able to outperform the Markov Chain model, but this time, in contrast
to the low level case, it can’t reach the scores of both the standard LSTM and the
CNN. Once again, the results here confirm the overall best performance of the
LSTM and the CNN in [10]. The reduced performance of the GA-based LSTM
could be mainly attributed to a certain overfitting effect during the training
process due to the smaller number of location classes in combination with the
large number of hidden units while keeping the population number the same.
In terms of time needed to reach to an optimal hyperparameter set, the opti-
mal parameter search lasted much less compared to the exhaustive grid search,
by a factor of 0.8. A random grid search should probably be equally fast, but
the overall final results are usually not as optimal as with the population based
genetic algorithmic approach, due to its “naive” high variance when selecting
points randomly in the hyperparameter space.
388 A. Karatzoglou
6 Conclusion
Semantically enriched Location Based Services have recently gained increasingly
in importance. For this reason, there exists a growing variety of semantic location
prediction approaches in the current literature. One of the most common and
promising methods are the ones that rely on Artificial Neural Networks. However,
finding the appropriate network topology and the respective hyperparameters is
a very time consuming process. Evolutionary algorithms help shortening this
process while keeping the models’ performance high. This work explores the use
of Genetic Algorithms in optimizing a Long Short-Term Memory neural network
in a semantic trajectory modeling and prediction scenario. It can be shown that
genetic algorithms are capable of optimizing the LSTM and help them reach
high accuracy scores outperforming baseline systems, such as a Markov Chains
and a Grid Search based optimized LSTM model.
References
1. Alvares, L.O., Bogorny, V., Kuijpers, B., de Macedo, J.A.F., Moelans, B., Vaisman,
A.: A model for enriching trajectories with semantic geographical information. In:
Proceedings of the 15th Annual ACM International Symposium on Advances in
Geographic Information Systems, GIS 2007, pp. 22:1–22:8. ACM, New York (2007).
https://doi.org/10.1145/1341012.1341041
2. Ashbrook, D., Starner, T.: Using GPS to learn significant locations and predict
movement across multiple users. Pers. Ubiquitous Comput. 7(5), 275–286 (2003)
3. Eagle, N., Pentland, A.S.: Reality mining: sensing complex social systems. Pers.
Ubiquitous Comput. 10(4), 255–268 (2006)
4. Hägerstraand, T.: What about people in regional science? Pap. Reg. Sci. 24(1),
7–24 (1970)
5. Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2),
37–50 (1912)
Evolutionary Optimization on Artificial Neural Networks 389
6. Karatzoglou, A., Jablonski, A., Beigl, M.: A Seq2Seq learning approach for mod-
eling semantic trajectories and predicting the next location. In: Proceedings of
the 26th ACM SIGSPATIAL International Conference on Advances in Geographic
Information Systems. ACM (2018)
7. Karatzoglou, A., Khler, D., Beigl, M.: Semantic-enhanced multi-dimensional
Markov chains on semantic trajectories for predicting future locations. Sensors
18(10) (2018). https://doi.org/10.3390/s18103582. http://www.mdpi.com/1424-
8220/18/10/3582
8. Karatzoglou, A., Köhler, D., Beigl, M.: Purpose-of-visit-driven semantic similar-
ity analysis on semantic trajectories for enhancing the future location prediction.
In: International Conference on Pervasive Computing and Communications Work-
shops (PerCom). IEEE (2018)
9. Karatzoglou, A., Lamp, S.C., Beigl, M.: Matrix factorization on semantic trajecto-
ries for predicting future semantic locations. In: Wireless and Mobile Computing,
Networking and Communications (WiMob), pp. 1–7. IEEE (2017)
10. Karatzoglou, A., Schnell, N., Beigl, M.: A convolutional neural network approach
for modeling semantic trajectories and predicting future locations. In: Kůrková, V.,
Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018.
LNCS, vol. 11139, pp. 61–72. Springer, Cham (2018). https://doi.org/10.1007/978-
3-030-01418-6 7
11. Karatzoglou, A., Sentürk, H., Jablonski, A., Beigl, M.: Applying artificial neural
networks on two-layer semantic trajectories for predicting the next semantic loca-
tion. In: Lintas, A., Rovetta, S., Verschure, P.F.M.J., Villa, A.E.P. (eds.) ICANN
2017. LNCS, vol. 10614, pp. 233–241. Springer, Cham (2017). https://doi.org/10.
1007/978-3-319-68612-7 27
12. Krishnamurthy, R., Kapanipathi, P., Sheth, A.P., Thirunarayan, K.: Knowledge
enabled approach to predict the location of Twitter users. In: Gandon, F., Sabou,
M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC
2015. LNCS, vol. 9088, pp. 187–201. Springer, Cham (2015). https://doi.org/10.
1007/978-3-319-18818-8 12
13. Long, X., Jin, L., Joshi, J.: Exploring trajectory-driven local geographic topics in
foursquare. In: Proceedings of the 2012 ACM Conference on Ubiquitous Comput-
ing, UbiComp 2012, pp. 927–934. ACM, New York (2012)
14. Mala, C., Loganathan, M., Gopalan, N.P., SivaSelvan, B.: A novel genetic algorithm
approach to mobility prediction in wireless networks. In: Ranka, S., et al. (eds.)
IC3 2009. CCIS, vol. 40, pp. 49–57. Springer, Heidelberg (2009). https://doi.org/
10.1007/978-3-642-03547-0 6
15. Malki, J., Wannous, R., Bouju, A., Vincent, C.: Temporal reasoning in trajectories
using an ontological modelling approach. Control Cybern. 41 (2012)
16. Mantoro, T., Muataz, Z., Ayu, M.A.: Mobile user location prediction: genetic
algorithm-based approach. In: 2010 IEEE Symposium on Industrial Electronics
and Applications (ISIEA), pp. 345–349. IEEE (2010)
17. Ridhawi, I.A., Aloqaily, M., Karmouch, A., Agoulmine, N.: A location-aware user
tracking and prediction system. In: 2009 Global Information Infrastructure Sym-
posium, pp. 1–8, June 2009
18. Ridhawi, Y.A., Ridhawi, I.A., Karmouch, A., Nayak, A.: A context-aware and
location prediction framework for dynamic environments. In: 2011 IEEE 7th Inter-
national Conference on Wireless and Mobile Computing, Networking and Commu-
nications (WiMob), pp. 172–179, October 2011
390 A. Karatzoglou
19. Samaan, N., Benmammar, B., Krief, F., Karmouch, A.: Prediction-based advanced
resource reservation in mobile environments. In: Canadian Conference on Electrical
and Computer Engineering 2005, pp. 1411–1414 (2005)
20. Samaan, N., Karmouch, A., Kheddouci, H.: Mobility prediction based service loca-
tion and delivery. In: Canadian Conference on Electrical and Computer Engineering
2004 (IEEE Cat. No. 04CH37513), vol. 4, pp. 2307–2310, May 2004
21. Samaan, N., Karmouch, A.: A mobility prediction architecture based on contextual
knowledge and spatial conceptual maps. IEEE Trans. Mob. Comput. 4(6), 537–551
(2005)
22. Spaccapietra, S., Parent, C., Damiani, M.L., de Macedo, J.A., Porto, F., Vangenot,
C.: A conceptual view on trajectories. Data Knowl. Eng. 65(1), 126–146 (2008).
https://doi.org/10.1016/j.datak.2007.10.008
23. Tiwari, S., Kaushik, S.: Modeling personalized recommendations of unvisited
tourist places using genetic algorithms. In: Chu, W., Kikuchi, S., Bhalla, S. (eds.)
DNIS 2015. LNCS, vol. 8999, pp. 264–276. Springer, Cham (2015). https://doi.
org/10.1007/978-3-319-16313-0 20
24. Tversky, A.: Features of similarity. Psychol. Rev. 84(4), 327 (1977)
25. Vlahogianni, E.I., Karlaftis, M.G., Golias, J.C.: Optimized and meta-optimized
neural networks for short-term traffic flow prediction: a genetic approach. Transp.
Res. Part C: Emerg. Technol. 13(3), 211–234 (2005)
26. Wannous, R., Malki, J., Bouju, A., Vincent, C.: Modelling mobile object activities
based on trajectory ontology rules considering spatial relationship rules. In: Amine,
A., Otmane, A., Bellatreche, L. (eds.) Modeling Approaches and Algorithms for
Advanced Computer Applications, pp. 249–258. Springer, Cham (2013). https://
doi.org/10.1007/978-3-319-00560-7 29
27. Wannous, R., Malki, J., Bouju, A., Vincent, C.: Time integration in semantic
trajectories using an ontological modelling approach. In: Pechenizkiy, M., Woj-
ciechowski, M. (eds.) New Trends in Databases and Information Systems, pp. 187–
198. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-32518-2 18
28. Wannous, R., Malki, J., Bouju, A., Vincent, C.: Trajectory ontology inference con-
sidering domain and temporal dimensions–application to marine mammals. Future
Gener. Comput. Syst. 68, 491–499 (2016)
29. Wannous, R., Vincent, C., Malki, J., Bouju, A.: An ontology-based approach for
handling explicit and implicit knowledge over trajectories. In: Morzy, T., Valduriez,
P., Bellatreche, L. (eds.) ADBIS 2015. CCIS, vol. 539, pp. 403–413. Springer, Cham
(2015). https://doi.org/10.1007/978-3-319-23201-0 41
30. Ye, J., Zhu, Z., Cheng, H.: What’s your next move: user activity prediction in
location-based social networks. In: Proceedings of the 2013 SIAM International
Conference on Data Mining, pp. 171–179. SIAM (2013)
31. Ying, J.J.C., Lee, W.C., Tseng, V.S.: Mining geographic-temporal-semantic pat-
terns in trajectories for location prediction. ACM Trans. Intell. Syst. Technol. 5(1),
2:1–2:33 (2014)
32. Ying, J.J.C., Lee, W.C., Weng, T.C., Tseng, V.S.: Semantic trajectory mining for
location prediction. In: Proceedings of the 19th ACM SIGSPATIAL International
Conference on Advances in Geographic Information Systems, GIS 2011, pp. 34–43.
ACM, New York (2011)
Global Minimum Depth in Edwards-Anderson
Model
Abstract. In the literature the most frequently cited data are quite contradic-
tory, and there is no consensus on the global minimum value of 2D Edwards-
Anderson (2D EA) Ising model. By means of computer simulations, with the
help of exact polynomial Schraudolph-Kamenetsky algorithm, we examined the
global minimum depth in 2D EA-type models. We found a dependence of the
global minimum depth on the dimension of the problem N and obtained its
asymptotic value in the limit N ! ∞. We believe these evaluations can be
further used for examining the behavior of 2D Bayesian models often used in
machine learning and image processing.
1 Introduction
In many fields of science, it is necessary to know the global energy minimum for
different systems. Namely, in informatics we use it when solving problems of quadratic
optimization [1, 2], developing search algorithms for the global minimum and solving
max-cut problems [3–7]. In neuroinformatics, we have to know the global minimum
when developing associative memory systems and constructing neural networks and
neural network minimization algorithms [7]. In physics, the knowledge of the global
energy minimum is most frequently necessary when studying the behavior of spin glass
systems and even when describing four-photon mixing in nonlinear media [8–13].
The question of calculation of the global minimum depth has been discussed over
the years. However, since it has no decisive answer it remains a highly topical problem
up to now. Indeed, in the literature the most frequently cited data are quite contra-
dictory, and there is no consensus on the global minimum value (see references in
[10]). To illustrate this statement, we present the values of the global minimum depth
obtained by different methods:
Such a spread of values exists because until recently there were no exact calculation
algorithms for the determination of E0 . This was the reason why different authors used
different minimization methods, and consequently the obtained estimates were suffi-
ciently far from the true value of E0 . New algorithms appeared recently. They allow us
to calculate E0 exactly when examining spin systems on planar graphs with arbitrary
boundary conditions [14]. Implementing these algorithms, we were able to refine our
results [15] for the Edwards–Anderson model (the EA model).
In the present paper, we present an experimental analysis of the global minimum
depth in the EA model, which is a spin system on an N ¼ L L square lattice where
only interactions with four nearest neighbors do not equal to zero. Formally, we have in
mind a system whose behavior is described by a Hamiltonian
1X N
H¼ Jij si sj ð2Þ
2 i;j¼1
1 X N X N
E¼ Jij si sj ð3Þ
2NrJ i¼1 j¼1
The structure of the paper is as follows. In Sect. 2, we describe our experiment and
analyze the obtained data. In Sect. 3, we discuss the results and the tables showing our
experimental data.
2 Experiment
To define the value of E0 , we used an algorithm described in [14]. In the course of our
experiment, we examined the classical EA-model (with the normal distribution of Jij )
and the EA*-model (with the uniform distribution of Jij ). For the chosen model of the
given size N ¼ L L, we generated M matrices Jij and determined M values Em0 ,
m ¼ 1; M. We used these data to calculate the mean value and the variance of the
obtained values:
X
M X
M
E0 ¼ M 1 Em0 ; r20 ¼ M 1 Em0
2
E02 ð4Þ
m¼1 m¼1
where xexp are the experimental values, xexp are the means of the experimental values,
and xapp are the values obtained using the approximation formulas.
2.1 EA-model
This is the Edwards–Anderson model for a two-dimensional lattice where spins interact
with their four nearest neighbors only and nonzero matrix elements are normally
distributed.
We found that the function of E0 is almost constant for large L. Analysis of our
experimental data based on a standard maximization of value of R2 (5) shows that the
more careful approximation functions have the form:
matches perfectly with data of Table 1. The value of the relative error Err ¼
ðexpÞ ðapproxÞ
1 E0 =E0 is less than 2 103 .
The function r0 ¼ r0 ðNÞ (the second expression of Eq. (6)) in Fig. 2 also describes
the data of Table 1 very well. The value of the relative error is less than 0.4%.
E0
1.32
1.28
1.24
1.20
0 200 400 600 800 1000 L
σ0
0.075
0.050
0.025
0.000
0 200 400 600 800 1000 L
2.2 EA*-model
This is the same Edwards–Anderson model for a two-dimensional lattice, but here the
uniform distribution is used in place of the normal distribution.
In this case, approximation functions obtained after our analysis of the experimental
data have the form
E0
1.38
1.34
1.30
1.26
0 200 400 600 800 1000 L
Comparing the expressions of Eq. (7) with the experiment, we see that they
describe it very well. In Fig. 3, we present the dependence E0 ¼ E0 ðNÞ (the first
expression of Eq. (7)) that matches perfectly with the data from Table 1. When L [ 50,
the relative error is less than 2 104 .
The dependence r0 ¼ r0 ðNÞ (the second expression of Eq. (7)) shown in Fig. 4
also describes the data from Table 1 very well. Here the relative error is less than 0.5%.
σ0
0.075
0.050
0.025
0.000
0 200 400 600 800 1000 L
3 Discussion
Our analysis of the two models allowed us to derive empirical relations in Eqs. (6) and
(7) for the most important characteristics of the global minima (see Eqs. (6) and (7)).
Our goal was to obtain expressions which with a high certainty described the depen-
dences of these characteristics on N in the whole range of the dimensions of the
problem that we were able to examine. Based on these results, we had to determine the
asymptotic behavior of these characteristics when N ! 1. Evidently there are dif-
ferent approaches to approximation of the experimental data in Table 1. Consequently,
it is possible to obtain a list of different expressions, and some of them can be even
more accurate than the expressions of Eqs. (6) and (7). However, this fact does not
change the goal of our study: independent of the form of the obtained approximation
functions, they have to describe correctly the behavior of the characteristics inside the
test range of N and provide trustworthy asymptotic values when N ! 1 (see Table 2).
As we see, the data of Table 2 differ significantly from the values presented in
Eq. (1). The point is that when minimizing the functional of Eq. (2) with a view to
calculating E0 different authors used different minimization algorithms. To do that, they
defined E0 as the energy corresponding to the deepest minimum, which under a rea-
sonable number of tests frequently was far from E 0 . As an example, let us discuss the
results of numerical experiments [15] in which they defined the energy of the deepest
minimum E . Then, for the relative distance
E0 E
dE ¼ 100% 0 ð8Þ
E
From our point of view, this is a possible reason why the estimates of E0 obtained
by different authors differ so significantly. Namely, when the size of the system is
sufficiently large ðL 30Þ such an approach is not applicable since the probability of
finding the global minimum in the course of a random search is exponentially small: it
is expð0:04NÞ.
Acknowledgements. The work was supported by Russian Foundation for Basic Research
(RFBR Project 18-07-00750).
References
1. Hartmann, A.: Calculation of ground states of four-dimensional ±J Ising spin glasses. Phys.
Rev. B 60, 5135–5138 (1999)
2. Kryzhanovsky, B., Kryzhanovsky, V.: Binary optimization: on the probability of a local
minimum detection in random search. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A.,
Zurada, J.M. (eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 89–100. Springer,
Heidelberg (2008). https://doi.org/10.1007/978-3-540-69731-2_10
3. Houdayer, J., Martin, O.C.: Hierarchical approach for computing spin glass ground states.
Phys. Rev. E 64, 056704 (2001)
4. Litinskii, L.B., Magomedov, B.M.: Global minimization of a quadratic functional: Neural
networks approach. Pattern Recog. Image Anal. 15(1), 80–82 (2005)
5. Karandashev, Y.M., Kryzhanovsky, B.V.: Transformation of energy landscape in the
problem of binary minimization. Dokl. Math. 80(3), 927–931 (2009)
6. Liers, F., Junger, M., Reinelt, G., Rinaldi, G.: Computing exact ground states of hard Ising
spin glass problems by branch-and-cut. In: New Optimization Algorithms in Physics,
pp. 47–68. Wiley (2004)
7. Hopfield, J.J.: Neural networks and physical systems with emergent collective computational
abilities. Proc. Nat. Acad. Sci. USA 79, 2554–2558 (1982)
8. Thouless, D.J., Anderson, P.W., Palmer, R.G.: Solution of solvable model of a spin glass.
Philos. Mag. 35, 593–601 (1977)
398 I. Karandashev and B. Kryzhanovsky
1 Introduction
The problem of the detection of rare patterns through data–driven techniques
is common to different fields. Many practical tasks envisage the identification of
uncommon samples by means of a classifier trained by using a dataset that, due
to the nature of the observed phenomenon, is imbalanced [22]. Examples of such
situation can be found in the industrial field in machine fault or defect identifi-
cation [2,18], in medical applications for the diagnosis of particular pathologies,
in finance for the recognition of fraudulent transactions. All these applications
focus on the correct identification of the rare situation that, in the specific con-
texts, is the one of interest: defects have to be avoided in order to preserve
product quality, machine faults have to be spotted to limit their consequences,
diseases must be correctly diagnosed. These problems also share the fact that
the misclassification of frequent patterns (the so–called false alarms) is strongly
preferable than the missed detection of rare ones.
Multiple interacting factors make hard the satisfactory classification of imbal-
anced datasets by employing standard classifiers. First of all the basic assump-
tion of even distribution among classes which is implicitly done by classifiers
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 399–411, 2019.
https://doi.org/10.1007/978-3-030-20257-6_34
400 M. Vannucci and V. Colla
such as Artificial Neural Networks (ANN) [3], Decision Trees (DT) and Sup-
port Vector Machines (SVM). The goal of most machine learning algorithms is
the achievement of an optimal overall performance, that is satisfied in the case
of balanced classes but, in the case of imbalanced datasets, standard classifiers
result to be biased toward the majority class and, as a consequence, the minority
class is neglected [10]. An additional element that contributes to the difficulty of
the task is the complexity of the classification problem: in facts, in presence of
complex decision boundaries and highly overlapping classes the mission of the
learner becomes harder and standard classifier tend to solve conflicts in favor of
majority class [8]. The complexity of the classification of imbalanced datasets
was demonstrated to be proportional to the imbalance degree [8] and the low
quality (i.e. presence of noise and outliers) of data that prevent the classifier to
correctly represent the different classes either due to the low number of available
data or their reliability.
In this paper a method based on the combined use of Self–Organizing–
Maps (SOM) and Genetic Algorithms (GA) for the pre–processing of unbalanced
datasets is presented. This method tries to efficiently merge the characteristics of
main approaches into an optimization context for the maximization of classifier
performance. The organization of the paper is the following one: Sect. 2 intro-
duces the main families of methods designed for dealing with class imbalance.
In Sect. 3 the main characteristics of the proposed algorithm are described while
the performance it achieves on a set of tests involving imbalanced dataset are
reported in Sect. 4. Finally, in Sect. 5 conclusions are drawn and future perspec-
tives of the proposed approach are outlined.
The method proposed in this work belongs to external approaches. This cat-
egory of approaches tries to improve classifiers performance by reducing the
unbalance rate of the training dataset through the so–called resampling. Resam-
pling increases the rate α of rare samples with respect to the total number of
observations. One of the main advantages of this category of methods is that they
are extremely portable. Since they do not require modifications on the algorithm
side, they can be used for the pre–processing of a training dataset to be fed to
any kind of classifier. An issue is the determination of the best value of α that
mostly depends on problem characteristics and no rationale for its determina-
tion exists and thus must be often empirically set. Resampling can take place
either by removing frequent samples – the so–called under–sampling – or adding
rare ones – the so–called over–sampling. None of these two approaches, that can
be combined, is prevalent for all the problems [1] and both of them have some
criticality.
It is possible to apply a wide variety of strategies for over and under sam-
pling. The most basic one consists in the random selection of the samples to
remove or replicate. Although this simple method can lead to interesting results,
it assumes some risks. On one hand the removal can interest samples with high
informative content, resulting harmful for the classifier performance. On the
other hand random oversampling can lead to the formation of compact clusters
of minority samples that reduces, instead of expanding, the area of the input
space associated to the rare class, giving rise to over–fitting problems [1,8].
In order to avoid these detrimental effects, more sophisticated resampling
strategies have been developed aiming to the selection of the samples whose
removal or addition maximizes the benefits for the classifier. An example of
focused oversampling is found in [11] where only the infrequent samples in prox-
imity of boundary regions between the minority and majority classes are repli-
cated: this selection broadens the input domain that the classifier assigns to the
minority class and reduces eventual conflicts that standard classifiers would solve
in favor of majority class. Another oversampling approach that goes beyond the
402 M. Vannucci and V. Colla
pure replication of existing samples is proposed in [5] with the SMOTE (Syn-
thetic Minority Oversampling TEchnique) algorithm. This method creates syn-
thetic unfrequent samples locating them where they likely could be (i.e. along
the lines connecting two existing minority samples). The main advantage of this
widely used method is the exploitation of original (although synthetic) infor-
mation, which avoids the overfitting that can be caused by the pure samples
replications. On the other hand, the risk that comes with SMOTE, is the possi-
ble introduction of misleading information due to the wrong positioning of the
created samples, for instance, in the neighborhood of clusters of frequent sam-
ples. An evolution of the SMOTE algorithm is the SUNDO method [4] that tries
to overcome this issue by calculating the placing of the newly created samples
according to the samples of the other class. Focused undersampling is inves-
tigated as well in literature works. In [12] and [13] frequent samples from the
regions where inter–class conflicts are more frequent are removed. This selec-
tion aims at reducing majority class redundancy without decreasing the dataset
informative content. In [21] undersampling is performed by using a two SOMs
that determine the centroids of the clusters of frequent and rare samples. This
information is used to guide the selection of the frequent samples to be removed
in order to maximize the impact of this operation selecting, for instance, samples
belonging to dense majority clusters or samples in conflicting regions. Several
attempts of combining Smart–Undersampling and Smart–Oversampling have
been done. In particular the OGAR algorithm (Optimal GA–based Resampling)
[20] combines these two approaches and optimally determines the resampling
rates by using a GA. The resampling method presented in this work extends
the investigation pursued in [21] by exploiting the SOM–based clusterization for
both over– and under– sampling, moreover it goes beyond it by using a GA for
optimizing the two resampling rates.
In this work an approach based on the optimal combination of over and under
sampling is proposed in order to simultaneously take the advantages and avoid
the drawbacks of both the approaches. The two rebalancing methods are applied
on the basis of the spatial distribution of original samples according to optimal
intervention rates determined through a GA. In brief, this approach exploits
two different SOMs that determine two distinct clusterizations of frequent and
infrequent samples respectively. The outcomes of the SOMs are used in order
to determine the regions of the domain where the two classes of samples are
more dense. Such information is used to calculate a ranking among the frequent
samples that determines their suitability for the removal and another ranking
applied on a set of synthetically created infrequent samples that estimates the
impact of their inclusion within a training dataset.
Given an original dataset D, characterized by an α unbalance rate, the whole
process can be summarized through the following points that will be discussed
more in detail in the subsequent parts of this section:
Imbalanced Datasets Resampling Through SOM and GAs 403
1. the frequent D− and unfrequent D+ samples of D are used to train two dis-
tinct SOMs. The two SOMs determine two sets of centroids (one for each neu-
ron/cluster whose number is automatically determined): CF and CU respec-
tively
2. a set of M synthetic unfrequent samples S + is created by means of the
SMOTE algorithm. The quantity M of these samples is determined so as
to completely rebalance the training dataset D
3. four ranking, two for D− and the others for S + are calculated for D− and
S + samples. At the top of the rankings applied to D− samples, the frequent
samples whose elimination from the training dataset is more useful will be
located whilst, at the top of the ranking of synthetic data, those samples
whose inclusion in the dataset is mostly beneficial will be found
4. a GA–based optimization simultaneously determines the rates of samples
– to remove according to the unsersampling rankings
– to addto the datasets according to the oversampling rankings
5. the so–determined rates are applied on the D− and S + datasets to form the
final resampled training dataset D∗ . Undersampling is performed by elimi-
nating from D− the frequent samples at the top of the undersampling ranking
according to the two optimal undersampling rates. Oversampling takes place
by adding to this dataset the synthetic samples at the top of the S + samples
rankings according to the two optimal oversampling rates.
For the clustering of the frequent and rare samples SOMs are used in order
to exploit their capability of preserving the distribution and topology of the
original data. At the end of the SOMs training, more clusters will be located in
the regions of the space where more original data are placed so as to represent
with more details such regions. Further, the natural spatial relationship among
samples will be kept at the level of clusters: samples that are close each other
will be associated to the same cluster or to a nearby one. The number of neurons
of the two SOMs, that determines the number of clusters, is chosen by means of
the Silhouette criterion [6] which evaluates the goodness of different clusterings
on the basis of the similarity among each cluster and the data samples associated
to it. In this case different hexagonal topology SOMs are evaluated, varying the
number of neurons and the shape of the map through this criterion in order
select the best one. The 4 criteria adopted for the resampling are based on the
distances of data samples with respect to the centroids CF and CU . The number
of these centroids is set by means of the previously described silhouettemethod.
The criteria designed for the ranking of frequent samples D− aim at putting
to the top of the ranking those samples whose presence in the training dataset
would lead to a degradation of classifier performance. The two criteria for the
selection of undersampled data, calculated for each frequent sample are:
2. Minimum Distance from Rare Centroid (MDRC), shown in Eq. 2 and calcu-
lated for each sample d ∈ D− . This metric is higher for frequent samples
closer to infrequent ones. The elimination of samples for which this measure
is high reduces the conflicts among frequent and rare samples favoring the
expansion of rare class in the domain region from which the sample is removed
The criteria that drive the ranking of the synthetically created infrequent
samples grant higher evaluations for the samples in S + whose presence within
the training dataset would broaden the domain region associated to this class,
possibly avoiding any conflict with the frequent class. The two criteria are imple-
mented as follows:
1. Minimum Distance from Frequent Centroid (MDFC), depicted in Eq. 3,
applies to all samples d ∈ S + and is higher for those ones farther from the
regions where frequent samples are more dense. The use of this criterion
avoids one of the main limitations of SMOTE that, when creating an arbi-
trary synthetic sample, does not check for the eventual proximity of frequent
samples increasing the risk of conflicts
These four interacting criteria are at the basis of the simultaneous under– and
over– sampling process of the original data. More in detail, if the training dataset
D∗ is initially set as D∗ = D, two rates RADCF and RM DCR of samples will be
selected according to their respective ranking (from the tops) and removed from
D∗ . Contextually, two rates RM DCF and RLCR of synthetically created samples
will be taken from their associated ranking and added to the training dataset
D∗ .
The optimal resampling rates associated to each ranking are determined
through a GAs–based optimization in order to maximize the benefits in terms of
classification performance. The GA candidate solutions are coded as 4 elements
Imbalanced Datasets Resampling Through SOM and GAs 405
where the |ET R − EV D | term is added to take into account the effect of
overfitting.
4 Experimental Tests
The SOM–based resampling method proposed in this paper has been tested
on several tasks that involve imbalanced datasets. In all these datasets, that
come both from the of UCI repository [14] and real industrial applications, the
number of samples belonging to the class whose identification is more important
is far low with respect to the others. The main features of the test datasets are
summarized in Table 1 where the origin of each dataset is shown together with
its number of samples and variables and the original unbalance rate.
A description and contextualization of the data coming from the UCI repos-
itory can be found on–line, the others, all deriving from the steel industry, are
briefly described below. The Nozzle Clogging (NC) dataset was formed to get
406 M. Vannucci and V. Colla
better understanding of the occlusion phenomenon that affects ladle nozzles dur-
ing the continuous casting of steel. This dataset collects for each cast various
measurements from sensors and machine operating parameters to be associated
to the eventual clogging occurrence. The correct detection of such patterns can
avoid machine faults and improve steel quality. On the other hand the gener-
ation of some false alarms is tolerable. The two Metal Sheet Quality (MSQ1,
MSQ2) collect data for the automatic grading of steel coils. They include the
information obtained by the coils surface inspection system which are used for
assessing the compliance of each product with quality standards. Since it is fun-
damental not to put into market defective products, the correct identification of
non-complying sheets (the rare patterns) is of utmost importance.
For the sake of validity of the performed test campaign, all the original
datasets used in this work have been divided, prior to any processing step, in a
training and test set (70% and 30% of samples respectively), each one with the
same unbalance ratio as the original dataset. The results obtained by the SOM–
Based Resampling (SBR) on these classification tasks are compared to those
achieved by other approaches used for imbalanced datasets resampling: random
oversampling and undersampling, SMOTE oversampling, SOM–Undersampling
and OGAR, further, the results obtained by using the original dataset are
reported for measuring the impact of the each resampling approach.
The resampled datasets obtained by the use of each of the listed techniques is
exploited for the training of a DT tuned by means of the C4.5 algorithm. The GA
engine settings for the methods (OGAR and SBR) that exploit it for reaching
the optimal unbalance ratio, set the population cardinality to 100 and a terminal
condition that stops the search when 50 generations are completed. The other
operators (i.e. crossover, selection) are the same for these approaches (already
described in Sect. 3) in order to grant the comparability of the results. For the
other methods that do not determine in an automatic manner the optimal under–
and/or over– sampling rates, several runs of the algorithms were performed,
varying the resampling rates from 5% up to 50% and the best rates have been
considered.
The results achieved by all the tested approaches are summarized in the fol-
lowing tables. For each method the following information are provided (Tables 2,
3, 4, 5, 6 and 7):
Imbalanced Datasets Resampling Through SOM and GAs 407
The results achieved on the datasets coming from the UCI repository put
into evidence the good performance of the proposed method if compared to
standard and advanced resampling techniques. On the CARDATA and NURS-
ERY datasets all the approaches are able to achieve a satisfactory accuracy,
even the no–resampling strategy. Nevertheless the methods that perform data
resampling are able to sensibly increase the TPR and keep low the FPR. On the
CARDATA dataset the performance of the advanced methods are equivalent
408 M. Vannucci and V. Colla
The good performance of the SBR method are confirmed also on the indus-
trial datasets where its is able to grant a high TPR, always much higher with
respect to standard resampling methods and comparable to the advanced ones
but with a markedly lower value of FPR: on the NOZZLE CLOGGING method
it is 9% lower with respect to SOM–Undersampling (whose TPR is just 3%
higher); on the MSQ–2 dataset FPR is 3% lower than OGAR (same TPR) and
23% lower of the SOM–Undersampling (whose TPR is 10% higher).
In general the resampling operated by SBR on the original dataset is less
strong with respect to the other approaches that achieve similar TPR: it results
in an unbalance ratio more similar to the natural one and which is likely due
Imbalanced Datasets Resampling Through SOM and GAs 409
References
1. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several
methods for balancing machine learning training data. SIGKDD Explor. Newsl.
6(1), 20–29 (2004)
2. Borselli, A., Colla, V., Vannucci, M., Veroli, M.: A fuzzy inference system applied
to defect detection in flat steel production. In: 2010 IEEE World Congress on
Computational Intelligence, WCCI 2010 (2010)
3. Cateni, S., Colla, V., Vannucci, M.: A genetic algorithm-based approach for select-
ing input variables and setting relevant network parameters of a som-based classi-
fier. Int. J. Simul.: Syst. Sci. Technol. 12(2), 30–37 (2011)
4. Cateni, S., Colla, V., Vannucci, M.: A method for resampling imbalanced datasets
in binary classification tasks for real-world problems. Neurocomputing 135, 32–41
(2014)
5. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic
minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)
6. De Amorim, R.C., Hennig, C.: Recovering the number of clusters in data sets with
noise features using feature rescaling factors. Inf. Sci. 324, 126–145 (2015)
7. Elkan, C.: The foundations of cost-sensitive learning. In: International Joint Con-
ference on Artificial Intelligence, vol. 17, pp. 973–978. Lawrence Erlbaum Asso-
ciates Ltd (2001)
8. Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning
from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)
9. Fan, W., Stolfo, S.J., Zhang, J., Chan, P.K.: AdaCost: misclassification cost-
sensitive boosting. In: Proceedings of the Sixteenth International Conference on
Machine Learning, ICML 1999, pp. 97–105. Morgan Kaufmann Publishers Inc.,
San Francisco (1999)
10. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data
Eng. 21(9), 1263–1284 (2009)
11. Japkowicz, N.: The class imbalance problem: significance and strategies. In: Pro-
ceedings of the 2000 International Conference on Artificial Intelligence (ICAI), pp.
111–117 (2000)
12. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided
selection. In: Proceedings of the Fourteenth International Conference on Machine
Learning, pp. 179–186. Morgan Kaufmann (1997)
Imbalanced Datasets Resampling Through SOM and GAs 411
13. Laurikkala, J.: Improving identification of difficult small classes by balancing class
distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001.
LNCS (LNAI), vol. 2101, pp. 63–66. Springer, Heidelberg (2001). https://doi.org/
10.1007/3-540-48229-6 9
14. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/
ml
15. Ling, C.X., Yang, Q., Wang, J., Zhang, S.: Decision trees with minimal costs. In:
Proceedings of the Twenty-first International Conference on Machine Learning,
ICML 2004, p. 69. ACM, New York (2004)
16. Soler, V., Prim, M.: Rectangular basis functions applied to imbalanced datasets.
In: de Sá, J.M., Alexandre, L.A., Duch, W., Mandic, D. (eds.) ICANN 2007. LNCS,
vol. 4668, pp. 511–519. Springer, Heidelberg (2007). https://doi.org/10.1007/978-
3-540-74690-4 52
17. Vannucci, M., Colla, V.: Novel classification method for sensitive problems and
uneven datasets based on neural networks and fuzzy logic. Appl. Soft Comput. J.
11(2), 2383–2390 (2011)
18. Vannucci, M., Colla, V., Nastasi, G., Matarese, N.: Detection of rare events within
industrial datasets by means of data resampling and specific algorithms. Int. J.
Simul.: Syst. Sci. Technol. 11(3), 1–11 (2010)
19. Vannucci, M., Colla, V., Sgarbi, M., Toscanelli, O.: Thresholded neural networks
for sensitive industrial classification tasks. In: Cabestany, J., Sandoval, F., Prieto,
A., Corchado, J.M. (eds.) IWANN 2009. LNCS, vol. 5517, pp. 1320–1327. Springer,
Heidelberg (2009). https://doi.org/10.1007/978-3-642-02478-8 165
20. Vannucci, M., Colla, V.: Genetic algorithms based resampling for the classification
of unbalanced datasets. Smart Innov. Syst. Technol. 73, 23–32 (2018)
21. Vannucci, M., Colla, V.: Self organizing maps based undersampling for the classi-
fication of unbalanced datasets. In: 2018 International Joint Conference on Neural
Networks (IJCNN), pp. 1–6, July 2018
22. Vannucci, M., Colla, V.: Classification of unbalanced datasets and detection of rare
events in industry: issues and solutions. In: Jayne, C., Iliadis, L. (eds.) EANN 2016.
CCIS, vol. 629, pp. 337–351. Springer, Cham (2016). https://doi.org/10.1007/978-
3-319-44188-7 26
23. Wu, Y., Shen, L., Zhang, S.: Fuzzy multiclass support vector machines for unbal-
anced data. In: 2017 29th Chinese Control And Decision Conference (CCDC), pp.
2227–2231, May 2017
24. Yuan, Z., Bao, D., Chen, Z., Liu, M.: Integrated transfer learning algorithm using
multi-source tradaboost for unbalanced samples classification. In: 2017 Interna-
tional Conference on Computing Intelligence and Information System (CIIS), pp.
188–195, April 2017
Improvement of Routing in Opportunistic
Communication Networks of Vehicles
by Unsupervised Machine Learning
1 Introduction
contain any fixed base stations or fixed routers. The routing is provided by the
nodes of the network. Each node provides several functions simultaneously: it
serves as (i) source node, (ii) destination node, (iii) transmission node, (iv) trans-
portation node. As the source node, the node generates a message for the des-
tination node. As the transmission node, the node transmits messages received
from other nodes. As the transportation node, the node moves and transports
messages received from other nodes.
2 Previous Work
In recent years, many different algorithms for routing in OPN/DTN have been
proposed. Unfortunately, there is no unique taxonomy of OPN routing proto-
cols, instead of that, several taxonomies have been proposed. Pelusi et al. [18]
have adopted a hierarchical taxonomy of OPN routing protocols proposed by
Zhang [17]. At the highest level of this taxonomy, the OPN routing algorithms
are classified into two classes: routing without infrastructure and routing with
infrastructure. The class first class contains methods designed for completely
flat ad hoc networks, while the second class contains algorithms in which the
some form of infrastructure is used in order to opportunistically forward mes-
sages. Research presented in this paper is considered to be applicable only on
OPNs which support the routing without infrastructure.
Hong et al. [28] have proposed a hierarchical taxonomy of OPN routing pro-
tocols based on routing protocol reactivity or proactivity. At the highest level,
the OPN routing protocols are classified into two categories: proactive rout-
ing and reactive routing protocols. The proactive routing class contains meth-
ods which use the centralized or offline knowledge about the mobile network to
make the routing decision. The reactive routing class contains methods, in which
the nodes compute forwarding strategies through the contact history, without a
global or predetermined knowledge. The examples of proactive routing protocols
are knowledge-based routing schemes [8], RAPID [3], Routing in cyclic mobile
space [13], Capacity-aware routing using throw-boxes [5], and Mobyspace [10]
or ML-SOL [22]. The examples of reactive routing protocols are First Contact,
Epidemic [27], PROPHET [12], Spray and wait [24], Seek and focus [26], Spray
and focus [25], Bubble Rap [7], Social network-based multi-casting [4], or Island
Hopping [20]. Context-based algorithms do not use flood techniques but they
try to select the nodes to which the message should be forwarded.
Another approach to OPN routing protocols classification has been adopted
by Moreira et al. [14–16], who have proposed a hierarchical taxonomy which
is taking into account both way of message transmission and OPN social and
topological features, such as contact frequency and age, resource utilization, com-
munity formation, common interests or node popularity. At the highest level, the
OPN routing protocols are classified into three categories: (i) forwarding-based
routing protocols, (ii) flooding-based routing protocols, and (iii) replication-
based routing protocols.
Xia et al. [29] proposed a hierarchical taxonomy of OPN routing protocols
primary in the context of social aware routing. At the highest level, the OPN
414 L. Smı́tková Janků and K. Hyniová
routing protocols are classified into two categories: (i) unicast routing and (ii)
multicast routing. The unicast routing protocols are further divided into two
groups: (i) community-based routing and (ii) community-independent routing.
Community-based routing class of OPN routing protocols contains OPN routing
protocols, which uses knowledge obtained from community detection and forma-
tion in order to improve routing performance. As examples of community-based
routing protocols, Xia et al. [29] have discussed BUBBLE RAP [7], LocalCom,
Gently and Diverse Routing. Our approach presented in this paper is community-
based routing.
Ahmad et al. [1] have proposed a taxonomy, which divides OPN routing
protocols into six main classes: Geographic, Link State-aware, Context-aware,
Probabilistic, Optimization Based and Cross Layer routing protocols.
The simulations are conducted with the Opportunistic Network Environ-
ment (ONE) simulator [9], which has been reported previously as a simulation
environment in scientific literature on OPN routing protocols. Using the ONE
simulator, Li et al. [11] have studied how the selfish behaviors of nodes affect the
performance of DTN multicast. They used standard mobility model available
in the ONE simulator. Socievole et al. compared six different routing protocols
using simulation scenario with random way-point mobility model in the simula-
tor ONE [21]. Spaho et al. [23] conducted simulations with the ONE simulator in
order to evaluate and compare the performance of four different routing proto-
cols in a many-to-one communication opportunistic network. In [6] the simulator
ONE has been used to evaluate the performance of SRAMSW routing algorithm.
correspond to the nodes of the OPN. If the finite opportunistic network graph
exists, the existence of the OPN communication paths for all pairs of vertices
can be theoretically computed analytically, however this task is consider to be a
NP-hard problem, so it is computationally intractable.
The main idea of community-based routing is that the community has a
strong impact on human mobility pattern. At first, the mobile nodes are grouped
into communities by certain community detection algorithm. Secondly, the rout-
ing scheme is proposed and the messages are forwarded in accordance to this
routing schema. In our research, we try to find communities by application of
unsupervised machine learning rather than by contact graph partitioning using
clique computation algorithm or modularity.
The centrality is a network property which characterizes the node importance
in the network. The most recognized centralities are closeness centrality, degree
centrality and betweenness centrality.
The closeness centrality of a vertex v, for a given graph G := (V, E) with N
vertices and |E| edges, is defined as
1
C(x) = (1)
y d(y, x)
where σst is total number of shortest paths from node s to node t and σst (v) is
the number of those paths that pass through v.
Both betweenness and closeness centralities of all vertices in a graph involve
calculating the shortest paths between all pairs of vertices on a graph. The nodes
which have high values of betweenness centrality become to the most important
nodes for routing in OPNs. The equations above are valid for static graphs; for
temporal graphs it is necessary to compute centralities through the time devel-
opment of the graph. It is computational expensive. Using centralities as routing
metrics give very good results and it is involved in several routing schemes as
ML-SOL, for example. The disadvantage of application particularly betweenness
416 L. Smı́tková Janků and K. Hyniová
4 Routing Method
5 Performance Evaluation
The proposed routing algorithm was experimentally evaluated on the data in the
simulator ONE [9]. Xia et al. [29] summarize research on socially aware routing
and forwarding protocols and have demonstrated that proposed methods are
mainly compared with Epidemic and PROPHET routing protocols. We decided
to compare our method to well-known epidemic routing protocol. Epidemic rout-
ing protocol is a flooding protocol and it is based on general broadcasting of
messages. It determines an upper bound for message delivery ratio and a lower
bound for message delivery delay.
The area of the opportunistic network was generated from the road map of
the city of Venice. The size of the area of the opportunistic network was
2224 m × 2225 m. We generated scenario with 129 vehicles (nodes). Each vehi-
cle has assigned a set of destinations S. Each node moves from one destination
from the set of destinations to another one, and than to another one. The traces
the nodes move between destinations are generated automatically by the sim-
ulator using Dijkstra algorithm. The sets of destinations was selected in order
to achieve distribution of traces typical for cluster opportunistic networks. The
number of clusters was set to 6. 119 nodes have several local destinations in a
part of the city. 10 vehicles have trace destination all over the area of simulation
environment. We tested communication for different sizes of message buffer (the
“buffer” parameter): 5, 50, 500 messages. If the buffer overflows, the messages
that cannot be stored in buffer are discarded (lost). The node positions were col-
lected each 0.1 simulation unit. The training data were collected for the time of
43200 simulation units. Messages are generated by the nodes during whole sim-
ulation. We tested the method performance for the different periods of message
generation by nodes.
The messages are injected into system as follows: each node generates a new
message periodically. The number of geographical sectors was estimated by the
method to 14. The number of local communication communities estimated in
each time slot oscillates between 20 and 34. We conducted experiments on sim-
ulation scenario for different time periods of message generation in order to
observe the influence of number of messages injected into system to the message
delivery ratio. The message delivery ratio as a function of the message generation
period is shown on Fig. 1. Both the proposed method and Epidemic doesn’t work
well when the message generation period is to high (high number of generated
messages in the system). As the message generation period decreases, we can
observe a rapid improvement in the number of delivered messages for the pro-
posed method in comparison to the epidemic routing. Figure 1 shows the relation
between message delivery ratio and the parameter TTL. The proposed method
outperformed Epidemic routing in overhead cost ratio: the overhead cost ration
of the proposed method reaches approximately only 20% of overhead cost ratio
of Epidemic routing, but this result was expected due to the flooding character
of Epidemic routing. The proposed method outperform Epidemic routing also
in message delivery delay performance metrics (Fig. 2).
Improvement of Routing in Opportunistic Communication Networks 421
6 Conclusion
This paper deals with the issue of an application of the unsupervised machine
learning in order to improve routing for the communication in a special class of
opportunistic networks of vehicles called cluster opportunistic networks. With
respect to existing work in this area, we have proposed the hierarchical routing
algorithm which combines three routing strategies in order to improve routing
in OPNs: metric based on the node affiliation with detected OPN geographic
sector, metric based on the node affiliation with the communication community
constructed in spatio-temporal domain with time constraints, metric based on
the node local encounter measure and the metric for a special types of nodes
called geoconnect. The proposed routing scheme combines these three metrics to
make decisions on message forwarding. In comparison to existing approaches, the
communities are constructed only locally for predefined time slots. The advan-
422 L. Smı́tková Janků and K. Hyniová
References
1. Ahmad, K., Udzir, N.I., Deka, G.C.: Opportunistic Networks: Mobility Models,
Protocols, Security, and Privacy. CRC Press, Boca Raton (2018)
2. Ashbrook, D., Starner, T.: Using GPS to learn significant locations and predict
movement across multiple users. Pers. Ubiquit. Comput. 7(5), 275–286 (2003)
3. Balasubramanian, A., Levine, B., Venkataramani, A.: DTN routing as a resource
allocation problem. In: ACM SIGCOMM Computer Communication Review, vol.
37, pp. 373–384. ACM (2007)
4. Gao, W., Li, Q., Zhao, B., Cao, G.: Multicasting in delay tolerant networks: a social
network perspective. In: Proceedings of the Tenth ACM International Symposium
on Mobile Ad Hoc Networking and Computing, pp. 299–308. ACM (2009)
5. Gu, B., Hong, X.: Capacity-aware routing using throw-boxes. In: 2011 IEEE Global
Telecommunications Conference, GLOBECOM 2011, pp. 1–5. IEEE (2011)
6. Guan, J., Chu, Q., You, I.: The social relationship based adaptive multi-spray-
and-wait routing algorithm for disruption tolerant network. Mob. Inf. Syst. 2017
(2017)
7. Hui, P., Crowcroft, J., Yoneki, E.: BUBBLE RAP: social-based forwarding in delay-
tolerant networks. IEEE Trans. Mob. Comput. 10(11), 1576–1589 (2011)
8. Jain, S., Fall, K., Patra, R.: Routing in a delay tolerant network, vol. 34. ACM
(2004)
9. Keranen, A., Ott, J., Kerkkainen, T.: The one simulator for DTN protocol eval-
uation (2009). http://www.scipress.org/e-library/sof/pdf/0441.PDF. Accessed 22
Jan 2018
10. Leguay, J., Friedman, T., Conan, V.: DTN routing in a mobility pattern space. In:
Proceedings of the 2005 ACM SIGCOMM Workshop on Delay-Tolerant Network-
ing, pp. 276–283. ACM (2005)
11. Li, Y., Su, G., Wu, D.O., Jin, D., Li, S., Zeng, L.: The impact of node selfishness on
multicasting in delay tolerant networks. IEEE Trans. Veh. Technol. 60(5), 2224–
2238 (2011)
12. Lindgren, A., Doria, A., Schelén, O.: Probabilistic routing in intermittently con-
nected networks. In: ACM International Symposium on Mobilde Ad Hoc Network-
ing and Computing, MobiHoc 2003, 01–03 June 2003 (2003)
13. Liu, C., Wu, J.: Routing in a cyclic mobispace. In: Proceedings of the 9th ACM
International Symposium on Mobile ad Hoc Networking and Computing, pp. 351–
360. ACM (2008)
14. Moreira, W., Mendes, P.: Social-aware opportunistic routing: the new trend. In:
Woungang, I., Dhurandher, S., Anpalagan, A., Vasilakos, A. (eds.) Routing in
Opportunistic Networks, pp. 27–68. Springer, New York (2013). https://doi.org/
10.1007/978-1-4614-3514-3 2
Improvement of Routing in Opportunistic Communication Networks 423
15. Moreira, W., Mendes, P., Sargento, S.: Assessment model for opportunistic routing.
In: 2011 IEEE Third Latin-American Conference on Communications, pp. 1–6.
IEEE (2011)
16. Moreira, W., Mendes, P., Sargento, S.: Opportunistic routing based on daily rou-
tines. In: 2012 IEEE International Symposium on a World of Wireless, Mobile and
Multimedia Networks (WoWMoM), pp. 1–6. IEEE (2012)
17. Pelusi, L., Passarella, A., Conti, M.: Beyond MANETs: dissertation on opportunis-
tic networking. IITCNR Technical report (2006)
18. Pelusi, L., Passarella, A., Conti, M.: Opportunistic networking: data forwarding
in disconnected mobile ad hoc networks. IEEE Commun. Mag. 44(11), 134–141
(2006)
19. Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect
community structures in large-scale networks. Phys. Rev. E 76, 036106 (2007)
20. Sarafijanovic-Djukic, N., Pidrkowski, M., Grossglauser, M.: Island hopping: effi-
cient mobility-assisted forwarding in partitioned networks. In: 2006 3rd Annual
IEEE Communications Society on Sensor and Ad Hoc Communications and Net-
works, vol. 1, pp. 226–235. IEEE (2006)
21. Socievole, A., De Rango, F., Coscarella, C.: Routing approaches and performance
evaluation in delay tolerant networks. In: 2011 Wireless Telecommunications Sym-
posium (WTS), pp. 1–6. IEEE (2011)
22. Socievole, A., Yoneki, E., De Rango, F., Crowcroft, J.: ML-SOR: message routing
using multi-layer social networks in opportunistic communications. Comput. Netw.
81, 201–219 (2015)
23. Spaho, E., Bylykbashi, K., Barolli, L., Kolici, V., Lala, A.: Evaluation of differ-
ent DTN routing protocols in an opportunistic network considering many-to-one
communication scenario. In: 2016 19th International Conference on Network-Based
Information Systems (NBiS), pp. 64–69. IEEE (2016)
24. Spyropoulos, T., Psounis, K., Raghavendra, C.S.: Spray and wait: an efficient rout-
ing scheme for intermittently connected mobile networks. In: Proceedings of the
2005 ACM SIGCOMM Workshop on Delay-Tolerant Networking, pp. 252–259.
ACM (2005)
25. Spyropoulos, T., Psounis, K., Raghavendra, C.S.: Spray and focus: efficient
mobility-assisted routing for heterogeneous and correlated mobility. In: Fifth
Annual IEEE International Conference on Pervasive Computing and Communi-
cations Workshops (PerComW 2007), pp. 79–85. IEEE (2007)
26. Spyropoulos, T., Psounis, K., Raghavendra, C.S.: Efficient routing in intermit-
tently connected mobile networks: the multiple-copy case. IEEE/ACM Trans.
Netw. (ToN) 16(1), 77–90 (2008)
27. Vahdat, A., Becker, D., et al.: Epidemic routing for partially connected ad hoc
networks (2000)
28. Woungang, I., Dhurandher, S.K., Anpalagan, A., Vasilakos, A.V.: Routing in
Opportunistic Networks. Springer, New York (2013). https://doi.org/10.1007/978-
1-4614-3514-3
29. Xia, F., Liu, L., Li, J., Ma, J., Vasilakos, A.V.: Socially aware networking: a survey.
IEEE Syst. J. 9(3), 904–921 (2015)
Machine Learning Approach for Drone
Perception and Control
1 Introduction
Drones have significant potential with many practical applications like aerial
delivery systems, search and rescue operation, monitoring etc. Utilizing machine
learning methods for the design and development of drones can make various
drone operations better and more efficient. The recent development in the com-
putational devices and the availability of data are enabling advancement in the
field of machine learning especially, deep learning. Application of deep learning
for visual perception can be seen in previous works related to self-driving car [8]
and drone navigating through forest trail [1] while using the classical approach
to model dynamics and low-level commands of these systems. Both studies were
focused on the vision and need a path to navigate. This study attempts to con-
sider the dynamics, low level control and perception in a forest environment
without any path.
As artificial neural network (ANN) works as a universal function approxima-
tor [2–4], the mapping between the current state and next states can be achieved
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 424–431, 2019.
https://doi.org/10.1007/978-3-030-20257-6_36
Machine Learning Approach for Drone Perception and Control 425
which is the dynamic model of the drone. Similarly, for autopilot behaviors, the
sensor data of different state is mapped to appropriate action to achieve a par-
ticular goal. In the presented work, the take-off flight of the drone in a simulated
environment is considered. To add visual perception for obstacle detection task
such as tree detection in a forest environment, the same approach was used while
utilizing a special type of artificial neural network called deep convolutional neu-
ral network (CNN). All models are an example of supervised learning, where the
correct output of each input was given to the model while training.
2 Method
Before the learning process, the correct or actual data were given for each task
and all different tasks such as learning dynamics or detection tree were optimized
independently.
Instead of writing the equation of drone motion, we let the NN model to deter-
mine the dynamic behavior of drone. A 2D nonlinear regression model to predict
the next state was developed based on the given current state and the action
taken in that state. The deterministic time-invariant dynamic system can be
defined as follows
ẋ = f (x, u) (1)
where x is a state vector, u is the input to the system and f is a function of
states and inputs.
The NN model Fig. 1, consists linear unit at input lin and output lout layers
with the non linearity in the hidden layer lhid and parameter θ = [θ0 , θ1 ] depicted
426 Y. S. Mandloi and Y. Inada
2.4 Learning
Here learning is basically a parameter optimization process. The learning process
is to find the weights of NN which minimize the objective function. The objective
function for model predictive network Jmp is defined as the sum of squared error
between predicted and actual state over training dataset with total T time steps.
Equations (9)–(10) define the optimization problem for the dynamic model. For
parameter update, the gradient descent algorithm Eq. (11) with backpropaga-
tion [3] and batch normalization [6] used and the optimal parameter θ∗ was
found.
1
T
2
Jmp = st+1 − lout (9)
T t=1
θ∗ = argminθ Jmp (10)
428 Y. S. Mandloi and Y. Inada
1
T
2
Jp = πφ (at |st ) − ut (12)
T t=1
φ∗ = argminφ Jp (13)
φ ← φ − β∇φ Jp (14)
α and β are hyperparameters with value less than one.
3 Results
Experiments were conducted for takeoff and hover flight of drone in a simu-
lated environment. NN models described in previous sections were built using
Google open source deep learning framework Tensorflow [13]. All calculations
were done offline on the ground computer however the trained model can be
used on board. Take off and hover flight was simulated in ROS-gazebo environ-
ment [9,10]. An open source autopilot firmware PX4 [11] was used for generation
of thrust commands for taking off and hovering. State and control command data
were collected from the ground control station application Qgroundcontrol [12].
the prediction for height is very close to actual height except for the initial time
of flight. Prediction in the case of testing on new states and action input data
gave similar behavior with some errors as shown in Fig. 5.
Fig. 7. Neural network and autopilot thrust compared with desired thrust while testing
Although the pilot NN produced smoother thrust commands than the teacher
autopilot in the test case, the built NN as performed better on training data
compared to the test data which needs more improvement to generate a more
generalized model.
4 Future Work
In the presented study, numerical simulation results were discussed for one step
prediction of states and action scoping for multiple step prediction with the
application of NN. After the training process, all parameters of NN model will
be fixed hence only the inference of learned model can be run on actual drone
and the control decision can be made on board. Experiments with actual drone
flights need to be done. Further research would be to implement the build vision
model on the drone for the experimentation while making improvements in cur-
rent models for better generalization. Extending the NN perception model to
larger state space and machine learning approach for motion planning will be
considered for further study.
References
1. Zhilenkov, A.A., Epifantsev, I.R.: System of autonomous navigation of the drone
in difficult conditions of the forest trails. In: IEEE Conference of Russian Young
Researchers in Electrical and Electronic Engineering (2018). https://doi.org/10.
1109/EIConRus.2018.8317266
2. Gallant, S.I.: Perceptron-based learning algorithms. IEEE Trans. Neural Netw.
1(2), 179–191 (1990)
3. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-
propagating errors. Nature 323(6088), 533–536 (1986)
4. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural
Netw. 4(2), 251–257 (1991). https://doi.org/10.1016/0893-6080(91)90009-T
5. Cybenko, G.: Approximations by superpositions of sigmoidal functions. Math. Con-
trol Sig. Syst. 2(4), 303–314 (1989)
6. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge
(2016)
7. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile
vision applications. arXiv preprint arXiv:1704.04861 (2017)
8. Bojarski, M., et al.: End to end learning for self-driving cars. arXiv preprint
arXiv:1604.07316v1 (2016)
9. Gazebo. http://gazebosim.org/
10. ROS Documentation. http://wiki.ros.org/Documentation
11. PX4 Development. https://dev.px4.io/en
12. QGroundControl. http://qgroundcontrol.com/
13. Google Tensorflow API. https://www.tensorflow.org/
ML - DL Financial Modeling
A Deep Dense Neural Network
for Bankruptcy Prediction
1 Introduction
Bankruptcy is a vitally important problem with a variety of aspects that are par-
ticularly relevant to an economic system. There are different types of bankrupt-
cies and their consequences in the field of business and society are many and
varied [2]. Banks are very interested in this problem. Firstly, a creditor in order
to approve a loan must take into account a number of parameters such as the
age, the consistency in payments, proximity to any other loans, and therefore
predict the probability of bankruptcy. Such decisions are very important for the
development of a company and consequently for the economy of a country. If
such decisions are taken very carefully, the economic system can be strengthened
and business growth blossom.
Considering the financial crisis of recent years that has hit many economies
in the world, it is easy to understand the importance of timely and credible
bankruptcy forecasting. In addition, there is an urgent need for well-structured
risk management models as well as correction associated with economic incon-
sistencies of a bank’s customers. The economic and financial stability, as well as
the healthy development of enterprises, seems directly related to the prevention
of credit risk and bankruptcies in a market, as it is noted in [11].
The basic two approaches in order to predict loan default or bankruptcy
are the following: (a) Structural approaches. In these approaches, the interest
rates and firm attributes are examined for an outcome of the default probability
and (b) Statistical approaches. These methods outcome the desirable probability
based on mining the data. This paper aims through a dense deep neural network
to accurately predict the possibility of bankruptcy.
The reader may reach informative reviews about the methods used to pre-
dict bankruptcy in [5] and [17]. In the first work, various standard statistical
methodologies implemented on business failure are studied, while in the second
one statistical and intelligent methods are presented. Furthermore, in [4] and
[24] the reader can be informed for more recent survey works. Despite the fact
that a variety of methods have been developed over the last ten years, there are
aspects that need further study. For this reason, there is enough space for new
approaches which can handle bankruptcy prediction in a more accurate way. The
aim of the researchers is to provide schemes that improve existing models and
ensure security and stability in business markets. The settlement of bankruptcy
prediction is studying intensely [26] and new techniques have been developed in
order to tackle this problem [22].
This study provides a deep dense artificial neural network for bankruptcy
prediction. Based on the inherent ability of artificial neural networks in handling
difficult problems such as image recognition, speech recognition etc. we test the
performance of a Deep Dense Multilayer Perceptron in bankruptcy prediction
task.
The rest of the paper is organized as follows. In the next section, a brief pre-
sentation of related works is provided. In Sect. 3, we describe the datasets which
are used in our work. In Sect. 4 the proposed method is presented, and experimen-
tal and comparisons with our method to well-known algorithms are exhibited.
Finally, the paper ends with a short discussion and some future research remarks
in Sect. 5.
2 Related Work
The problem of timely and valid bankruptcy prediction has attracted interest
not only from financial analysts or researchers in the field of economic science but
also from researchers in the scientific area of Machine Learning. As it is already
mentioned, the methods developed to solve this problem over the last decade are
A Deep Dense Neural Network for Bankruptcy Prediction 437
many and varied. Indicatively, we refer the reader to Artificial Neural Networks
[9], instance-based learners [1], Decision Trees [8], Support Vector Machines [28]
among others.
An ensemble classifier scheme that combines well-known learners such as
Decision Trees, Back Propagation Neural Networks and Support Vector Machines
has proposed in [13] to predict bankruptcy by exploiting only the advantages
of individual classifiers. This approach adopts the decision-making strategy of
financial institutions where many experts are asked before the final decision is
taken. Thus, everyone’s opinion counts and a more complete decision is formed
about whether will be given a credit, a loan or if there is a risk of a bankrupt
company. In particular, the provided approach selectively combine the expected
probabilities given by each classifier and the experimental results showed better
performance than stacking ensemble using the weighting or voting strategy.
A comparison between several prediction models such as Artificial Neural
Networks, Decision Trees, Support Vector Machines and Logistic Regression
tackling the bankruptcy prediction was made in [21]. The authors taking into
account the obtained experimental results and the simplicity of Decision Trees
have recommended these models with the minimum support required for a rule
in order to tackle the bankruptcy prediction problem.
In [25] a meta-learning scheme that is inspired by the stacking methodol-
ogy has been proposed. This approach combines two-level classifiers to make a
bankruptcy prediction. In the first level, data preprocessing takes place in order
to filter noisy or unrepresentative training data. Thus, the classifiers in the sec-
ond level, receive better representative training data and the prediction is more
accurate. The experiments conducted by the authors showed that the proposed
method exceeds stacked generalization method and also, it obtains a better pre-
diction accuracy than neural networks, decision trees, and logistic regression.
Another study based on ensemble methods as reliable predictive models for
solving the bankruptcy prediction problem has been carried out in [27]. Par-
ticularly, the authors combined IG based feature selection with the standard
Boosting procedure in order to reinforce the performance of base learners. The
proposed FS-Boosting approach compared with well-known Bagging and Boost-
ing approaches achieved promising results and performed the best average accu-
racy on two of the considering bankruptcy datasets in any condition.
In [14] a recent research study about the performance of semi-supervised
methods for addressing bankruptcy prediction task has been conducted. The
authors include in their study well-known semi-supervised algorithms such as
C4.5, k-Nearest Neighbors and Sequential Minimal Optimization algorithm. The
experimental results showed that the semi-supervised algorithms are really com-
petitive with the corresponding supervised algorithms.
Although the problem of accurate bankruptcy prediction is particularly
important for various financial institutions, studies that were based on prob-
abilistic models are not so many. Such a study [3] was conducted to address
this issue. Specifically, in this work, Gaussian processes classifier was applied
in comparison with Support Vector Machines and the Logistic Regression app-
roach. Furthermore, an informative visualization of the conducted experiments
438 S.-A. N. Alexandropoulos et al.
was presented, so that the reader can easily understand the content of their
study. The experiments conducted showed that Gaussian processes can improve
the classification performance and successfully deal with bankruptcy prediction.
Extensive research based on real-world datasets from American firms was
conducted in [6]. The authors tested well-known Machine Learning models such
as Support Vector Machines, Bagging, Boosting, and Random Forest against
Logistic Regression and Artificial Neural Networks. The fundamental point of
their study was the usage, of six additional complementary financial indicators,
including original Altman’s Z-score. This leads to superior performance by Bag-
ging, Boosting, and Random Forest models. Moreover, the last models achieved
the highest accuracy relating to all the other methods.
An algorithm, named TACD, based on the ant colony strategy proposed
for predicting bankrupt and non-bankrupt in [18]. The provided model is sim-
ple and easy to use. Moreover, this method handles continuous data and thus,
data discretization can be avoided. The experimental tests over three real-world
datasets and in comparison with several strategies showed that the presented
method provides effective results.
Recently, the inherent difficulty of automated decision systems for accurate
outcomes using natural language seems to be treated through deep learning
techniques [16]. Specifically, the authors tested the effectiveness of deep neural
networks over a very difficult problem, the financial decision support. In this
research, traditional Machine Learning approaches take part in such as Ridge
regression, Random Forest, AdaBoost, Gradient Boosting. In addition, Transfer
Learning Techniques were tested, such as RNN with pre-training and LSTM
with both pre-training and word embeddings. The results obtained showed that
deep models give reliable and accurate outcomes and in many cases are better
than traditional bag-of-words models.
In [19] the importance of feature selection process in building strong predic-
tion models is presented. Particularly, the authors studied the appropriate com-
bination between the feature selection technique and the classification method.
Thus, both filter and wrapper-based methods were studied regarding the feature
selection methods. On the other hand, statistical and machine learning models
were studied concerning the classification process. Furthermore, two well-known
ensemble techniques, the Bagging and Boosting methods were used in order
to make comparisons. This work concluded that the genetic algorithm as the
wrapper-based feature selection method performs better than the filter-based
one. Moreover, the combination of a genetic algorithm with naive Bayes and
Support Vector Machine classifiers without bagging and boosting achieves the
best prediction error rates.
In the Greek context, recently, Active Learning approaches for bankruptcy
prediction problem was studied in [15]. For a more informative study about
bankruptcy prediction problem as well as bankruptcy prediction models the
reader is referred to [7].
A Deep Dense Neural Network for Bankruptcy Prediction 439
3 Data Description
The source of the data we use comes from the National Bank of Greece and the
business database containing the financial information of the companies, named
ICAP. In particular, the bankruptcy deposits that we have included in our study
are related to the years 2003 and 2004. In addition, the collection of financial
statements for the years before the bankruptcy was taken by the ICAP database.
The financial data are related to a period of three years. We denote these years
as follows: (a) The bankrupt year is marked as year 0, (b) The year before the
failure is noted as year −1 while (c) Three years before is considered as year −3.
In order to build a good bankruptcy sample, we include 50 bankruptcies in the
final dataset. For each bankrupt firm, we sampled two healthy firm with about
the same characteristics. Thus, our sample consists of 150 individual firms and
450 firm-year observations. Due to missing financial values and ratio overlaps the
final input variables were measured on 21. In Table 1, there is a brief description
of the financial variables included in the present research. The characteristics of
bankrupt firms are exhibited in Table 2.
and the predictions were based on textual disclosures. The experimental results
showed that deep learning models give a promising framework for predicting
financial outcomes.
Another deep learning technique was tested in bankruptcy prediction task in
[12]. In particular, convolutional neural networks were applied to the prediction
of stock price movements. In detail, a set of financial ratios are represented
as a grayscale image. Thus, the network was trained and tested based on that
image. The experimental results showed that the convolutional neural network
has higher performance compared to other traditional methods such as Decision
Trees or AdaBoost.
In our work a Deep Dense Multilayer Perceptron (DDMP) is applied to
address bankruptcy prediction task. Neural networks with two hidden layers
can represent functions with any kind of shape. In general, there is not theoret-
ical reason to use neural networks with any more than two hidden layers with
simple data sets.
Specifically, we use an artificial neural network with two hidden layers. The
decision of the number of neurons in the hidden layers is a very important
issue of the neural network architecture. Despite the fact that these layers do
not directly interact with the external environment, they have a tremendous
influence on the final output. The number of neurons in each of these hidden
layers must be carefully considered. The usage of many neurons in the hidden
layers can result in various problems. Firstly, too many neurons in the hidden
layers may result in overfitting. Overfitting occurs when the neural network has
so much information processing capacity that the limited amount of information
contained in the training set is not enough to train all the neurons in the hidden
layers. A second problem may occur even when the training data is sufficient.
An inordinately large number of neurons in the hidden layers can increase the
training time of the network. In order to secure the ability of the network to
generalize, the number of neurons must be kept as low as possible. If one has
a large excess of neurons, the network becomes a memory bank that can recall
the training set to perfection, but it does not perform well on samples that were
not in the training set.
In the first hidden layer, we used as number of neurons the [2/3] of the
number of input attributes and as activation function, the ReLU were used. In
the second hidden layer, the [1/3] of the number of input attributes were used
as the number of neurons and the ReLU activation function was used again.
Moreover, the Drop-out technique (10%) was considered and as loss function
the LOSSBinaryXENT function was used.
Dropout is an approach to regularization in neural networks which helps
reducing interdependent learning amongst the neurons. In the training phase, for
each hidden layer, for each training sample and for each iteration, the dropout
procedure ignores a random fraction of nodes (and the corresponding activa-
tions). Dropout forces a neural network to learn more robust features that are
useful in conjunction with many different random subsets of the other neurons.
442 S.-A. N. Alexandropoulos et al.
Cart NB LR MP DDMP
2 years before 0.532 0.579 0.586 0.584 0.627
1 year before 0.539 0.588 0.643 0.605 0.664
Last year 0.671 0.647 0.646 0.648 0.732
References
1. Ahn, H., Kim, K.: Bankruptcy prediction modeling with hybrid case-based reason-
ing and genetic algorithms approach. Appl. Soft Comput. 9(2), 599–607 (2009)
2. Altman, E.I., Hotchkiss, E.: Corporate Financial Distress and Bankruptcy: Predict
and Avoid Bankruptcy, Analyze and Invest in Distressed Debt, vol. 289. Wiley,
Hoboken (2010)
A Deep Dense Neural Network for Bankruptcy Prediction 443
3. Antunes, F., Ribeiro, B., Pereira, F.: Probabilistic modeling and visualization for
bankruptcy prediction. Appl. Soft Comput. 60, 831–843 (2017)
4. Appiah, K.O., Chizema, A., Arthur, J.: Predicting corporate failure: a system-
atic literature review of methodological issues. Int. J. Law Manag. 57(5), 461–485
(2015)
5. Balcaen, S., Ooghe, H.: 35 years of studies on business failure: an overview of
the classic statistical methodologies and their related problems. Br. Account. Rev.
38(1), 63–93 (2006)
6. Barboza, F., Kimura, H., Altman, E.: Machine learning models and bankruptcy
prediction. Expert Syst. Appl. 83, 405–417 (2017)
7. Chaudhuri, A., Ghosh, S.K.: Bankruptcy Prediction Through Soft Computing
Based Deep Learning Technique. Springer, Heidelberg (2017). https://doi.org/10.
1007/978-981-10-6683-2
8. Cho, S., Hong, H., Ha, B.C.: A hybrid approach based on the combination of vari-
able selection using decision trees and case-based reasoning using the mahalanobis
distance: For bankruptcy prediction. Expert Syst. Appl. 37(4), 3482–3488 (2010)
9. Cho, S., Kim, J., Bae, J.K.: An integrative model with subject weight based on
neural network learning for bankruptcy prediction. Expert Syst. Appl. 36(1), 403–
410 (2009)
10. Chollet, F., et al.: Keras (2015). https://keras.io
11. Erdogan, B.E.: Long-term examination of bank crashes using panel logistic regres-
sion: Turkish banks failure case. Int. J. Stat. Probab. 5(3), 42 (2016)
12. Hosaka, T.: Bankruptcy prediction using imaged financial ratios and convolutional
neural networks. Expert Syst. Appl. 117, 287–299 (2019)
13. Hung, C., Chen, J.H.: A selective ensemble based on expected probabilities for
bankruptcy prediction. Expert Syst. Appl. 36(3), 5297–5303 (2009)
14. Karlos, S., Kotsiantis, S., Fazakis, N., Sgarbas, K.: Effectiveness of semi-supervised
learning in bankruptcy prediction. In: 2016 7th International Conference on Infor-
mation, Intelligence, Systems and Applications (IISA), pp. 1–6. IEEE (2016)
15. Kostopoulos, G., Karlos, S., Kotsiantis, S., Tampakas, V.: Evaluating active learn-
ing methods for bankruptcy prediction. In: Frasson, C., Kostopoulos, G. (eds.)
Brain Function Assessment in Learning. LNCS (LNAI), vol. 10512, pp. 57–66.
Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67615-9_5
16. Kraus, M., Feuerriegel, S.: Decision support from financial disclosures with deep
neural networks and transfer learning. Decis. Support Syst. 104, 38–48 (2017)
17. Kumar, P.R., Ravi, V.: Bankruptcy prediction in banks and firms via statistical
and intelligent techniques–a review. Eur. J. Oper. Res. 180(1), 1–28 (2007)
18. Lalbakhsh, P., Chen, Y.P.P.: TACD: a transportable ant colony discrimination
model for corporate bankruptcy prediction. Enterp. Inf. Syst. 11(5), 758–785
(2017)
19. Lin, W.C., Lu, Y.H., Tsai, C.F.: Feature selection in single and ensemble learning-
based bankruptcy prediction models. Expert Syst. 36, e12335 (2018)
20. Mai, F., Tian, S., Lee, C., Ma, L.: Deep learning models for bankruptcy prediction
using textual disclosures. Eur. J. Oper. Res. 274(2), 743–758 (2019)
21. Olson, D.L., Delen, D., Meng, Y.: Comparative analysis of data mining methods
for bankruptcy prediction. Decis. Support Syst. 52(2), 464–473 (2012)
22. Onan, A., et al.: A clustering based classifier ensemble approach to corporate
bankruptcy prediction. Alphanumeric J. 6(2), 365–376 (2018)
23. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn.
Res. 12, 2825–2830 (2011)
444 S.-A. N. Alexandropoulos et al.
24. Pereira, V.S., Martins, V.F.: Estudos de previsão de falências-uma revisão das
publicações internacionais e brasileiras de 1930 a 2015. Revista Contemporânea de
Contabilidade 12(26), 163–196 (2015)
25. Tsai, C.F., Hsu, Y.F.: A meta-learning framework for bankruptcy prediction. J.
Forecast. 32(2), 167–179 (2013)
26. Tseng, F.M., Hu, Y.C.: Comparing four bankruptcy prediction models: logit,
quadratic interval logit, neural and fuzzy neural networks. Expert Syst. Appl.
37(3), 1846–1853 (2010)
27. Wang, G., Ma, J., Yang, S.: An improved boosting based on feature selection for
corporate bankruptcy prediction. Expert Syst. Appl. 41(5), 2353–2361 (2014)
28. Yang, Z., You, W., Ji, G.: Using partial least squares and support vector machines
for bankruptcy prediction. Expert Syst. Appl. 38(7), 8336–8342 (2011)
Stock Price Movements Classification
Using Machine and Deep Learning
Techniques-The Case Study of Indian
Stock Market
1 Introduction
Generally, the financial time series movements predictions is a difficult task due
to unstable stock data which is noisy and nonlinear. The variation in policies
such as economic policy, macroeconomic data, political uncertainty, and gov-
ernment policy are affected in the direction of the stock market. This can be
reflected in stock prices and stock market fluctuated and volatile due to this rea-
son. Classification, regression and pattern recognition problems have been solved
using Artificial Neural Networks (ANN) over the years. Stock market data is the
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 445–452, 2019.
https://doi.org/10.1007/978-3-030-20257-6_38
446 N. Naik and B. R. Mohan
time series data which is more volatile during day trade and it has tremendous
noise. The structure of data is complex due to high dimensionality. Therefore, to
make accurate decisions in stock markets, fundamental analysis, technical anal-
ysis, and artificial intelligence methods have been used by professional traders.
Artificial intelligence techniques are widely used for predicting nonlinear, noisy
and chaotic kind of data. In the past, most of the studies were considered data
mining methods and Neural Networks (NN). Most of the existing NN work had
a limitation in learning the larger amount of nonlinear, complex stock data and
extracting features of larger amount data is a difficult task.
The contribution of this study can be summarized as follows. First is the
technical indicator feature selection and identification of the relevant technical
indicators by using Boruta feature selection techniques. The second is an accu-
rate prediction model for stock prices.
2 Related Work
Zhong et al. [18] studied data mining method for forecasting stock prices on a
daily basis. The study considered various financial and economic features and
dimension of the feature has been reduced by techniques, namely fuzzy robust
principal component analysis and KPCA. Stock data which is noisy and nonlin-
ear, however reducing the noise could be effective while constructing the fore-
casting model. To accomplish this task, the integration of PCA and SVR have
been proposed. In this first step, a set of technical indicators is calculated from
the daily transaction data of the target stock and then PCA applied to these
values aiming to extract the principal components. After filtering the principal
components, a model is finally constructed to forecast the future price of the tar-
get stocks [7]. The three feature selection techniques have been discussed, namely
PCA, genetic algorithms and decision trees for forecasting the stock prices [15].
Most of the literature PCA is applied for data representation and transforma-
tion. However, PCA is considered for linearly transforms the high dimension data
into new low dimensional data. Therefore the KPCA method has been proposed
to handle the nonlinear data by using appropriate kernel parameters [5]. Nahil
et al. [12] introduced Kernel Principal Component Analysis (KPCA) to reduce
the dimensions of the technical indicator feature.
Moving average convergence, divergence, and exponential moving average are
stock technical indicators have been studied to identify the short term of stock
prices [1]. Chourmouziadis et al. [6] addressed the problem of bull and bear mar-
ket trends using fuzzy logic. Lin et al. [10] proposed PCA to reduce and filter
the noise in the data. However, most of the study, PCA improves the prediction
accuracy is very small. Deep learning extensively used in medical image classi-
fication, big data analysis, electronic health record analysis, Parkinson’s disease
diagnosis and so on [14].
Stock Price Movements Classification 447
3 Data Specification
In this paper, stock data are collected from http://www.nseindia.com. The data
contain information about stock such as stock day open price, day low price,
day high price and day close price. We have considered banking sector stock,
namely ICICI Bank, Yes Bank, Kotak Bank and SBI Bank. The dataset range
is obtained from the year 2009 to 2018. Each stock on the closing basis, we have
assigned a stock class tag up and down by comparing the stock current price
and its previous price.
4 Proposed Work
The flow of the proposed model is described in Fig. 1. The data are retrieved from
NSE. The study considered 33 different combinations of technical indicators and
computed based on formulas [9] which are described in Table 1.
The proposed task is carried out by using two approaches. First is Boruta feature
selection method which is used to select the important feature of technical indi-
cators. In this method it create duplicate/shadow copies of input feature to make
448 N. Naik and B. R. Mohan
dataset as random. Random shuffling of data removes their correlations with the
outcome variable. Random forest algorithm has been applied to find important
technical indicator feature based on higher mean values(Z). In this algorithm,
we have considered the Z score threshold value as 0.80. If any technical indica-
tor feature has a threshold value is greater than 0.80 then it is considered for
classification. The step by step proposed Boruta feature selection algorithm is
stated in Algorithm 1. We have carried out this task by using Boruta package
in R programming. Second task is accurate prediction model. Feature selection
performed on technical indicator using Boruta algorithm and selected technical
indicator feature is given as input to the prediction model.
Algorithm 1
1: Input 33 technical indicators feature F.
2: Create duplicate/shadow copies of technical indicators feature D.
3: Do the random Shuffle original technical indicators F and duplicate copies of tech-
nical indicator D to remove their correlations with the outcome variable.
4: Apply random forest algorithm to find important technical indicator feature based
on higher mean values .
5: Calculate Z score by using Mean/Std deviation.
6: Find the maximum Z score on duplicates technical indicator feature.
7: Remove technical indicator feature if Z is less than Technical indicator feature.
suggested that these methods gain the highest accuracy compares to the other
machine learning algorithm. Kara et al. [9] has been proposed a framework for
stock prediction and it used a three-layer artificial neural network. In our pro-
posed work deep learning in H2O is implemented. Feature selection performed
on technical indicator using Boruta algorithm and selected technical indicator
feature is given as input to the deep learning model. The deep learning model is
used to classify stock price up and down movement and it is described in Fig. 2.
It has five layers of interconnected neuron units through which data is trans-
formed. The input layer neurons represent technical indicators feature which is
denoted by ti and Wi denotes the weights of the neurons. Stochastic gradient
descent with back-propagation has been used to adjust the weight. Bias input is
given to each layer except the output layer of the model. The objective function
L(W, Bias|j) aims is to reduce the classification error in the data.
The weighted combination of input summation is denoted in Eq. 1.
n
α= Wi ti + Bias (1)
i=1
The activation function Tanh and rectified linear units are used. The model
supports the regularization function to avoid overfitting as shown in Eq. 2.
6 Conclusion
Stock market predictions is a difficult task for stock fund managers and finan-
cial analysts due to unstable stock data which is noisy and nonlinear. The paper
focused on stock price movements classification on a daily basis. We conclude
that boruta feature selection is a useful method for identification of relevant
technical indicators. The study also demonstrated that deep learning model per-
formance is better than machine learning techniques. The contribution of this
study can be summarized as follows. First is the technical indicator feature selec-
tion and identification of the relevant technical indicators by using Boruta feature
selection techniques. The second is an accurate prediction model for stocks. The
stock data is collected from the National Stock Exchange (NSE), India.
References
1. Anbalagan, T., Maheswari, S.U.: Classification and prediction of stock market
index based on fuzzy metagraph. Procedia Comput. Sci. 47, 214–221 (2015)
2. Anish, C.M., Majhi, B.: Hybrid nonlinear adaptive scheme for stock market pre-
diction using feedback flann and factor analysis. J. Korean Stat. Soc. 45(1), 64–76
(2016)
3. Barak, S., Modarres, M.: Developing an approach to evaluate stocks by forecasting
effective features with data mining methods. Expert Syst. Appl. 42(3), 1325–1339
(2015)
4. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal mar-
gin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational
Learning Theory, pp. 144–152. ACM (1992)
5. Cao, L.J., Chua, K.S., Chong, W.K., Lee, H.P., Gu, Q.M.: A comparison of PCA,
KPCA and ICA for dimensionality reduction in support vector machine. Neuro-
computing 55(1–2), 321–336 (2003)
452 N. Naik and B. R. Mohan
1 Introduction
Stock market predictions is a difficult task for stock fund managers and financial
analysts due to unstable stock data which is noisy and nonlinear. The variation
in policies such as economic, macroeconomic data, political uncertainty, and
government policy are affected in the direction of the stock market. This impact
may be reflected in stock prices and the market may be volatile. For a day trader
to gain more profits, it is significant to know how to identify the quality of stocks
for intraday trading. Most of the traders are not able to gain profits because they
fail to select appropriate stocks to trade during the day. Hence there is a need
for the short-term daily trading framework is to predict stock price. This will
help investors and traders to gain the profit from the day trade. In this paper,
we have proposed the recurrent neural network (RNN) with long short term
memory (LSTM) to forecast future stock returns.
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 453–459, 2019.
https://doi.org/10.1007/978-3-030-20257-6_39
454 N. Naik and B. R. Mohan
2 Related Work
Existing trading rules were not gainful for future periods when the market con-
dition changes dynamically. Chourmouziadis and Chatzoglou [2] proposed short-
term technical trading strategy by considering the daily price of the stock using
fuzzy systems.
An automatic way of buying and selling financial securities without the help
of portfolio managers has been discussed. The combination of technical trading
indicators like moving average, alpha, beta and volatility of the stock over a
period of time has been proposed [1]. Nakano et al. [7] proposed a method in
which non-linear financial time-series data are considered and machine learning
techniques were used for predicting stock prices. Mousavi et al. [6] proposed
generalized Exponential Moving Average technical indicator model to predict
the stock prices. The future performance of stock indices has been studied using
fuzzy time series modeling [11]. Return and risk are important objectives for
managing a portfolio. Macedo et al. [5] proposed a model to enhance technical
trading rule indicator based on Moving average convergence/divergence, Rela-
tive Strength Index, Bollinger Bands and Contrarian Bollinger Bands. Artificial
Neural Network(ANN) has been widely used in predicting stock for financial
markets. Zhang and Wu [16] proposed back propagation ANN to predict stock
prices and indices.
Technical indicators like moving average, moving average convergence and
divergence, relative strength index and commodity channel index have been
Study of Stock Return Predictions using RNN with LSTM 455
used to predict the stock price [14]. Performing feature extraction could help
to reduce the redundant features, which can reduce the measurements, stor-
age requirements and the running time of classifiers. It also avoids the curse of
dimensionality and improves prediction performance as well as facilitate data
visualization and understanding [13]. Ticknor [12] proposed artificial neural net-
work approach to improve the prediction performance. Preis et al. [9] hypoth-
esized that investors may use a Google hits ratio of pages are used to take the
decision to predict stock price. Macroeconomic factors are believed to influence
stock market movements. Machine learning methods, which are data-driven and
assumption-free have become more popular in stock market prediction [15].
3 Data Specification
In this paper, stock data are collected from http://www.nseindia.com. The data
contain information about stock such as stock day open price, day low price,
day high price and day close price. We have considered CIPLA stock, ITC stock,
TCS stock, ONGC stock and Nifty index for the experiment. The dataset range
is obtained from the year 2009 to 2018.
4 Proposed Work
RNN with LSTM. The proposed model that takes stock return as input data
from the recent past and predicts the stock returns for the next 24 h in the
future. Existing literature suggested that RNN is not able to hold long-term
dependencies in stock returns [4,8]. Therefore LSTM has been proposing to
capture long-term dependencies in stock returns. The LSTM organized as a cell,
each cell has an internal state variable that passes information from one cell to
another cell. A sigmoid layer of forget gate takes the previous output at t − 1
and the present input at time t and performed concatenation operation. The
output of this layer lies between 0 and 1 and it is shown in Fig. 2. If f t = 0 then
the internal state variable is completely forgotten, and f t = 1 it will be passed
through one cell to another.
456 N. Naik and B. R. Mohan
Fig. 2. RNN with LSTM framework for stock return forecasting [4, 8].
The forget gate and input gate are given Eqs. 2, 3, 4, 5, 6 and 7.
ht = Ot .tanhCt (7)
NSE datasets consist of 2400 rows, we have split 70% data for training and 30%
for validation. The RNN model built in three layers, and we have considered
the rectifier unit as the activation function. The experiment is carried out in
R Studio platform. Decreasing in training rate and increasing validation rate
suggests that the model is overfitting and it is described in Fig. 3. Therefore, we
have added dropout function to RNN layers that avoid the overfitting problem
and it is described in Fig. 4.
1.0
0.8
loss
data
training
0.6 validation
0.4
0.2
5 10 15 20
epoch
Fig. 3. Overfitting.
MAE and RMSE is used to evaluate the performance of the prediction model
and it is described in Eqs. 8, 9. The proposed model outperforms compared to
with feed forward artificial neural network(ANN) and it is shown in Table 2.
n
1
M AE = |et | (8)
n t=1
n
1
RM SE = e2 (9)
n t=1 t
458 N. Naik and B. R. Mohan
0.67
loss
data
training
0.65 validation
0.63
5 10 15 20
epoch
6 Conclusion
Stock price movements forecasting is challenging task for day traders to yield
more returns. Recurrent neural network with LSTM is a state-of-the-art method
for sequence learning. They are less commonly applied to stock return predic-
tions. The first the recurrent neural network with LSTM is studied to forecast
the future stock returns. Second, considered a recurrent dropout in RNN layers
to avoid overfitting in the model. The future work can be time series forecasting
of stock prices by combining technical and fundamental analysis of stocks.
References
1. Berutich, J.M., López, F., Luna, F., Quintana, D.: Robust technical trading strate-
gies using GP for algorithmic portfolio selection. Expert Syst. Appl. 46, 307–315
(2016)
2. Chourmouziadis, K., Chatzoglou, P.D.: An intelligent short term stock trading
fuzzy system for assisting investors in portfolio management. Expert Syst. Appl.
43, 298–311 (2016)
3. Enke, D., Mehdiyev, N.: Stock market prediction using a combination of step-
wise regression analysis, differential evolution-based fuzzy clustering, and a fuzzy
inference neural network. Intell. Autom. Soft Comput. 19(4), 636–648 (2013)
4. Graves, A.: Generating sequences with recurrent neural networks. arxiv preprint
arxiv: 1308.0850 (2013)
5. Macedo, L.L., Godinho, P., Alves, M.J.: Mean-semivariance portfolio optimization
with multiobjective evolutionary algorithms and technical analysis rules. Expert
Syst. Appl. 79, 33–43 (2017)
6. Mousavi, S., Esfahanipour, A., Zarandi, M.H.F.: A novel approach to dynamic
portfolio trading system using multitree genetic programming. Knowl.-Based Syst.
66, 68–81 (2014)
7. Nakano, M., Takahashi, A., Takahashi, S.: Generalized exponential moving average
(EMA) model with particle filtering and anomaly detection. Expert Syst. Appl.
73, 187–200 (2017)
8. Olah, C.: Understanding LSTM networks (2015)
9. Preis, T., Moat, H.S., Stanley, H.E.: Quantifying trading behavior in financial
markets using google trends. Sci. Rep. 3, 01684 (2013)
10. Qiu, M., Song, Y., Akagi, F.: Application of artificial neural network for the pre-
diction of stock market returns: the case of the Japanese stock market. Chaos,
Solitons Fractals 85, 1–7 (2016)
11. Rubio, A., Bermúdez, J.D., Vercher, E.: Improving stock index forecasts by using a
new weighted fuzzy-trend time series method. Expert Syst. Appl. 76, 12–20 (2017)
12. Ticknor, J.L.: A bayesian regularized artificial neural network for stock market
forecasting. Expert Syst. Appl. 40(14), 5501–5506 (2013)
13. Tsai, C.-F., Hsiao, Y.-C.: Combining multiple feature selection methods for stock
prediction: union, intersection, and multi-intersection approaches. Decis. Support
Syst. 50(1), 258–269 (2010)
14. Tsai, C.-F., Lin, Y.-C., Yen, D.C., Chen, Y.-M.: Predicting stock returns by clas-
sifier ensembles. Appl. Soft Comput. 11(2), 2452–2459 (2011)
15. Vaisla, K.S., Bhatt, A.K.: An analysis of the performance of artificial neural net-
work technique for stock market forecasting. Int. J. Comput. Sci. Eng. 2(6), 2104–
2109 (2010)
16. Zhang, Y., Lenan, W.: Stock market prediction of s&p 500 via combination of
improved bco approach and bp neural network. Expert Syst. Appl. 36(5), 8849–
8854 (2009)
17. Zhong, X., Enke, D.: Forecasting daily stock market return using dimensionality
reduction. Expert Syst. Appl. 67, 126–139 (2017)
Security - Anomaly Detection
Comparison of Network Intrusion
Detection Performance Using Feature
Representation
1 Introduction
This research was supported by the Regional Government of Castilla y León and the
European Regional Development Fund under project LE045P17.
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 463–475, 2019.
https://doi.org/10.1007/978-3-030-20257-6_40
464 D. Pérez et al.
two types: external, where unauthorized users try to gain access to the system,
and internal, which are more frequent and users with different permission roles
could have access to resources of the system. Moreover, different situations can
be labelled as intrusions, ranging from worms that try to propagate through the
network without authorization to denial of service (DoS) which focus on disrupt-
ing the resources of a system on a network. Intrusion detection systems (IDS) are
devices that monitor a network in order to find any malicious activity. They are
commonly classified in different types: Host-based IDS (HIDS) that analyses the
internals of an individual system and Network-based IDS (NIDS) that monitors
traffic between the devices of a network trying to find suspicious patterns [3]. The
techniques for intrusion detection include misuse-based approaches that look for
known malicious activity mostly using signatures and anomaly-based approaches
which consider as an anomaly any intrusive action and would potentially detect
unknown intrusions. Although both techniques have been extensively studied
[5], misuse detectors are much commonly deployed in real systems.
Anomaly detection methods attempt to estimate a model of the normal
behaviour of data according to a specific criteria and find patterns deviated
from the resulting model [6]. These methods have been extensively applied to
network intrusion detection [1,3,5]. However, their application in real scenarios
has traditionally been unusual because it implies to deal with some issues [25].
For instance, network traffic presents large variability so anomalous behaviour
can sometimes be related to performance, or high false positives also involve the
evaluation of potential alarms which are actually normal situations.
Besides, there are other different aspects such as labelling or scaling the
data that improve the success of these techniques. In addition, feature selection
helps to reduce complexity and understand data interpretation. Although there
are different existing strategies, representation learning and deep learning have
provided enormous advances in several areas [2] such as computer vision or
natural language processing.
In this work, a comparison of anomaly detection tasks is made using a feature
representation of data for network intrusion detection. For that purpose, different
methods of anomaly detection are compared in order to evaluate how feature
transformation affects them, using four recent network datasets that provide
real situations. The organization of the paper is structured as follows: in Sect. 2,
different anomaly detection approaches are described and some examples for
their application in intrusion detection are mentioned; in Sect. 3, the method
used is illustrated; in Sect. 4, the datasets, the configuration of the experiments
and the results are discussed and, finally, the conclusions are summarized in
Sect. 5.
2 Related Work
There have been numerous efforts to survey available techniques for the imple-
mentation of anomaly detection tasks [6], specifically applied to network intru-
sion detection [1,3,5]. Common approaches can be grouped into categories
Comparison of Network Intrusion Detection Performance 465
depending on how the method detects outliers. Next, they are briefly reviewed
and some related works about network intrusion detection are mentioned.
Generally statistical-based approaches assume normal data points are gen-
erated from a Gaussian distribution. The estimation of their parameters can
be sensitive to outliers so that robust estimators were proposed, like minimum
covariance determinant [22]. There are examples that use statistical approaches
for intrusion detection systems such as HIDE [29] which uses statistical mod-
elling along with neural network classifiers or PAYL [28] that computes statistical
parameters of the application payload, estimated from normal behaviour using
a 1-gram model and then evaluated in terms of Mahalanobis distances. Other
strategy to detect anomalies is based on distance-based approaches considering
data instances with N features as a N -dimensional vector. For instance, One-
Class Support Vector Machines (OC-SVM) constructs a hyperplane that aims
to separate with a maximum margin the normal instances from the anomalous
ones. Moreover, clustering methods [10] like K-means can also be used where
the anomaly score is evaluated using the distance between new data points and
computed centroids. Related examples for intrusion detection include Khan et al.
[12], who proposed a combination of a hierarchical clustering with a SVM classi-
fication or Muda et al. [20] that computes initially K-means cluster centroids and
then applies Naive Bayes classification in the final stage to distinguish between
five different classes. Other proposals use ensemble-based methods such as boot-
strap aggregation (bagging) or boosting that combine individual results of mul-
tiple classifiers to achieve a final decision. Similarly, Isolation Forest [15] creates
an ensemble of decision trees isolating anomalies instances.
Since the performance of machine learning methods is generally affected by
the number of the data dimensions, there are algorithms to select and transform
data features providing another representation of the data. On one hand, irrele-
vant features can be eliminated in terms of information redundancy removal and
accuracy improvement. Several feature selection methods have been proposed in
the intrusion detection domain [7]. Some algorithms use an optimization crite-
ria (wrapper), others compute independent features (filter) and hybrid methods
try to combine both approaches for a better performance. On the other hand,
feature transformation algorithms estimates a latent space that provides a new
representation of the data. Dimensionality reduction techniques can be used for
that purpose, like Principal Component Analysis (PCA) which computes lin-
early the principal components with largest variance. The use of autoencoders
as dimensionality reduction tool was proposed in [9] whose low-dimensional rep-
resentation can improve the performance of different tasks. Although there are
other dimensionality reduction techniques, for instance those based on neighbour
embeddings or spectral methods [14], the active recent research in deep learn-
ing has provided an increasing interest in approaches related to representation
learning [2].
There are similar works that propose approaches related to neural networks,
deep learning and anomaly detection. Previous examples include the combi-
nation of deep belief network with linear one-class SVM [8] for unsupervised
466 D. Pérez et al.
3 Proposed Method
In this work, a feature learning stage is combined with well-known anomaly
detection methods to detect network intrusions. Taking into account the com-
plexity of the application area, real traffic data should be considered in order
to provide more realistic scenarios for the analysis. The variables included in
data are usually essential features such as protocol, service, flags, bytes between
source and destination or their IP addresses and, in some cases, additional ones
including statistical or aggregation measures like sum of connections or mean
values. In this case, an intrusion is considered as an individual point labelled in
data which is a simplification of the consequences provoked by a network attack.
A preparation stage for preprocessing data should be done. In that stage, the
transformation of categorical attributes like protocol type into numeric values
is performed and also normalization of data provides scaling of the features so
that they are between similar ranges of values. Besides, the variables with a few
unique values can be transformed into binary values using one hot encoding.
Finally, data are split into train, validation and test sets.
The feature representation estimates a reduced latent space of the data by
means of unsupervised learning, that is, without using data labels of the status
of the network. As a baseline, the widely-used method PCA is used for reducing
the dimensionality of the input data by computing a feature representation.
Also, a deep auto-encoder is used with training data to compute the latent
representation in the bottleneck so that the encoder provides the representation
of new data. The number of the low-dimensional space is considered taking
into account a trade-off between a significant reduction of the dimensionality of
data without an excessive loss of information. The same number is used in the
transformation of the reduced features for comparison purposes of the methods.
Once the feature transformation is done, the anomaly detection methods are
trained using normal instances from the labels of the data for train and validation
sets. Then the prediction of test data is performed after a data transformation
into the resulting latent space. A flowchart diagram for the architecture of the
method is represented in Fig. 1.
Comparison of Network Intrusion Detection Performance 467
4 Experimental Methodology
Several experiments were performed using publicly available datasets based on
real traffic for network intrusion detection. A previous feature learning stage
was applied and various evaluation metrics were used in order to compare the
performance.
468 D. Pérez et al.
4.1 Datasets
In the majority of the previous works reviewed (see Sect. 2) about network intru-
sion detection, only a couple of datasets are widely used for the assessment of the
detection systems [1,3], i.e. DARPA 98 and 99 from MIT’s Lincoln Laboratory
and KDDCup’99. However, these datasets have several shortcomings which have
already been identified in the literature [16,17,27]. This leads to consider that
other datasets might be more suited to evaluate the detection of contemporary
network attacks. For that reason, the following recent datasets have been used
in the experiments in order to consider more realistic situations:
– UNSW-NB15 [19] possess a hybrid of the real modern normal traffic and
synthesized attack activities. It was generated using an attack automatic gen-
eration tool called IXIA PerfectStorm.
– NSL-KDD [27] was created in order to improve the KDDCup’99 dataset.
Although the dataset still suffers some problems to be considered a complete
representative of modern networks, it can be used as a reference for compar-
ison purposes because of its wide use.
– CIC-IDS-2017 [24] covers updated attacks with more than 80 features and
labelled for benign and intrusive flows. Concretely, the data used here corre-
spond exactly to working hours of Wednesday.
– Kyoto [26], built on 3 years of real traffic data (Nov. 2006–Aug. 2009) which
were obtained from different kinds of honeypots.
All datasets include labels about normal and different types of attacks
occurred in the network which are used for training and evaluation of anomaly
detection tasks. A description of the datasets such as number of instances, the
attributes or dimensions obtained after one hot encoding of some of the features
and the percentage of anomalies is detailed in Table 1.
a min-max scaling were made so that each feature is scaled to a range of values
between [0, 1] on the training set and then also transform the validation and
test data. Despite the variety of the intrusions labelled in data, they are all
grouped only into one category, that is, are considered only two classes (normal
and anomaly) in the analysis.
4.2 Experiments
First, four methods are applied for anomaly detection where only normal
instances are used for training the different methods. These methods that were
used are:
– Local Outlier Factor (LOF): [4] assigns to each object a degree about
how it is isolated with respect to a specific neighbourhood. The number of
the k-nearest neighbours selected after several tests is set to 60 for all datasets
used in the experiments.
– One-Class Support Vector Machine (OC-SVM): only uses one class
for estimating a model and detects new data different from that class as
outliers [23]. The kernel used in this work is a radial basis function (RBF)
with γ = 0.1, fixed experimentally.
– Isolation Forest (IF): creates an ensemble of trees that isolate anomalies
instead of fitting normal instances, which is a different approach for outlier
detection [15].
– Robust Covariance (RC): implements a minimum covariance determinant
which is a highly robust algorithm for estimating covariance matrix in mul-
tivariate data [22].
detailed in Table 3. In this table, the accuracy, precision, recall and F1 score
indicate the performance of the corresponding method for each dataset, also
including the previous feature learning stage using PCA and encoder network.
The best resulting F1 score for each dataset is highlighted between all the meth-
ods used. Furthermore, area under curve (AUC) and ROC curves are shown
in Fig. 2 in a matrix form where the rows correspond to each dataset and the
columns the method applied. In case of equal F1 scores, the AUC value is con-
sidered for selecting the best one.
Several changes can be observed in the performance of anomaly detection
tasks as a result of feature learning stage. The most significant improvement is
produced using One-Class SVM method, where the use of the auto-encoder com-
puting a feature representation shows better evaluation metrics for all datasets
used in the experiments. In addition, the auto-encoder representation also pro-
duces small enhancements in the results using Local Outlier Factor, as it is shown
in the Fig. 2.
However, the feature representation barely affects the effectiveness for
anomaly detection using the Isolation Forest and Robust Covariance methods.
There are only improvements for both methods using CIC-IDS-2017 dataset,
shown by the values of F1 scores (see Table 3). Moreover, in some cases it is
preferable the application of these two methods using the original data without
any feature learning.
On the other hand, PCA transformation produces generally similar results to
original data and, in some cases even worse than original features. There are only
a few cases where the representation computed by PCA overcomes the rest. In
these cases, the method used is Robust Covariance which seems to be the most
suitable one to a previous PCA feature learning. This can reflect that linear
techniques could only work in specific scenarios and they might be insufficient
for a general type of analysis. Finally, it is remarkable that results from the
experiments show in some cases a poor performance, for example Kyoto data
using Local Outlier Factor.
Table 3. Performance of the proposed methods for intrusion detection.
Fig. 2. ROC curves obtained from each method and datasets using the direct method and feature transformation using PCA and encoder
network.
Comparison of Network Intrusion Detection Performance 473
5 Conclusions
Network intrusion detection is an active research area in a continuous develop-
ment. Although there have been numerous efforts to address several challenges,
anomaly-based approaches are sometimes difficult to be applied in real systems
for intrusion detection.
In this work, feature learning is used for network intrusion detection through
its application as a previous stage to four different anomaly detection techniques
applied to recent datasets. The methods used for computing the latent represen-
tation of data are PCA and the encoder part of an auto-encoder that introduces
non-linearity. The main improvement for the datasets is shown for One-Class
SVM method using the latent space computed by the auto-encoder. In contrast,
PCA transformation does not show relevant enhancement in order to be applied
as a previous feature learning stage.
Future work includes the study of other types of auto-encoders and techniques
including different feature selection methods combined with more algorithms for
anomaly detection that can help to improve the identification of intrusions.
References
1. Ahmed, M., Mahmood, A.N., Hu, J.: A survey of network anomaly detection tech-
niques. J. Netw. Comput. Appl. 60, 19–31 (2016)
2. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new
perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
3. Bhuyan, M.H., Bhattacharyya, D.K., Kalita, J.K.: Network anomaly detection:
methods, systems and tools. IEEE Commun. Surv. Tutor. 16(1), 303–336 (2014)
4. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based
local outliers. ACM SIGMOD Rec. 29, 93–104 (2000)
5. Buczak, A.L., Guven, E.: A survey of data mining and machine learning methods
for cyber security intrusion detection. IEEE Commun. Surv. Tutor. 18(2), 1153–
1176 (2015)
6. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection - a survey. ACM Com-
put. Surv. 41(3), 15:1–15:44 (2009). https://doi.org/10.1145/1541880.1541882
7. Chen, Y., Li, Y., Cheng, X.-Q., Guo, L.: Survey and taxonomy of feature selection
algorithms in intrusion detection system. In: Lipmaa, H., Yung, M., Lin, D. (eds.)
Inscrypt 2006. LNCS, vol. 4318, pp. 153–167. Springer, Heidelberg (2006). https://
doi.org/10.1007/11937807 13
8. Erfani, S.M., Rajasegarar, S., Karunasekera, S., Leckie, C.: High-dimensional and
large-scale anomaly detection using a linear one-class SVM with deep learning.
Pattern Recognit. 58, 121–134 (2016)
9. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neu-
ral networks. Science 313(5786), 504–507 (2006). https://doi.org/10.1126/science.
1127647
10. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput.
Surv. (CSUR) 31(3), 264–323 (1999)
474 D. Pérez et al.
11. Javaid, A., Niyaz, Q., Sun, W., Alam, M.: A deep learning approach for net-
work intrusion detection system. In: Proceedings of the 9th EAI International
Conference on Bio-inspired Information and Communications Technologies (for-
merly BIONETICS), pp. 21–26. ICST (Institute for Computer Sciences, Social-
Informatics and Telecommunications Engineering) (2016)
12. Khan, L., Awad, M., Thuraisingham, B.: A new intrusion detection system using
support vector machines and hierarchical clustering. VLDB J. 16(4), 507–521
(2007)
13. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR
abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
14. Lee, J.A., Verleysen, M.: Nonlinear Dimensionality Reduction. Springer, New York
(2007). https://doi.org/10.1007/978-0-387-39351-3
15. Liu, F.T., Ting, K.M., Zhou, Z.-H.: Isolation forest. In: Proceedings of the 2008
Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 413–422.
IEEE Computer Society (2008)
16. Mahoney, M.V., Chan, P.K.: An analysis of the 1999 DARPA/Lincoln Laboratory
evaluation data for network anomaly detection. In: Vigna, G., Kruegel, C., Jonsson,
E. (eds.) RAID 2003. LNCS, vol. 2820, pp. 220–237. Springer, Heidelberg (2003).
https://doi.org/10.1007/978-3-540-45248-5 13
17. McHugh, J.: Testing intrusion detection systems: a critique of the 1998 and 1999
DARPA intrusion detection system evaluations as performed by Lincoln laboratory.
ACM Trans. Inf. Syst. Secur. (TISSEC) 3(4), 262–294 (2000)
18. Mirsky, Y., Doitshman, T., Elovici, Y., Shabtai, A.: Kitsune: an ensem-
ble of autoencoders for online network intrusion detection. arXiv preprint
arXiv:1802.09089 (2018)
19. Moustafa, N., Slay, J.: UNSW-NB15: a comprehensive data set for network intru-
sion detection systems (UNSW-NB15 network data set). In: Military Communica-
tions and Information Systems Conference (MilCIS), pp. 1–6. IEEE (2015)
20. Muda, Z., Yassin, W., Sulaiman, M., Udzir, N.I., et al.: A k-means and Naive Bayes
learning approach for better intrusion detection. Inf. Technol. J. 10(3), 648–655
(2011)
21. Nguyen, M.N., Vien, N.A.: Scalable and interpretable one-class SVMs with deep
learning and random fourier features. arXiv preprint arXiv:1804.04888 (2018)
22. Rousseeuw, P.J., Driessen, K.V.: A fast algorithm for the minimum covariance
determinant estimator. Technometrics 41(3), 212–223 (1999)
23. Schölkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Esti-
mating the support of a high-dimensional distribution. Neural Comput. 13(7),
1443–1471 (2001)
24. Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion
detection dataset and intrusion traffic characterization. In: ICISSP, pp. 108–116
(2018)
25. Sommer, R., Paxson, V.: Outside the closed world: on using machine learning for
network intrusion detection. In: 2010 IEEE Symposium on Security and Privacy,
pp. 305–316. IEEE (2010)
26. Song, J., Takakura, H., Okabe, Y., Eto, M., Inoue, D., Nakao, K.: Statistical analy-
sis of honeypot data and building of Kyoto 2006+ dataset for NIDS evaluation. In:
Proceedings of the First Workshop on Building Analysis Datasets and Gathering
Experience Returns for Security, pp. 29–36. ACM (2011)
27. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the
KDD CUP 99 data set. In: Proceedings of the Second IEEE Symposium on Com-
putational Intelligence for Security and Defence Applications (2009)
Comparison of Network Intrusion Detection Performance 475
28. Wang, K., Stolfo, S.J.: Anomalous payload-based network intrusion detection. In:
Jonsson, E., Valdes, A., Almgren, M. (eds.) RAID 2004. LNCS, vol. 3224, pp. 203–
222. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30143-1 11
29. Zhang, Z., Li, J., Manikopoulos, C., Jorgenson, J., Ucles, J.: HIDE: a hierarchical
network intrusion detection system using statistical preprocessing and neural net-
work classification. In: Proceedings of the IEEE Workshop on Information Assur-
ance and Security, pp. 85–90 (2001)
Cyber Security Incident Handling, Warning
and Response System for the European Critical
Information Infrastructures (CyberSANE)
Abstract. This paper aims to enhance the security and resilience of Critical
Information Infrastructures (CIIs) by providing a dynamic collaborative, warn-
ing and response system (CyberSANE system) supporting and guiding security
officers and operators (e.g. Incident Response professionals) to recognize,
identify, dynamically analyse, forecast, treat and respond to their threats and
risks and handle their daily cyber incidents. The proposed solution provides a
first of a kind approach for handling cyber security incidents in the digital
environments with highly interconnected, complex and diverse nature.
Keywords: Incident handling Web mining Data fusion and risk assessment
1 Introduction
In the digital era, Critical Infrastructures (CIs) are operating under the premise of robust
and reliable ICT components, complex ICT infrastructures and emerging technologies
(e.g. IoT, Cloud Computing) and are transforming into Critical Information Infras-
tructures (CIIs) that can offer a high degree of flexibility and efficiency in the com-
munication and coordination of advanced services. The increased usage of information
technology in modern CIIs means that they are becoming more vulnerable to the
activities of hackers and other perpetrators of cyber-related crime.
Over the last few years, it is a common phenomenon to see daily headlines
describing major cyber-attacks or some new strain of malware or insidious social
engineering technique being used to attack ICT infrastructures. In particular, CIIs have
become lately targets for cyberattacks attracting the attention of security researchers,
cyber-criminals, hacktivists (e.g. Anonymous, LulzSec) and other such role-players
(e.g. cyber-spies). These cyber actors have significantly evolved their tactics, tech-
niques and procedures to include next-generation malware toolkits available in various
locations on the internet (e.g. deep web, dark web) and new data exfiltration methods
that give them an asymmetric quantum leap in capability. In the past years, there have
The main goal of the security incident handling and response process is to define the
main aspects and principles for coordinating the effort that should be applied in
managing a security breach/incident/event [1, 2]. In principle, choosing the right
approach for incident handling proves to be complicated. In recent years, a number of
security incident response approaches and frameworks [3–14] have been introduced by
the research and industrial communities as well various standardization bodies.
Although, many of these approaches provide specific technical guidelines, aiming to
enhance the security incident response capabilities of the organizations, they present
significant limitations. In particular, Grimes (2007) argues that most of the existing
incident response approaches follow a linear process that is outdated and does not
support the highly efficient capability that is required to handle and manage today’s
incidents. Therefore, a progression flaw exists in these processes, since if one phase in
the linear process is not completed, the entire process cycle may stop midstream. [15]
notes that current incident response processes are too focused on the containment,
478 S. Papastergiou et al.
eradication, and recovery-related activities and usually ignore, skip or do not empha-
size on other important steps of incident management, such as investigations actions.
[16] proposal gives emphasis on proactive preparation and reactive learning to
encourage security incident learning. [17–19] argue that the existing incident handling
approaches do not provide adequate guidance on how to conduct effective forensic
investigations. Hence, current methods’ limitation to assist and guide the investigators
in forensic evidence analysis, undermines the value of the evidence and fails to pro-
mote incident resolution.
In addition, the available security information and event management solutions lack
significant reactive and post-incident capabilities for managing incidents and events in
the scope of the ICT-based CIIs providing inadequate technical guidance to the incident
response professionals on how to detect, investigate and reproduce attacks. As such,
and despite the socioeconomic importance of tools and techniques for handling inci-
dents there is still no easy, structured, standardized and trusted way to manage and
forecast interrelated, cybersecurity incidents in a way that takes into account the
heterogeneity and complexity of the CIIs and the increasingly sophisticated types of
attacks. Therefore, there is a pressing need for devising novel systems for efficient CIIs
incident handling and support thorough and common understanding of cyber-attack
situations in a timely manner.
In a nutshell, the main limitations [20–22] of the existing approaches are the
following: (i) the traditional linear incident response models are too slow, ineffective
and do not support the highly efficient capability that is required to handle and manage
today’s incidents; (ii) focus mostly on the proactive element (i.e. provide assistance and
information to help prepare, protect, and secure) of the incident management;
(iii) current approaches do not provide enough insight into the underlying causes of the
incident; (iv) poor provisions for incident planning; (v) undermine the value of forensic
evidence possibly required for subsequent legal action; (vi) do not take into account the
risk-related results produced by existing risk assessment methodologies.
The proposed incident handling approach aims to address the aforementioned limita-
tions of these existing methodologies and tools, providing a step-by-step guidance to
manage incidents and breaches on CIIs occurred due to cyber attacks. On this account,
CyberSANE pursues to combine active approaches that are used to detect and analyse
anomaly activities and attacks in real-time with reactive approaches that deals with the
analysis of the underlying infrastructure to assess an incident in order to provide a more
holistic and integrated approach to incident handling. In this vein, CyberSANE aims to
enhance the incident detection capabilities of the existing methods described in the
previous section with a more efficient, elastic and scalable reasoning approach. The
main characteristics of the proposed approach are the following: (i) learning from
unstructured data without the need to understand the content; (ii) identification of
unusual activities that match the structural patterns of possible intrusions (instead of
predefined rules); and (iii) automatic identification and adaption to a change of the
underlying infrastructure.
Cyber Security Incident Handling 479
approach takes into consideration and addresses both technical and cognitive chal-
lenges (Fig. 1).
• The Privacy & Data Protection (PrivacyNet) Orchestrator which provides a set of
privacy (anonymization, pseudonymization, obfuscation), data protection orches-
tration and consistency capabilities.
It should be noted that the proposed solution and the incorporating techniques is
able to operate in heterogeneous, large-scale, cross-border CIIs that are characterized
by the following features: (i) complex, highly distributed, and large-scale cyber systems
(including IoT and cyber-physical) with respect to the number of entities involved;
(ii) heterogeneity of the underlying networks interconnecting the physical-cyber sys-
tems; and (iii) different levels of exposure to attacks. The following Sections provide a
detailed description of the each component.
4.2 Deep and Dark Web Mining and Intelligence (DarkNet) Component
The Deep and Dark Web mining and intelligence (DarkNet) component provides the
appropriate Social Information Mining capabilities that will allow the exploitation and
analysis of security, risks and threats related information embedded in user-generated
content (UGC). This is achieved via the analysis of both the textual and meta-data
content available from such streams. Textual information is processed to extract data
from otherwise disparate and distributed sources that may offer unique insights on
possible cyber threats. Examples include the identification of situations that can
become a threat for the CIIs with significant legal, regulatory and technical consider-
ations. Such situations are: organization of hacktivist activities in underground forums
or IRC channels; external situations that can become a potential threat to the CIIs (e.g.
relevant geopolitical changes); disclosure of zero day vulnerabilities; sockpuppets
impersonating real profiles in social networks etc. Entities (e.g., events, places) and
security-related information will be uniquely extracted from textual content using
advanced Natural Language Processing (NLP) techniques, such as sentiment analysis.
(ii) the identification of the attacker’s goals and strategies and prediction of their next
actions; and (iii) the accurate assessment of the impact of an incident on the CII and the
damage caused so far.
The Security Incident/Attack Simulation Environment of the CyberSANE system
comprises a set of novel mathematical instruments, including mathematical models for
simulating, analyzing, optimizing, validating, monitoring simulation data and opti-
mizing security incident handling process. Specifically, these instruments include: (i) a
buddle of novel process/attack analysis and simulation techniques for designing,
executing, analyzing and optimizing threat and attack simulation experiments that will
produce appropriate evidence and information that facilitate the identification, assess-
ment and mitigation of the CII-related risks; (ii) graph theory to implement attack graph
generation, to perform security incident analysis and to strengthen the prognosis of
future malefactor steps; (iii) pioneering mathematical techniques for analyzing, com-
piling and combining information and evidences about security incidents and
attacks/threats patterns and paths in order to find relationships between the recovered
forensic artefacts and piecing the evidential data together to develop a set of useful
chain of evidence (linked evidence) associated with a specific incident; (iv) innovative
simulation techniques which will optimize the automatic analysis of diverse data;
(v) innovative techniques in order to link optimization and simulation. In this context,
this simulation environment is fed with information about an incident and proceeds to
calculate and generate a number of possible attack graphs (routes of possible attacks)
and graphs of linked evidence (chains of evidence) and also compute probabilities for a
sequence of events on top of these graphs. The resulting probabilistic estimate for the
compromised CIIs’ assets will be used to identify, model and represent the course of an
attack as it propagates across the CIIs. It should be noted the HybridNet component
continuously updates the simulation engine with information collected and piece of
information, thereby enabling both understanding which assets might have been
compromised, as well as gain more accurate estimates on the likelihood that other
assets might be compromised in the future.
privacy-aware information sharing of the CIIs’ operators with relevant parties (e.g.
industry cooperation groups, CSIRTs), in order to exchange risk incident-related
information, through specific standards and/or formats (STIX), improving overall cyber
risk understanding and reduction. Privacy preserving is another important issue con-
sidered at every phase of sharing, applying methods such as anonymization or pseudo
anonymization and encryption techniques incorporated in and made available from
PrivacyNet Orchestrator. This brings forward a mixture of several cryptographic
techniques that holds certain security guarantees.
The operational data flows supported by the CyberSANE system are the following:
(1) Detected cyber attack visualized in the LiveNet Component; (2) Determine whether
their organization has fallen victim to a cyber-attack; (3) Discovery, extraction and
collection of raw data from various sources (e.g. servers, logs) and in different format
(4) Extraction, harmonization and processing data from distributed sources (Dark Web,
Social Media) that offer unique insights on the cyber threats and provide information
about latest mechanisms of cyber-attacks; (5) Collected data is normalized, cleansed to
remove redundant information and transformed into a common representation format;
(6) All relevant information extracted are analyzed and correlated to provide a more
comprehensive and detailed view of the incident; (7) Dependency evidence chains are
generated; (8) Identification of on-going attacks (Identification of the attacker’s
behavior, prediction of next actions); (9) Evaluate the risks; Assess the impact and the
assessment and cascading effects; and Formulate mitigation plan; (10) Prediction of
possible scenarios of future attacks; (11) Visualization of incident related information
enabling deep understanding of the situation and decision making (Notifications); (12)
Secure and privacy aware managing and storing of incident-related information; and
(13) Information sharing and dissemination of useful incident-related information.
6 Conclusions
The paper aims to leverage collected security information to find new ways of pro-
tection for technology assets, enabling the entity at risk to evaluate the risk and invest
to limit that risk in an optimal way. Providing a way to securely collect both structured
data (e.g. logs and network traffic) and unstructured data (e.g. data coming from social
networks and dark web) and making them available for analysis fosters new innova-
tions that will only unravel after having access to such data, harnessing its full
potential. CyberSANE’s has a twofold aim; to minimize the exposure to security
risks/threats and help CIIs’ operators to respond successfully to relevant incidents.
The ground-breaking nature of the proposed incident handling approach is based
on: (i) the identification of attacks and incidents using innovative approaches and
486 S. Papastergiou et al.
Acknowledgements. The authors would like to thank the University of Piraeus Research Centre
for its continuous support.
References
1. West-Brown, M.J., Stikvoort, D., Kossakowski, K.P., Killcrece, G., Ruefle, R.: Handbook
for computer security incident response teams (CSIRTs). (No. CMU/SEI-2003-HB-002).
Carnegie-Mellon Univ Pittsburgh PA Software Engineering Inst. (2003a)
2. Wiik, J., Kossakowski, K.P.: Dynamics of incident response. In: 17th Annual FIRST
Conference on Computer Security Incident Handling, Singapore (2005)
3. British Standards Institution. BS ISO/IEC 27035:2011 - Information Technology. Security
Techniques. Information Security Incident Management (2011)
4. Cichonski, P., Scarfone, K.: Computer Security Incident Handling Guide Recommendations.
NIST, Gaithersburg (2012). National Institute of Standards and Technology (NIST)
5. ENISA CSIRTs by Country-Interactive Map. https://www.enisa.europa.eu/topics/csirts-in-
europe/csirt-inventory/certs-by-country-interactive-map
6. Northcutt, S.: Computer Security Incident Handling Version 2.3.1 (2003)
7. Vangelos, M.: Incident response: managing. In: Encyclopedia of Information Assurance,
pp. 1442–1449. Taylor & Francis (2011)
8. Werlinger, R., Muldner, K., Hawkey, K., Beznosov, K.: Preparation, detection, and analysis:
the diagnostic work of it security incident response. Inf. Manag. Comput. Secur. 18(1),
26–42 (2010)
9. Khurana, H., Basney, J., Bakht, M., Freemon, M., Welch, V., Butler, R.: Palantir: a
framework for collaborative incident response and investigation. In: Proceedings of the 8th
Symposium on Identity and Trust on the Internet, p. 38e51 (2009)
10. Grobauer, B., Schreck, T.: Towards incident handling in the cloud. In: Proceedings of the
2010 ACM Workshop on Cloud Computing Security Workshop (CCSW 10), pp. 77–85
(2010)
11. Monfared, A., Jaatun, M.G.: Handling compromised components in an IaaS cloud
installation. J. Cloud Comput. Adv. Syst. Appl. 1, 16 (2012)
Cyber Security Incident Handling 487
12. Line, M.B.: A case study: preparing for the smart grids-identifying current practice for
information security incident management in the power industry. In: 2013 7 International
Conference on IT Security Incident Management and IT Forensics, IT Security Incident
Management and IT Forensics (IMF), pp. 26–32. IEEE (2013)
13. Cusick, J.J., Ma, G.: Creating an ITIL inspired incident management approach: roots,
response, and results. In: Network Operations and Management Symposium Workshops
(NOMS Wksps) 2010 IEEE/IFIP, pp. 142–148. IEEE (2010)
14. Connell, A., Palko, T., Yasar, H.: Cerebro: a platform for collaborative incident response and
investigation. In: 2013 IEEE International Conference on Technologies for Homeland
Security (HST) (2013)
15. Ahmad, A., Hadgkiss, J., Ruighaver, A.B.: Incident response teams-challenges in supporting
the organisational security function. Comput. Secur. 31(5), 643–652 (2012)
16. Shedden, P., Ahmad, A., Ruighaver, A.B.: Informal learning in security incident response
teams. In: 2011 Australasian Conference on Information Systems (2011)
17. Casey, E.: Investigating sophisticated security breaches. Commun. ACM 49(2), 48–55
(2006)
18. Nnoli, H., Lindskog, D., Zavarsky, P., Aghili, S., Ruhl, R.: The governance of corporate
forensics using COBIT, NIST and increased automated forensic approaches. In: 2012
International Conference on Privacy, Security, Risk and Trust. IEEE (2012)
19. Tan, T., Ruighaver, T., Ahmad, A.: Incident handling: where the need for planning is often
not recognised. In: 1st Australian Computer, Network & Information Forensics Conference
(2003)
20. FireEye. The Need for Speed: 2013 Incident Response Survey (2013)
21. Grispos, G., Glisson, W.B., Storer, T.: Rethinking security incident response: the integration
of agile principles. arXiv preprint arXiv:1408.2431 (2014)
22. Ab Rahman, N.H., Choo, K.K.R.: A survey of information security incident handling in the
cloud. Comput. Secur. 49, 45–69 (2015)
23. Papastergiou, S., Polemi, D.: Securing maritime logistics and supply chain: the Medusa and
MITIGATE approaches. Maritime Interdiction Operations Journal 14(1), 42–48 (2017).
Proceedings of 2nd NMIOTIC Conference on Cyber Security. ISSN 2242-441X
24. Papastergiou, S., Polemi, N.: MITIGATE: a dynamic supply chain cyber risk assessment
methodology. In: Yang, X.S., Nagar, A., Joshi, A. (eds.) Smart Trends in Systems, Security
and Sustainability. LNNS, vol. 18, pp. 1–9. Springer, Heidelberg (2018). https://doi.org/10.
1007/978-981-10-6916-1_1
25. Kalogeraki, E.-M., Papastergiou, S., Polemi N.: SAURON real-life use cases: terrorists
attack a cruise ship berthed at a port facility. In: The 9th NMIOTC Annual Conference
“Fostering Projection of Stability through Maritime Security: Achieving Enhanced
Capabilities and Operational Effectiveness” 5–7 June 2018 (2018)
Fault Diagnosis in Direct Current Electric
Motors via an Artificial Neural Network
1 Introduction
Electric motors are ubiquitous components of the infrastructure, consuming
approximately 60% of all the electric power produced. As a consequence, appro-
priate and reliable operation is deemed necessary.
Design criteria, such as the margins of the nominal parameter values of a
system, can ensure acceptable performance when there are slight operating dis-
turbances. Nevertheless, if the system dynamics change significantly due to a
component failure, the result may be suboptimal or, in the worst case, catas-
trophic. In order to ensure safety and proper maintenance, fault detection and, if
possible, fault classification (also referred to as fault diagnosis) in the operating
systems are imperative. The fault diagnosis should be ideally employed using as
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 488–498, 2019.
https://doi.org/10.1007/978-3-030-20257-6_42
Fault Diagnosis in DC Electric Motors via an Artificial Neural Network 489
few sensors as possible. In cases where the faults are “small” and incipient, the
problem turns out to be highly challenging.
Multiple approaches have been proposed for the problem of fault detection
and classification in electric motors—the interested reader may refer to [7] for a
survey confined to Alternating Current (AC) induction (asynchronous) motors.
In this article, a simple, yet powerful, technique for fault detection and clas-
sification of Direct Current (DC) electric motors is proposed, in the core of
which lies an Artificial Neural Network (ANN). Loosely speaking, an ANN is
a computing system, vaguely inspired by the biological neural networks, which
constitute living organisms’ brains [6]. Such a system “learns” to perform tasks
by considering examples, generally without being explicitly programmed with
any task-specific rules and without need of a rigorous mathematical model for
the specific system.1 It is worth-noting that the work, although limited to DC
motors, has a variety of extensions in a plethora of types of electric motors.
Other studies may be, in their majority, characterized as threshold-oriented,
in the sense that the techniques involved are based on the intuition that, when
a (signal or parameter) value exceeds a present threshold, a fault is detected [4].
However, this is often problematic, since the problem of classification of the fault
is not easily addressed, as well as such practices may lead to many false alarms.
In fact, under varying operating conditions, the effects of the considered faults
on the dynamics are almost fully “masked” by those of the operating conditions.
In addition, for these methods to be applied, an accurate physics-based motor
model is necessary.
Another great portion of fault detection and classification approaches requires
pre-processing of the measured signals—such as Fourier transform analysis, sta-
tistical analysis of a residual [5,8]—demanding significant resources, thus making
them unsuitable for quick on-line tests.
The goal of the present study is to take advantage of the powerful ability
of an ANN to identify patterns with high fidelity—which in turn addresses the
majority of the aforementioned disadvantages of other techniques—and test its
performance in the problem of fault diagnosis in DC motors. As far as the pro-
posed method is concerned, the ANN consists of an input layer, two hidden
layers, and an output layer. The input layer of the ANN receives (without any
pre-processing) the measured signals of the armature current and angular veloc-
ity of a DC motor under load condition, and the output layer identifies and
classifies the (potential) fault. The ANN, essentially, recognizes the patterns of
the corresponding signals; hence, it recognizes the unique “profile” of each state
condition of the system. This, in turn, deviates from the rigid threshold-oriented
approach; therefore, the false alarms are significantly reduced or even disappear.
Furthermore, the proposed technique requires no pre-processing of the measured
signals, making it fast and flexible for on-line testing.
The rest of this paper is organized as follows: Sect. 2 presents the model of the
system, Sect. 3 analyses the fault detection and classification problem, whereas
1
Artificial Neural Networks have been successively used on a variety of applications,
including computer vision, speech recognition, machine translation and many more.
490 T. I. Aravanis et al.
in Sect. 4 the Artificial Neural Network training and testing results are discussed.
The last section is devoted to some concluding remarks.
2 System Modelling
The system layout of a DC motor is shown in Fig. 1, where it is assumed that
the DC voltage, U , and load torque, Tl , are the inputs, and armature current,
ia (t), and angular velocity of the shaft, ω(t), are the outputs of the system.
Ra La
ia (t)
+ +
U e(t) J
− −
dia (t)
La · + Ra · ia (t) + e(t) = U (1)
dt
dω(t)
J· = Tm (t) − Tl − Tf (t) (2)
dt
1
ż1 = · U − R a · z1 − Ke · z2 (3)
La
Fault Diagnosis in DC Electric Motors via an Artificial Neural Network 491
1
ż2 = · − T l + Kt · z1 − b · z 2 (4)
J
The state-space representation of the system is given by the following two
equations:
ż = A · z + B · u (5)
y =C ·z+D·u (6)
Then, the following matrices are derived for the examined model (note that
all states are outputs, i.e., z = y):
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−Ra Ke
La − La −1910 −20 1
La 0 333 0
A=⎣ ⎦=⎣ ⎦, B = ⎣ ⎦=⎣ ⎦
Kt
J − Jb 746 −1243 0 − J1 0 −12500
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 0 0 0 U ia (t) ia (t)
C=⎣ ⎦, D = ⎣ ⎦, u = ⎣ ⎦, z = ⎣ ⎦, y = ⎣ ⎦
0 1 0 0 Tl ω(t) ω(t)
This section is devoted to the definition of the fault diagnosis problem. In par-
ticular, the fault scenarios, along with the simulations of the system in healthy
and faulty conditions, are presented.
492 T. I. Aravanis et al.
Three fault scenarios are considered and simulated by modifying the nominal
values of the physical parameters of the system, as described in the following
[1–4]:2
The output (time) response of the system is derived using the MATLAB’s lsim
function, at a sample rate fs = 1 kHz. The time vector contains 1001 points
(from t = 0 s to t = 1 s). Each output signal is normalized in the interval [0, 1], in
order to “feed” the input layer of the ANN, with all inputs being at an unbiased
comparable range.
The DC input voltage U and the load torque Tl are set equal to 200 V and
0.1 N m, respectively. Their fluctuations, ±2% for the voltage and ±10% for the
load torque, are approximated by zero-mean uniform white noise, and represent
the varying operating conditions.
Indicative armature current and angular velocity responses for all four oper-
ating states of the system (one healthy and three faulty) are depicted in Fig. 2.
A Fast Fourier Transform (FFT) is employed in order to study the frequency
content of the two output signals, for all four states of the system. The results
are illustrated in Fig. 3.
2
The model of the DC motor is exclusively used for the data generation of the healthy
and faulty system.
3
Brush faults are quite common failures in DC motors, and they are typically a con-
sequence of unsmooth contact of the brushes, dirty collector or brushes, unadjusted
brush pressure springs, oval shaped collector and consumed brush life [1].
Fault Diagnosis in DC Electric Motors via an Artificial Neural Network 493
(a)
40
20 Decreased K e and K t
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Time (s)
(b)
Angular velocity (rad/s)
4000
3000
Healthy state
2000 Increased R a
Increased L a
1000
Decreased K e and K t
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Time (s)
Fig. 2. Indicative waveforms of the armature current (a) and angular velocity (b), for
all four operating states of the system.
(a)
10
Healthy state
8 Increased R a
Amplitude (A)
Increased L a
6
Decreased K e and K t
4
0
0 5 10 15 20 25 30
Frequency (Hz)
(b)
4000
Healthy state
Amplitude (rad/s)
3000 Increased R a
Increased L a
1000
0
0 5 10 15 20 25 30
Frequency (Hz)
Fig. 3. Spectrum of the armature current (a) and angular velocity (b), for all four
operating states of the system.
494 T. I. Aravanis et al.
As is evident from Figs. 2 and 3, for all four scenarios, the waveforms of
the healthy and faulty system, as well as their frequency content, are almost
identical, indicating a highly challenging fault detection problem.4
– Firstly, the data (armature current and angular velocity signals) is obtained
from the DC motor and normalized in the interval [0, 1], in order to appro-
priately drive the ANN.
– Subsequently, the normalized signals are imported in the (input layer of the)
ANN, without any analysis or pre-processing.
– Finally, the motor condition is decided as an output of the ANN.
Data Aquisition
DC Motor ia (t), ω(t) ANN Motor Condition
Fig. 4. Block diagram of the proposed technique for fault diagnosis in DC motors.
For training and evaluating the ANN, the contemporary and powerful
machine learning framework TensorFlow is utilized, along with the high-level
neural networks API, Keras.5
The ANN is a four-layer (one input layer, two hidden layers and one output
layer), feed-forward, fully-connected neural network, as depicted in Fig. 5. The
input layer of the network is a Flatten layer, which merely transforms the
format of the waveforms from a 2D-array (of 1001 by 2 values) to a 1D-array of
1001 · 2 = 2002 values. Recall that the signals of the input layer are the armature
current and the angular velocity of the DC motor, and the length of each signal
is 1001 points. The Flatten layer has no parameters to learn; it only reformats
the data. After this layer, the network consists of a sequence of three Dense
layers. The first two Dense (hidden) layers have 200 nodes (or neurons) each,
with a rectifier activation function (relu). The third (output) layer is a 4-node
softmax layer; each node contains a score that indicates the probability that the
current waveform belongs to one of the four fault classes (see Table 2 below).
The network weights are initialized to a small random number generated from
a uniform distribution, in this case between 0 and 0.05.
4
More evidently for the healthy case and the faulty scenario where La is increased by
200%.
5
TensorFlow was developed by the Google Brain team, and can be found at https://
www.tensorflow.org.
Fault Diagnosis in DC Electric Motors via an Artificial Neural Network 495
In #1
Out #1
In #2
Out #2
In #3
Out #3
In #4
.. Out #4
.
.. ..
. .
In #2002
As mentioned above, the input data (armature current and angular velocity) for
training the ANN is constructed of successive ranges of 12 samples, representing
the four states of the operating conditions of the DC motor. That is, healthy
state (3 samples), increased armature resistance (3 samples), increased armature
inductance (3 samples), and decreased electromotive force and motor torque
constants (3 samples).
Each motor state is mapped to a single target of the output layer of the
ANN, as it is shown in Table 2. Each target is represented by a set of a binary
state of each output neuron of the network.
The accuracy of the model on the training data reached 100%, with a total
training time of about 5 s (500 epochs of about 10 ms each).
Table 2. Targets of the output layer of the ANN and fault classes.
The accuracy of the model on test phase reached 99.7%, which indicates the
successful training (without over-fitting).6
The confusion matrix (error matrix), presented in Table 3, allows visualiza-
tion of the model’s performance. A confusion matrix C is such that Ci,j is equal
to the number of instances known to be in class i, but predicted to be in class j.
The accuracy of the model on test phase can, also, be calculated/verified from
Table 3 as follows:
Correct 247 + 250 + 250 + 250
Accuracy = = = 0.997 or 99.7%.
Total Samples 1000
Fig. 6. Accuracy and confidence of predictions on four indicative samples of the test
dataset (armature current in blue, angular velocity in orange). Green label indicates
correct prediction. (Color figure online)
confidence level is 87% and 84%, respectively. In this case, the model classifies
the input signals of Class No. 1, with a confidence level of 13% (gray label in
Fig. 6), as they were extracted from the Class No. 3, and that of Class No. 3,
with a confidence level of 16%, as they were extracted from the Class No. 1. This
is not surprising, since the scenarios of the healthy state and the faulty scenario
of the increased armature inductance result in almost identical signals (both in
time and frequency domain), as it was highlighted in Figs. 2 and 3. Table 3 also
indicates the challenge of the discrimination between the input signals belonging
to Class No. 1 and Class No. 3, as 3 instances were predicted to be in the latter,
although known to be in the former.
The trained model can straightforwardly be used for instant evaluations of
several motor states.
Clearly, the obtained results demonstrate that the proposed method achieves
high accuracy in the combined problem of fault detection and classification,
pinpointing a promising way to diagnose DC motors faults, in real-world
applications.
5 Conclusions
electromotive force constant and the motor torque constant, with respect to
the nominal values, are detectable at perfect (above 99.7%) detection rates.
– The achieved results were obtained with challenging output responses (wave-
forms) of the DC motor, in the sense that are both in time and frequency
domain almost identical, and with no pre-procession of the measured signals.
– The ANN was trained with a small training dataset, and tested with a suffi-
ciently large test dataset.
– Disturbances and noise (for certain parameters of the system) were considered
in simulations, in order to take into account the varying operating conditions.
– Key to the high performance of the proposed method is the powerful ability
of an ANN to identify patterns with high fidelity.
– The ANN does not need a rigorous mathematical model of the underlying
system for fault diagnosis.
– The flexibility and speed of the presented method indicate that it can easily
be applied to on-line fault diagnosis.
– The training and evaluating of the ANN were employed in the contemporary
and powerful tool TensorFlow, constituting a representative implementation
of the state-of-the-art ANN modelling methodologies.
References
1. Bay, O.F., Bayir, R.: A fault diagnosis of engine starting system via starter motors
using fuzzy logic algorithm. Gazi Univ. J. Sci. 24, 437–449 (2011)
2. Filbert, D., Schneider, C., Spannhake, S.: Model equation and fault detection of
electric motors. Technical report, IFAC Fault Detection, Supervision and Safety for
Technical Processes, Baden-Baden, Germany (1991)
3. Isermann, R.: Fault diagnosis of electrical drives. In: Isermann, R. (ed.) Fault-
Diagnosis Applications, pp. 49–80. Springer, Heidelberg (2011). https://doi.org/
10.1007/978-3-642-12767-0 3
4. Liu, X.Q., Zhang, H.Y., Liu, J., Yang, J.: Fault detection and diagnosis of
permanent-magnet DC motor based on parameter estimation and neural network.
IEEE Trans. Ind. Electron. 89, 1021–1030 (2000)
5. Patan, K., Korbicz, J., Glowacki, G.: DC motor fault diagnosis by means of artificial
neural network. In: Proceedings of the International Conference on Informatics in
Control, Automation and Robotics, ICINCO 2007, pp. 11–18 (2007)
6. Samarasinghe, S.: Neural Networks for Applied Sciences and Engineering: From
Fundamentals to Complex Pattern Recognition. Auerbach Publications, Boca Raton
(2006)
7. Trigeassou, J.C.: Electrical Machines Diagnosis. Wiley, Hoboken (2013)
8. Yu, K., Yang, F., Guo, H., Xu, J.: Fault diagnosis and location of brushless DC motor
system based on wavelet transform and artificial neural network. In: Proceedings
of the 2010 International Conference on Electrical Machines and Systems. IEEE
(2010)
1st PEINT Workshop
On Predicting Bottlenecks in Wavefront
Parallel Video Coding Using
Deep Neural Networks
1 Introduction
The ever increasing demands for higher video resolution led to the proliferation of 4K
cameras and TV sets. Unfortunately, the departure from the UHD era resulted in higher
bandwidth demands, which the popular H.264/AVC standard [1] proved insufficient to
handle. As an example, YouTube suggests a roughly 50MBps transmission rate for 4K
videos coded with H.264/AVC [2]. Recognizing the necessity for a more efficient video
coding standard the MPEG group launched the High Efficiency Video Coding standard
[3] (also referred to as H.265). Similarly, AOMedia (a consortium of blue chip com-
panies in the hi-tech industry), recently launched AV1 codec [4] as a royalty free
© Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 501–510, 2019.
https://doi.org/10.1007/978-3-030-20257-6_43
502 N. Panagou et al.
The rest of the paper is organized as follows. Section 2 presents related work on
block based parallelism. The proposed model is discussed in Sect. 3 and experimen-
tally evaluated in Sect. 4. Finally, Sect. 5 concludes the paper.
2 Related Work
Block based parallelism in video coding has been exploited in the context of slices, tiles
and wavefront. Slice partitioning was used in H.264/AVC and was inherited in HEVC.
Slices essentially form sub-frames that can be coded and transmitted independently
(prediction and entropy coding dependencies are broken on slice boundaries).
In H.264/AVC a slice could be defined by taking Macroblocks in consecutive raster
order, while in HEVC slices can also be defined as groups of tiles taken again in raster
order. Since slices aim at facilitating network transmission they carry header infor-
mation that is unnecessary in the other two block parallel methods. Therefore, from a
bitrate perspective slices are expensive compared to tiles and wavefront thus, their use
as parallelization granule although important in H.264/AVC [15] is rather limited in
HEVC.
Tiles are defined by introducing an M N grid that splits a frame into M vertical
and N horizontal zones each containing a number of CTU rows and columns. At the
boundaries of a tile’s rectangle, dependencies are broken enabling independent pro-
cessing. A number of studies demonstrated the high parallelization potential that is
achievable by tile parallelization. In [16] the authors proposed to use a large number of
tiles in order to smooth load imbalance experienced at CPU cores. Since using large
number of tiles hinders coding efficiency many works advocated the use of a number of
tiles equaling the available processing cores. In [17] a tile resizing scheme was pro-
posed with the aim of reducing load imbalances. The scheme was based on estimating
504 N. Panagou et al.
CTU compression time from the past average. Prediction methods and partitioning
algorithms specifically tuned for LowDelay (LD) video coding were proposed in [18]
and [19]. In [20] tile resizing was considered with the aim of reducing coding losses.
Finally, in [21] a tile resizing and CPU scheduling algorithm is illustrated that aims at
improving speedup when the number of processors is fewer than the number of tiles.
Wavefront Parallel Processing (WPP) was proposed in [22] for H.264. In WPP
threads are assigned complete rows thus, whenever precedence constraints are not
fulfilled a thread pauses execution which can be wasteful particularly under limited
processing resources. Overcoming this deficiency was the aim of [23] which studied
WPP in HEVC. The authors advocated the use of thread pooling and thread assignment
consisting of one CTU each time. In case precedence constraints do not allow a thread
to proceed, the thread returns to the pool. In this way, waiting overheads are avoided
but other overheads are introduced due to the fact that a finishing thread must return to
the pool before being assigned another CTU. In [11] Overlapping Wavefront
(OWF) was proposed and evaluated in the decoder side. In WPP a thread finishing its
assigned row returns in case no other row can be assigned. On the other hand, in OWF
it proceeds by processing the first row of the next frame. Expanding the parallelization
potential of wavefront was the aim of [24] and [25] where precedence constraints were
modeled not only within a single frame (as per WPP), but also between a frame and its
reference frames. Thus, the scope of wavefront was expanded to account for paral-
lelism over multiple frames.
We view the work of this paper as complimentary to the aforementioned papers.
Wavefront can in principle be not only applied on a per frame basis, but also on a per
tile or slice basis. Regardless of the application basis or the parallelization scope (one or
more frames), bottlenecks can occur from the enforcement of precedence constraints.
Thus, predicting them in advance can contribute in their alleviation through adequate
resource allocation and/or task assignment strategies. In this paper we focus on WPP as
the baseline wavefront mechanism, nevertheless our approach is applicable to the other
schemes of the literature with only small changes.
We can calculate the delay values using (1) and (2) by iterating on the CTUs of a
frame in raster order, for a time complexity of O(WH).
4 Experiments
4.1 Setup
We performed simulation based experiments using dataset obtained from real video
encodings. The encodings were run on a Linux Server with two 12-core Intel Xeon E5-
2650 running at 2.20 GHz. Class A and B test sequences were used for our evaluation
with characteristics summarized on Table 1. HM 16.15 reference software for HEVC
was used with the following settings: LowDelay (LD) setup [28] with one slice, first
On Predicting Bottlenecks in Wavefront Parallel Video Coding Using DNN 507
frame I followed by P frames, GOP size was 4, bit depth was 8, CTU size was
64 64, max partitioning depth was 4, search mode was TZ and QP was set to 32.
120000
Processing time per row (msec)
100000
80000
60000 Actual
Predicted
40000 Difference
20000
0
PeopleOnStreet Traffic ParkScene
Sequences
30000
25000
Delay per row (msec)
20000
15000 Actual
Predicted
10000 Difference
5000
0
PeopleOnStreet Traffic ParkScene
Sequences
5 Conclusions
In this paper we tackled the problem of identifying the bottlenecks in wavefront parallel
video encoding. The proposed method involves the use of a deep neural network for
predicting the compression time of the blocks a frame is split into and the subsequent
calculation of expected delays due to precedence constraints. Experiments using real
datasets obtained from HEVC coding, demonstrated that the resulting model can
predict wavefront delays with relative accuracy.
Acknowledgments. This research has been co-financed by the European Union and Greek
national funds through the Operational Program Competitiveness, Entrepreneurship and Inno-
vation, under the call RESEARCH-CREATE-INNOVATE (project code: T1EDK-02070).
References
1. Wiegand, T., Sullivan, G.J., Bjontegaard, G., Luthra, A.: Overview of the H. 264/AVC video
coding standard. IEEE Trans. Circ. Syst. Video Technol. 13, 560–576 (2003)
2. YouTube, Recommended upload encoding settings. https://support.google.com/youtube/
answer/1722171
3. Sullivan, G.J., Ohm, J.R., Han, W.J., Wiegand, T.: Overview of the high efficiency video
coding (HEVC) standard. IEEE Trans. Circ. Syst. Video Technol. 22, 1649–1668 (2012)
4. AV1: Bitstream & Decoding Process Specification (2018). https://aomedia.org/av1-
bitstream-and-decoding-process-specification/
5. Topiwala, P., Krishnan, M., Dai, W.: Performance comparison of VVC, AV1 and HEVC on
8-bit and 10-bit content. In: Applications of Digital Image Processing XLI International
Society for Optics and Photonics, vol. 10752, p. 107520 (2018)
6. VVC: Versatile Video Coding (2018). https://jvet.hhi.fraunhofer.de/
7. Franche, J.F., Coulombe, S.: A multi-frame and multi-slice H. 264 parallel video encoding
approach with simultaneous encoding of prediction frames. In: 2012 2nd International
Conference on Consumer Electronics, Communications and Networks (CECNet), pp. 3034–
3038. IEEE (2012)
8. Lemmetti, A., Koivula, A., Viitanen, M., Vanne, J., Hämäläinen, T.D.: AVX2-optimized
Kvazaar HEVC intra encoder. In: IEEE International Conference on Image Processing
(ICIP), pp. 549–553 (2016)
9. Koziri, M.G., Papadopoulos, P., Tziritas, N., Dadaliaris, A.N., Loukopoulos, T., Khan, S.U.:
Slice-based parallelization in HEVC encoding: realizing the potential through efficient load
balancing. In: 18th IEEE International Workshop on Multimedia Signal Processing
(MMSP), pp. 1–6 (2016)
10. Misra, K.M., Segall, C.A., Horowitz, M., Xu, S., Fuldseth, A., Zhou, M.: An overview of
tiles in HEVC. IEEE J. Sel. Top. Signal Process. 7(6), 969–977 (2013)
11. Chi, C.C., et al.: Parallel scalability and efficiency of HEVC parallelization approaches.
IEEE Trans. Circ. Syst. Video Technol. 22, 1827–1838 (2012)
12. x265 HEVC encoder (2018). http://x265.org
13. Qiu, X., Zhang, L., Ren, Y., Suganthan, P.N., Amaratunga, G.: Ensemble deep learning for
regression and time series forecasting. In: 2014 IEEE Symposium on Computational
Intelligence in Ensemble Learning, pp. 1–6 (2014)
14. HM reference software. http://hevc.hhi.fraunhofer.de
510 N. Panagou et al.
15. Zhao, L., Xu, J., Zhou, Y., Ai, M.: A dynamic slice control scheme for slice-parallel video
encoding. In: ICIP 2012, pp. 713–716 (2012)
16. Shafique, M., Khan, M.U.K., Henkel, J.: Power efficient and workload balanced tiling for
parallelized high efficiency video coding. In: ICIP 2014, pp. 1253–1257 (2014)
17. Storch, I., Palomino, D., Zatt, B., Agostini, L.: Speedup-aware history-based tiling algorithm
for the HEVC standard. In: ICIP 2016, pp. 824–828 (2016)
18. Koziri, M., et al.: Adaptive tile parallelization for fast video encoding in HEVC. In:
Proceedings of the 12th International Conference on Green Computing and Communications
(GreenCom), pp. 738–743 (2016)
19. Koziri, M., et al.: Heuristics for tile parallelism in HEVC. In: 25th European Signal
Processing Conference (EUSIPCO), pp. 1514–1518 (2017)
20. Blumenberg, C., Palomino, D., Bampi, S., Zatt, B.: Adaptive content-based tile partitioning
algorithm for the HEVC standard. In: PCS 2013, pp. 185–188 (2013)
21. Papadopoulos, P.K., Koziri, M.G., Loukopoulos, T.: A fast heuristic for tile partitioning and
processor assignment in HEVC. In: Proceedings of the IEEE International Conference on
Image Processing (2018)
22. Zhao, Z., Liang, P.: Data partition for wavefront parallelization of H. 264 video encoder. In:
Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), p. 4
(2006)
23. Zhao, Y., Song, L., Wang, X., Chen, M., Wang, J.: Efficient realization of parallel HEVC
intra encoding. In: Proceedings of the IEEE International Conference Multimedia and Expo
Workshops (ICMEW), pp. 1–6 (2013)
24. Wang, Z.Y., Dong, S.F., Wang, R.G., Wang, W.M., Gao, W.: Dynamic macroblock
wavefront parallelism for parallel video coding. J. Vis. Commun. Image Represent. 28, 36–
43 (2015)
25. Wen, Z., Guo, B., Liu, J., Li, J., Lu, Y., Wen, J.: Novel 3D-WPP algorithms for parallel
HEVC encoding. In: Proceedings of the IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 1471–1475 (2016)
26. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
27. Lv, Y., Duan, Y., Kang, W., Li, Z., Wang, F.Y.: Traffic flow prediction with big data: a deep
learning approach. IEEE Trans. Intell. Transp. Syst. 16(2), 865–873 (2015)
28. Bossen, F.: Common test conditions and software reference configurations. In: Joint
Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC
JTC1/SC29/WG11, 5th meeting (2011)
29. Frank, E., Hall, M.A., Witten, I.H.: The WEKA Workbench. Online Appendix for “Data
Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann,
Burlington (2016)
Recognizing Human Actions Using 3D
Skeletal Information and CNNs
1 Introduction
Understanding of human actions from video has attracted an increasing interest
during the last few years. This research area lies in the broader field of human-
centered activity recognition and combines ideas and techniques mainly from
the fields of computer vision and pattern recognition. There exist several human
action understanding tasks. Wang et al. [23] proposed a categorization into the
following sub-problems: gesture, action, interaction and group activity recogni-
tion. The performance of a gesture requires a relatively small amount of time,
while the performance of an action requires a significant amount of time and
contrary to a gesture, it typically involves more than one body parts. Moreover,
c Springer Nature Switzerland AG 2019
J. Macintyre et al. (Eds.): EANN 2019, CCIS 1000, pp. 511–521, 2019.
https://doi.org/10.1007/978-3-030-20257-6_44
512 A. Papadakis et al.
2 Related Work
As it has already been mentioned in Sect. 1, when working with deep learning
approaches, a large-scale multi-class dataset may be the key to effectiveness and
robustness. The first publicly available datasets such as the KTH [19], were
limited to a small number of simple actions e.g., walking, running, hand clapping
Recognizing Human Actions Using 3D Skeletal Information and CNNs 513
etc.. Later, a next series of datasets such as the Hollywood dataset [11] targeted
more realistic human actions e.g., answer phone, get out of car, hand shake etc.,
still being limited to a small number of classes. In less than a decade, more
challenging datasets such as the UCF101 [21] and the HMDB [10] emerged,
containing large numbers of more complex actions, including interactions with
objects such as playing cello, horse riding, swing baseball bat, fencing etc. Recent
large-scale datasets such as PKU-MMD [15] or the NTU [20], are comprised of
large numbers of training video and depth sequences.
According to Wang et al. [23] human action recognition tasks may be divided
into two major categories:
– segmented recognition: the given input video sequence contains only the
action to be recognized. This means that any frame before/after the action,
i.e., not depicting a part of the action, has been removed. In this case, Recur-
rent Neural Networks (RNNs) [5] or CNNs [13] are typically used.
– continuous recognition: the goal is to recognize actions within a given
video; the video may or may not depict a single action. In that case, also
known as “online” recognition, RNNs are typically used.
Note that when a CNN is used and the only available motion features are skeletal
data, an intermediate visual representation of skeletal sequences is required. This
representation should capture both spatial and temporal information regarding
the motion of joints, i.e., in the 3D space over time. This information should
be reflected to its color and/or texture properties. In this section our goal is
to present research works that are based on visual representations of 3D skele-
tal data of human actions and training deep networks, i.e., an intermediate
hand-crafted feature extraction step is not included in the process. Skeletal data
typically consist of a set of skeletal joints moving in 3D space over time, i.e., for
each joint 3 1D signals are generated per action. The extraction of joints from
video requires depth information.
In the work of Du et al. [4], in order to preserve the spatial information
the set of joints is split into five subsets corresponding to arms, legs and the
trunk. Pseudo-colored images are generated by corresponding x, y and z spa-
tial coordinates to R, G and B components, respectively. To preserve temporal
information, spatial representations are chronologically arranged. Wang et al.
[24] proposed a representation of “joint trajectory maps,” wherein the motion
direction is encoded by hue. Maps were constructed by appropriately setting
saturation and brightness so as texture would correspond to motion magnitude;
each was based on the projected trajectory of the skeleton to a Cartesian plane.
Similarly, Hou et al. [6] transformed the extracted skeleton joints into a represen-
tation called “skeleton optical spectra,” so that hue changes would reflect to the
temporal variation of skeletal motion. Li et al. [14] proposed the representation
of “joint distance maps,” and opted for encoding the pair-wise joint distances
in the 3 orthogonal 2D planes and also used a fourth one to encode distances in
the 3D space, while hue was used to encode distance variations. Each map was
separately classified and a late fusion scheme was adopted. In an effort for invari-
ance to the initial position and orientation of the skeleton, Liu et al. [16] applied
514 A. Papadakis et al.
Fig. 1. (a) A signal image; activity image resulting upon (b) DFT; (c) FFT; (d) DCT;
(e); DST. Action is playing with phone/tablet. DFT and FFT images have been pro-
cessed with log transformation for visualization purposes. Figure best viewed in color.
(Color figure online)
respectively. Their discrimination power lies to the fact that their convolutional
layers are designed to learn a set of convolutional filters; during training, their
parameters are learnt. Neurons are grouped into rectangular grids; each grid
performs a convolution in a part of the input image. A pooling layer typically
succeeds a single or a set of convolutional layers and sub-samples its input, to
produce a single value from a small rectangular block. Finally, dense layers are
those that are ultimately responsible for classification, based on the features that
have been extracted by the convolutional layers and sub-sampled by the pooling
ones.
4 Experiments
4.1 Dataset
For the experimental evaluation of our approach we used the PKU-MMD dataset
[15]. As it has already been mentioned, PKU-MMD is a large-scale benchmark
1
http://www.numpy.org/.
2
https://www.scipy.org/.
3
https://opencv.org/.
Recognizing Human Actions Using 3D Skeletal Information and CNNs 517
Fig. 3. Examples of activity images from 11 classes and for the 4 transforms used. 1st
row: DFT; 2nd row: FFT; 3rd row: DCT; 4th row: DST. (a) eat meal/snack; (b) falling;
(c) handshaking; (d) hugging other person; (e) make a phone call/answer phone; (f)
playing with phone/tablet; (g) reading; (h) sitting down; (i) standing up; (j) typing
on a keyboard; (k) wear jacket. DFT and FFT images have been processed with log
transformation for visualization purposes. Figure best viewed in color. (Color figure
online)
518 A. Papadakis et al.
images from these 11 classes and for all types of transforms. In the second part,
we performed experiments with the whole dataset, i.e., with all 51 classes. In
both parts, we worked only based on the skeletal data, discarding RGB, depth
and infrared information.
4.2 Results
The evaluation protocol we followed is as follows: we first performed experiments
per camera position; in this case both training and testing sets derived from the
same viewpoint. Then, we performed cross-view experiments, where different
viewpoints were used for training and for testing. The goal of these experiments
was to test the robustness of the proposed approach in terms of transformation
(e.g., a translation and a rotation), which could correspond to abrupt viewpoint
changes which typically occur in real-life situations. Finally, we performed cross-
subject experiments, where subjects were split in training and testing groups,
i.e., any actor “participated” only into one of the groups. The goal of this subject
was to test the robustness of our approach into intra-class variations. In real-
life situations this is expected to happen when a system is trained e.g., within
a laboratory environment and is deployed into a real ambient-assistive living
environment. Note that in all cases we measured classification accuracy. Detailed
results are depicted in Tables 1 and 2 for 11 and 51 classes, respectively. As it
may be observed, in the first case, DST showed best accuracy for the majority of
single- and cross-view experiments, followed by DCT. In cross-view experiments,
while in the cross-subject case, DFT showed best accuracy, once again followed
by DCT. In the second case, DCT and DST showed best accuracy in the majority
of cases, apart from the extreme cross-view cases where left angle was used for
training and right for testing or vice versa, where DFT showed best accuracy,
followed by FFT.
References
1. Abadi, M., et al.: TensorFlow: a system for large-scale maching learning. In: Pro-
ceedings of the USENIX Symposium on Operating Systems Design and Implemen-
tation (OSDI) (2016)
2. Berretti, S., Daoudi, M., Turaga, P., Basu, A.: Representation, analysis, and recog-
nition of 3D humans: a survey. ACM Trans. Multimed. Comput. Commun. Appl.
(TOMM) 14(1S), 16 (2018)
3. Chollet, F.: Keras (2015). https://github.com/fchollet/keras
4. Du, Y., Fu, Y., Wang, L.: Skeleton based action recognition with convolutional neu-
ral network. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR),
pp. 579–583. IEEE (2015)
5. Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent
neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and
Signal Processing, pp. 6645–6649. IEEE (2013)
6. Hou, Y., Li, Z., Wang, P., Li, W.: Skeleton optical spectra-based action recognition
using convolutional neural networks. IEEE Trans. Circuits Syst. Video Technol.
28(3), 807–811 (2018)
7. Jiang, W., Yin, Z.: Human activity recognition using wearable sensors by deep
convolutional neural networks. In: Proceedings of the 23rd ACM International
Conference on Multimedia, pp. 1307–1310 (2015)
8. Ke, Q., An, S., Bennamoun, M., Sohel, F., Boussaid, F.: SkeletonNet: mining deep
part features for 3-D action recognition. IEEE Signal Process. Lett. 24(6), 731–735
(2017)
9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Sys-
tems, pp. 1097–1105 (2012)
Recognizing Human Actions Using 3D Skeletal Information and CNNs 521
10. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video
database for human motion recognition. In: 2011 International Conference on Com-
puter Vision, pp. 2556–2563. IEEE (2011)
11. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human
actions from movies. In: 2008 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1–8. IEEE (2008)
12. Lawton, M.P., Brody, E.M.: Assessment of older people: self-maintaining and
instrumental activities of daily living. Gerontol. 9(3 Part 1), 179–186 (1969)
13. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
14. Li, C., Hou, Y., Wang, P., Li, W.: Joint distance maps based action recognition
with convolutional neural networks. IEEE Signal Process. Lett. 24(5), 624–628
(2017)
15. Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: a large scale bench-
mark for continuous multi-modal human action understanding. arXiv preprint
arXiv:1703.07475 (2017)
16. Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invariant
human action recognition. Pattern Recognit. 68, 346–362 (2017)
17. Mathe, E., Mitsou, A., Spyrou, E., Mylonas, Ph.: Arm gesture recognition using a
convolutional neural network. In: Proceedings of International Workshop Semantic
and Social Media Adaptation and Personalization (SMAP) (2018)
18. Mathe, E., Maniatis, A., Spyrou, E., Mylonas, Ph.: A deep learning approach
for human action recognition using skeletal information. In: Proceedings of
World Congress “Genetics, Geriatrics and Neurodegenerative Diseases Research”
(GeNeDiS) (2018)
19. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM app-
roach. In: Proceedings of the 17th International Conference on Pattern Recognition
(ICPR 2004), vol. 03, pp. 32–36. IEEE Computer Society (2004)
20. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for
3D human activity analysis. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1010–1019 (2016)
21. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions
classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
22. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn.
Res. 15(1), 1929–1958 (2014)
23. Wang, P., Li, W., Ogunbona, P., Wan, J., Escalera, S.: RGB-D-based human
motion recognition with deep learning: a survey. Comput. Vis. Image Underst.
171, 118–139 (2018)
24. Wang, P., Li, W., Li, C., Hou, Y.: Action recognition based on joint trajectory
maps with convolutional neural networks. Knowl.-Based Syst. 158, 43–53 (2018)
Staircase Detection Using a Lightweight
Look-Behind Fully Convolutional
Neural Network
1 Introduction
Staircases can be found almost everywhere in different colors, shapes and sizes in both
indoor and outdoor environments. Staircases are useful in everyday life; however, they
can be seen also as an obstacle for the navigation of humans with disabilities, as well as
the navigation of artificial, robotic, agents. The detection of a staircase can be even
more difficult in unknown environments, especially for the visually impaired, where
there is no previous knowledge about the surroundings, and they can become haz-
ardous. Therefore, staircase detection can be considered as an important component of
any system aiming to provide navigational assistance in either indoor or outdoor
environments. In controlled, indoor environments, markers, such as augmented reality
markers can be used to provide high success rate of staircase detection [1]. The
detection problem usually becomes much harder in outdoor, uncontrolled environ-
ments, where different types of staircases of various sizes can be found under various
illumination conditions, and can be observed from different viewpoints.
In this paper we address image-based staircase detection as a pattern recognition
problem in the context of embedded and mobile devices. The main challenge is to be
able to provide sufficient detection accuracy by utilizing the limited computational
resources of such devices, especially in outdoor environments with low latency and
limited network accessibility. To address this challenge, we propose a novel light-
weight Fully Convolutional neural Network (FCN) architecture as a modification of our
recent Look-Behind FCN (LB-FCN) architecture [2]. This novel architecture, named
LB-FCN light, has significantly fewer free parameters and requires fewer Floating
Point Operations (FLOPs) compared to the previous LB-FCN and state-of-the-art
architectures for mobile devices. This was achieved by implementing depthwise sep-
arable convolutions throughout the convolutional layers of the network. Also, it
enables multi-scale feature extraction and residual learning, making it suitable for
multi-scale staircase detection in both indoor and outdoor environments. To evaluate
the performance of LB-FCN light we created a weakly labeled image dataset, with
staircases found in natural images collected from publicly available datasets, i.e., a
dataset with semantically labeled images as containing or not containing staircases.
The rest of the paper consists of four sections. In Sect. 2 the related work focusing
on staircase detection is presented. In Sect. 3 we describe the proposed architecture and
its advantages. In Sect. 4 we describe our weakly annotated staircase dataset, and the
results of the experiments performed. The last section summarizes the conclusions that
can be derived from this study along with our plans for future work.
2 Related Work
Staircase detection has been an active research topic in computer vision and robotics,
with an increasing interest nowadays as we are going through the era of ubiquitous
computing and pervasive intelligence. One of the first relevant works [3] was based on
Gabor filters and concurrent line grouping for distant and close staircase detection
respectively. In the context of autonomous vehicle navigation, an outdoor descending
staircase detection algorithm was presented by [4], based on texture energy, optical
flow, and scene geometry features. In the context of computer aided navigation of
visually impaired in outdoor environments using a wearable stereo camera, [5] utilized
Haar features and Adaboost learning providing real-time detection performance.
A similar approach that utilizes Haar-like features and an improved staircase specific
Viola-Jones detector was proposed in [6].
Frequency domain features obtained by ultrasonic sensors were investigated in [7],
to detect and recognize floor and staircases in electronic white cane. A wearable RGB-
D camera mounted on the chest of a visually impaired individual, was used in [8],
where an indoor environment for staircase detection and modeling was proposed. Their
approach is capable of providing information for the presence and location along with
the number of steps of staircases. Recently an indoor staircase detection framework was
proposed in [9], utilizing depth images, capable of running on mobile devices. The
approach is based on the detection and clustering of image patches that have the surface
vectors pointing to the top direction. In addition, information from the Inertial Mea-
surement Unit (IMU) sensor of the device is used to calibrate the surface vectors with
the camera orientation. Most of the current staircase detection approaches are super-
vised, requiring fully annotated training images from controlled environments, i.e.,
images indicating the location of the staircases within the images. Furthermore, to the
524 D. E. Diamantis et al.
best of our knowledge the staircase detection has not been previously investigated to a
sufficiently generic extent.
Although deep learning and more specifically Convolutional Neural Networks
(CNNs) [10] have demonstrated impressive performance in computer vision applica-
tions, especially in natural image classification [11], staircase detection approaches
have not been previously reported. While they are effective, conventional deep CNNs
such as [12], suffer from high computational complexity mainly due to their large
number of free parameters. As a result, high-end computational equipment such as
Graphical Processing Units (GPUs) is needed for both training and testing time, lim-
iting their use in indoor workstations. Recent studies such as [13–15] focus their
interest in computational complexity reduction of CNN architectures, aiming to enable
their usage in mobile and embedded devices. In this context, the tradeoff between
computational efficiency and detection performance has been investigated, resulting in
a state-of-the-art architecture called MobileNet-v2 [16], extending the original
MobileNet-v1 proposed in [14]. More specifically this architecture keeps the basic
principles of depthwise convolutions for the original design enhances it by adding
linear bottleneck layers and shortcut connections between each bottleneck. Linear
bottleneck layers were utilized as experimental evidence that the non-linear ones were
damaging the extracted features between the bottlenecks. As a result of these changes
the architecture contains 30% less parameters than MobileNet-v1 while providing a
higher accuracy. Recently, we presented LB-FCN [2] architecture in the context of
abnormality detection in medical images. The architecture featured multi-scale feature
extraction modules composed of conventional convolutional layers, to better represent
the different scales of abnormalities. In addition look-behind connections were used,
which connect the input features to the output of each multi-scale feature extraction
module. This was required, so that the high-level features will propagate throughout the
network, allowing the network to converge faster and increasing the overall detection
accuracy.
The core of LB-FCN light architecture is inspired by LB-FCN [2] and includes
modification to enable efficient computations on mobile and embedded devices, while
providing a sufficient staircase detection accuracy. More specifically, LB-FCN light
extends the original LB-FCN design by replacing the multi-scale conventional con-
volutional layers with depthwise convolutional layers [17]. Key features of this
architecture include the utilization of multi-scale depthwise separable convolution
layers [17] and residual learning [18] connections which help to maintain relatively low
number of free parameters, without sacrificing the detection accuracy.
3 Architecture
The design of the LB-FCN light architecture follows the FCN [19] network design,
where only convolutional layers are utilized throughout the network. By replacing the
fully connected layers, usually found in the classification layer of conventional CNN
architectures such as [11, 12], a significant reduction of the number free parameters of
the architecture can be achieved. Inspired by the MobileNet architecture, proposed in
[14], depthwise separable convolutions [17] are implemented throughout the network
Staircase Detection Using a Lightweight 525
Fig. 2. The complete LB-FCN light architecture composed of four multi-scale blocks and three
residual connections.
4.1 Dataset
To evaluate the performance of the proposed architecture in the context of natural
image staircase detection we have considered two publicly available datasets. The first
dataset, named LM+Sun [21], is a fully annotated natural image dataset obtained from
the combination of LabelMe Database [22] and SUN dataset [23]. The dataset consists
of 45,676 images from 232 categories, found in indoor and outdoor environment under
various conditions and sizes. For the purpose of our experiment we utilized a subset of
Staircase Detection Using a Lightweight 527
LM+Sun dataset which includes natural images found in urban and street areas. While
the full LM+Sun dataset contains 314 staircase labeled images, most of them are found
in indoor environments. Images containing staircases were also found in the urban and
street subsets of this dataset, e.g., staircases of buildings that can be directly recognized
by a human observer, considering: (a) staircases that have at least two steps, and
(b) staircases covering >15% of the image (in staircases of smaller coverage the steps
are not distinguishable; therefore, they cannot be perceived directly as such, without
contextual information). To minimize the possibility of a human error in the annotation
process, two reviewers separately reviewed and annotated the dataset, and found in
total 245 images that include outdoor staircases. To further increase the number of
outdoor staircase images, we have created a second dataset named “StairFlickr” which
extends LM+Sun staircases with a total of 524 outdoor staircase images. StairFlickr
dataset images were obtained from the popular photo management and sharing web
application Flickr [24].
Fig. 3. Top: staircases found in StairFlickr dataset. Middle: staircases found in LM+Sun dataset.
Bottom: non-staircases images from LM+Sun dataset.
For the purposes of our research, we omitted the fully annotated metadata provided
about the staircases in the original LM+Sun dataset. This was performed as our
architecture aims for staircase detection on solely weakly-labeled natural images. In
total the described dataset includes 5,539 images from which 1,083 images contain
528 D. E. Diamantis et al.
staircases1. Indicative images from this dataset are illustrated in Fig. 3. As it can be
observed, the dataset includes various types of staircases found in various positions,
sizes, capture from different viewpoints.
TP þ TN
ACC ¼ ð1Þ
TP þ TN þ FP þ FN
TN
SPC ¼ ð2Þ
TP þ FP
TP
TPR ¼ ð3Þ
TP þ FN
4.3 Results
We trained the LB-FCN light architecture using the images from both Flickr and
LM+Sun datasets. As the images differ from each other in both size and aspect ratio we
rescaled the dataset to the standardized input size of the network which is 224 224
pixels. To maintain the original aspect ratio of the images, they were padded with zeros
to match the network’s input dimensions. It is worth mentioning that no further pre-
processing step was applied to the images. As the proposed architecture focuses on
weakly labeled images, the detailed annotations for the staircases provided by LM+Sun
[21] dataset were ignored. We utilized only the semantic annotations of the images
which indicate the presence or absence of staircases.
For the training of the network we utilized the Adam [26] optimizer with initial
learning rate alpha = 0.001 and first and second moment estimates exponential decay
1
A link to the dataset will be provided in the final manuscript.
Staircase Detection Using a Lightweight 529
rate beta1 = 0.9 and beta2 = 0.999 respectively. For the implementation of the archi-
tecture we utilized the Python Keras [27] library and the Tensorflow [28] tensor graph
framework. The network was trained with mini-batch size of 32 samples on
NVIDIA TITAN X GPU, equipped with 3584 CUDA [29] cores, 12 GB of RAM and
base clock speed of 1417 MHz. On each fold we utilized the early-stopping technique
where a small subset of the training fold was utilized as a validation dataset.
To evaluate the effectiveness in both detection accuracy and computational com-
plexity reduction of LB-FCN light architecture we used the MobileNet-v2 [16] as a
state-of-the-art architecture for comparison. The results obtained by the two architec-
tures are illustrated in Table 1. The confusion matrix of LB-FCN light classification
performance is illustrated in Table 3.
While the detection performance is slightly higher in case on LB-FCN light, the
noticeable difference between the two architectures is the computational complexity
requirements. Table 2 includes a comparison between the architectures in terms of both
the number of trainable free parameters and the total number of required FLOPs. The
improvements made on the original LB-FCN design, resulted in a significant reduction
of the overall number of FLOPs, from 1.3 107 down to 0.6 106, and reduction of
the free parameters of the network, from 8.2 106 down to 0.3 106 respectively.
5 Conclusions
We proposed a novel lightweight multi-scale FCN architecture that copes with the
problem of staircase detection in natural images. To evaluate the performance of the
architecture we extended the LM+Sun [21] natural image dataset with staircase images
530 D. E. Diamantis et al.
obtained from Flickr [24] social network. To the best of our knowledge there has been
no existing work in this field that utilize solely weakly-labeled images to detect
staircases in the natural images. The key features of the proposed LB-FCN light
architecture can be summarized as follows:
• It has a relatively low number of free parameters requiring an also low number of
FLOPs, which makes it suitable to be used on mobile and embedded devices;
• It features multi-scale feature extraction design allowing the architecture to detect
staircases of various sizes and under difficult conditions, such as natural images;
• Following the FCN [12] architecture approach it offers a lightweight and logically
unified design;
• Compared to MobileNet-v2 [16] network, the proposed architecture offers a rela-
tively lower number of FLOPs and free parameters and a slightly higher detection
performance. This makes it attractive for lower-end mobile and embedded devices.
In our future work we are planning to evaluate the performance of the proposed
architecture in larger weakly-labeled staircase natural image datasets, to further explore
the potential of the architecture. Furthermore we plan to extend the purpose of LB-FCN
light architecture to include the localization of the staircases within the images, by
following a weakly supervised approach.
Acknowledgments. This research has been co-financed by the European Union and Greek
national funds through the Operational Program Competitiveness, Entrepreneurship and Inno-
vation, under the call RESEARCH – CREATE – INNOVATE (project code:T1EDK-02070). It
was also supported by the Onassis Foundation - Scholarship ID: G ZO 004-1/2018-2019. The
Titan X used for this research was donated by the NVIDIA Corporation.
References
1. Yu, X., Yang, G., Jones, S., Saniie, J.: AR marker aided obstacle localization system for
assisting visually impairedn. In: IEEE International Conference on Electro/Information
Technology (EIT), pp. 271–276 (2018)
2. Diamantis, D.E., Iakovidis, D.K., Koulaouzidis, A.: Look-behind fully convolutional neural
network for computer-aided endoscopy. Biomed. Signal Process. Control 49, 192–201
(2019)
3. Se, S., Brady, M.: Vision-based detection of staircases. In: Fourth Asian Conference on
Computer Vision ACCV, vol. 1, pp. 535–540 (2000)
4. Hesch, J.A., Mariottini, G.L., Roumeliotis, S.I.: Descending-stair detection, approach, and
traversal with an autonomous tracked vehicle. In: 2010 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pp. 5525–5531 (2010)
5. Lee, Y.H., Leung, T.-S., Medioni, G.: Real-time staircase detection from a wearable stereo
system. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 3770–
3773 (2012)
6. Maohai, L., Han, W., Lining, S., Zesu, C.: A robust vision-based method for staircase
detection and localization. Cogn. Process. 15(2), 173–194 (2014)
Staircase Detection Using a Lightweight 531
7. Bouhamed, S.A., Kallel, I.K., Masmoudi, D.S.: Stair case detection and recognition using
ultrasonic signal. In: 2013 36th International Conference on Telecommunications and Signal
Processing (TSP), pp. 672–676 (2013)
8. Pérez-Yus, A., López-Nicolás, G., Guerrero, J.J.: Detection and modelling of staircases
using a wearable depth sensor. In: Agapito, L., Bronstein, Michael M., Rother, C. (eds.)
ECCV 2014. LNCS, vol. 8927, pp. 449–463. Springer, Cham (2015). https://doi.org/10.
1007/978-3-319-16199-0_32
9. Ciobanu, A., Morar, A., Moldoveanu, F., Petrescu, L., Ferche, O., Moldoveanu, A.: Real-
time indoor staircase detection on mobile devices. In: 2017 21st International Conference on
Control Systems and Computer Science (CSCS), pp. 287–293 (2017)
10. LeCun, Y., et al.: LeNet-5, convolutional neural networks, p. 20 (2015). http://yann.lecun.
com/exdb/lenet
11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105
(2012)
12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition, arXiv preprint arXiv:1409.1556 (2014)
13. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet:
Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size, arXiv preprint
arXiv:1602.07360 (2016)
14. Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision
applications, arXiv preprint arXiv:1704.04861 (2017)
15. Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional
neural network for mobile devices. ArXiv e-prints, July 2017
16. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: Mobilenetv2: Inverted
residuals and linear bottlenecks. In: IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 4510–4520 (2018)
17. Chollet, F.: Xception: Deep learning with depthwise separable convolutions, arXiv preprint,
pp. 1610–2357 (2017)
18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–
778 (2016)
19. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: The all
convolutional net. arXiv preprint arXiv:1412.6806 (2014)
20. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a
simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–
1958 (2014)
21. Tighe, J., Lazebnik, S.: SuperParsing: scalable nonparametric image parsing with
superpixels. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol.
6315, pp. 352–365. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15555-
0_26
22. Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: LabelMe: a database and web-
based tool for image annotation. Int. J. Comput. Vision 77(1–3), 157–173 (2008)
23. Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene
recognition from abbey to zoo. In: IEEE Conference on 2010 Computer Vision and Pattern
Recognition (CVPR), pp. 3485–3492 (2010)
24. Flickr Inc., “Find your inspiration. | Flickr.” 2019
25. Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)
26. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:
1412.6980 (2014)
532 D. E. Diamantis et al.
1 Introduction
employing hard bounds [15], e.g., an obstacle is considered as a threat of high risk after
a specified distance. The human eye fixation map produced by the GAN, combined
with the fuzzy interpretation of the depth values, produce a perceptually meaningful
and interpretable information for the user. Also, compared with the machine learning-
based obstacle detection methods that were mentioned above, our method does not
need any training regarding the obstacle detection part. The only training that it takes
place is that of the saliency prediction based on human eye fixation data.
The rest of this paper is consisting of 3 sections. Section 2 describes the proposed
methodology and Sect. 3 describes the experiments and the obtained results. The last
section summarizes the conclusions of our study.
2 Methodology
The proposed method consists of two components; the saliency map generation using a
GAN trained on human eye fixations, called SalGAN [16] and a fuzzy set-based
approach combining the 3D spatial information acquired by an RGB-D sensor to risk
values for a possible obstacle threat.
was places with a sigmoid activation function. The weights of the decoder were ran-
domly initialized. The discriminator CNN architecture consists of six 3 3 convo-
lution filters with 3 pooling layers and followed by 3 FC layers. The activation
functions for the convolutional and FC layers are the ReLU and hyperbolic tangent
function (tanh), respectively. Exception is the final layer, which uses the sigmoid
activation function. Figure 1 illustrates the architecture of the saliency map generator.
The convolutional layers of the encoder are depicted with a blue colour and the max-
pooling layers with red; the convolutional layers of the decoder are depicted with green
colour, the up-sampling layers with orange and the final convolutional layer with the
sigmoid activation with grey colour. An example of a saliency map produced by the
generative model, is illustrated in detail in Fig. 2.
Fig. 1. The SalGAN architecture for saliency map generation. (Color figure online)
Fig. 2. Illustration of the generated saliency map from an input image. (a) Input image. (b) The
generated saliency map.
universe of discourse for these fuzzy sets is the range of values of the depth maps
produced by the RGB-D sensor. The fuzzy sets D1 and D2, and the fuzzy sets D2 and
D3, are overlapping between each other, considering the uncertainty in the assessment
of an obstacle threat as high or medium, and as medium or low respectively, upon its
distance from the user. The respective membership functions di(z), i = 1, 2, 3 are
illustrated in Fig. 3(a), where z is a value of the estimated depth map.
For the spatial localization of an obstacle in the image plane, we constructed 6
additional fuzzy sets, namely H1, H2 and H3 for the horizontal axis, corresponding to
the left, central and right part of the image, namely, V1, V2 and V3 for the vertical axis,
corresponding to the up, central and bottom part of the image. The respective mem-
bership functions, hi(x), and vi(y) are illustrated in Fig. 3(b), (c), where x 2 [0, 1] and
y 2 [0, 1] are the horizontal and vertical coordinates within an image, normalized by
image width and height, respectively.
Fig. 3. Membership functions of fuzzy sets used for the localization of objects in the 3D space
using linguistic variables. (a) Membership functions for low (d1), medium (d1), and high risk (d3)
upon the distance of the user from an obstacle. (b) Membership functions for left (h1), central (h2)
and right (h3) positions on the horizontal axis. (c) Membership functions for up (v1), central (v2)
and bottom (v1) positions on the vertical axis.
Once we establish the membership functions for each fuzzy set, we create three
different risk maps, DiM based on the depth values and the membership functions of
each fuzzy set Di. Each risk map consists of the responses of a membership function
given a depth value z and it can be formally expressed as follows:
where Mz is the depth map of an RGB image, IRGB. From Eq. (1) a total of 3 risk maps
is derived, where each, represents regions of a degree of risk that an object may impose
a threat to a person navigating within its range. A visual representation of the risk maps
can be seen in Fig. 4. Figure 4(a) illustrates the depth map corresponding to Fig. 2(a),
where the dark pixel values represent lower depth values (nearest distances) and the
brighter pixel values represent higher depth values (further distances). Figures 4(b)-(d)
are illustrations of the different risk maps produced by Eq. (1) using membership
functions di(z), i = 1, 2, 3. The brighter pixel values of the risk maps indicate higher
participation in D1, D2 and D3 fuzzy sets respectively.
Fig. 4. A graphical representation of the risk maps. (a) Depth map Mz of the image in Fig. 2(a).
(b) High-risk map D1M . (c) Medium-risk map D2M . (d) Low-risk map D3M .
where F1, F2 are two 2D fuzzy maps, with values within [0, 1], and x, y represent the
coordinates of each value within the 2D map. A risk map DiM and a (normalized)
saliency map SM, can be considered as such fuzzy maps, since the risk maps are
generated by the responses of fuzzy membership functions, and saliency maps indicate
Obstacle Detection Based on Generative Adversarial Networks and Fuzzy Sets 539
the degree to which a pixel belongs to a salient region. Thus, Eq. (3) gives a new
image:
Fig. 5. A visual representation of the Oiobstacle . (a) The original IRGB image. (b) Image O1obstacle .
(b) Image O2obstacle . (c) Image O3obstacle .
540 G. Dimas et al.
For the validation of our method, we captured 10,170 frames in total. The videos were
captured using an RGB-D sensor, namely, the state-of-the-art Intel® RealSense™
D435. The size of this sensor is conveniently small (90 25 25 mm) to be attached
on a wearable system for the assistive navigation of the VII. It enables 3D depth
sensing, with a maximum range of 10 m. The D435 sensor has two infrared
(IR) cameras that enable stereoscopic vision, an IR projector and a high resolution
RGB camera. The IR projector is used to improve the depth estimation. This is done by
the stereoscopic system while projecting a static IR pattern on the scene. The IR pattern
projection enables the texture enrichment of low texture scenes.
Fig. 7. Example detection of high risk obstacles (nearest crowd and tree branches). (a) Original
image RGB image. (b) The corresponding image O1obstacle obtained by using membership
function d1.
Obstacle Detection Based on Generative Adversarial Networks and Fuzzy Sets 541
Fig. 8. A qualitative comparison of the fuzzy approach (a) IRGB input image. (b) The hard-
thresholded saliency map overlaid to the input image (c) Image O1obstacle obtained by the proposed
methodology. (d) Obstacle mask after hard thresholding of the depth map corresponding to the
region of interest defined in (b).
542 G. Dimas et al.
The application of the proposed methodology for obstacle detection on the avail-
able dataset resulted in an accuracy of 88.1%, with the respective specificity to be
85.9% and the sensitivity 90.1%. The classification results are summarized in detail in
Table 1. We compared our approach with a state-of-the-art methodology, where hard
thresholding is employed. For a threshold at 3 m the hard thresholding method pro-
duced an accuracy of 81.1%, with a specificity of 79.1% and a sensitivity of 82.9%,
whereas by further reducing the threshold, the detection performance was degraded.
A qualitative comparison of the two approaches is illustrated in Fig. 8. Figure 8(b)
illustrates the hard-thresholded saliency map overlaid to the input image. It can be
noticed that the rock at the bottom right is not included in the region of interest defined
by the saliency map. This region of interest is used to isolate a respective region of the
depth map, which is subsequently hard-thresholded to obtain possibly threatening
obstacle regions. The result of this process is illustrated in Fig. 8(d), where the rock is
falsely not considered as a threat. Comparatively in Fig. 8(c) the image O1obstacle
obtained by the proposed methodology includes the rock. This is because of the fuzzy
fusion applied between the saliency map and the risk-map. The fuzzy fusion operation,
defines a region that may not be considered sufficiently salient by SalGAN, but due to
the uncertainty-aware evaluation of the depth map using fuzzy sets, a certain degree of
risk is assigned. Figure 9 illustrates a qualitative comparison of O1obstacle obtained using
various T-norms and S-norms, including fuzzy AND Fig. 9(a), OR Fig. 9(b) and SUM
Fig. 9(c) operators. It can be observed that the fuzzy AND operation produces the most
reliable results.
Fig. 9. An illustration of O1obstacle using different fuzzy operation between the saliency map SM
and the risk map D1M . (a) Fuzzy AND. (b) Fuzzy OR. (c) Fuzzy SUM.
Obstacle Detection Based on Generative Adversarial Networks and Fuzzy Sets 543
4 Conclusions
In this work we presented a novel methodology for obstacle detection in the context of
safe navigation of VII. It is based on a GAN which produces saliency maps, based on
human eye fixations. These maps are co-evaluated with the depth information obtained
by an RGB-D camera to assess the risk of an obstacle. Fuzzy sets were used to translate
the values of the depth maps to risk levels, and a fuzzy fusion of depth and saliency
information was applied to enable enhanced detection of obstacles.
The proposed approach, which is based on fuzzy sets, demonstrated a more robust
performance in comparison to the current approach, which is based on hard thresh-
olding. In addition, the proposed approach enables the translation of the 3D spatial
information into linguistic values easily comprehensible by VII.
As a future work, we intend to extend this methodology by creating multidimen-
sional membership functions for the risk-level interpretation based on the speed and the
location of an obstacle. Also, we intend to investigate various fuzzy set-based
approaches to image fusion, and to create an extended dataset that will be made
publicly available, to foster research in this domain.
Acknowledgments. This research has been co-financed by the European Union and Greek
national funds through the Operational Program Competitiveness, Entrepreneurship and Inno-
vation, under the call RESEARCH – CREATE – INNOVATE (project code:T1EDK-02070).
The Titan X used for this research was donated by the NVIDIA Corporation.
References
1. Rodríguez, A., Bergasa, L.M., Alcantarilla, P.F., Yebes, J., Cela, A.: Obstacle avoidance
system for assisting visually impaired people. In: Proceedings of the IEEE Intelligent
Vehicles Symposium Workshops, Madrid, Spain, p. 16 (2012)
2. Iakovidis, D.K., Diamantis, D., Dimas, G., Ntakolia, C., Spyrou, E.: Digital enhancement of
cultural experience and accessibility for the visually impaired. In: Paiva, S. (ed.) Improved
Mobility for the Visually Impaired. Springer (2019, to appear)
3. Brassai, S.T., Iantovics, B., Enachescu, C.: Optimization of robotic mobile agent navigation.
Stud. Inform. Control 21, 403–412 (2012)
4. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with
region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–
99 (2015)
5. Kaur, B., Bhattacharya, J.: A scene perception system for visually impaired based on object
detection and classification using multi-modal DCNN. arXiv preprint arXiv:1805.08798
(2018)
6. Tapu, R., Mocanu, B., Zaharia, T.: DEEP-SEE: joint object detection, tracking and
recognition with application to visually impaired navigational assistance. Sensors 17, 2473
(2017)
7. Suresh, A., Arora, C., Laha, D., Gaba, D., Bhambri, S.: Intelligent smart glass for visually
impaired using deep learning machine vision techniques and robot operating system (ROS).
In: Kim, J.-H., Myung, H., Kim, J., Xu, W., Matson, E.T., Jung, J.-W., Choi, H.-L. (eds.)
RiTA 2017. AISC, vol. 751, pp. 99–112. Springer, Cham (2019). https://doi.org/10.1007/
978-3-319-78452-6_10
544 G. Dimas et al.
8. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time
object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 779–788 (2016)
9. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N.,
Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).
https://doi.org/10.1007/978-3-319-46448-0_2
10. Poggi, M., Mattoccia, S.: A wearable mobility aid for the visually impaired based on
embedded 3D vision and deep learning. In: 2016 IEEE Symposium on Computers and
Communication (ISCC), pp. 208–213 (2016)
11. Lee, C.-H., Su, Y.-C., Chen, L.-G.: An intelligent depth-based obstacle detection system for
visually-impaired aid applications. In: 2012 13th International Workshop on Image Analysis
for Multimedia Interactive Services, pp. 1–4. IEEE (2012)
12. Song, H., Liu, Z., Du, H., Sun, G.: Depth-aware saliency detection using discriminative
saliency fusion. In: 2016 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 1626–1630. IEEE (2016)
13. Mancini, M., Costante, G., Valigi, P., Ciarfuglia, T.A.: J-MOD 2: joint monocular obstacle
detection and depth estimation. IEEE Robot. Autom. Lett. 3, 1490–1497 (2018)
14. Heinrich, S.: Fast obstacle detection using flow/depth constraint. In: 2002 Intelligent Vehicle
Symposium, pp. 658–665. IEEE (2002)
15. Chen, L., Guo, B., Sun, W.: Obstacle detection system for visually impaired people based on
stereo vision. In: 2010 Fourth International Conference on Genetic and Evolutionary
Computing, pp. 723–726. IEEE (2010)
16. Pan, J., et al.: Salgan: visual saliency prediction with generative adversarial networks. arXiv
preprint arXiv:1701.01081 (2017)
17. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information
Processing Systems, pp. 2672–2680 (2014)
18. Bylinskii, Z., Recasens, A., Borji, A., Oliva, A., Torralba, A., Durand, F.: Where should
saliency models look next? In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV
2016. LNCS, vol. 9909, pp. 809–824. Springer, Cham (2016). https://doi.org/10.1007/978-
3-319-46454-1_49
19. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556 (2014)
20. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale
hierarchical image database (2009)
21. Nguyen, H.T., Walker, C.L., Walker, E.A.: A First Course in Fuzzy Logic. CRC Press, Boca
Raton (2018)
Author Index