TP1 ACT RPT CMS ARI 13 9202 ClimateTippingPoints

Climate tipping points: Detection
and analysis of patterns using an

ordinal regression approach
Final Report
Authors: P.A. Gutiérrez1, C. Hervás-Martínez1,F. Fernández-Navarro2, I.

Dicaire2, A. Nikolaou2, M. Pérez-Ortiz1, J. Sánchez-Monedero1
Affiliation: 1University of Córdoba, 2ESA ACT
Date: 21/04/2014
Contacts:
Pedro Antonio Gutiérrez

Tel: (+34) 957218153
Fax: (+34) 957218630
e-mail: otri@uco.es pagutierrez@uco.es
Leopold Summerer (Technical Officer)

Tel: +31(0)715654192
Fax: +31(0)715658018
e-mail: act@esa.int
Ariadna ID: 13-9202

Ariadna study type: Standard
Contract Number: 4000108222/13/NL/MV
Available on the ACT
website
http://www.esa.int/act
Ariadna AO/1-7415/13/NL/KML: Climate
tipping points: Detection and analysis of patterns
using an ordinal regression approach [13-9202]
Gutiérrez, Pedro Antonio1 , Hervás-Martı́nez, César1 ,

Fernández-Navarro, Francisco2 , Dicaire, Isabelle2 , Nikolaou, Athanasia2 ,
Pérez-Ortiz, Marı́a1 and Sánchez-Monedero, Javier1
1
: Department of Computer Science and Numerical Analysis, University of Córdoba,
Spain, pagutierrez@uco.es,chervas@uco.es,i82perom@uco.es,jsanchezm@uco.es
2
: Advanced Concepts Team, European Space Research and Technology Centre
(ESTEC), European Space Agency (ESA), Noordwijk, Netherlands
isabelle.dicaire@esa.int;athanasia.nikolaou@esa.int;francisco.fernandez.
navarro@esa.int,i22fenaf@uco.es
Contents
1 Introduction 3
2 Detection of early warning signals in paleoclimate data using a

genetic time-series segmentation algorithm 8
2.1 Segmentation Algorithm . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Mathematical description of the segmentation problem . . 10
2.1.2 General overview of the segmentation algorithm . . . . . 10
2.1.3 Chromosome representation . . . . . . . . . . . . . . . . . 11
2.1.4 Initial population . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Segment characteristics . . . . . . . . . . . . . . . . . . . 11
2.1.6 Clustering: k -means Algorithm . . . . . . . . . . . . . . . 14
2.1.7 Fitness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.8 Selection and replacement processes . . . . . . . . . . . . 15
2.1.9 Mutation Operator . . . . . . . . . . . . . . . . . . . . . . 15
2.1.10 Crossover Operator . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Climate datasets . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Algorithm parameters . . . . . . . . . . . . . . . . . . . . 17
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Additional details about the Segmentation Algorithm . . . . . . 22
2.5.1 Generation of each individual for the initial population . . 22
2.5.2 Application of the k -means algorithm . . . . . . . . . . . 22
2.6 Additional Examples of Segmentation for the GISP2 and NGRIP
datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Alternative fitness functions and a new evaluation method 24

3.1 Measuring the quality of the clustering process . . . . . . . . . . 24
3.2 Automatic evaluation method and experimental setting . . . . . 26
3.2.1 Experimental setting . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Automatic evaluation method . . . . . . . . . . . . . . . . 26
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Time Series Forecasting by Evolutionary Recurrent Product

Unit Neural Networks 30
4.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 Short memory model: Autoregressive Product Unit Neu-
ral Network (ARPUNN) . . . . . . . . . . . . . . . . . . . 31
1
4.1.2 Long memory model: Recurrent Product Unit Neural
Network (RPUNN) . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Dataset Selected . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 Metrics Considered for Evaluation . . . . . . . . . . . . . 36
4.3.3 Algorithms Selected for Comparison Purpose . . . . . . . 36
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Conclusions 39
2
Chapter 1
Introduction
Climate variability as it was regionally manifested during the past millenia is

documented in paleoclimatic proxy data extracted from various spots around
the globe. This information is, however, nothing but straightforward in the
derived time series which are obscured by noise signals, error margins in time
resolution that is reflected in uncertain timing of events and an unknown number
of internal and external modes that interact cumulatively to form the observed
behavior (Kanner et al., 2012; Crucifix, 2012). This elusive internal variability
is consecutively under-represented in current climate models. Predicting future
climate transitions based on their output has still a long way to go, that also
passes through the challenging task of thorough analysis of the past climate
transitions. In response to this challenge the development of suitable tools for
statistical analysis of paleoclimate time series has seen a boost during the last
years.
Certain events stand out in a paleoclimate proxy time series because of their
abrupt character and discontinuity. Some of them are encountered in more
than one proxy. Combined study of proxies from different spots on earth help
infer the global/regional character of events encountered in the time series, or
phase differences of the same event across the globe, which makes their timing
precision important. The comparison between the Greenland ice cores (GISP:
Dansgaard et al., 1969; Johnsen et al., 1972) and Antarctic ice cores (Tabacco
et al., 1998; Schwander et al., 2001) leads to the hypothesis of the bipolar see-
saw effect, which improved our understanding of the Atlantic thermohaline cir-
culation inter-hemispheric effects and identified it as a global propagator of
freshwater flux anomalies (Broecker, 1998; Stocker and Johnsen, 2003). The
abrupt transitions that were detected throughout the data sets are related to
the Dansgaard-Oeschger (DO) events described in Dansgaard et al. (1993) and
Alley et al. (2003) as abrupt warming in Greenland followed by gradual cool-
ing. These events are regarded as critical transition points and indicate that
the climate can demonstrate threshold behaviour (Alley et al., 2003). Therefore
analysing the evolution of the climatic variables in the time period preceding
the transition helps in identifying possible warning signals.
The statistical tools used to extract knowledge from time series analysis have
undergone considerable development during the past decade (see Livina and
Lenton, 2007; Livina et al., 2011; Lenton et al., 2012; Scheffer et al., 2009; Dakos
et al., 2008; Held and Kleinen, 2004; Cimatoribus et al., 2013). Driven by the
3
ultimate aim of predicting future transitions, the effort is concentrated in over-
coming the limitations of the finite length of the time series and simultaneously
reveal statistical parameters that unfold the system’s global properties. Focus-
ing on a single transition event, identifying the slowing down as an early warning
signal (EWS) before the Younger Dryas period draw a lot of attention in 2008
in the work of Dakos et al. (2008). Discussion was raised on what the precursors
of a bifurcation point are, naming increase in autocorrelation as a prominent
indicator. Ditlevsen and Johnsen (2010) brought forward that respecting the
fluctuation-dissipation theorem (Kubo, 1966) imposes both increasing autocor-
relation and variance before crossing a bifurcation. The fluctuation-dissipation
theorem is in the centre of the theoretical analysis of the statistical findings,
as it relates the response of the system to external perturbations to its fluctu-
ations when in thermal equilibrium (Palmer and Weisheimer, 2011). Based on
the same theorem, Cimatoribus et al. (2013) suggested a shortcoming of the
previous statement based on the fact that the climate system is not in thermal
equilibrium. Preprocessing or filtering the raw proxy data in order to achieve
stationarity and infer more straightforward diagnostics is suggested, such as
the Detrended Fluctuation Analysis (DFA) coefficient and the modified DFA
coefficient. Livina and Lenton (2007) measured the proximity of a system to
a tipping point not in units of critical parameter difference, but by monitor-
ing the evolution of the DFA propagator. Held and Kleinen (2004) proposed
the method of degenerate fingerprinting and hypothesised the system as an
auto-regressive process with lag-1 autocorrelation, including only exponentially
decaying modes and white noise in the dynamics of the system. Livina et al.
(2011) applied tools of potential analysis to describe the location and stability
of distinct global states of a system from climate time series data.
The above studies involve an implicit association of the performed analysis
to a selected transition point. It makes sense to focus on certain abrupt events
which are seen in the data series with bare eye, in terms of comprehensive
analysis, but certain algorithms can also be used for a quantitative description
of the time series. Rahmstorf (2003) tried to automatise the characterisation of
the DO events by introducing an event detection algorithm based on the data
slope. The author also proposed estimating systematic and dating errors and
suggested that the DO events are very likely synchronised to an external forcing
of Bond cycle periodicity (Bond et al., 1992). Cimatoribus et al. (2013) went
a step forward in the use of algorithms employing bootstraping, an advanced
boostrapped ensemble method to estimate the probability error in detecting
DO events within the Pleistocene ensemble of DO events.
While debate on the statistical parameters suitable for detecting a transition
is still vivid, attributing DO events to one among three types of transitions as
defined in Ashwin et al. (2012) is a requirement of direct relevance to the seek of
EWSs. The hypotheses on the causality of the DO events include noise-induced
transitions, relaxation oscillations and excitability of the system of which the
stochastic resonance is a subcase (Crucifix, 2012). Each case stems from a dif-
ferent simple mathematical model which demonstrates complex behaviour. The
patterns preceding a transition can include EWSs which can be diagnosed and
studied in the controlled environment of simulated data. Simulated time series
are used to complement the findings from the finite time series analysis, since
in lack of long and regular observational time series of climate variables, the
study of the statistical properties is hindered. Concepts drawn from various dy-
4
namical systems are merged into EMICs (Models of Intermediate Complexity)
(Ganopolski and Rahmstorf, 2001; Stocker and Johnsen, 2003; Claussen et al.,
2003; Arzel et al., 2012) in order to produce evolution of variables in thousands
of model years and test hypotheses for the forcings and processes that possibly
shaped the proxy time series. Since the simplified forcings are in direct cor-
respondence with an idealised mathematical model, hypotheses of underlying
mechanisms can be verified or rejected by additional comparison of simulated
data behaviour to the proxy data observable. In this reverse engineering ap-
proach more than a single model can reproduce the variability encountered in
the ice core or any other paleoclimatic record, so interpretation of dynamical
systems should be used with caution and in combination with scientific insight
(Crucifix, 2012). Returning to the DO and Younger Dryas abrupt changes en-
countered in the ice core records, grouping them according to the behaviour
preceding them is only advantageous for reinforcing or weakening the existing
hypotheses on their incidence.
This Ariadna study proposes a different approach to address the problem of
EWS detection in time series. We supply no prior knowledge of tipping point to
the algorithm and employ a segmentation method for classifying the different
kinds of segments presented in the time series. The goal of time-series segmen-
tation is to provide a more compact representation of time series data through
dividing time series data into segments and using a high level representation
to approximate each segment. The time-series segmentation problem has been
widely studied within various disciplines. For example, time-series segmenta-
tion algorithms have been successfully applied in phoneme recognition (Xiong
et al., 1994; Prandom et al., 1997), paleoecological problems (Bennett, 1996),
telecommunication applications (Himberg et al., 2001) or financial problems
(Tseng et al., 2009). For an excellent review of time-series segmentation see
Keogh et al. (2001).
On the other hand, climate time series can also be modelled by using pre-
dictive models. The term Time Series (TS) refers to a succession of data values
chronologically sorted that belongs to a magnitude or phenomenon that has
been sampled at a certain rate. An example of time series could be the evolu-
tion of the maximum daily temperature, the unemployment rate of a country
or the amplitude of the seismic waves of an earthquake. Time series are present
in most of the science fields like econometrics (Gonzalo and Ng, 2001), weather
forecasting (Arroyo and Maté, 2009) or control engineering (Lee and Davier,
2013). Nowadays, TS research concerns about TS Analysis (TSA) and TS
Forecasting (TSF). The goal of TSA is to extract the main features and charac-
teristics that describe the underlying phenomena, while the objective of TSF is
to find a function to predict the next value of the time series using its p lagged
values.
Artificial Neural Networks (ANNs) are a very popular Machine Learning
(ML) tool used for TSF (Hansen and Nelson, 1997). In the field of TSF, there
are different ANN architectures. Feedforward Neural Networks (FFNNs) are the
most common and simplest type of ANNs, where the information moves in a
forward direction (Johansson et al., 1991; Ahmed and Rauf, 1991). For example,
the Time Delay Neural Network (TDNN) consists on a FFNN whose inputs are
the delayed values of the TS (Sitte and Sitte, 2000). Instead, Recurrent Neural
Networks (RNN) are based on a different architecture where the information
through the system moves forming a direct cycle (Connor et al., 1994). This
5
cycle can storage information from previous data in the internal memory of the
network, which be useful for certain kind of applications. One example of RNN
is the Long Short Term Memory Neural Network (LSTMNN) (Hochreiter and
Schmidhuber, 1997), whose main characteristic is the capability of its nodes to
remember a time series value for an arbitrary length of time. Robot control
and real time recognition (Baccouche et al., 2011) are example of real applica-
tions of LSTMNN. Echo State Networks (ESNs) are RNNs whose architecture
includes a random number of neurons whose interconnections are also randomly
decided. This provides the network with a long term memory and a competitive
generalisation performance (Jaeger, 2002; Gallicchio and Micheli, 2011; Rodan
and Tiňo, 2011). From the previous introduction to ANNs in TSF, it can be
derived that one of the main differences between FFNNs and RNNs lies on their
storage capacity. RNNs have a long term memory because of the architecture
of the model, whereas the memory of FFNNs is provided by the lagged terms
at the input of the network.
The parameter estimation algorithm is also very important when analyzing
the different algorithms. The more complex the structure of a neural network is,
the more challenging its weight matrix estimation turns. Traditional Backprop-
agation (BP) algorithms can result in a very high computational cost, specially
when dealing with complex nonlinear error surfaces. The Extreme Learning Ma-
chine (ELM) is an example of an algorithm that can estimate the parameters
of a FFNN model efficiently (Pan et al., 2009). It is a highly implemented al-
gorithm that determines the hidden layer parameters randomly and the output
layer ones by using the Moore-Penrose (MP) generalised inverse (Huang et al.,
2012), providing a better generalisation performance than traditional gradient-
based learning algorithms for some problems. As will be analysed later in this
document, this Ariadna study also contributes a new algorithm for TSF.
The results of this Ariadna study have been included in two journal publi-
cations and one international conference contribution:
• Isabelle Dicaire, Pedro Antonio Gutiérrez, Antonio Durán, Athanasia
Nikolaou, Francisco Fernández-Navarro, César Hervás-Martı́nez. Detec-
tion of yearly warning signals in paleoclimate data using a genetic time-
series segmentation algorithm. Submitted to Climate Dynamics. 2014.
• Marı́a Pérez Ortiz, Pedro Antonio Gutiérrez, Javier Sánchez Monedero,
César Hervás-Martı́nez, Athanasia Nikolaou, Isabelle Dicaire, Francisco
Fernández-Navarro. Time series segmentation of paleoclimate tipping
points by an evolutionary algorithm. Accepted in the 9th International
Conference on Hybrid Artificial Intelligence Systems (HAIS 2014). Sala-
manca, Spain. June, 2014.
• Marı́a de los Angeles de la Cruz, Francisco Fernández-Navarro, Pedro An-
tonio Gutiérrez, Adiel Castaño, and César Hervás-Martı́nez. Time Series
Forecasting by Evolutionary Recurrent Product Unit Neural Networks.
Submitted to IEEE Transactions on Neural Networks and Learning Sys-
tems. 2014.
This study report is organised in five chapters: after this Introduction, a new
algorithm for time series segmentation is presented in Chapter 2, analysing its
performance for EWS detection. Chapter 3 is devoted to the study of alternative
6
fitness functions and evaluation methods for the considered algorithm. Then,
Chapter 4 presents a new modelling technique for time series. Finally, Chapter
5 summarises the contributions of the study.
7
Chapter 2
Detection of early warning

signals in paleoclimate data
using a genetic time-series
segmentation algorithm
This chapter presents the main characteristics of the segmentation algorithm

presented in this study. The algorithm is presented and its results are analysed
for the different datasets considered.
As previously discussed, prior to EWS detection, this study introduces a
segmentation method as a first step to better understand the time series. This
segmentation provides a more compact representation of the time series through
splitting it into segments with similar behaviour Keogh et al. (2001). A segmen-
tation analysis avoids the necessity of specifying predefined sliding windows for
the different TPs, which is one of the main difficulties of previous TP detection
methods Dakos et al. (2012). Moreover, the segmentation algorithm is able to
detect differences between the TPs. We address the segmentation problem as
a heuristic search problem with the proposal of a Genetic Algorithm (GA) to
overcome the limitations of traditional statistical methods. The GA segments
the data trying to obtain diverse clusters of segments based on six statistical
properties.
The segmentation problem is usually converted into an optimisation prob-
lem that could be addressed using local or global algorithms (like GAs). For
example, several GA-based approaches to segment time-series were proposed
in Chung et al. (2004). In a similar way, Tseng et al. (2009) also proposed a
GA to address the segmentation of the time series. The main novelty of this
last approach is the inclusion of a clustering technique within the optimisation
procedure to assign a class label to each segment.
In this study, a time-series segmentation algorithm is proposed by combining
a clustering technique and a GA to automatically find the proper segmentation
points and segment classes of a climate time series with abrupt changes. Inter-
est in GAs applied to climate tipping points is rising, e.g. Lenton et al. (2009)
used a GA to tune 12 physical parameters of an Earth System Model to study
8
the tipping of the Atlantic thermohaline circulation following a multi-objective
optimisation method. The time-series segmentation algorithm presented in this
chapter is a significant extension of the one in Tseng et al. (2009). The fi-
nal goal of the proposed GA is to minimise the distance of each segment to
its centroid in a six-dimensional space where the six dimensions are statistical
properties of each segment. The proposed approach first groups the segments
into k clusters according to their statistical characteristics by using the k-means
clustering technique (MacQueen et al., 1967). The Euclidean distance is used to
calculate the distance of each segment with respect to its centroid. Because two
segments may have different lengths and characteristics, six statistical metrics
are measured for each segment such that the distance can be calculated and
the clustering technique can be applied using this six-dimensional space. The
algorithm is specifically adapted to time-series with abrupt changes.
The proposed approach features the following characteristics:
• Assigns a class label to the different segments via the combination of
the GA with the clustering technique; traditional approaches would only
provide the segmentation points (Sclove, 1983; Himberg et al., 2001; Keogh
et al., 2001). This is specially useful for finding common pattern analysis
from climate data.
• As mentioned in the previous point, the focus in the present study is not
just on the determination of the cut points (ti , i = 1, . . . , m − 1). Rather,
the idea underlying the development here is that of transitions between
classes. As stated previously, the labels Cj , j = 1, . . . , k are categorised
using six statistical measures. The analysis of this transitions is crucial to
the detection of EWSs.
• The algorithm considers all segments of the paleoclimatic record and at-
tempts to find common characteristics within the different segments. This
approach is not unlike that of Cimatoribus et al. (2013), who considered
the average behaviour of the ensemble of DO events with the added bonus
that our approach can also analyze each transition and their underlying
statistical properties by applying a label to each climate segment.
• Each segment is represented by a six-dimensional vector where the dimen-
sions are several statistical metrics some of which have been previously
considered in detecting EWSs of critical transitions by various authors
(Cimatoribus et al., 2013; Dakos et al., 2008, 2012).
• Instead of representing the time series evolution by plotting one of its
metrics, the approach proposed in this chapter allows to visualise several
metrics simultaneously and to compare several sections of the time series
to find common patterns.
This chapter is organised as follows. Section 2.1 introduces the segmenta-
tion algorithm with a detailed description of the embedded genetic algorithm,
the six statistical metrics and the clustering process. Section 2.2 presents the
paleoclimate datasets used in this study and the algorithm parameters. Section
2.3 presents the main results of the segmentation algorithm including a detailed
analysis of the statistical metrics preceding DO events. Finally, Section 2.4 dis-
cusses the results from the point of view of the stochastic resonance model and
possible limitations of the algorithm.
9
2.1 Segmentation Algorithm
2.1.1 Mathematical description of the segmentation prob-
lem
The problem of time-series segmentation considered here is the following: Given
a time-series Y = {yn }N n=1 , partition the set of values of yn into m segments
within which the behavior of yn is homogeneous. The segmentation algorithm
should provide a partition of the time index set (n = 1, . . . , N ) into subsets:
S1 = {y1 , . . . , yt1 }, S2 = {yt1 , . . . , yt2 }, . . . , Sm = {ytm −1 , . . . , yN }, where t’s
are the cut points and are subscripted in ascending order (t1 < t2 < tm−1 ).
Each subset Sl , l = 1, . . . , m is a segment. The integer m and the cut points
ti , i = 1, . . . , m − 1, have to be determined automatically by the algorithm.
Formally, the segmentation problem is a special case of the general clustering
problem.
Furthermore, the segments considered in this study are grouped into k dif-
ferent classes (k < m), where k is a parameter defined by the user. Therefore,
each Sl segment has associated a class label: (S1 , C1 ), (S2 , C2 ),. . .,(Sm , Cm ),
where Cl , l = 1, . . . , m, is the class label of the l-th segment. The class label of
each segment, Cl , has k possible values.
2.1.2 General overview of the segmentation algorithm

This chapter proposes a novel Genetic Algorithm (GA) from the field of time
series segmentation (see Sclove, 1983; Himberg et al., 2001; Keogh et al., 2001;
Chung et al., 2004). The general objective of the GA is to identify segments
with common characteristics by applying a label to these segments. In prac-
tice this means finding the cut points of the time-series defining the different
segments to be discovered in the time-series together with the class labelling
of these segments. As in traditional GAs, the proposed approach considers
a population of candidate solutions (representing different possible segmenta-
tions) which are evolved towards better segmentation solutions. Each possible
segmentation is represented in an array of integer values (chromosome repre-
sentation) which can be mutated and recombined. The evolution starts from a
population of randomly generated segmentations. After that, every segment in
every chromosome is categorised using six statistical metrics. It is important to
point out that most of these six statistical metrics were previously considered in
the climate community (variance, autocorrelation, skewness, etc.). The cluster-
ing technique is applied over this six-dimensional space for every chromosome
and a fitness value is assigned to every chromosome according to the degree of
homogeneity of the segments with respect to their centroids. The class label
is assigned during the clustering process. After that, different mutation and
crossover operators are applied to explore and exploit the search space. This
procedure is repeated during n generations. The main steps of the proposed
algorithm are summarised in Figure 2.1.
The different characteristics of the GA are defined in the following sub-
sections. Further information of the algorithmic flow of the GA proposed is
included in Appendix 2.5.
10
Time series segmentation:
Input: Time series.
Output: Best segmentation of the time series.
1: Generate a random population of t time series segmentations.
2: Evaluate all segmentations of the initial population by using the fitness
function.
3: while not Stop Condition do
4: Store a copy of the best segmentation.
5: Select parent segmentations from current population.
6: Generate offsping: apply crossover and mutation to construct new candi-
date segmentations.
7: Evaluate the fitness of the offsping segmentations.
8: Replace current population with offspring.
9: end while
10: return Best segmentation from final population
Figure 2.1: Main steps of the algorithm
2.1.3 Chromosome representation

A direct encoding of the final segmentation solution is adopted where each
individual chromosome consists of an array of integer values (Michalewicz, 1996).
Each position stores a cutting point of the time-series. A chromosome of m
segments in the time-series is represented by {t1 , . . . , tm−1 }, where the value ti
is the index of the i-th cutting point of the time series. In this way, the first
segment is delimited by the cutting points 1 and t1 , the second by the cutting
points t1 and t2 and so on. An example of this chromosome representation is
given in Figure 2.2.
2.1.4 Initial population

A GA requires a population of feasible solutions to be initialised and updated
during the evolutionary process. As mentioned above, each individual within
a population is a possible segmentation result for the time-series considered.
An initial set of chromosomes is thus generated with some constraints to form
feasible segments. This initial population of t individuals is randomly generated.
The number of individuals will be kept constant during the evolution. Further
information of the creation of each initial individual can be found in 2.5.1.
2.1.5 Segment characteristics

As a result of the genetic operators, the segments in a chromosome may have
different length. Thus, an approach has to be designed to transform all the
segments to the same dimensional space. In our case, six statistical metrics are
measured for all the segments included in a chromosome allowing the GA to
calculate similarities between segments using the same dimensional space. For
the sake of simplicity, all the following characteristics are going to be referred
to the segment Ss which is the part of the time series limited in the following
way Ss = {yts−1 , . . . , yts }:
11
a) Example chromosome. Each position represents an index of a time series value.
4 8 13 18
b) Segments of the time series resulting from the chromosome.
1 2 3 4 4 5 6 7 8 8 9 10 11 12 13
Segment 1 Segment 2 Segment 3
13 14 15 16 17 18 18 19 20 21 22
Segment 4 Segment 5
c) Corresponding segmentation and time series. The characteristics of each segment will be
obtained for the corresponding part of the time series.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
t1 t2 t3 t4
Figure 2.2: Chromosome representation (Online version in colour)
12
1. Variance (Ss2 ): It is a measure of variability that indicates the degree of
homogeneity of a group of observations. The mathematical expression of
this metric is:
ts
1 X 2
Ss2 = (yi − y¯s ) (2.1)
ts − ts−1 i=t
s−1
where (ts − ts−1 ) is the number of points of the segment, ts−1 is the index
of the first point in the s-th segment, ts is the index of the last point in
the segment, yi are the time series values of the segment, and y¯s is the
average value of the segment.
2. Skewness (γ1s ): The skewness represents the asymmetry of the distribu-
tion of the series values in the segment. Segments can be skewed either
up or down with respect to the arithmetic mean. The skewness is defined
as: Pts
1 3
ts −ts−1 i=ts−1 (yi − y¯s )
γ1s = , (2.2)
Ss3
where Ss is the standard deviation of the s-th segment.
3. Kurtosis (γ2s ): It measures the degree of concentration that the values
present around the mean of the distribution. Positive kurtosis (i.e. long
tails) indicate large excursions away from the arithmetic mean. Kurtosis
is defined as: Pts
1 4
ts −ts−1 i=ts−1 (yi − y¯s )
γ2s = − 3. (2.3)
Ss4
4. Slope of a linear regression over the points of the segment (as ): A linear
model is constructed for every segment trying to achieve the best linear
approximation of the points of the time-series in the evaluated segment.
The slope of the linear model is a measure of the general tendency of the
segment. The slope parameter is obtained as:
Ssyt
as = 2, (2.4)
(Sst )
where Ssyt is the covariance of the time indexes, t, and the time series
values, y, for the s-th segment; and where Sst is the standard deviation of
the time values. The mathematical expression for the covariance is:
ts
1 X
Ssyt = (i − t¯s ) · (yi − y¯s ). (2.5)
ts − ts−1 i=t
s−1
5. Mean Squared Error (M SEs ): This statistic measures the degree of non-
linearity of the segment. As done for the slope, we also fit a linear model
for the segment and then we measure the M SEs of this linear fitting. This
is defined by:
M SEs = Ss2 · (1 − rs2 ), (2.6)
where:
Ssyt
rs2 = 2. (2.7)
Ss2 · (Sst )
13
6. Autocorrelation coefficient (ACs ): It measures the degree of correlation
between the current values of the time-series and the values of the time-
series in the previous time stamp. The ACs is defined as:
Pts −1
i=ts−1 (yi − y¯s ) · (yi+1 − y¯s )
ACs = . (2.8)
Ss2
Once the six statistical metrics have been calculated for each segment in a
chromosome, a clustering technique is applied over this six-dimensional space.
2.1.6 Clustering: k -means Algorithm

A clustering process has to be applied in order to obtain the value of the fitness
function for each segment. The algorithm chosen, k -means, is applied to the
time-series segments. Further information on the application of k-means and
the initialisation procedure can be found in Appendix 2.5.2.
Before applying the clustering algorithm one should normalise the values of
the segment metrics, as the distance of each segment to its centroid strongly
depends on the range of values of each metric (e.g. variance can have a much
broader range of variation than skewness). Thus, distances from each metric
with larger ranges would disrupt others from smaller ranges. Scaling is used
to avoid this problem. For a given segmentation, the segment metrics are nor-
malised to the range [0, 1] using the min-max normalisation:
v − vmin
v∗ = , (2.9)
vmax − vmin
where v is the value of the metric for a given segment, v ∗ is the normalised value,
vmin is the minimum value for this metric when considering all the segments
and vmax is the maximum one. A constant value of v ∗ = 0.5 will be assigned
whenever the metric is constant for all segments.
2.1.7 Fitness
All GAs need a measure which allows assigning a quality index for each indi-
vidual of the population. If we are dealing with a clustering process, a way to
evaluate the obtained groups is to consider the Sum of Squared Errors (SSE),
which consists of the sum of squared distances between each segment and its
cluster centroid:
Xm
SSE = d2i (2.10)
i=1
where i is the segment being evaluated, m is the total number of segments, and
di is the Euclidean distance between segment i and its closest centroid.
Our goal is minimise this SSE in order to obtain more compact clusters
(where each point is as closer as possible to its centroid, but the centroids are
as far as possible from each other). However, when the GA tries to minimise
the SSE, it tends to minimise the number of segments as much as possible, in
the extreme case producing a partition where each cluster is a single segment.
For instance, assuming that the number of clusters considered is five and that
a chromosome include only five segments, the SSE would be minimum in this
14
case, SSE = 0, because each segment would constitute a cluster. Taking into
account that this is not an acceptable solution, the fitness function is redefined
considering also the number of segments:
m
f itness = . (2.11)
SSE
In this way, the algorithm tries to find partitions of the time series where
the number of segments is sufficiently high to assure the acquisition of valuable
information from the clustering process.
2.1.8 Selection and replacement processes

In this study, a direct selection with the criteria all individuals are selected is
adopted. That is, in each generation, all individuals within the population are
selected for reproduction and generation of offspring. Thus, a greater diversity
is promoted, because the parents are not selected based on their fitness.
The replacement process has been performed by roulette wheel selection,
i.e. a selection probability for each individual chromosome is calculated from its
fitness value. The number of individuals selected is the population size minus
one, and the vacant place is occupied by the best segmentation (that with the
highest fitness) of the previous generation, thus being an elitist algorithm.
As can be seen, the selection process promotes diversity, while the replace-
ment process promotes elitism.
2.1.9 Mutation Operator

The algorithm has been endowed with four mutation operators whose principal
function is to perform a better random exploration of the search space, with
the aim of reducing the dependency with respect to the initial population and
escaping from local optima. The probability pm of performing any mutation is
decided by the user. Once a mutation is decided to be performed, the kind of
perturbation applied to the chromosome is randomly selected from the following
list: 1) add a cut point, 2) remove a cut point, 3) move half of the cut points
to the left, and 4) move half of the cut points to the right.
When adding or removing cut points, the number of cut points to be added
or removed is also determined randomly. When moving cut points to the right
or the left, the number of points to move is approximately m/2 (half of the
available points), they are randomly selected, and each cut point is randomly
pushed to the previous or the following cut point (with the constraint that it
never reaches the previous or the next point). An example of the four mutation
operations is included in Figure 2.3, where two cut points are removed, one cut
point is added and half of the cut points are moved to the left and to the right.
2.1.10 Crossover Operator

The algorithm includes a crossover operator, whose main function is to per-
form an exploitation of the existing solutions. For each parent individual, the
crossover operator is applied with a given probability pc . The operator randomly
selects the other parent, a random index of the time series, and it interchanges
the left and right parts of the chromosomes with respect to this point. An
illustration of the crossover operator can be seen in Figure 2.4.
15
10 15 18 26 33 36 47 52 59 10 15 18 26 33 36 47 52 59
10 15 26 33 36 47 59 9 15 16 26 31 32 47 52 53
a) Mutation operator: remove two cut points c) Mutation operator: randomly move cut-
(18 and 52). points to the left.
10 15 18 26 33 36 47 52 59 10 15 18 26 33 36 47 52 59
10 15 18 26 33 36 39 47 52 59 10 17 20 29 33 36 47 54 63
b) Mutation operator: add a cut point: 39. d) Mutation operator: randomly move
cut-points to the right.
Figure 2.3: Mutation operator (Online version in colour)
10 15 18 26 33 36 47 52 59 62 68 75 80 84 88 92 95 99
15 20 23 27 32 36 45 48 55 65 71 77 81 86 91 96 98 99
a) Before applying crossover operator.
10 15 18 26 33 36 47 52 59 65 71 77 81 86 91 96 98 99
15 20 23 27 32 36 45 48 55 62 68 75 80 84 88 92 95 99
b) After applying crossover operator. The crossover point was randomly decided to be 60.
Figure 2.4: Crossover operator (Online version in colour)
16
2.2 Experiments
2.2.1 Climate datasets
The datasets chosen for this study are the GISP2 Greenland Ice Sheet Project
Two and the NGRIP North Greenland Ice Core Project δ 18 O ice core data
(Grootes and Stuiver, 1997; Stuiver and Grootes, 2000; Andersen et al., 2004;
Svensson et al., 2008). The δ 18 O water isotope record is a proxy for past atmo-
spheric temperature but also reflects changes in water temperature and seasonal
snow accumulation. In this study we focus on the 20-yr resolution δ 18 O isotope
records from both drilling sites.
Pre-processing the datasets in the form of a 5-point average was found to
help reduce short-term fluctuations within the datasets and improve the analysis
of time series segmentations. If {yn }N
n=1 is the original time series, then the time
N/5 P5i+4
series we have considered is {yn∗ }n=1 with yi∗ = 15 j=5i yi .
2.2.2 Algorithm parameters

GAs usually involve adjusting a notable set of parameters. However, their search
dynamics, which adapts to the problem evaluated, results in a performance
which is negligibly affected by minor changes in these parameters. In our case,
all the parameters were initially set by trial and error and then we used the
same values for all the problems analysed.
The number of individuals (segmentations) of the population is t = 80. The
crossover probability is pc = 0.9 and the mutation probability pc = 0.075. The
number of clusters to be discovered from each candidate segmentation is k = 5.
This number is possibly the most important parameter, but we experimentally
found that k = 5 clusters is high enough to discover new information among
derived clusters but not so high that the interpretation and reproducibility of
the results could be threatened. The maximum number of generations is set to
2000, and the k-means clustering process is allowed a maximum of 20 iterations.
It is important to point out that the algorithm estimates the type of segments
and the cutpoints without any additional climate knowledge or supervision of
the climate experts. The data time-series obtained from the system undergoing
a transition is the only information available to the algorithm.
Finally a GA is a stochastic optimisation algorithm with an embedded ran-
dom number generator. Given that the results can be different depending on the
seed value the algorithm should be run several times with different seeds. For
each dataset, the GA was run 10 times, with seeds in the set {10, 20, . . . , 100}
to evaluate and remove the dependence of the results on the seed value. It is
also a means to evaluate the accuracy of the algorithm.
2.3 Results
This section presents the main results of the segmentation algorithm for the two
paleoclimate datasets under study. The segmentation returned by the GA in the
last generation was analyzed, and the following approach has been considered
for its analysis. First it was verified whether the DO events were belonging to
17
Table 2.1: Detection accuracy for early warning signals of Dansgaard–Oeschger
events when considering the results of the GA for the 10 seeds.
DO event Detectability success (%)

GISP2 NGRIP
End of Younger Dryas 80 100
1 90 90
2 70 50
3 30 20
4 90 90
5 30 50
6 60 20
7 70 50
8 90 100
9 0 0
10 50 70
11 70 70
12 80 80
different classes or if they were grouped according to some common character-

istics. Second the behaviour of each metric in the six-dimensional parameter
space was observed on the onset of DO events to find common patterns that
would be indicative of EWSs, e.g. increasing variance and autocorrelation coef-
ficient. This was done for two independent datasets and for the ten seed values.
The detection accuracy when considering the results of the 10 seeds is included
in Table 2.1. Following such approach five main results have been obtained.
They are listed below:
1. The DO events are grouped into two main classes, sometimes three because
the values of autocorrelation, variance, and MSE may differ significantly
from one DO event to another. The high number of classes considered
here (5 classes in total) allows for flexibility within the algorithm as warn-
ing signals may have different strengths in agreement with the stochastic
resonance model (Ganopolski and Rahmstorf, 2001, 2002).
2. EWSs of DO events can be found by the segmentation algorithm in the
form of an increase in autocorrelation, variance, and mean square er-
ror (MSE). These EWSs are robustly (70%+) found in the GISP2 δ 18 O
dataset for DO 1, 2, 4, 7, 8, 11, 12, and end of Younger Dryas and for DO
1, 4, 8, 10, 11, 12, and end of Younger Dryas for the NGRIP δ 18 O dataset
(see Table 2.1 for more details).
3. The increase in mean square error (MSE) is suggested here as another
indicator of abrupt climate change. The increase in MSE, which suggests
nonlinear behaviour, has been found to correspond with an increase in
variance prior to DO events for ∼90% of the seed runs for the GISP2
δ 18 O dataset (e.g. see Figure 2.5) and for ∼100% of the seed runs for the
NGRIP δ 18 O dataset.
4. The increase in the autocorrelation coefficient cannot be solely used as
indicator of climate change. The algorithm sometimes found an increase
18
in MSE and variance but a decrease in autocorrelation coefficient on the
onset of DO events. This signature was minor in the GIPS2 δ 18 O dataset
(e.g. DO 2, 10) but much more present in the NGRIP δ 18 O dataset (e.g.
DO 0, 1, 5, 7, 8, 10). Hints of this behaviour could already be found for
DO 1 by Lenton et al. (2012). We stress that the increase in variance and
MSE is a much more robust EWS for NGRIP especially.
5. Analysis of paleoclimate records GIPS2 and NGRIP did not find any con-
sistent change in skewness nor kurtosis on the onset of DO events.
10
12 10 8 7 65 43 2 1 0
8
GISP2 metrics evolution
0
Variance
−2 MSE
Autocorrelation
−50 −45 −40 −35 −30 −25 −20 −15 −10 −5 0
Time before present (kyrs)
Figure 2.5: Time Series metrics after the clustering process (i.e. the segments
found by tha algorithm are replaced with their clusters centroids). The increase
in MSE is associated with an increase in variance and autocorrelation on the
onset of DO events. Several DO events are represented for reference (GISP2
δ 18 O ice core, seed = 10) (Online version in colour).
Figure 2.6 presents the detailed segmentation results for GISP2 and NGRIP
δ 18 O ice core data for a given seed value, including the segmentation, the DO
events and the centroids for each cluster. The Dansgaard-Oeschger events are
found grouped into two or three main classes with high autocorrelation, MSE,
and variance corresponding to classes C1 and C2 for GISP2 and classes C1 and
C5 for NGRIP for that run. Class C5 (cyan curve in Fig. 2.6b) is considered
the main DO class in NGRIP data for that particular run with a highly linear
relationship (ratio of 1:1) between variance and MSE within that class and a
constant high autocorrelation coefficient. This is illustrated in Figure 2.7b.
Class C3 for GISP2 dataset was the third main class grouping segments
with the lowest MSE, variance, and autocorrelation for that seed run and was
found at the onset of several DO events (e.g. 1, 4, 8, 12) collocated with the
Heinrich events H1, H3, H4, H5 as well as during the Holocene period (for an
introduction to Heinrich events see Hemming, 2004). Classes C4 and C5 have
been found outside the plotted area (in the -50ka, -60ka range) and therefore
do not appear in the graph. As for the NGRIP dataset classes C2 and C4 with
the lowest MSE, variance, and autocorrelation have been found at the onset of
several DO events as well (e.g. 4, 7, 8, 10 and 12) with a strange behaviour
19
in the autocorrelation coefficient for DO 1. A detailed analysis of their six-
dimensional vector revealed that classes C2 and C4 differ only from the point
of view of kurtosis in that run. This is further discussed in the discussion
section about the limitations of the algorithm. Considering algorithm runs with
different seed values revealed minor differences such as DO events belonging
to other classes but the main characteristics described here and in the five
main points remained robust throughout the results. The reader is invited to
Appendix 2.6 for the detailed segmentation results of GISP2 and NGRIP δ 18 O
ice core data for another seed value.
−34 −34
−35
1 −36 1
−36
12
NGRIP oxygen isotope data

GISP2 oxygen isotope data
−37
12
−38 8 7 65 43 2
−38
8 7 65 43 2
−39 −40
−40
−42
−41
−42
−44
−43
−44 −46
−50 −40 −30 −20 −10 0 −50 −40 −30 −20 −10 0
Time before present (kyrs) Time before present (kyrs)
(a) Segmentation results on GISP2 dataset (b) Segmentation results on NGRIP dataset
Figure 2.6: Results of segmentation algorithm on δ 18 O ice core data (seed =

10). The Dansgaard-Oeschger events are found grouped into two or three main
classes with high autocorrelation, MSE, and variance corresponding to classes
C1 , C2 and C5 for GISP2 and C1 and C5 for NGRIP. Several Dansgaard-Oeschger
events are numbered for reference. (Online version in colour)
2.4 Discussion
As expected the two ice core studied showed a few differences with respect
to the detectability sucess of DO events (see Table 2.1) but overall the main
characteristics could be captured within the two datasets. For instance changes
in statistical parameters were detected in the GIPS2 ice core for DO 3, 5, 6 and
10 with medium success (30%–60%) and with medium to low success (20%–60%)
for DO 2, 3, 5, 6, 7 and 10 in the NGRIP ice core, suggesting that these particular
DO events possess weak EWSs. Furthermore the segmentation algorithm could
not find any EWS for DO event 9. We suggest here that this particular event
does not possess any EWS, i.e. that the transition to a warm ocean circulation
mode close to a bifurcation point is taking place because of internal noise. The
detection of EWSs at the onset of some DO events and absence in other events
is a strong argument in favour of the stochastic resonance model as proposed in
Ganopolski and Rahmstorf (2001, 2002). It is worth mentioning that DO 9 is
also different from the point of view of its amplitude within a given time period.
20
1
0.8
1
Autocorrelation
0.6 0.8
Autocorrelation
0.6
0.4
0.4
0.2 1
0.2
0.8
0 0
0 1
0 0.6
0.2 0.8
0.2
0.4 0.6 0.4
0.4
0.6 0.4 0.6 0.2
0.8 0.2 0.8
1 0 0 Mean Squared Error
Mean Squared Error 1
Variance Variance
(a) GISP2 (b) NGRIP
Figure 2.7: 3D representation of the clustering results for variance, autocorre-

lation and MSE (normalised values) where each point is a segment within its
own cluster. The centroids are represented by black circles (Online version in
colour).
Using a simple event detection algorithm based on the data slope Rahmstorf
(2003) could not detect DO 9 as the associated warming was slower than in the
other events documented in Dansgaard et al. (1993).
When analyzing the results of segmentation algorithms one must also con-
sider the segment lengths to obtain meaningful information. It can happen that
the algorithm is not able to assign a proper class to a segment and prefers to di-
vide the segments into smaller sections to reduce e.g. MSE and kurtosis values.
The new smaller sections are likely to be grouped together in this parameter
space, allowing the algorithm to perform the clustering process. Moreover, an-
alyzing Eq. (2.11), fitness is directly proportional to the number of segments so
segmentations with a high number of segments will be prefered. One signature
of this effect is seen in the fact that all small segments are found in a single class
with very low kurtosis (γ2s =[-1.6,-1.9]), constant skewness (equal to 0), and a
large range of slope coefficients. They are represented by a straight line in Fig.
2.8. Special care was taken to discard those small segments (e.g. containing 2
or 3 points) in the analysis of EWSs. The difficulty of evaluating the clustering
quality is the origin of these problems. Sum of squared errors (SSE) directly
depends on the number of segments of the segmentation in such a way that, if
the number of segments is not taken into account for the fitness function, the
segmentations tend to be too simple (few segments per cluster) resulting in too
coarse-grained information. As discussed, when introducing the number of seg-
ments, the algorithm tends to discover small segments. However, as consecutive
small segments are usually labelled with the same class, the final result is still
useful. More advanced quality metrics for clustering could avoid this kind of
problems.
On the other hand, when considering the EWSs discovered by the algorithm
in future events, the length of the segments should be defined. This could be
done by analyzing the historical data obtained for the time series (for one or
even several different seeds).
21
1
0.9
1 0.8
0.7
0.8
0.6
Slope
0.6 0.5
Slope
0.4
0.4
0.3
0.2 0.2
0 1 0.1
0 0.8 0
0 1
0.2
0.6 0.2 0.8
0.4 0.6
0.4 0.4
0.6 0.6 0.4
0.8 0.2 0.8 0.2
Kurtosis 1 0
1 0 Kurtosis
Skewness Skewness
(a) GISP2 (b) NGRIP
Figure 2.8: 3D representation of the clustering results for slope, skewness and
kurtosis (normalised values) where each point is a segment within its own cluster.
The centroids are represented by black circles (Online version in colour).
2.5 Additional details about the Segmentation

Algorithm
2.5.1 Generation of each individual for the initial popula-
tion
We apply the following steps to generate the initial population of segments:
1. We determinate randomly the numbers of cutting points, m, as a uniform

value in the interval [10k, 15k], where k is the number of clusters to be
discovered among the different segments (and it is a user parameter). In
this way, we guarantee the existence of at least ten segments on average for
each cluster, so we can assure a minimum number of patterns to discover
proper clusters.
2. The procedure generates randomly the indexes of these m cutting points.

These random indexes are generated in such a way that t1 < t2 < . . . <
tm−1 , not allowing the repetition of indexes in the chromosomes, which
will result in a zero-length segment.
2.5.2 Application of the k -means algorithm

The algorithm is applied in the following way:
1. Initialisation of the centroids: we have modified the classic algorithm in
the sense that, instead of randomly choosing the initial centroids from the
list of segments, we consider a deterministic selection. This determinis-
tic process ensures that a chromosome will have always the same fitness
value. First, we choose the characteristic of the segment where the differ-
ence between the maximum and minimum value is the highest, i.e. the
characteristic with the most extreme values. The first centroid is the seg-
ment with the highest value in that characteristic. The second centroid
22
is the segment with the highest Euclidean distance from the first centroid
previously selected. The third centroid will be that which is farthest from
both, and so on. This assures a deterministic initialisation process, as the
same time that the initial centroids are as far as possible from each other.
2. Then the usual k-means algorithm is applied, i.e.:
(a) We calculate the distance between each segment and all the centroids.
(b) Each segment is assigned to the cluster of the closest centroid.
(c) The centroids are recalculated, as the average value of all the seg-
ments belonging to the corresponding cluster.
(d) If the stop condition is not fulfilled, return to (a). The algorithm
stops when the centroids are not modified for two consecutive itera-
tions.
2.6 Additional Examples of Segmentation for the

GISP2 and NGRIP datasets
Figure 2.9 presents the detailed segmentation results for GISP2 and NGRIP
δ 18 O ice core data for a seed value of 100. The Dansgaard-Oeschger events
are found grouped into two main classes with high autocorrelation, MSE, and
variance corresponding to classes C1 and C4 for GISP2 and classes C1 and C3 for
NGRIP for that run.
−34 −34
−35
−36 1
−36 1
12
NGRIP oxygen isotope data
GISP2 oxygen isotope data
−37 12
−38 8 7 65 43 2
−38 8 7 65 43 2
−39 −40
−40
−42
−41
−42
−44
−43
−44 −46
−50 −40 −30 −20 −10 0 −50 −40 −30 −20 −10 0
Time before present (kyrs) Time before present (kyrs)
(a) GISP2 (b) NGRIP
Figure 2.9: Results of segmentation algorithm on δ 18 O ice core data (seed =

100). The Dansgaard-Oeschger events are found grouped into two main classes
with high autocorrelation, MSE, and variance corresponding to classes C1 and
C4 for GISP2 and C1 and C3 for NGRIP. Several Dansgaard-Oeschger events are
numbered for reference (Online version in colour).
23
Chapter 3
Alternative fitness functions

and a new evaluation
method
The experiments in Chapter 2 showed some important limitations for the fitness
function considered. In this chapter, we consider alternative fitness functions for
the segmentation algorithm. Moreover, we approach a method for automatic
clustering result evaluation. Measuring the quality of a segmentation can be
only achieved by expert evaluation of the solutions given by the algorithm.
We present a quantitative method to perform comparisons with respect to an
expected ideal segmentation of the series to assess the robustness and stability
of the method. This method allows evaluating a segmentation algorithm with
a minimal effort by the expert, who has only to provide the ideal segmentation.
The rest of the chapter is organised as follows. Section 3.1 presents some
the new alternative fitness functions, while Section 3.2 presents a proposal for
segmentation comparison and discusses the experimental setting. The results
are included in Section 3.3.
3.1 Measuring the quality of the clustering pro-

cess
As described in Section 2.1.7, the last step of the evaluation of the chromosome
is to measure how well the segments are grouped (compactness of the clustering).
It is clear that different clustering algorithms usually lead to different clusters or
reveal different clustering structures. In this sense, the problem of objectively
and quantitatively evaluating the clustering results is particularly important
and this is known in the literature as cluster validation. There are two different
testing criteria for this purpose (Xu and Wunsch, 2008): external criteria and
internal criteria. When a clustering result is evaluated based on the data that
was clustered itself, this is called internal evaluation. In external evaluation,
clustering results are evaluated using for example known class labels. Based on
these concepts, the internal criteria evaluation metrics will be a suitable option
for the evolution, because the GA is not given any a priori information of the
24
segments to be found. Note that the segments metrics are normalised at this
step as well. We have considered four different metrics:
1. Sum of squared errors (SSE): The simplest error measure is the sum of
squared errors (considering errors as the distance from each point to their
centroid), i.e.:
Pk P
SSE = N1 i=1 x∈Ci d(x, ci )2 , (3.1)
where k is the number of clusters, ci is the centroid of cluster Ci and
d(x, ci ) is the Euclidean distance between pattern x and centroid ci . This
function does not prevent clusters to fall very close in the clustering space.
1
As this index has to be minimised, the fitness will be defined as f = 1+SSE .
2. Caliński and Harabasz index (CH): This index has been found to be one
of the best performing ones for adjusting the value of k. It is defined as:
Tr(SB )·(N −k)
CH = Tr(SW )·(k−1) , (3.2)
where N is the number of patterns, and Tr(SB ) and Tr(SW ) are the trace
of the between and within-class scatter matrix, respectively. Note that
the value of k will be fixed in our algorithm. As this index has to be
maximised, the fitness will be defined as f = CH.
3. Davies-Bouldin index (DB): This index also attempts to maximise the
between-cluster distance while minimising the distance between the cluster
centroids to the rest of points. It is calculated as follows:
Pk α +α
DB = k1 i=1 maxi6=j d(ci i ,cjj) , (3.3)
where αi is the average distance of all elements in cluster Ci to centroid

ci , and d(ci , cj ) is the distance between centroids ci and cj . As this index
1
has to be minimised, the fitness will be defined as f = 1+DB .
4. Dunn index (DU ): The Dunn index attempts to identify clusters that
are compact and and well-separated. In this case, the distance between
two clusters is defined as d(Ci , Cj ) = minx∈Ci ,y∈Cj d(x, y), that is, the
minimum distance between a pair of points x and y belonging to Ci and
Cj . Furthermore, we could define the diameter diam(Ci ) of cluster Ci as
the maximum distance between two of its members, such as: diam(Ci ) =
maxx,y∈Ci d(x, y). Then, the Dunn index is constructed as:

d(Ci ,Cj )
DU = mini=1,...,k minj=i+1,...,k maxl=1,...,k diam(Cl ) . (3.4)
The Dunn index has been found to be very sensitive to noise, but this
disadvantage can be avoided by considering different definitions of clus-
ter distance or cluster diameter. For example, as suggested in (Xu and
Wunsch, 2008), the cluster diameter can be computed as:
diam(Ci ) = NC (N1C −1) x,y∈Ci d(x, y),

P
(3.5)
i i
where NCi is the number of patterns belonging to cluster Ci . This cluster

diameter estimation has been found to be more robust in the presence of
noise. As this index has to be maximised, the fitness will be f = DU .
25
3.2 Automatic evaluation method and experi-
mental setting
As previously, the dataset chosen for this chapter is the North Greenland Ice
Core Project (NGRIP) δ 18 O ice core data (Andersen et al., 2004; Svensson
et al., 2008). The δ 18 O water isotope record is used as a proxy for past atmo-
spheric temperature. We focus on the 20-yr resolution δ 18 O isotope records.
The dataset is pre-processed by obtaining a 5-point average in order to reduce
short-term fluctuations within the data. In this way, the time series we have
N/5 P5i+4
considered is {yn∗ }n=1 with yi∗ = 15 j=5i yi .
3.2.1 Experimental setting

The experimental design is presented in this subsection. The GA was configured
with the following parameters: the number of individuals of the population is
t = 100. The crossover probability is pc = 0.8 and the mutation probability
pm = 0.2. The percentage of cut points to be mutated is the integer part
of the 20% of the number of cut points. For the initialisation, the number
of segments is decided by defining the average segment length, which is set to
sl = 4. The maximum number of generations is set to g = 100, and the k-means
clustering process is allowed a maximum of 20 iterations. These parameters were
optimised by a trial and error procedure, although the algorithm showed a very
robust performance to their values. The most important parameters for the
final performance of the algorithm were sl and k.
We performed different experiments considering the 4 different fitness func-
tions presented in Section 3.1 and different values of k for the k-means algorithm
(k = 2, . . . , 6). It is important to recall that the algorithm estimates the optimal
segments and clusters them without any prior information of the DO events.
The only information given to the algorithm is the time series and the statistic
characteristics to use for the clustering in order to validate whether the statis-
tics proposed in the literature are useful for characterising paleoclimate TPs in
general. Given the stochastic nature of GAs, the algorithm was run 30 times
with different seeds to evaluate its stability and robustness.
3.2.2 Automatic evaluation method

In order to evaluate the results of the algorithm, two evaluation metrics were
used. These measures analyse both the homogeneity of cluster assignation with
respect to the DO events and the robustness of the results obtained from dif-
ferent seeds. They are not included in the fitness function, serving only as an
automatic way of evaluating the quality of the segmentation, avoiding the in-
tervention of the expert. Both are indexes comparing two different clustering
partitions:
1. Rand index (RI) This metric is particularly useful for data clustering eval-
uation (Rand, 1971). It is related to the accuracy, but is applicable even
when class labels are not available for the data, as in our case. A set
Y = {yn }N n=1 is given (in our case, the time series), and two clustering par-
titions of Y are to be compared: X = {X1 , . . . , Xr } and Z = {Z1 , . . . , Zs }.
For a given segmentation, the partitions are defined in the following way:
26
−35
1
0
14 12 8
17 11
16 15 10 7 6 5
13 43
9
−40 2
Non DO event
DO event
−45
−60 −50 −40 −30 −20 −10 0
Figure 3.1: Representation of the ideal segmentation and the different DO

events.
Xl is a set containing every yi ∈ ss , ss ∈ Cl , i.e. the partitions are based

on the label assigned to each time series value yi from the current seg-
mentation. The following two numbers are defined: a (number of pairs
in Y that are in the same set in X and Z) and b (number of pairs in Y
that are in different sets in X and Z). Then, the Rand index is defined
as: RI = (a + b)/ n2 . This metric has a value between 0 and 1, with 0
indicating that the two partitions do not agree on any pair of points and
1 indicating that they are exactly the same.
2. Adjusted rand index (ARI): It is a corrected version of the RI (Hubert and
Arabie, 1985) trying to fix some known problems with the RI, e.g. the
expected value of the RI of two random partitions does not take a constant
value and it approaches its upper limit of unity as the number of clusters
increases. ARI values range from −1 to +1, yielding negative values if the
index is less than the expected index. The detailed formulation can be
found in Hubert and Arabie (1985).
In order to evaluate the segmentation returned by the algorithm, we compare
it with an ideal segmentation1 . The ideal segmentation (Figure 3.1) has been
designed by examining the literature about Dansgaard-Oeschger (DO) events,
which are associated to TPs. In the Figure, the onsets of the DO events (in a
first approximation, we do not consider the error margin) reported in Svensson
et al. (2008) are represented by vertical lines and the segments covering the
period precursor to the DO events (which we hypothesise as TP) are delimited
by the slope close to the corresponding onset. The closer the segmentation
returned by the GA is to this ideal segmentation, the better the segmentation.
To perform this comparison, RI and ARI indexes will be used (ARI Ideal and
RI Ideal).
Given that the wishful ideal segmentation would be binary (non DO event or
DO event) and the segmentation returned by the GA can have a value of k > 2,
we need to binarise the segmentation of the GA (i.e. decide which clusters
1 Hypothetically ideal segmentation, based on the available data. The hypothesis is that
the onset of the DO events is detected from combined analysis of benthic sediment data and
ice core analysis (Peterson et al., 2000). Those data do not always agree, therefore part of the
error margin. The method of timing contributes the rest of the error.
27
Table 3.1: NGRIP average segmentation results for different algorithm settings.
Fitness k ARI Ideal RI Ideal ARI Seeds RI Seeds

DB 5 0.315 ± 0.060 0 .777 ± 0 .015 0.346 ± 0.078 0.727 ± 0.040
DU 5 0 .308 ± 0 .067 0.788 ± 0.018 0 .341 ± 0 .092 0.727 ± 0.046
CH 5 0.260 ± 0.073 0.772 ± 0.008 0.223 ± 0.105 0.644 ± 0.074
SSE 5 0.279 ± 0.048 0.770 ± 0.018 0.057 ± 0.018 0.638 ± 0.017
Fitness k ARI TPs RI TPs ARI Seeds RI Seeds
DB 2 0.171 ± 0.132 0.766 ± 0.001 0.258 ± 0.292 0.821 ± 0.081
DB 3 0.257 ± 0.081 0.758 ± 0.013 0 .411 ± 0 .152 0 .780 ± 0 .046
DB 4 0 .304 ± 0 .045 0.773 ± 0.009 0.412 ± 0.080 0.761 ± 0.037
DB 5 0.315 ± 0.060 0 .777 ± 0 .015 0.346 ± 0.078 0.727 ± 0.040
DB 6 0.286 ± 0.075 0.779 ± 0.014 0.214 ± 0.109 0.615 ± 0.084
represent the DO events and which not). Preliminary experiments revealed

that DO events were usually grouped under one or two clusters, so we evaluated
ARI Ideal and RI Ideal for all possible combinations of one or two clusters.
The final value was the maximum ARI Ideal and RI Ideal values of all these
combinations. Moreover, the stability of the GA was estimated by comparing
the 30 segmentations from the different runs. This was done by averaging RI and
ARI comparing all possible pairs of segmentations (ARI Seeds and RI Seeds).
3.3 Results
All these results are included in Table 3.1. The first part of the table compares
the different fitness functions for a predefined value of k = 5 (as we initially
observed that this was obtaining suitable results). As can be seen, both DB
and DU fitness functions obtain very good segmentation quality and stability,
although DB performs slightly better. In contrast, CH and SSE are perform-
ing poorly in both scenarios (it is noteworthy the very low stability obtained by
the SSE fitness function, which may be due to the fact that it only minimises
the intra-cluster distances and obviates the inter-cluster distances). The result
that the algorithm is robust and stable to different initialisations is crucial for
the following parts of the study (i.e. develop an early warning system for TPs of
climatic component). Concerning the experiment that studies different values
of k, it can be seen that k = 5 is indeed the optimal value for the segmentation.
This result indicates that the concept and nature of DO events is too complex to
only consider a binary approach (TPs versus non TPs). The climate system ex-
hibits a dynamical behaviour with intrinsic variability hence a binary approach
is not able to encompass all features present within a DO event, being k = 5 a
reasonable choice. Moreover, the method can group several DO events together
and is still a useful tool to better understand the behaviour of DO events.
The segmentation obtaining the highest ARI Ideal metric (with a value of
0.498) for the fitness function DB, along with a representation of the 18 DO
events can be seen in Figure 3.2. The segments have been coloured according
to their cluster assignation. The clusters associated to the DO events are C1
and C5 . If we compare this segmentation to the one in Figure 3.1, we can see
that almost all DO events are correctly segmented by the algorithm (C1 and
C5 segments are always close to the DO onset) and that there are not “false
positives” labels (C1 and C5 segments are not found in a non DO event part of
28
−35
1
−36 0
−37 12
17 11 8
14
−38 16 6 5
15 10 7
−39 13
4 3
9 2
−40
−41
−42
−43
−44
−45
−60 −50 −40 −30 −20 −10 0
Figure 3.2: Best time series cluster assignation after the evolutionary process.
2.5 4 16
2 14
3
1.5 12
10
1 2
8
0.5
1 6
0
4
−0.5 0
2
−1
0
−1
−1.5 −2
−2 −2 −4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 −3 −2 −1 0 1 2 3 4 5 −1 0 1 2 3 4 5
Figure 3.3: Clustering space for the six metrics (each point represents a seg-
ment).
the series). However, five events are not detected: 2, 9, 11, 13 and 16 (some of
which have been found in the literature to be caused by random fluctuations of
the dynamics of the time series and for which there is no evidence of increase
in the selected statistics). The clustering space of this segmentation can be
analysed in Figure 3.3. This Figure confirms that there are some differences
between the two clusters associated to the DO events (C1 and C5 ), mainly from
the values of the Ss2 metric.
29
Chapter 4
Time Series Forecasting by

Evolutionary Recurrent
Product Unit Neural
Networks
This chapter is focused on Product Unit Neural Networks (PUNNs) and its ap-
plication on TSF. The basis function of the hidden neurons of PUNNs is the
Product Unit (PU) function, where the output of the neuron is the product of
their inputs raised to real valued weights. PUNNs are an alternative to sigmoidal
neural networks and are based on multiplicative nodes instead of additive ones
(Durbin and Rumelhart, 1989). This model has the ability to express strong in-
teractions between input variables, providing big variations at the output from
small variations at the inputs. Consequently, it has increased storage informa-
tion capability and promising potential for TSF. However, they result in a highly
convoluted error function, plenty of local minima. This handicap makes conve-
nient the use of global search algorithms, such as genetic algorithms (Li et al.,
2011), evolutionary algorithms (Luque et al., 2007) or swarm optimisation algo-
rithms (Cai et al., 2004), in order to find the parameters minimising the error
function. PUNNs has been widely used in classification (Hervás-Martı́nez and
Martı́nez-Estudillo, 2007) and regression problems (Martı́nez-Estudillo et al.,
2006), but scarcely applied to TSF, with the exception of some attempts on
hydrological TSA (Dulakshi S. K. Karunasingha and Li, 2011; Piotrowski and
Napiorkowski, 2012). It is important to point out that, in TSF, there is an au-
tocorrelation between the lagged values of the series. In this way, theoretically,
PUNNs should constitute an appropriate model for TSF because they can eas-
ily model the interactions (correlations) between the lagged values of the time
series.
The first goal of this chapter is to evaluate the performance of Autoregre-
sive Product Unit Neural Networks (ARPUNN) on TSF. The ARPUNN model
should yield high performance for TSF, as it fulfils the requirements that al-
low the modelling of TS: ability to express the interactions between inputs and
increased storage capability. However, as mentioned above, long term memory
30
ANNs usually obtain better results than FFNNs. For this reason, a second
goal of this work is to propose a hybrid ANN combining an ARPUNN with a
reservoir network with the objective of increasing the storage capability of the
resulting model. The short term memory is provided by the different lags of
the TS included in the input layer and the long term memory is supplied by a
reservoir network included as one of the inputs of the system. The final model
is called Recurrent Product Unit Neural Network (RPUNN).
From the points of view of the learning algorithm, the complex error surface
associated to PUs implies serious difficulties for searching the best parame-
ters minimising the error function. A novel hybrid algorithm is proposed in this
work to alleviate these difficulties. It combines the exploration abilities of global
search algorithms with the exploitation ones of local search methods. The Co-
variance Matrix Adaptation Evolution Strategy (CMA-ES) algorithm (Hansen,
2006; Jastrebski and Arnold, 2006) is used to calculate the parameter values
of the hidden layer, whereas the weights of the output layer are determined by
means of the MP generalised inverse. These combinations provide us with a
hybrid model and algorithm capable to afford the difficulties of TSF, obtaining
a competitive performance.
This chapter is organised as follows: Section 4.1 describes the ANN hybrid
model proposed to be applied in TSF. Section 4.2 explains the hybrid search
algorithm designed to get the parameters which optimise the error function.
Sections 4.3 and 4.4 explains the experiments that were carried out and the
results obtained.
4.1 Models
In this section, we first introduce a ARPUNN model, which is then extended by
considering a reservoir to result in the RPUNN model. The models proposed
addressed the TSF problem. This problem is mathematically formulated as
follows. Let {yn }N +p
n=1 be a TS to be predicted, where N + p TS values are
given for training. In this way, the function f : Rp → R is estimated from
a training set of N patterns, D = (X, Y) = {(xn , yn+p )}N n=1 where xn =
{yn+p−1 , yn+p−2 , . . . , yn } is the vector of input characteristics (p past values of
the TS) taking values in the space Ω ⊂ Rp , and the label, yn+p , is the value of
the TS for the n + p instant. Both models are now explained in the following
subsections.
4.1.1 Short memory model: Autoregressive Product Unit

Neural Network (ARPUNN)
This section presents the first model proposed to address the TSF problem, the
so-called ARPUNN. The suggested architecture is based on considering PUs as
the basis functions for the hidden layer of the network. PUs neural networks
models have the ability to express strong interactions between the input vari-
ables. The model is composed by an input, hidden and output layer. The input
layer has p input units that correspond to the lagged values of the TS providing
the network with a short memory. The hidden layer of the network is composed
by S PUs and the output layer contains only one neuron. A representation of
31
model proposed has been included in the supplementary material of the paper
associated to this chapter 1 .
The final model is linear in the basis function space together with the ini-
tial variables. A similar architecture (which is usually referred to as skip-layer
connections) was also considered for classification in previous works for PUs
(Hervás-Martı́nez and Martı́nez-Estudillo, 2007; Gutiérrez et al., 2010) and Ra-
dial Basis Functions (RBFs) (Gutiérrez et al., 2011). The TS value is estimated
by ybn+p = f (xn , θ) : Rp → R, where the final output of the model is defined as
S
X p
X
f (xn , θ) = β0 + βs Bs (xn , ws ) + αk yn+p−k , (4.1)
s=1 k=1
where βs ∈ R denotes the weight of the connection between the hidden neu-
ron s and the output neuron (s = 1, 2, . . . , S), leading the structure that
provides the non linear contribution of the inputs. The β vector includes
all the parameters connecting the hidden with the output layer and the bias
β = (β0 , β1 , β2 , . . . , βS ) ∈ RS . The linear contribution of the inputs is controlled
by αk which is the weight of the connection between the input k and the output
layer (k = 1, 2, . . . , p). The vector α contains all the parameters connecting the
input and the output layer, α = (α1 , α2 , . . . , αp ) ∈ Rp . Another kind of weights,
ws ∈ Rp , represent the connections of the hidden neuron s and the input layer.
The θ vector contains the full parameter vector (θ = {w1 , w2 , . . . , wS , β, α}).
Finally, Bs (xn , ws ) : Rp → R represents the output of the s-th PU basis function
and it is defined as:
p
Y wis
Bs (xn , ws ) = (yn+p−i ) ,1 ≤ s ≤ S (4.2)
i=1
where wis ∈ R is the weight of the connection between the i-th input node and
the s-th basis function and yn+p−i denotes the i-th lagged past value of the TS.
4.1.2 Long memory model: Recurrent Product Unit Neu-

ral Network (RPUNN)
In this section, the long memory model is presented (called Recurrent Product
Unit Neural Network, RPUNN). The RPUNN model is also established on the
ARPUNN advanced previously and reuses its network architecture. One aspect
that should be considered on TSF is the memory or the amount of information
that can be storage in the network. Traditionally, ANNs with longer memory
have an enhanced performance for TSF. The main difference between ARPUNN
and RPUNN lies in the inclusion on a new structure as an input, a reservoir
network. The reservoir provides the whole structure with long term and dynamic
memory. The structure of the RPUNN is depicted in Figure 4.1.2
As can be seen, the network inherits the architecture of ARPUNN with
the linear and non-linear combination of the inputs described in the previous
1 http://www.esa.int/gsp/ACT/cms/projects/rpunn.html
2 For the sake of clarity, reservoir representation is simplified: there is a link between each
reservoir node and each PU and all reservoir nodes receive yt−1 time series value as input. The
interconnections between reservoir nodes are random. Internal connections of the reservoir
are given by κ.
32
Figure 4.1: Architecture of the Recurrent Product Unit Neural Network
(RPUNN).
section. The output layer contains only one neuron while the hidden layer of
the network is composed by S neurons, whose basis function is the PU. The
input layer considered has p + m neurons that correspond to the p lagged values
of the TS plus the m outputs of the reservoir network. The p lagged values
provides to the network the short memory. The reservoir part is formed by a
set of m nodes and the output of each of these nodes is considered as an input to
the RPUNN, providing the whole structure with a dynamic memory. The only
input considered for the reservoir is the first lagged value of the input of the
network. The estimated TS value is defined by the final output of the model,
ybn+p = f (xn , θ) : Rm+p → R, as follows:
S
X p
X
(n)
f (xn , θ) = β0 + βs Bs (xn , ψ , ws ) + αk yn+p−k (4.3)
s=1 k=1
where ψ(n) ∈ Rm is the reservoir state vector for time n, and θ = {w1 , w2 , . . . ,
wS , β, α, κ} represents the set of the network weights, composed by the vectors
β ∈ RS and α ∈ Rp (previously defined), ws ∈ Rm+p , which represents the
connections of the hidden neurons and the input layer, s = 1, . . . , S, and, finally,
the matrix of the connections for the reservoir network, κ ∈ Rm×(m+2) . At last,
Bs (xn , ψ(n) , ws ) : Rm+p → R represents the basis function considered in the
hidden layer yielding the following nonlinear output for the model:
p p+m wjs
(n)
Y wis
Y
Bs (xn , ψ(n) , ws ) = (yn+p−i ) ψj (4.4)
i=1 j=p+1
where 1 ≤ s ≤ S, ws = (w1s , ..., wps , w(p+1)s , . . . , w(p+m)s ) ∈ Rm+p is the

hidden layer weight vector, wis ∈ R is the weight of the connection between
the input neuron i and the hidden neuron s, 1 ≤ i ≤ p, and w(p+j)s is the
weight of the connection between the j-th reservoir node and the hidden neuron
(n)
s, 1 ≤ j ≤ m. Finally, ψj represents the output of the j-th reservoir node at
(n) (n)
time n, 1 ≤ j ≤ m and the corresponding vector is ψ(n) = {ψ1 , . . . , ψm }.
33
The reservoir consists of a sparsely connected group of nodes, where each
neuron output is randomly assigned to the input of another neuron. This allows
the reservoir reproducing specific temporal patterns. All the reservoir nodes are
sigmoidal nodes, as this model is more adequate in order to keep a long term
memory of the TS:
(n)
ψj = Rj (ψ(n−1) , κj ) = (4.5)
m
!
(n−1)
X
= σ κ0j + κij ψi + κ(m+1)j yn−1 ,
i=1
where σ(x) = 1/(1 + exp(−x)) is the sigmoidal activation function, and κj

is the vector of parameters corresponding to the j-th reservoir neuron, κj =
{κ0j , κ1j , . . . , κmj , κ(m+1)j }, with m + 2 elements. As can be observed, self-
connections are allowed. The internal structure of the reservoir is randomly
fixed and kept constant during the learning process, in the same vein than it is
done with ESNs (Gallicchio and Micheli, 2011).
4.2 Parameter Estimation

This section discuss the training algorithm proposed to fit ARPUNN and RPUNN
parameters. As stated above, PUNNs exhibit a highly convoluted error sur-
face, which can easily make the training algorithm get stuck in local minima
and in consequence, avoiding optimum parameters to be obtained. In general,
this can be tackled by using global search algorithms, but instead they can be
slow to reach the global optimum. The method considered in this work fo-
cuses in obtaining a trade-off between both extremes, which is achieved by a
hybrid algorithm. The parameter set to be optimised in the ARPUNN model
is θ = {β, α, w1 , w2 , . . . , wS }, which is composed by the set of weights of the
hidden layer nodes (w1 , w2 , . . . , wS ), and the set of weights of the output layer,
β and α. In the case of the RPUNN, it is also required to estimate the values
of the parameters included in the vector κ, i.e. the weights of the reservoir
interconnections.
The beginning of the algorithm involves the CMA-ES method as a global op-
timisation procedure (Hansen, 2006). CMA-ES is an evolutionary algorithm for
difficult nonlinear non-convex optimisation problems in continuous domain. The
evolution strategy defined in this algorithm is based on the use of a covariance
matrix that represents the pairwise dependencies between the candidate values
of the variables to be optimised. The distribution of the covariance matrix is
updated by means of the covariance matrix adaptation method, that attempts
to learn a second order model of the cost function similar to the optimisation
made in the Quasi-Newton methods (Saini and Soni, 2002). The CMA-ES has
several invariance properties and does not require a complex parameter tuning.
In this study, the uncertainty is undertaken as proposed in Hansen et al. (2009)
and a substractive update of the covariance matrix is done as in Jastrebski and
Arnold (2006). Another consideration is to adapt only the diagonal of the co-
variance matrix for a number of initial iterations, as stated in Ros and Hansen
(2008), leading to a faster learning. The cost function according to which the
weigths are optimised is the Root Mean Squared Error (RMSE).
34
For both ARPUNN and RPUNN models, the target parameters under op-
timisation by the CMA-ES algorithm are the weights from the input layer to
the hidden layer {w1 , w2 , . . . , wS }. The hybrid algorithm starts by randomly
generating the values for these weights. Although the rest of the weights are
needed to obtain the cost function, they can be analytically calculated by using
the MP generalised inverse, as done in the ELM (Huang et al., 2012). This
process has to be performed on each iteration of the CMA-ES algorithm and
for each individual of the population. Let φ = (β1 , . . . , βS , α1 , . . . , αp )T denote
the weights of the links connecting hidden and output layers. The calculation
of φ can be done by taking into account that the system is linear is the basis
function space is considered. In this way, the nonlinear system can be converted
into a linear system where:
Y = Hφ (4.6)
where H = {hij } (i = 1, . . . , N and j = 1, . . . , S + p) represents the hidden and
input layers output matrix: if 1 ≤ j ≤ S, hij = Bj (xi , wj ) (for the ARPUNN
model) or hij = Bj (xi , ψ(i) , wj ) (for the RPUNN model); if S < j ≤ S + p,
hij = yi+p−j . Finally, the determination of φ can be obtained by finding the
least-square solution of the equation:
φ̂ = H† Y (4.7)
where H† is the MP generalised inverse of the matrix H. The solution provided

by this method is unique and it has the smallest norm within all least-square so-
lutions. In addition, it obtains a high generalisation performance that decreases
the time required to learn the sequence as states (Huang et al., 2004).
The parameters of the reservoir for RPUNN model are randomly fixed be-
fore starting the CMA-ES optimisation and then kept constant for the whole
evolution. Sparsity is achieved by randomly setting to 0 a percentage (in our
case, ∼ 90%) of the weights for the connections between reservoir nodes (i.e.
κij = 0, for some randomly selected i values, 1 ≤ i ≤ m).
4.3 Experiments
In order to analyze the performance of the proposed methods, an experimental
study was carried out. The TS data selected, the metrics considered to evaluate
the performance of the models and the algorithms used for comparison purposes
are described in the following subsections.
4.3.1 Dataset Selected

The TS datasets used for the experimental setup belongs to the NNGC1, Acont,
B1dat, D1dat and Edat forecasting competitions3 . These datasets were also con-
sidered in Bergmeir et al. (2012). A total amount of 29 time series available
in the KEEL-dataset repository4 (Alcalá-Fdez et al., 2011) have been used. A
detailed description of the datasets considered and a table with their character-
istics have been included in the supplementary material of the paper.
3 Available at http://www.neural-forecasting-competition.com
4 Available at http:/sci2s.ugr.es/keel/timeseries.php
35
The datasets have been preprocessed to adapt the inputs to the mathemati-
cal characteristics of the PU-based models: input variables have been scaled in
the rank [0.1, 0.9]5 . The experimental design was conducted using a 5-fold cross
validation, with 10 repetitions per each fold.
4.3.2 Metrics Considered for Evaluation

The metrics considered in this brief are the Root Mean Square Error (RMSE)
and the Number of Hidden Nodes (NHN). Given that all the models consider
fully connected neurons, NHN is a measure of the size of the neural network.
Neural Networks are very sensitive to this value (generally, large networks re-
quire a higher value of NHN and a longer processing time (Crone and Dhawan,
2007)).
4.3.3 Algorithms Selected for Comparison Purpose

In order to evaluate the performance of the RPUNN and ARPUNN models, they
have been compared to some of the most promising neural networks models for
TSF. Aiming to outline different characteristics of the methods, the compared
methods have been grouped in two sets. The main objective behind the first set
of models is comparing ARPUNN and RPUNN methods to baseline algorithms.
This set is composed by the following algorithms:
• A Nonlinear Autoregressive Neural Network (NARNN) (Chow and Leung,
1996), which parameters have been determined by the Broyden-Fletcher-
Goldfarb-Shannon gradient-based algorithm (Nawi et al., 2006).
• The Echo State Network (ESN) (Rodan and Tiňo, 2011).
• The Extreme Learning Machine (ELM) method (Huang et al., 2012).
The second set of models is selected with purpose of analising the perfor-
mance of the PU basis functions for TSF. Due to this, the two models proposed
are compared to ANNs models trained with the same algorithm, but considering
other basis functions. The models employed in this set are:
• The Nonlinear Autoregressive Radial Basis Function Neural Network (NAR-
RBFNN).
• The Nonlinear Autoregressive Sigmoid Neural Network (NARSIGNN).
All the hyperparameters considered in this chapter were estimated by a
nested five-fold cross-validation procedure. The metric considered to determine
the best configuration of parameters was the RMSE. The most important hy-
perparameter was NHN, and the range of possible values considered for model
selection depends on the model:
• In the case of the NARNN, NARRBFNN, NARSIGNN, ARPUNN and
RPUNN algorithms, the experiment was carried out considering the set
S ∈ {5, 10, 15, 20}.
5 Scaling the input data to positive values is required to avoid having complex numbers as
output of the basis function. Additionally, the scaling considered also avoids having inputs
equal to zero or one.
36
• The ESN and ELM algorithms require a higher number of hidden neurons
that can supply the network with sufficiently informative random projec-
tions (Huang et al., 2012). In this case, the set of hidden nodes considered
is formed by S ∈ {10, 20, 50, 100, 150, 200, 300}.
Further considerations on the parameters values for the models can be found in
the supplementary material of the paper.
4.4 Results
For all of the 29 data series, models were trained, predictions were made on the
test set, and the RMSE and NHN were computed for these predictions. The
detailed tables of results (for the two metrics considered) can be found in the
suplementary material of this brief. Table 4.1 reports the averaged results over
all the series for the methods compared (including the averaged value for the
metric and the averaged ranking). As can be seen in Table 4.1, the RPUNN
model yielded the best mean in RMSE (RM SE = 0.0583 and RRM SE = 1.793)
followed by the ARPUNN model (RM SE = 0.0624 and RRM SEG = 2.7586).
Despite this improvement, there are some datasets where the results were not
as accurate as expected, which might be caused by the degrees of freedom of the
models proposed. On the other hand, the minimun N HN is obtained by the
ESN model with a mean of 12.06 followed by the ARPUNN model with a mean
of 12.65. In terms of ranking the best results are obtained by the ARPUNN
model with a 2.46 mean position followed by the ESN model with a 2.75 mean
position. The RPUNN does not obtains as good results as in the case of the
RMSE getting a N HN of 15.10 and a RN HN of 4.10. These outcomes show
that the ARPUNN model is highly competitive regarding the simplicity of the
network. However, in the case of the RPUNN model, in order to get the best
performance it requires a larger architecture leading to a higher NHN. A boxplot
of the results obtained for the ranking of the RM SEG and NHN can be seen in
the Figure 4.2 where it can be appreciated the performance above mentioned.
Table 4.1: Summary of results for RMSE and NHN as the test variables, in-
cluding results of the Holm test for rankings.
RMSE NHN
RM SE G RRM SEG N HN RN HN
NARNN 0.0652 3.7586• 14.37 3.41
NARRBFNN 0.0881 5.3448• 14.82 3.68•
ESN 0.0718 4.3793• 12.06 2 .75
ELM 1.1155 5.1379• 163.69 6.70•
NARSIGNN 0.0689 4.8275• 16.58 4.86•
ARPUNN 0 .0624 2 .7586 • 12 .65 2.46C
RPUNN 0.0583 1.7931C 15.10 4.10•
The best result is in bold face and the second one in italics
C: control method (Holm test)
•: significant differences wrt. the control method (Holm test)
The results provided in this brief were validated using non-parametric tests.
The Friedman test detected significant differences on a significance level of
37
7
G
RRMSE
4
NARRBFNN
NARSIGNN
ARPUNN
NARNN
RPUNN
ESN
ELM
(a) Test Variable: RRMSEG
4
RNHN
1
NARRBFNN
NARSIGNN
ARPUNN
NARNN
RPUNN
ESN
ELM
(b) Test Variable: RNHN
Figure 4.2: Boxplot for the average ranking of RM SEG (RMSE over the gen-
eralisation set) and NHN over the 29 datasets
α = 0.10. Based on this fact, the Holm test was applied (with the same level
of confidence), considering as the control method for the RM SEG variable the
RPUNN algorithm, and the ARPUNN method for the N HN variable, because
they obtains the best mean ranking for these metrics. The Holm test shows
that the RPUNN model performs significantly better than the rest of the mod-
els considered. Regarding the N HN metric, the ARPUNN method shows a
significant lower size than the NARRBFNN, RPUNN, NARSIGNN and ELM
methods. A fully description of the non-parametric tests applied and the results
of all of them are included in the supplementary material of this brief.
38
Chapter 5
Conclusions
This Ariadna study has introduced two different tools for analysing paleoclimate
data: 1) a novel genetic algorithm (GA) (covered in Chapters 2 and 3), and 2)
two novel time series forecasting (TSF) models.
The GA, taken from the field of time series segmentation, has been ap-
plied to paleoclimate data to identify common patterns that would act as early
warning signals for abrupt climate change. The segments are represented in a
six-dimensional space with dimensions corresponding to statistical metrics that
contain information about the system undergoing a critical transition, and they
are automatically grouped by a clustering algorithm to uncover common proto-
types of segments throughout the time series. These common patterns can be
visualised in a straightforward manner by looking at their segment class label.
The GA presents differentiating characteristics with respect to previous time
series segmentation algorithms, specially in the generation of the initial popu-
lation, in the mutation operators based on moving cut points and in the fitness
function. The clustering process and the GA complement each other with the
final aim of achieving a higher level representation of the time series informa-
tion. Despite being a stochastic algorithm, the GA shows a robust behaviour in
different datasets, independently of the algorithm seeds, with very low standard
deviations for the fitness values.
Experimental results show that early warning signals of Dansgaard-Oeschger
events could be robustly found for several of these events in the form of an
increase in autocorrelation, variance, and mean square error in both GISP2 and
NGRIP δ 18 O ice core data. The GA applied to NGRIP δ 18 O ice core record
showed that increasing autocorrelation coefficient cannot be solely used as an
indicator of climate change. The quantitative results presented in this chapter
strongly support the hypothesis of stochastic resonance model brought forward
to explain abrupt Dansgaard-Oeschger climate events. Finally the proposed
approach provides a novel visualisation tool in the field of climate time series
analysis and detection of critical transitions.
On the other hand, the GA has been complemented by the evaluation of
different fitness functions and a new method for automatically assessing the
performance of the algorithm. From the different experiments, the Davies-
Bouldin index presented in Section 3.1 is the best performing fitness function.
Future work about this first contribution includes extending the method to
find early warning signals and considering other time series datasets, mutation
39
and crossover operators and fitness functions.
This study has also proposed two new models of artificial neural networks
(ANNs) based on the use of product units (PUs) as a basis function for TSF.
The interest on the use of PUs arise from its ability to express strong inter-
actions between input variables, a feature truly important in TSF where there
is autocorrelation between the lagged values of the time series. Two models
of PU neural networks (PUNNs) have been implemented, the autoregressive
PUNN (ARPUNN), and the recurrent PUNN (RPUNN) which consists on an
enhanced version of the ARPUNN. The architecture of ARPUNN considers a
short-term memory provided by the lagged values of the time series, whereas
the RPUNN model includes an additional set of inputs supplied by a reservoir
network which provides a long-term memory to the model. The parameters of
the models were determined by a hybrid learning algorithm that combines a
global and a local search methods (the CMA-ES algorithm and the use of the
MP generalised inverse, respectively). The proposed models have been imple-
mented, tested and compared to the state-of-the-art ANNs for TSF. The results
show that the introduced models present a very good performance in terms of
root mean squared error (RMSE).
40
Bibliography
Ahmed, H. and Rauf, F. (1991). Nadine-a feedforward neural network for ar-
bitrary nonlinear time series. In Neural Networks, 1991., IJCNN-91-Seattle
International Joint Conference on, volume ii, pages 721–726 vol.2.
Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., and Garcı́a, S. (2011).
Keel data-mining software tool: Data set repository, integration of algorithms
and experimental analysis framework. Multiple-Valued Logic and Soft Com-
puting, 17(2-3):255–287.
Alley, R. B., Marotzke, J., Nordhaus, W. D., Overpeck, J. T., Peteet, D. M.,
Pielke, R. A., Pierrehumbert, R. T., Rhines, P. B., Stocker, T. F., Tal-
ley, L. D., and Wallace, J. M. (2003). Abrupt climate change. Science,
299(5615):2005–2010.
Andersen, K. K., Azuma, N., Barnola, J.-M., Bigler, M., Biscaye, P., Caillon,
N., Chappellaz, J., Clausen, H. B., Dahl-Jensen, D., Fischer, H., et al. (2004).
High-resolution record of northern hemisphere climate extending into the last
interglacial period. Nature, 431(7005):147–151.
Arroyo, J. and Maté, C. (2009). Forecasting histogram time series with k-nearest
neighbours methods. International Journal of Forecasting, 25(1):192–207.
Arzel, O., England, M. H., De Verdiere, A. C., and Huck, T. (2012). Abrupt
millennial variability and interdecadal-interstadial oscillations in a global cou-
pled model: sensitivity to the background climate state. Climate dynamics,
39(1-2):259–275.
Ashwin, P., Wieczorek, S., Vitolo, R., and Cox, P. (2012). Tipping points in
open systems: bifurcation, noise-induced and rate-dependent examples in the
climate system. Philosophical Transactions of the Royal Society A: Mathe-
matical, Physical and Engineering Sciences, 370(1962):1166–1184.
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., and Baskurt, A. (2011). In
Salah, A. and Lepri, B., editors, Human Behavior Understanding, volume
7065 of Lecture Notes in Computer Science, pages 29–39. Springer Berlin
Heidelberg.
Bennett, K. D. (1996). Determination of the number of zones in a biostrati-

graphical sequence. New Phytologist, 132(1):155–170.
41
Bergmeir, C., Triguero, I., Molina, D., Aznarte, J., and Benitez, J. (2012).
Time series modeling and forecasting using memetic algorithms for regime-
switching models. Neural Networks and Learning Systems, IEEE Transactions
on, 23(11):1841–1847.
Bond, G., Heinrich, H., Broecker, W., Labeyrie, L., McManus, J., Andrews, J.,
Huon, S., Jantschik, R., Clasen, S., Simet, C., Tedesco, K., Klas, M., and
Bonani, G. (1992). Evidence for massive discharges of icebergs into the north
atlantic ocean during the last glacial period. Nature, 360:245 – 249.
Broecker, W. S. (1998). Paleocean circulation during the last deglaciation: A
bipolar seesaw? Paleoceanography, 13(2):119 – 121.
Cai, X., Zhang, N., Venayagamoorthy, G., and Wunsch, D. (2004). Time series
prediction with recurrent neural networks using a hybrid pso-ea algorithm.
In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Con-
ference on, volume 2, pages 1647–1652 vol.2.
Chow, T. W. S. and Leung, C. (1996). Nonlinear autoregressive integrated neu-
ral network model for short-term load forecasting. Generation, Transmission
and Distribution, IEE Proceedings-, 143(5):500–506.
Chung, F.-L., Fu, T.-C., Ng, V., and Luk, R. W. (2004). An evolutionary ap-
proach to pattern-based time series segmentation. Evolutionary Computation,
IEEE Transactions on, 8(5):471–489.
Cimatoribus, A. A., Drijfhout, S. S., Livina, V., and van der Schrier, G. (2013).
Dansgaard–oeschger events: bifurcation points in the climate system. Clim.
Past, 9:323–333.
Claussen, M., Ganopolski, A., Brovkin, V., Gerstengarbe, F.-W., and Werner,
P. (2003). Simulated global-scale response of the climate system to dans-
gaard/oeschger and heinrich events. Climate Dynamics, 21(5-6):361–370.
Connor, J., Martin, R., and Atlas, L. (1994). Recurrent neural networks
and robust time series prediction. Neural Networks, IEEE Transactions on,
5(2):240–254.
Crone, S. and Dhawan, R. (2007). Forecasting seasonal time series with neural
networks: A sensitivity analysis of architecture parameters. In Neural Net-
works, 2007. IJCNN 2007. International Joint Conference on, pages 2099–
2104.
Crucifix, M. (2012). Oscillators and relaxation phenomena in pleistocene climate
theory. Philosophican Transactions of the Royal Society A, 370(1962):1140–
1165.
Dakos, V., Carpenter, S. R., Brock, W. A., Ellison, A. M., Guttal, V., Ives,
A. R., Kefi, S., Livina, V., Seekell, D. A., Van Nes, E. H., et al. (2012).
Methods for detecting early warnings of critical transitions in time series
illustrated using simulated ecological data. PLoS One, 7(7):e41010.
Dakos, V., Scheffer, M., van Nes, E. H., Brovkin, V., Petoukhov, V., and Held,
H. (2008). Slowing down as an early warning signal for abrupt climate change.
Proceedings of the National Academy of Sciences, 105(38):14308–14312.
42
Dansgaard, W., Johnsen, S., Møller, J., and Langway, C. (1969). One thousand
centuries of climatic record from camp century on the greenland ice sheet.
Science, 166(3903):377–380.
Dansgaard, W., Johnsen, S. J., Clausen, H. B., Dahl-Jensen, D., Gundestrup,

N. S., Hammer, C. U., Hvidberg, C. S., Steffensen, J. P., Sveinbjornsdottir,
A. E., Jouzel, J., and Bond, G. (1993). Evidence for general instability of
past climate from a 250-kyr ice-core record. Nature, 364:218–220.
Ditlevsen, P. D. and Johnsen, S. J. (2010). Tipping points: Early warning and

wishful thinking. Geophysical Research Letters, 37(19):L19703.
Dulakshi S. K. Karunasingha, A. W. J. and Li, W. K. (2011). Evolutionary prod-
uct unit based neural networks for hydrological time series analysis. Journal
of Hydroinformatics, 13(4):825–841.
Durbin, R. and Rumelhart, D. (1989). Products units: A computationally

powerful and biologically plausible extension to backpropagation networks.
Neural Computation, 1(1):133–142.
Gallicchio, C. and Micheli, A. (2011). Architectural and markovian factors of
echo state networks. Neural Netw., 24(5):440–456.
Ganopolski, A. and Rahmstorf, S. (2001). Rapid changes of glacial climate

simulated in a coupled climate model. Nature, 409.
Ganopolski, A. and Rahmstorf, S. (2002). Abrupt glacial climate changes due
to stochastic resonance. Physical Review Letters, 88(3):038501.
Gonzalo, J. and Ng, S. (2001). A systematic framework for analyzing the dy-
namic effects of permanent and transitory shocks. Journal of Economic Dy-
namics and Control, 25(10):1527 – 1546.
Grootes, P. M. and Stuiver, M. (1997). Oxygen 18/16 variability in greenland
snow and ice with 10-3- to 105-year time resolution. Journal of Geophysical
Research: Oceans, 102(C12):26455–26470.
Gutiérrez, P. A., Hervás-Martı́nez, C., and Martı́nez-Estudillo, F. J. (2011).
Logistic regression by means of evolutionary radial basis function neural net-
works. IEEE Transactions on Neural Networks., 22(2):246–263.
Gutiérrez, P. A., Segovia-Vargas, M. J., Salcedo-Sanz, S., Hervás-Martı́nez,

C., Sanchı́s, A., Portilla-Figueras, J. A., and Fernández-Navarro, F. (2010).
Hybridizing logistic regression with product unit and rbf networks for accurate
detection and prediction of banking crises. Omega, 38(5):333–344.
Hansen, J. and Nelson, R. (1997). Neural networks and traditional time se-
ries methods: a synergistic combination in state economic forecasts. Neural
Networks, IEEE Transactions on, 8(4):863–873.
Hansen, N. (2006). The cma evolution strategy: A comparing review. In To-
wards a New Evolutionary Computation, volume 192 of Studies in Fuzziness
and Soft Computing, pages 75–102. Springer Berlin Heidelberg.
43
Hansen, N., Niederberger, A. S. P., Guzzella, L., and Koumoutsakos, P. (2009).
A method for handling uncertainty in evolutionary optimization with an appli-
cation to feedback control of combustion. IEEE Transanction on Evolutionary
Computation, 13(1):180–197.
Held, H. and Kleinen, T. (2004). Detection of climate system bifurcations by
degenerate fingerprinting. Geophysical Research Letters, 31(23):L23207.
Hemming, S. R. (2004). Heinrich events: Massive late pleistocene detritus layers
of the north atlantic and their global climate imprint. Reviews of Geophysics,
42(1):n/a–n/a.
Hervás-Martı́nez, C. and Martı́nez-Estudillo, F. (2007). Logistic regression using
covariates obtained by product-unit neural network models. Pattern Recog-
nition, 40(1):52–64.
Himberg, J., Korpiaho, K., Mannila, H., Tikanmaki, J., and Toivonen, H. T.
(2001). Time series segmentation for context recognition in mobile devices. In
Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference
on, pages 203–210. IEEE.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural
Comput., 9(8):1735–1780.
Huang, G.-B., Zhou, H., Ding, X., and Zhang, R. (2012). Extreme learning
machine for regression and multiclass classification. IEEE Transactions on
Systems, Man, and Cybernetics, Part B: Cybernetics, 42(2):513–529.
Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. (2004). Extreme learning machine:
a new learning scheme of feedforward neural networks. In Neural Networks,
2004. Proceedings. 2004 IEEE International Joint Conference on, volume 2,
pages 985–990 vol.2.
Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of classifica-
tion, 2(1):193–218.
Jaeger, H. (2002). Adaptive nonlinear system identification with echo state
networks. In Advances in neural information processing systems, pages 593–
600.
Jastrebski, G. and Arnold, D. (2006). Improving evolution strategies through
active covariance matrix adaptation. In Evolutionary Computation, 2006.
CEC 2006. IEEE Congress on, pages 2814–2821.
Johansson, E., Dowla, F., and Goodman, D. (1991). Backpropagation learn-
ing for multilayer feed-forward neural networks using the conjugate gradient
method. International Journal of Neural Systems, 02(04):291–301.
Johnsen, S., Dansgaard, W., Clausen, H., and Langway, C. (1972). Oxy-
gen isotope profiles through the antarctic and greenland ice sheets. Nature,
235(5339):429–434.
Kanner, L. C., Burns, S. J., Cheng, H., and Edwards, R. (2012). High-latitude
forcing of the south american summer monsoon during the last glacial. Sci-
ence, 335:570–573.
44
Keogh, E., Chu, S., Hart, D., and Pazzani, M. (2001). An online algorithm
for segmenting time series. In Data Mining, 2001. ICDM 2001, Proceedings
IEEE International Conference on, pages 289–296. IEEE.
Kubo, R. (1966). The fluctuation - dissipation theorem. Reports on Progress in

Physics, 29I.
Lee, Y.-H. and Davier, A. (2013). Monitoring scale scores over time via quality
control charts, model-based approaches, and time series techniques. Psy-
chometrika, 78(3):557–575.
Lenton, T. M., Livina, V. N., Dakos, V., and Scheffer, M. (2012). Climate
bifurcation during the last deglaciation? Clim. Past, 8:1127–1139.
Lenton, T. M., Myerscough, R. J., Marsh, R., Livina, V. N., Price, A. R., and
Cox, S. J. (2009). Using genie to study a tipping point in the climate system.
Philosophical Transactions of the Royal Society A: Mathematical, Physical
and Engineering Sciences, 367(1890):871–884.
Li, P., Tan, Z., Yan, L., and Deng, K. (2011). Time series prediction of mining
subsidence based on genetic algorithm neural network. In Computer Science
and Society (ISCCS), 2011 International Symposium on, pages 83–86.
Livina, V., Kwasniok, F., Lohmann, G., Kantelhardt, J., and Lenton, T. (2011).
Changing climate states and stability: from pliocene to present. Climate
dynamics, 37(11-12):2437–2453.
Livina, V. and Lenton, T. (2007). A modified method for detecting incipient
bifurcations in a dynamical system. Geophys. Res. Lett., 34:L03712.
Luque, C., Ferran, J., and Vinuela, P. (2007). Time series forecasting by means
of evolutionary algorithms. In Parallel and Distributed Processing Symposium,
2007. IPDPS 2007. IEEE International, pages 1–7.
MacQueen, J. et al. (1967). Some methods for classification and analysis of
multivariate observations. In Proceedings of the fifth Berkeley symposium on
mathematical statistics and probability, volume 1, page 14. California, USA.
Martı́nez-Estudillo, A. C., Martı́nez-Estudillo, F. J., Hervás-Martı́nez, C., and
Garcı́a-Pedrajas, N. (2006). Evolutionary product unit based neural networks
for regression. Neural Networks, 19(4):477–486.
Michalewicz, Z. (1996). Genetic algorithms + data structures = evolution pro-

grams. The Netherlands, HOOO.
Nawi, N., Ransing, M., and Ransing, R. (2006). An improved learning algo-
rithm based on the broyden-fletcher-goldfarb-shanno (bfgs) method for back
propagation neural networks. In Intelligent Systems Design and Applications,
2006. ISDA ’06. Sixth International Conference on, volume 1, pages 152–157.
Palmer, T. N. and Weisheimer, A. (2011). Diagnosing the causes of bias in
climate models – why is it so hard? Geophysical & Astrophysical Fluid Dy-
namics, 105(2-3):351–365.
45
Pan, F., Zhang, H., and Xia, M. (2009). A hybrid time-series forecasting model
using extreme learning machines. In Intelligent Computation Technology and
Automation, 2009. ICICTA ’09. Second International Conference on, vol-
ume 1, pages 933–936.
Peterson, L. C., Haug, G. H., Hughen, K. A., and Röhl, U. (2000). Rapid
changes in the hydrologic cycle of the tropical atlantic during the last glacial.
Science, 290(5498):1947–1951.
Piotrowski, A. P. and Napiorkowski, J. J. (2012). Product-units neural networks

for catchment runoff forecasting. Advances in Water Resources, 49(0):97 –
113.
Prandom, P., Goodwin, M., and Vetterli, M. (1997). Optimal time segmenta-
tion for signal modeling and compression. In Acoustics, Speech, and Signal
Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, vol-
ume 3, pages 2029–2032. IEEE.
Rahmstorf, S. (2003). Timing of abrupt climate change: A precise clock. Geo-
physical Research Letters, 30(10).
Rand, W. M. (1971). Objective Criteria for the Evaluation of Clustering Meth-
ods. Journal of the American Statistical Association, 66(336):846–850.
Rodan, A. and Tiňo, P. (2011). Minimum complexity echo state network. Neural
Networks, IEEE Transactions on, 22(1):131 –144.
Ros, R. and Hansen, N. (2008). A simple modification in cma-es achieving
linear time and space complexity. In Proceedings of the 10th international
conference on Parallel Problem Solving from Nature: PPSN X, pages 296–
305. Springer-Verlag.
Saini, L. and Soni, M. (2002). Artificial neural network based peak load fore-
casting using levenberg-marquardt and quasi-newton methods. Generation,
Transmission and Distribution, IEE Proceedings-, 149(5):578–584.
Scheffer, M., Bascompte, J., Brock, W. A., Brovkin, V., Carpenter, S. R., Dakos,
V., Held, H., Van Nes, E. H., Rietkerk, M., and Sugihara, G. (2009). Early-
warning signals for critical transitions. Nature, 461(7260):53–59.
Schwander, J., Jouzel, J., Hammer, C. U., Petit, J.-R., Udisti, R., and Wolff,
E. (2001). A tentative chronology for the epica dome concordia ice core.
Geophysical research letters, 28(4):4243–4246.
Sclove, S. L. (1983). Time-series segmentation: A model and a method. Infor-
mation Sciences, 29(1):7–25.
Sitte, R. and Sitte, J. (2000). Analysis of the predictive ability of time de-
lay neural networks applied to the s amp;p 500 time series. Systems, Man,
and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on,
30(4):568–572.
Stocker, T. F. and Johnsen, S. J. (2003). A minimum thermodynamic model
for the bipolar seesaw. Paleoceanography, 18(4):1087.
46
Stuiver, M. and Grootes, P. M. (2000). Gisp2 oxygen isotope ratios. Quaternary
Research, 53(3):277–284.
Svensson, A., Andersen, K. K., Bigler, M., Clausen, H. B., Dahl-Jensen, D.,
Davies, S., Johnsen, S. J., Muscheler, R., Parrenin, F., Rasmussen, S. O.,
et al. (2008). A 60 000 year greenland stratigraphic ice core chronology.
Climate of the Past, 4(1):47–57.
Tabacco, I., Passerini, A., Corbelli, F., and Gorman, M. (1998). Determination
of the surface and bed topography at dome c, east antarctica. Journal of
Glaciology, 44(146):185–191.
Tseng, V. S., Chen, C.-H., Huang, P.-C., and Hong, T.-P. (2009). Cluster-based
genetic segmentation of time series with dwt. Pattern Recognition Letters,
30(13):1190–1197.
Xiong, Z., Herley, C., Ramchandran, K., and Orchard, M. T. (1994). Flexi-
ble time segmentations for time-varying wavelet packets. In Time-Frequency
and Time-Scale Analysis, 1994., Proceedings of the IEEE-SP International
Symposium on, pages 9–12. IEEE.
Xu, R. and Wunsch, D. (2008). Clustering. IEEE Press Series on Computational
Intelligence. Wiley.
47

TP1 ACT RPT CMS ARI 13 9202 ClimateTippingPoints

Uploaded by

Copyright:

Available Formats

TP1 ACT RPT CMS ARI 13 9202 ClimateTippingPoints

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TP1 ACT RPT CMS ARI 13 9202 ClimateTippingPoints

Uploaded by

Copyright:

Available Formats

Climate tipping points: Detection

and analysis of patterns using an

Authors: P.A. Gutiérrez1, C. Hervás-Martínez1,F. Fernández-Navarro2, I.

Pedro Antonio Gutiérrez

Leopold Summerer (Technical Officer)

Ariadna ID: 13-9202

Gutiérrez, Pedro Antonio1 , Hervás-Martı́nez, César1 ,

2 Detection of early warning signals in paleoclimate data using a

3 Alternative fitness functions and a new evaluation method 24

4 Time Series Forecasting by Evolutionary Recurrent Product

Climate variability as it was regionally manifested during the past millenia is

Detection of early warning

This chapter presents the main characteristics of the segmentation algorithm

2.1.2 General overview of the segmentation algorithm

Figure 2.1: Main steps of the algorithm

2.1.3 Chromosome representation

2.1.4 Initial population

2.1.5 Segment characteristics

Figure 2.2: Chromosome representation (Online version in colour)

2.1.6 Clustering: k -means Algorithm

2.1.8 Selection and replacement processes

2.1.9 Mutation Operator

2.1.10 Crossover Operator

Figure 2.3: Mutation operator (Online version in colour)

Figure 2.4: Crossover operator (Online version in colour)

2.2.2 Algorithm parameters

DO event Detectability success (%)

different classes or if they were grouped according to some common character-

NGRIP oxygen isotope data

Figure 2.6: Results of segmentation algorithm on δ 18 O ice core data (seed =

(a) GISP2 (b) NGRIP

Figure 2.7: 3D representation of the clustering results for variance, autocorre-

(a) GISP2 (b) NGRIP

2.5 Additional details about the Segmentation

1. We determinate randomly the numbers of cutting points, m, as a uniform

2. The procedure generates randomly the indexes of these m cutting points.

2.5.2 Application of the k -means algorithm

2.6 Additional Examples of Segmentation for the

(a) GISP2 (b) NGRIP

Figure 2.9: Results of segmentation algorithm on δ 18 O ice core data (seed =

Alternative fitness functions

3.1 Measuring the quality of the clustering pro-

where αi is the average distance of all elements in cluster Ci to centroid

diam(Ci ) = NC (N1C −1) x,y∈Ci d(x, y),

where NCi is the number of patterns belonging to cluster Ci . This cluster

3.2.1 Experimental setting

3.2.2 Automatic evaluation method

Figure 3.1: Representation of the ideal segmentation and the different DO

Xl is a set containing every yi ∈ ss , ss ∈ Cl , i.e. the partitions are based

Fitness k ARI Ideal RI Ideal ARI Seeds RI Seeds

represent the DO events and which not). Preliminary experiments revealed

Time Series Forecasting by

4.1.1 Short memory model: Autoregressive Product Unit

4.1.2 Long memory model: Recurrent Product Unit Neu-

where 1 ≤ s ≤ S, ws = (w1s , ..., wps , w(p+1)s , . . . , w(p+m)s ) ∈ Rm+p is the

where σ(x) = 1/(1 + exp(−x)) is the sigmoidal activation function, and κj

4.2 Parameter Estimation

where H† is the MP generalised inverse of the matrix H. The solution provided

4.3.1 Dataset Selected