Evolutionary Markov chain Monte
Carlo
Mădălina M. Drugan
Dirk Thierens
institute of information and computing sciences, utrecht university
technical report UU-CS-2003-047
www.cs.uu.nl
Evolutionary Markov chain Monte Carlo
Mădălina M. Drugan and Dirk Thierens
Institute of Information and Computing Sciences, Utrecht University
P.O. Box 80.089, 3508 TB Utrecht, The Netherlands
{madalina,dirk}@cs.uu.nl
Abstract. Markov chain Monte Carlo (MCMC) is a popular class of algorithms to sample from a
complex distribution. A key issue in the design of MCMC algorithms is to improve the proposal mechanism and the mixing behaviour. This has led some authors to propose the use of a population of
MCMC chains, while others go even further by integrating techniques from evolutionary computation
(EC) into the MCMC framework. This merging of MCMC and EC leads to a class of algorithms,
we call Evolutionary Markov Chain Monte Carlo (EMCMC). In this paper we first survey existing
EMCMC algorithms and categorise them in two classes: family-competitive EMCMC and populationdriven EMCMC. Next, we introduce the Elitist Coupled Acceptance rule and the Fitness Ordered Tempering algorithm.
1 Introduction
Markov Chain Monte Carlo (MCMC) algorithms provide a framework for sampling from complicated target distributions that cannot be sampled with simpler, distribution specific, methods. MCMC algorithms
are applied in many fields, and their use in Machine Learning has recently been advocated in [1]. Usually, MCMC uses a single chain which runs for a long time. However, to improve the convergence rate,
there are some MCMC variants that work with a population of MCMCs. The use of a population makes
these algorithms somewhat similar to Evolutionary Algorithms (EA). Indeed some authors have proposed
algorithms that integrate techniques from the EC field into the MCMC framework. Here we survey these
EMCMC algorithms and classify them into two basic categories: family-competitive EMCMC algorithms
that operate through an acceptance rule at the family level, and population-driven EMCMC algorithms that
operate through a population-driven, adaptive proposal distribution. One property of EMCMC algorithms
is that they are not necessarily a set of parallel MCMC chains, but that they are a single MCMC at the population level. Besides surveying existing EMCMC algorithms we also propose two alternative techniques:
the Elitist Coupled Acceptance (ECA) rule and the Fitness Ordered Tempering (FOT) algorithm.
The paper is organised as follows. Section 2 discusses basic concepts and algorithms from MCMC.
Section 3 describes parallel MCMC algorithms, while Section 4 surveys EMCMC algorithms. We introduce
the ECA and FOT techniques in Section 5, and report some experimental results in Section 6.
2 The Markov Chain Monte Carlo framework
MCMC is a general framework to generate samples X t from a probability distribution P (·) while exploring its search space Ω(X) using a Markov chain. MCMC does not sample directly from P (·) but
′
only requires that the density P (X) can be evaluated within a multiplicative constant P (X) = P Z(X) ,
where Z is a normalisation constant and P ′ (·) is the unnormalised target distribution. A Markov chain
is a discrete-time stochastic process {X 0 , X1 , . . .} with the property that the state X t given all previous
values {X0 , X1 , . . . , Xt−1 } only depends on X t−1 : P (Xt | X0 , X1 , . . . , Xt−1 ) = P (Xt | Xt−1 ). We call
P (· | ·) the transition matrix of the Markov chain. P (· | ·) is a stationary - this is, independent of time t
- transition matrix with the following properties: (i) all the entries are non-negative, and (ii) the sum of
the entries in a row is 1. We assume that P (·) > 0. MCMC converges, in infinite time, to the probability
distribution P (·), thus it samples with higher probability from more important states of P (·). An finite
state MCMC which has an irreducible and aperiodic stationary transition matrix converges to a unique
stationary distribution [1]. A MCMC chain is irreducible if, and only if, every state of the MCMC chain
can be reached from every other state in several steps. A MCMC is aperiodic if, and only if, there exists no
cycles to be trapped into. A sufficient, but not necessary, condition to ensure that P (·) is the stationary distribution is that MCMC satisfies the detailed balance condition [1]. A MCMC satisfies the detailed balance
condition if, and only if, the probability to move from X t to XN EW multiplied by the probability to be in
Xt is equal to the probability to move from X N EW to Xt multiplied by the probability to be in X N EW :
P (XN EW | Xt ) · P (Xt ) = P (Xt | XN EW ) · P (XN EW ).
Metropolis-Hastings algorithms. Many MCMC algorithms are Metropolis-Hastings (MH) algorithms [10,
20]. Since we cannot sample directly from P (·), MH algorithms consider a simpler distribution S(· | ·),
called the proposal distribution for sampling the next state of a MCMC chain. S(X N EW | Xt ) generates
the candidate state XN EW from the current state X t , and the new state X N EW is accepted with probability:
A(XN EW | Xt ) = min (1,
P ′ (XN EW ) · S(Xt | XN EW )
)
P ′ (Xt ) · S(XN EW | Xt )
If the candidate state is accepted the next state becomes X t+1 = XN EW . Otherwise, Xt+1 = Xt . The
transition probability for arriving in X N EW when the current state is X
t is T (XN EW | Xt ) = S(XN EW |
Xt ) · A(XN EW | Xt ), if XN EW = Xt , and T (Xt | Xt ) = 1 − Y,Y =Xt S(Y | Xt ) · A(Y | Xt ),
otherwise.
The pseudo-code for the MH algorithm is:
M etropolis − Hastings()
1 Initialise X0 ; t ← 0
2 while true
3
do Sample XNEW f rom S(· | Xt )
4
if U nif orm sampling(0, 1) ≤ A(XNEW | Xt )
5
then Xt+1 ← XNEW
6
else Xt+1 ← Xt
7
t←t+1
A MH algorithm is aperiodic, since the chain can remain in the same state with a probability greater
than 0, and by construction it satisfies the detailed balance condition. If, in addition, the chain is irreducible, then it converges to the stationary distribution P (·). The rate of convergence depends on the
relationship between the proposal distribution and the target distribution: the closer the proposal distribution is to the stationary distribution, the faster the chain converges. Two popular Metropolis-Hastings
algorithms are the Metropolis algorithm and the independence sampler. For the Metropolis algorithm, the
proposal distribution is symmetrical S(X N EW | Xt ) = S(Xt | XN EW ) and the acceptance rule be′
N EW )
). Because it accepts the candidate states often - and thus
comes A(XN EW | Xt ) = min (1, P P(X′ (X
t)
the state space is well sampled - the Metropolis rule generally performs well. The proposal distribution of
the independence sampler does not depend on the current state S(X N EW | Xt ) = S(XN EW ). The indeN EW )
pendence sample’s acceptance probability can be written as A(X N EW | Xt ) = min (1, w(X
w(Xt ) ), where
′
(·)
w(·) = PS(·)
. Candidate states with low w(XN EW ) are rarely accepted, while states with high w(X N EW )
are very often accepted, and the process could get stuck for a long time in states with very high w(X).
Obviously, the choice of w(·) greatly influences the convergence rate.
Simulated annealing (SA). SA is a minor modification of a single chain MH algorithm used for
optimisation. Instead of sampling from the entire distribution P (·), SA samples at step t from P t′ (·) =
1
P ′ (·) T emp[t] , where T emp[t] decreases according to a cooling scheduler to 0. With T emp[·] close to ∞,
the chain accepts almost any candidate state according to MH acceptance rule A, whereas, when T emp[·]
is close to 0, the chain rejects almost all states that have lower fitness than the current one. Note that, for
constant temperature T emp[t], SA is a MCMC which converges to the distribution P t (·). However, every
time the SA chain is cooled, the transition matrix is changed and the detailed balance is not satisfied. Yet,
in infinite time, SA converges to the optimum and, more general, if T emp[i] decreases to 1 SA converges
to the stationary distribution P (·). SA is a non-homogeneous MCMC which converges to a given stationary
distribution. The time to convergence depends on the cooling schedule. In practice, a fast cooling schedule
is preferred to a slower one, increasing the risk of poor performance.
3 Parallel MCMC algorithms
MCMC algorithms are usually applied in a sequential way. A single chain is run for a long time until
it converges to the stationary distribution P (·). The states visited during the initial phase of the run are
considered to be unreliable and further ignored. For reasons of computational efficiency this burn-in phase
should be as short as possible. MCMCs with a short burn-in phase are said to mix well (note that this
is unrelated to the mixing of building blocks in the EC literature). There exist various variations on the
standard MCMC algorithm to speed up this mixing process. In this paper we are particularly interested in
techniques that use multiple chains in parallel as opposed to a single chain.
Multiple independent chains (MIC). The most straightforward technique to make use of multiple
chains is simply to run N independent MCMCs in parallel. The chains are started at different initial states
and their output is observed at the same time. It is hoped that this way a more reliable sampling of the
target distribution P (·) is obtained. It is important to note that no information exchange between the chains
is taking place. Recommendations in the literature are conflicting regarding the efficiency of parallel independent chains. Yet there are at least theoretical advantages of multiple independent chains MCMC for
establishing its convergence to P (·) [7].
Parallel tempering (PT). Parallel tempering [6] is a parallel MCMC with N chains each having a
1
different stationary distribution P i′ (·) = P ′ (·) T emp[i] , i = 1, . . . , N . The temperatures have an increasing magnitude T emp[1] < . . . < T emp[N ] with T emp[1] = 1. The stationary distribution of the lowest temperature chain is therefore equal to the target distribution, or P 1′ (·) = P ′ (·). The temperatures
T emp[i], (2 ≤ i ≤ N ) are given a constant value, typically according to a geometrically increasing series.
Note that this is similar to the temperature values proposed by the cooling scheduler of simulated annealing, though for SA the different temperatures are generated sequentially in time, while for PT the different
temperatures are generated in space - this is, the population - and remain unchanged during the run.
The candidate states are generated using mutation and accepted with the standard Metropolis-Hasting
acceptance rule. Chains in PT exchange information by swapping states. Two chains i and j interact by
trying to exchange their current states X t [i] and Xt [j] using the swapping acceptance rule:
AS (Xt [i], Xt [j]) = min (1,
Pi′ (Xt [j]) · Pj′ (Xt [i]) S(Xt′ | Xt′′ )
·
)
Pi′ (Xt [i]) · Pj′ (Xt [j]) S(Xt′′ | Xt′ )
where Xt′ = (. . . , Xt [i], . . . , Xt [j], . . .) and Xt′′ = (. . . , Xt [j], . . . , Xt [i], . . .). Note that AS is a MH
acceptance rule, satisfying the detailed balance condition. A S accepts with probability 1 an exchange of
states if the more important state is inserted into the chain with the lower temperature. To increase the
acceptance rate the two chains usually have adjacent temperatures (|i − j| = 1). Heuristically, PT improves
mixing: better states of a warmer chain can be inserted in a colder chain that is mixing slower.
Parallel sintering (PS). Parallel sintering [12] can be viewed as a generalisation of parallel tempering
where the proposal distributions of each chain in the population is a member of some family of distributions {Pi (·) | i = 1, . . . , N } defined over spaces of different dimensions (resolutions), that are related
to the target distribution by the highest dimensional distribution (or largest resolution) P 1 (·) = P (·). As
in PT parallel sintering exchanges information between adjacent chains by exchanging states through the
swapping acceptance rule. Here too the idea is to increase the mixing of the slowly mixing highest dimensional chain towards the target distribution by inserting states of the lower dimensional - and faster mixing
- chains in the population. Parallel sintering is a multiple chain MCMC version of the single chain MCMC
called simulated sintering [16].
4 Evolutionary MCMC algorithms
In the previous section, we have outlined parallel MCMCs. In the following we survey existing EMCMC
algorithms, and distinguish between two categories: family-competitive EMCMC and population-driven
EMCMC.
Table 1. Comparison of the presented EMCMC algorithms
algorithm
perturbation op. communication type EMCMC optim. / sampl.(distrib.)
QN
′
EMC
mut, recomb
AC , AS
fam-comp
i=1 Pi (·)
QN
′
fam-comp
MRGA
mut, recomb
AC , AS
i=1 Pi (·)
popSA
Boltzmann trials fam-comp
maxim
fam-comp
maxim
PRSA
mut, recomb
AC
mparSA
mut, neigh. recomb neighbours
fam comp
maxim
QN
′
popMCMC
mut
S(· | ·)
pop-driven
i=1 P (·)
eMCMC
mut, recomb
recomb, S(· | ·) pop-driven
maxim
QN
′
ECA
mut, recomb
ECA
fam-comp
i=1 R (·)
Family-competitive EMCMC algorithms let two chains from the population exchange information to
sample two new states. This interaction can be implemented by the proposal mechanism and/or by the
acceptance rule. Recombining the states of the two chains is an example of the former approach, while in
the latter two states are selected from the two proposed states and their ’parent’ states. We call any MH
acceptance rule which selects two states from the family of four states a coupled acceptance rule. Note that
in this view parallel tempering (and parallel sintering) belong to the class of family-competitive EMCMC
algorithms. Family-competitive EMCMC algorithms correspond to family-competitive EAs where two
individuals create offspring by mutation and recombination, and selection takes places at this family level examples are the elitist recombination GA [25], the deterministic crowding GA [18], the genetic invariance
GA [5].
Population-driven EMCMC algorithms adapt their proposal distribution according to the entire, current
population of states. The adaptation is such that the algorithm maintains a single MCMC at the population
level - this is, a MCMC whose states are populations. Population-driven EMCMC algorithms correspond
to the probabilistic model-building evolutionary algorithms in the sense that new solutions are proposed
on the basis of information taken from the entire, current population. In the probabilistic model-building
evolutionary algorithms a distribution is learned from the current population, and offspring is generated by
sampling from this distribution [22].
Table 1 compares the EMCMC algorithms surveyed here according to their perturbation operators,
their methods of the communication between chains, their EMCMC category, and their use as a sampler or
optimiser. Call Xt = (Xt [1], . . . , Xt [N ]) the population at time t of N states (or individuals) X t [·]. To each
individual X[·] we associate a fitness function f : Ω(X[·]) → IR, where X[·] samples from P ′ (X[·]) =
g(f (X[·])) (g is a monotonic function, g : IR → IR + \ {0}). Here, we consider an individual X[·] as a
string of l characters X[·][1]X[·][2] . . . X[·][l]. We use X[·][·] a to denote the a−th value Ω(X[·][·]), the
set of all possible values of X[·][·]. We call X[·][j] a an allele of X[·][j]. The position of X[·][j] in X[·] is
called the locus of X[·][j]. X t′ , Xt′′ are two intermediate populations.
We present below the pseudo-code for a general framework EMCMC algorithm which contains both
EMCMC categories.
EM CM C()
1 Initialise X0 [i], i = 1, . . . , N ; t ← 0
2 Calculate S(· | ·)
3 while true
4
do Sample Xt′ f rom S(· | Xt )
5
Recombine states f rom Xt′ obtaining Xt′′ according to S(· | ·)
6
Accept states f rom Xt′′ given Xt according to an acceptance rule obtaining Xt+1
7
Recalculate S(· | ·)
8
t←t+1
Evolutionary Monte Carlo (EMC). Liang and Wong [14], [15] propose the evolutionary Monte Carlo
algorithm, which incorporates recombination into the parallel tempering (PT) algorithm to speed up the
search and preserve good building blocks of the current states of the chains. Like PT, EMC has a population
of MCMC chains with constant and (geometrically) increasing temperatures. Chains interact through the
swapping acceptance rule A S that attempts to exchange states between chains with adjacent temperature.
This way, the good individuals sampled in the warmer chain are transfered to a colder chain where they
may be preserved longer. The lowest temperature chain T emp[1] = 1 converges to the target distribution
P (·), where Pi′ (·) = exp ( −f (X[i])
).
1
The candidate states are generated in two different ways and accepted by two different rules. With probability qm each chain in the population generates a new state by mutation and accepts it with the standard
Metropolis-Hastings acceptance rule. With probability 1 − q m new states are generated by recombination
of two states from different chains, and the offspring states are accepted with the coupled acceptance rule:
AC ((XN EW [i], XN EW [j]) | (Xt [i], Xt [j])) =
min (1,
′
Pi′ (XN EW [i]) · Pj′ (XN EW [j]) S(Xt′ | XN
EW )
·
),
′
′
′
Pi (Xt [i]) · Pj (Xt [j])
S(XN EW | Xt′ )
′
with Xt′ = (. . . , Xt [i], . . . , Xt [j], . . .), XN
EW = (. . . , XN EW [i], . . . , XN EW [j], . . .).
Note that when qm = 1 EMC reduces to PT. Liang and Wong discussed experimental results for
a model selection problem, a time series change-point identification problem, and for a protein folding
problem. These experiments showed the effectiveness of EMC as compared to PT.
Multi-resolution Genetic Algorithm-style MCMC (MRGA). Holloman, Lee and Higdon [12] also
proposed to integrate recombination in their multi-resolution parallel sintering algorithm. MRGA runs several chains for each resolution, sampling from the distribution at that resolution. Each individual has: (i) a
variable size part, which represents information specific to its resolution, updated with a MCMC and (ii)
a fixed dimensional (e.g. lowest dimension) vector with common interpretability used to move between
chains (e.g. for image processing this could be the mean of neighbourhood cells). MRGA samples at the
individual level using mutation, while it samples at the population level using recombination between the
fixed dimensionality vector of two chains of the same or different resolutions. Both new individuals are accepted/rejected using the swapping acceptance rule A S , which, in this case is equivalent with the coupled
acceptance rule AC . Like in parallel sintering, to improve mixing, chains of different resolutions periodically fully exchange the fixed dimensionality vectors using the swapping acceptance rule A S . MRGA
proposes with probability p swap a full exchange, or with probability 1 − p swap a crossover exchange.
Population MCMC (popMCMC). Laskey and Myers [13] introduced popMCMC where a population
of MCMC chains exchange information through a population-based, adaptive proposal distribution. As
before the Boltzmann distribution is used: P ′ (Xt [·]) = exp−f (Xt [·]) ,where f is a fitness function. Each
generation a locus X t [i][j] is randomly picked to be mutated. An allele X t [i][j]g on the j-th locus of the
(Xt [i][j]a )+1
a
candidate individual X N EW [i] is generated using a distribution α̂ ija = N
N +|Ω(X[·][·])| , where N (Xt [i][j] )
a
is the number of individuals that have the allele X t [i][j] on the j-th locus and |Ω(X[·][·])| is the total
number of alleles. X N EW [i] is accepted using the MH acceptance rule A(X N EW [i] | Xt [i]).
Since the proposal distribution depends on the current population, the transition matrix of a single
individual is not stationary, yet the transition matrix of the whole population is stationary. Whenever good
individuals from the population have a specific allele on a certain locus, new individuals will have with high
probability the same allele on the same locus. Since popMCMC is a population of independently sampling
MCMC chains, the allele remains in the population for a while. It is interesting to observe the similarity
between popMCMC and univariate EDAs [22] in generating the candidate individuals from the structure
of the current population. PopMCMC converges to the stationary distribution of N independently sampled
points from P (·). The authors show experimentally that popMCMC finds highly likely Bayesian network
structures faster than multiple independent MCMC chains.
Evolutionary MCMC (eMCMC). Zhang and Cho [27] proposed the eMCMC algorithm which is
designed to find the optimum of a distribution by generating L samples from a population of N MCMC
chains and then selecting the best N states of them (L > N ). The initial population is sampled according
to a prior Gaussian distribution. Each generation, each chain from the current population generates several
new candidate individuals using mutation and recombination which are accepted according to a Metropolis
acceptance rule A(· | ·). The best N individuals from the L individuals are then selected for the next
generation. Their posterior distribution - estimated with a Gaussian distribution model - represents the
proposal distribution S(· | ·) for the next generation. We observe that eMCMC is also related with EDA [22]
since the candidate individuals are generated from S(· | ·) which is adapted each generation. The eMCMC
algorithm is an EMCMC algorithm which samples using mutation and recombination and where S(· | ·)
is adapted each generation to sample from promising regions of the target distribution. Experimentally,
eMCMC outperformed single chain MCMC and the standard GA for a system identification task.
Population-based simulated annealing (popSA). Goldberg [8] proposed a population-based simulated annealing algorithm which creates a Boltzmann distribution over the population. Instead of MH
acceptance rule, it uses the logistic acceptance rule which has the probability to accept a candidate in(XN EW [·])
dividual: 1/(1 + exp f (Xt [·])−f
), where T emp[t] is the temperature for generation t. When
T emp[t]
T emp[t] is constant, the unnormalised stationary distribution of this algorithm is the Boltzmann distrit [·])
bution Pt′ (Xt [·]) = exp Tf (X
emp[t] , where f is the fitness function. Each individual X t [i] from the current
population chooses two other individuals from the population for coupling: one X t [j] has a fitness value
different from X t [i] by a threshold θ and one X t [k] has a fitness value different from X t [i] or from both
Xt [i] and Xt [j] by a threshold θ. An anti-acceptance competition is held between X t [j] and Xt [k], where
f (Xt [j]) − f (Xt [k])
). The primary acceptance competition
Xt [j] is accepted with probability 1/(1 + exp
T emp[t]
is held between the winner of the first competition X t [jk] and Xt [i], where Xt [i] is accepted with probaf (Xt [jk]) − f (Xt [i])
). The competitions are chosen to avoid the danger that copies of
bility 1/(1 + exp
T emp[t]
an individual are taking over the population. The anti-acceptance rule prefers poorer individuals, whereas
the primary acceptance rule prefers the better individuals. This way, the population equivalent is created to
generate a neighbour of the population at random.
Parallel recombinative simulated annealing (PRSA). Mahfoud and Goldberg [19] proposed a populationbased simulated annealing algorithm which also made use of recombination. All individuals from the population have the same temperature which decreases every generation according to a cooling schedule. New
individuals are generated using mutation and one point recombination between two individuals X t [i] and
Xt [j]. PRSA uses the logistic acceptance rule to accept a candidate individual. Two possible competitions
are considered: single acceptance/rejection holds two competitions between a parent vs. the child formed
from its own right-end and the other parent left-end, or double acceptance/rejection holds one competition
between both parents vs. both children using the coupled acceptance rule:
AC ((XN EW [i], XN EW [j]) | (Xt [i], Xt [j])) =
min (1, 1/(1 + exp
f (Xt [i]) + f (Xt [j]) − f (XN EW [i]) − f (XN EW [j])
))
T emp[t]
Massively parallel simulated annealing (mparSA). Rudolph [23] introduced the massively parallel simulated annealing algorithm. He experimentally achieved fast convergence to the optimum with
a population of independent SA chains. Like SA, mparSA uses the Boltzmann function P t′ (Xt [·]) =
t [·])
exp (− Tf (X
emp[t] ), where f is a fitness function and T emp[t] is the temperature for all individuals from
the t-th generation. The algorithm associates the population with a connected graph. Each node has an
individual which communicates with the individuals in the neighbouring nodes in the graph. For each
chain, each step, the current state X t [i] is recombined with a neighbour and the result is mutated several
times obtaining new neighbours. The best candidate individual X N EW [i] is then selected and is accepted
with the MH acceptance rule A(X N EW [i] | Xt [i]). The temperature is decreased each step according
to a SA cooling scheduler. Rudolph pointed out that the conflicting goals of fast convergence and global
convergence to the optimum can be satisfied with an adaptive proposal distribution, whereas it cannot
with a fixed proposal distribution. As in Evolutionary Strategies, mparSA uses a non-fixed proposal distribution St (· | ·) which is adapted each generation with a lognormally distributed random variable α t :
St (· | Xt [i]) = αt St−1 (· | Xt [i]).
Related work. Cercueil and Francois [3] give an overview of literature where EAs are viewed as Monte
Carlo methods which generates sample from a probability distribution defined on the trajectories of their
population. This helps to unify the convergence theories for EAs. For doing that, Cerf [4] has a GA with
the mutation rate related to a temperature, no crossover, and Boltzmann roulette selection. Each generation,
an intermediate population is generated by mutation. The next population is generated from a Boltzmann
distribution constructed from the intermediate population. Lozano and co-authors discuss a hybrid between
the genetic algorithm and simulated annealing based on a probabilistic Boltzmann reduction operator [17].
In a very recent publication Strens proposes an EMCMC algorithm with an ’exclusive-or’ proposal operator [24]. This operator takes one parent and two reference states, and generates an offspring state that
disagrees from its parent in a similar way as the two reference states.
Sequential Monte Carlo (SMC) is a new branch of Monte Carlo simulation, not necessarily a Markov
chain, which uses a population of samples to carry out on-line approximations of a target distribution. SMC
has sampling procedures which are similar to proportional selection in EAs. Bienvenue et al. introduce
niching in some SMC algorithms to maintain diversity in the population [2]. Del Moral [21] study the
convergence of GAs with SMC. Higuchi [11] and Tito, Vellasco and Pacheco [26] proposes GA filter and
Genetic Particle Filter, respectively. They integrate recombination into SMCs [11] to speed up convergence.
5 Elitist coupled acceptance rule and fitness ordered tempering
In the previous section we have surveyed a number of EMCMC algorithms, and discussed some techniques
that showed how multiple MCMC chains could exchange information in order to speed up the convergence
to some target distribution. In the following we introduce the elitist coupled acceptance rule (ECA) and
fitness ordered tempering (FOT), whose main purpose is to converge efficiently to a target distribution that
is biased towards the most likely states (or good solutions).
Fitness ordered tempering (FOT). FOT maintains a population of N chains each having its own
temperature T emp[i]. As in parallel tempering, FOT has a - typically geometrical - increasing series of
temperatures T emp[1] = 1 < . . . < T emp[N ] = T empmax . The population of N solutions (or states) is
sorted according to their fitness (or probability), and the solution at rank i gets the temperature T emp[i],
where the most fit solution gets T emp[1] = 1, and the worst solution T emp[N ] = T emp max. Therefore a
solution has lower temperature than any solution worse than itself, unless they are copies of each other. In
case there are multiple copies of the same solution ties within the sorted population are broken randomly,
so each copy gets an adjacent temperature. Copies receive a different temperature to avoid that multiple
copies of good solutions will remain in the population almost indefinitely.
In case there are multiple solutions with the same fitness but who are not copies of each other, the
temperature ladder has to be recomputed so that each unique solution with the same fitness gets the same
temperature. The number of different temperature values T emp[i] at each generation is therefore equal
to the number of solutions with different fitness value, unless they are copies, in which case they also
get a different temperature value. This scheme is necessary to avoid that some solution might get another
temperature in an identically composed population, and ensures that FOT has a homogeneous Markov chain
transition matrix at the population level. Contrary to parallel tempering, here the temperature assignment
depends on the fitness of the solutions relative to the fitness of the other solutions in the current population.
Since we want to remain in the vicinity of good solutions, we prefer to assign the lower temperatures to the
better solutions.
Elitist coupled acceptance rule (ECA). The ECA algorithm applies a coupled Metropolis Hastings
acceptance rule to two solutions and their two children. ECA accepts the best two solutions from the family of four if at least one of them is a child. However, when both children have a lower fitness than both
their parents, the children can still replace the parents with a probability determined by the coupled acceptance rule. This increases the explorative character of the algorithm. Call X t′ = (. . . , Xt [i], . . . , Xt [j], . . .),
f (Xt [·])
′
′
XN
EW = (. . . , XN EW [i], . . . , XN EW [j], . . .), PXt (X[·]) = exp ( T emp[·] ), and max2 the function returning the two most fit solutions. The ECA acceptance rule is now given as:
ECA(((XNEW [i], XNEW [j]) | (Xt [i], Xt [j])))
1 if {Xt [i], Xt [j]} = {XNEW [i], XNEW [j]}
2
then return 1
3 (XM AX [i], XM AX [j]) ← max2 (XNEW [i], XNEW [j], Xt [i], Xt [j])
-1.5
-2
-2.5
50
100 150
Fitness
200
250
-1
-1.5
-2
-2.5
0
50
0
-0.5
-1
-1.5
-2
-2.5
0
50
200
250
0
-0.5
-1
-1.5
-2
-2.5
0
50
(b)
100 150
Fitness
200
(d)
250
-ln(-ln(#founded/#solutions))
-ln(-ln(#founded/#solutions))
(a)
100 150
Fitness
0
-0.5
-1
-1.5
-2
-2.5
0
50
100 150
Fitness
200
250
(c)
100 150
Fitness
200
250
-ln(-ln(#founded/#solutions))
0
0
-0.5
-ln(-ln(#founded/#solutions))
-1
-ln(-ln(#founded/#solutions))
-ln(-ln(#founded/#solutions))
0
-0.5
0
-0.5
-1
-1.5
-2
-2.5
0
(e)
50
100 150
Fitness
200
250
(f)
Fig. 1. The number of different solutions found over the total number of possible solutions for a fitness value on a
logarithmic scale for (a) MIC, (b) popMCMC, (c) PT, (d) PT with recombination, (e) PRSA and (f) ECA/FOT
4
5
6
if XM AX [i] ∈ {XNEW [i], XNEW [j]} ∨ XM AX [j] ∈ {XNEW [i], XNEW [j]}
then return 1 ′
P
(XN EW [i])·P ′ (XN EW [j])
S(X ′ |X ′
)
t
· S(Xt′ N EW
else return Xt P ′ (Xt [i])·PX
′ (X [j])
|X ′ )
t
Xt
Xt
N EW
t
Note that the temperatures of the current and accepted solutions remain the same. ECA can be viewed
as a combination between the family-competitive, elitist replacement rule from regular GAs [25] and the
coupled acceptance rule A C . To see the difference between ECA and A C let us consider 2 parent states
′
′
′
′
and their offspring such that P X
(XN EW [i]) > PX
(Xt [j]) > PX
(Xt [i]) > PX
(XN EW [j]). Call F =
t
t
t
t
′
PX
(XN EW [i])
t
′ (X [i])
PX
t
t
·
′
PX
(XN EW [j])
t
.
′ (X [j])
PX
t
If F < 1, AC may loose XN EW [i] which is the best solution of this family,
t
while if F > 1, AC accepts XN EW [j] which is the worst solution.
Combining the FOT temperature assignment mechanism and the ECA acceptance rule leads to the
ECA/FOT MCMC algorithm: each generation the temperatures of each individual in the population are
recomputed with FOT, new individuals are proposed by mutation and recombination, and are accepted - or
rejected - with the ECA acceptance rule. It is important to note that at the individual level the transition
matrix of ECA/FOT is not stationary, since, in time, the same fitness values may be associated with different temperatures. At the population level however, the transition matrix is stationary because for the same
population of individuals, the temperature is always assigned in the same way. The ECA/FOT MCMC
is aperiodic, since, for each state, there is a non-zero probability to remain in the current state, and irreducible, because it has a non-zero probability to arrive from
i=1each state to each other state. If the state space
is finite, ECA/FOT converges to a distribution R(·) = N R′ (·), where R′ (·) is the unnormalised distribution for a single chain which depends on T emp MAX and T empMIN , the temperature assignment (e.g.
geometrically) and the size of the population.
6 Experimental results
To illustrate the use of the ECA/FOT algorithm we have applied 6 (E)MCMC algorithms to the well known
deceptive trap function [9], and counted the number of different solutions visited as a function of their fitness. The 6 algorithms are multiple independent MCMC chains (MIC), population MCMC (popMCMC),
parallel tempering (PT), parallel tempering with recombination, parallel recombinative simulated annealing
(PRSA), and elitist coupled acceptance with fitness ordered tempering (ECA/FOT). The first three algorithms do not apply recombination, while the other three do. For comparison purposes, we sample for each
f (·)
). In the MIC algorithm, each chain has been given
algorithm from the Boltzmann distribution exp ( T emp[·]
a constant temperature calculated with the PT ladder. The PRSA algorithm uses the MH acceptance rule A
(instead of the logistic acceptance rule), and two single acceptance/rejection competition between a parent
and one of its children.
10The trap function has 10 building blocks of length k = 3 and a linearly scaled fitness function: f =
i=1 i · fi . fi is the fitness value of the i-th trap function and depends only on the number of one bits
at each building block: f i (3) = 5, fi (2) = 0, fi (1) = 3, fi (0) = 4. The algorithmic parameters are
chosen as follows. We set T empMAX = 160 for the MH acceptance rule and T emp MAX = 320 for
the ECA acceptance rule, resulting in an acceptance probability of 0.73 for two individuals with fitness
difference equal to 50. T emp MIN is equal to 1. We choose a geometrically increasing temperature ladder
empM IN
), where s is the total number of
that increases the temperature each step with β = exp ( 1s · log TTemp
M AX
j
steps for the scheduler. At step j, T emp j = T empMAX · β . We choose the population size N equal with
the number of generations to obtain the same β for all algorithms. If β = 0.97, we obtain N = 250. The
mutation rate is 3l , which is the optimal rate for a deceptive function with building block length k = 3. The
recombination operator is two point crossover . For each algorithm, we have made 10 independent runs and
sampled the whole population every 10 generations, resulting in 25 samples and 6250 solutions in total.
Figure 1 shows - as expected - the advantage of using recombination as proposal operator for functions
(or distributions) whose structure can be exploited by the crossover operator. One can also notice that
ECA/FOT is more biased towards highly likely individuals than PT or PRSA.
7 Conclusion
Evolutionary Markov Chain Monte Carlo combine techniques form Evolutionary Computation and parallel
Markov Chain Monte Carlo algorithms to design new algorithms for sampling or optimising complex
distributions resp. functions. EMCMC algorithms offer new proposal operators and new acceptance rules.
Individual states in the population can be single MCMC chains that interact with each other, though it is
not necessary that they are indeed single MCMC chains. At the population level however - this is, states
are entire populations of given size - EMCMC algorithms need to be MCMC chains.
We have surveyed a number of existing EMCMC algorithms in the literature, and categorised them in
two classes: family-competitive EMCMC and population-driven EMCMC algorithms. We have also introduced an EMCMC algorithm, ECA/FOT, which applies a Fitness Ordered Temperature assignment mechanism and an Elitist Coupled Acceptance rule. EAC/FOT is, at population level, an MCMC which converges
to a stationary distribution biased towards the most likely states.
Clearly, the merger of the EC and MCMC paradigms represents a rich source for future development
of powerful sampling and optimisation algorithms.
References
1. C. Andrieu, N. de Freitas, A. Doucet, and M. Jordan. An introduction to MCMC for Machine Learning. Machine
Learning, pages 5–43, 2003.
2. A. Bienvenüe, M. Joannides, J. Bérard, E. Fontenas, and O. François. Niching in Monte Carlo filtering algorithms.
In E. Lutton M. Schoenauer P. Collet, JK Hao, editor, Artificial Evolution Selected Papers, 5th International
Conference, EA 2001, pages 19–30. Springer, 2002.
3. A. Cercueil and O. François. Monte Carlo simulation and population-based optimization. In Congress on Evolutionary Computation 2001, pages 191–198, 2001.
4. R. Cerf. A new genetic algorithm. Annals of Applied Probabilities, 6:778–817, 1996.
5. J. Culberson. Genetic invariance: A new paradigm for genetic algorithm design. Technical Report 92-02, Edmonton, Canada, 1992.
6. C. J. Geyer. Markov Chain Monte Carlo maximum likelihood. In Computing Science and Statistics: Proceedings
of the 23rd Symposium on the Interface, pages 156–163, 1991.
7. W.R. Gilks, S. Richardson, and D.J. Spiegelhalter, editors. Markov Chain Monte Carlo in practice. Chapman &
Hall, 1996.
8. D. E. Goldberg. A note on Boltzmann tournament selection for Genetic Algorithms and Population-oriented
Simulated Annealing. Complex Systems, 4:445–460, 1990.
9. D. E. Goldberg, K. Deb, and J. Horn. Massive multimodality, deception, and Genetic Algorithms. In Proceedings
of Parallel Problem Solving from Nature, pages 37–46, 1992.
10. W.K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97–
109, 1970.
11. T. Higuchi. Self-Organizing Time Series Model, volume Sequential Monte Carlo Methods in Practice, pages 429
– 444. Springer, 2001.
12. C. H. Holloman, H. K. H. Lee, and D. M. Higdon. Multi-resolution Genetic Algorithms and Markov Chain Monte
Carlo. Technical report, Duke University, 2002.
13. K. B. Laskey and J. W. Myers. Population Markov Chain Monte Carlo. Machine Learning, pages 175–196, 2003.
14. F. Liang and W. H. Wong. Evolutionary Monte Carlo: Applications to Cp model sampling and change point
problem. In Statistica Sinica, pages 317–342, 2000.
15. F. Liang and W.H. Wong. Evolutionary Monte Carlo for protein folding simulations. In Journal of Chemical
Physics, number 115, pages 3374–3380, 2001.
16. J.S. Liu and C. Sabatti. Simulated Sintering: Markov Chain Monte Carlo with spaces of varying dimensions. In
Bayesian Statistics 6, pages 389–413. Oxford University Press, 1999.
17. J. A. Lozano, P. Larrañaga, M. Graña, and F. X. Albizuri. Genetic algorithms: bridging the convergence gap.
Theoretical Computer Science, 229(1–2):11–22, 1999.
18. S. W. Mahfoud. Crowding and preselection revisited. In Reinhard Männer and Bernard Manderick, editors,
Parallel problem solving from nature 2, pages 27–36, Amsterdam, 1992. North-Holland.
19. S. W. Mahfoud and D. E. Goldberg. Parallel Recombinative Simulated Annealing: a Genetic Algorithm. Parallel
Computing, pages 1–28, 1995.
20. N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equation of state calculations by
fast computing machines. Jornal of Chemical Phisics, 21:1087–1092, 1953.
21. P. Moral, L. Kallel, and J. Rowe. Modelling Genetic algorithms with Interacting Particle systems. In Theoretical
Aspects of Evolutionary Computing, pages 10–67. L. Kallel, B. Naudts, A. Rogers, 2001.
22. M. Pelikan, D. E. Goldberg, and F. Lobo. A survey of optimization by building and using probabilistic models.
Computational Optimization and Applications, 21:5–20, 2002.
23. G. Rudolph. Massively Parallel Simulated Annealing and its Relation to Evolutionary Algorithms. Evolutionary
Computation, pages 361–382, 1994.
24. M. J. A. Strens. Evolutionary MCMC sampling and optimization in discrete spaces. In Proceedings of the
Twentieth International Conference on Machine Learning ICML-2003, 2003.
25. D. Thierens and D.E. Goldberg. Elitist recombination: an integrated selection-recombination GA. In Proceedings
of the First IEEE World Congress on Computational Intelligence, pages 508 – 512, 1994.
26. E.A.H. Tito, M.M.B.R. Vellasco, and M.A.C. Pacheco. Genetic Particle Filter: An evolutionary perspective of
SMC methods. Technical report, Catolic University of Rio de Janeiro, 2002.
27. B. T. Zhang and D. Y. Cho. System identification using evolutionary Markov Chain Monte Carlo. System Architecture, pages 287–599, 2001.