2019 - 1 - 3 - The Cross
2019 - 1 - 3 - The Cross
2019 - 1 - 3 - The Cross
Lih-Yuan Deng
To cite this article: Lih-Yuan Deng (2006) The Cross-Entropy Method: A Unified Approach to
Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, Technometrics, 48:1,
147-148, DOI: 10.1198/tech.2006.s353
Furthermore, if i and j are neighboring locations, then the correlation of ob- Taylor series remainder. All of these examples, although mathematically com-
servations at those points conditional on all other observations is Qij / Qii Qjj , plex, are presented with an emphasis on how certain fitting strategies can lead
and the conditional mean and precision of any single observation can also be to increases in computing speed.
expressed as simple functions of nonzero elements of Q. Since GMRF’s can The final new topic is the use of GMRF’s to approximate the Gaussian ran-
be specified through Normal conditionals defined by sparse matrices, this al- dom fields used in geostatistics. Unlike GMRF’s, these models are continuous
lows certain computations to run much more quickly than they would in the and have covariance functions defined on the entire real line. Since these co-
case of methods based on the dense matrix . Throughout the book, Normal variance functions concentrate most of the dependence within a finite region,
distributions are specified by precisions, rather than by variances. they can be approximated by a random process when the extremely weak long-
The authors assume that the reader has a good working knowledge of con- distance dependence is ignored. Great savings in speed are obtained by using
ditional probability and of the multivariate Normal distribution at the level of a toroidal GMRF, in which the underlying graph has a toroidal structure. Since
Hogg, McKean, and Craig (2005). Beyond this, they outline or discuss in detail this implies that data in a square window tiles the plane with replicates of itself,
the technical aspects of fitting GMRF’s, with a strong emphasis on only those this model should only be used with great caution.
details that are needed to make practical use of the methods. More formal and Overall, the book is an excellent, clearly written introduction to a class of
mathematical results are referenced in the endnotes. Their style is terse and if advanced models for use by those with a solid background in probability and
statistics. The book has one major flaw, which is related to its production. Al-
some detail seems to be missing, it can often be found upon rereading. If not,
though the page layout, in general, is nicely done, major mistakes were made
then the references and endnotes will point to where those details can be found.
in reproducing some figures. For example, there are many maps of Germany
Theory is usually presented first, followed by a detailed investigation of one or
with multiple regions in which each region has a gray scale value related to a
more complex examples.
response. The printing technology used for this book so muddies these grays
After a brief and very useful introduction, Chapter 2 contains most of the
together that it is almost impossible to compare the observed responses to fitted
general results about GMRF’s. It begins with a presentation of notation neces-
ones, except at the most extreme divergences. This could have been improved
sary to describe conditional distributions relative to graphical dependence net- by using different printing technologies or by printing the maps in color.
works. The use of the precision matrix in model specification, inference, and
simulation is then discussed, and a section is dedicated to a general review of Jeffrey D. P ICKA
solvers for sparse systems of linear equations. Since the speed of precision- University of New Brunswick
matrix-based methods depends on the speed of these solvers, this discussion is
central to the practical use of these methods, and any text on inference that re- REFERENCE
quires these methods should include a section of this kind. The use of toroidal
boundary conditions on the data is discussed, showing how they can lead to fur- Hogg, R. V., McKean, J., and Craig, A. T. (2005), Introduction to Mathematical
Statistics (6th ed.), Upper Saddle River, NJ: Prentice–Hall.
ther speed increases through use of cyclic precision matrices. Finally, several
methods for ensuring that Q is positive definite are compared.
The three remaining chapters apply the basic techniques to increasingly The Cross-Entropy Method: A Unified Approach to
complex models. Whereas the third chapter is accessible and self-contained, Combinatorial Optimization, Monte-Carlo Simulation,
the authors admit that the last two chapters cover areas where research is ongo- and Machine Learning, by Reuven Y. RUBINSTEIN and
ing and suggest that they are best used by those with prior knowledge of those Dirk P. K ROESE, New York: Springer-Verlag, 2004, ISBN
areas. 0-387-21240-X, xx + 300 pp., $84.95.
Chapter 3 presents GMRF’s in which Q is not of full rank. This can result
As mentioned in its Preface, this book is a comprehensive and accessible in-
from a linear constraint on conditional means, but also arises in cases where the
troduction to the cross-entropy (CE) method, which was originally proposed by
data are first or second differences between observations at neighboring sites.
the first author as an adaptive algorithm for a rare-event simulation. Using a di-
For example, GMRF methods can be used to estimate parameters for random
rect Monte Carlo technique to estimate rare-event probabilities requires a huge
walks, based upon their paths. After fitting a random walk on the line, the au-
number of trials. Traditionally, it is common to use importance sampling (IS)
thors go on to random walks on lattices, to random walks in continuous time
techniques to make the occurrence of rare events more frequent by simulating
observed at discrete random times, and to second order random walks. Very variates with a different probability distribution and estimating the rare prob-
nice graphics are used to describe sums with a simple geometrical structure ability via a likelihood ratio estimator. Around 1997, the first author proposed
that would be hard to express with formal notation. the CE method as an adaptive IS algorithm for a rare-event simulation based on
The fourth chapter discusses hierarchical models, in which GMRF’s are used variance minimization. Later, the CE method was modified to a randomized op-
to model dependencies between parameters. As an example, one could observe timization technique with the original variance minimization problem replaced
a sequence Yi of Bernoulli( pi ) random variables that are not independent. To by a Kullback–Leibler cross-entropy minimization problem. The key advan-
model the dependence, one could assume that Xi = logit(Yi ) arises from an au- tage of the CE method is that there are analytical solutions in parameters that
toregressive model and further assume that the variance of the random terms in update for a wide class of distributions including the natural exponential family.
that autoregressive process is random with some suitable distribution. Fitting a The CE method was later applied to the field of optimization problems by first
model of this kind requires use of Markov chain Monte Carlo (MCMC) tech- translating the underlying problem into an associated estimation problem.
niques, which are briefly discussed. The focus of these discussions is not on This book is organized into eight chapters and one appendix. Chapter 1 de-
convergence (which is referenced elsewhere), but to the blocking of parameters scribes brief and necessary background on the cross-entry method. In addi-
to increase the efficiency of the MCMC algorithms. Other examples include the tion to the basic introduction to the probability and statistics, various measures
fitting of a model in which monthly numbers of driver deaths on British roads of information in a random experiment are discussed. In particular, Kullback–
are modeled as Leibler cross-entropy is introduced as a “semidistance” measure between two
density functions. Chapter 2 is a tutorial introduction to the CE method. Two
Yi ∼ N(si + ti , 1/κy ), simple examples are given to demonstrate the applications of the CE method
to a rare-event simulation and a combinatorial optimization problem. Although
where s and t are models for seasonal variation based on random walks and κ is the first example of rare-event simulation is somewhat interesting, the second
the precision. example of combinatorial optimization is not very convincing, because one can
After considering Normal responses, various extensions to non-Normal re- find an easy and straightforward solution. Chapter 3 expands the brief introduc-
sponses are given. Rents in Munich are fitted to a mixture Normal distribution, tion of the CE method to efficient simulation of rare events. A variance min-
in which the precision parameter is Gamma-distributed so as to produce a re- imization algorithm for the IS technique and the CE algorithm are described
sponse with a t-distribution. A logistic model is fitted to rainfall data via a and compared in depth. Chapter 4 demonstrates how to transform a CE method
model that samples from a Normal and a Kolmogorov–Smirnov distribution. into an efficient randomized algorithm for solving a combinatorial optimization
A model with Poisson response is fitted by replacing a complicated Poisson- problem. Many combinatorial optimization problems are formulated as opti-
based conditional distribution with a Normal distribution via a Taylor series mization problems concerning a weighted graph. Most of the techniques and
approximation. In the second half of the last chapter, new methods are pre- applications described are related to nodes and edges in a graph, which may not
sented for improving this approximation, based on incorporating aspects of the be of a great concern for researchers and practitioners in the field of statistics.
Chapters 5 and 6 discuss how to extend the CE method for various optimization This is an excellent book to have for practitioners in the field of data analysis
problems. Chapter 5 introduces a fully automated version of the CE algorithm who face samples with incomplete data.
(FACE) where all the parameters of the CE algorithm are automatically tuned.
Ashok K. S INGH
Chapter 6 considers the problem of optimization for noisy objective functions.
University of Nevada–Las Vegas
Chapter 7 discusses applications of the CE method to some combinatorial opti-
mization problems such as pairwise sequence alignment problems in bioinfor-
matics, scheduling problems in operation research, and clique problem in graph REFERENCES
theory. Finally, in Chapter 8, applications of the CE method to some problems
in machine learning are presented. Various Matlab implementations of CE al- Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Likelihood
gorithms are given in the Appendix. From Incomplete Data via the EM Algorithm” (with discussion), Journal of
the Royal Statistical Society, Ser. B, 39, 1–38.
Since the CE method is a young and developing field, there is no book avail-
Gatliffe, T. (2006), Review of Nondetects and Data Analysis—Statistics for
able in this area where the two authors are the pioneers. Therefore, it is quite a Censored Environmental Data, by D. Helsel, Technometrics, 48, 145.
unique book and it may become a classic reference in the CE method literature. Helsel, D. (2005), Nondetects and Data Analysis—Statistics for Censored En-
vironmental Data, Hoboken, NJ: Wiley.
Lih-Yuan D ENG
University of Memphis
Nonlinear Signal Processing: A Statistical Approach,
by Gonzalo R. A RCE, Hoboken, NJ: Wiley, 2005, ISBN
The EM Algorithm and Related Statistical Models,
0-471-67621-1, xx + 459 pp. + CD, $79.95.
edited by Michiko WATANABE and Kazunori YAMAGUCHI,
New York: Marcel Dekker, 2004, ISBN 0-8247-4701-1, Linear filters, which are widely used in signal processing, are based on the
ix + 197 pp., $165.00. assumption of Gaussian noise, and this Gaussian noise assumption is based on
the central limit theorem, which is widely credited with universal applicabil-
This book is a compilation of chapters on statistical methods for dealing with
ity of the linear method of smoothing data that have noises. This book intends
incomplete data; the first five chapters are authored by the editors and the last
to broaden signal processing methodology by using nonlinear estimators, the
four chapters are co-authored by four other researchers. The common statisti-
reason being that real world noise, as exemplified in the book, is not limited
cal methods do not work well on samples with incomplete data. In the area of
to well behaved Gaussian distributions—the place where the nice property of
environmental statistics, incomplete data occur in the form of nondetects. The
linear estimators breaks down.
recent book by Helsel (2005), reported for Technometrics by Gatliffe (2006),
The book starts with a quick overview of the non-Gaussian random process
discusses statistical methods for dealing with censored environmental data. The
and its nonlinear estimators, followed by a detailed discussion in Chapter 2 on
book by Watanabe and Yamaguchi deals with the problem of incomplete data
the generalized Gaussian distribution, which is a special class of stable distrib-
in a more general setup.
utions. Like Gaussian random variables, a stable random variable is defined as
The first chapter of the book gives a general description of problems one
a class of random variables that is closed with linear operations. In other words,
faces when dealing with incomplete data; a discussion of models that generate
a random variable X is stable if X1 and X2 are independent copies of X and have
missing data is also included in this chapter.
arbitrary positive constants a and b, and there are constants c and d such that
Problems dealing with multivariate data with missing values are discussed
aX1 + bX2 = cX + d. A normal distribution is a member of stable distributions,
in the second chapter. Methods of dealing with missing values in a multivariate
but stable distributions include members with infinite second moment, whereas
dataset [casewise deletion, pairwise deletion, missing value replacement, the
the classical central limit theorem requires the existence of a finite second mo-
expectation–maximization (EM) algorithm] are discussed. The EM algorithm
ment.
is an iterative method that involves first finding a conditional expected value
Introduction of the class of stable distributions paved the way to state the
and then solving the likelihood equation. The method does not involve matrix
generalized central limit theorem: Let X1 , X2 , . . . be an independent, identi-
inversion, but generally needs a large number of iterations to reach convergence.
cally distributed sequence of random variables. There exist constants an such
Chapter 3 describes in detail the EM algorithm, developed by Dempster,
that as n → ∞, the sum an (X1 + X2 + · · · ) → Z if and only if Z is a stable
Laird, and Rubin (1977). An example using the multivariate normal distribu-
random variable with dispersion parameter 0 < α < 2. Again in the generalized
tion is included. This chapter also describes the properties of the EM algorithm.
case a stable random variable Z plays the role of the normal random variable.
Two additional examples (estimation of the survival function and analysis of re-
Ordered statistics plays the prominent role in nonparametric methods. The
peated measures model in the presence of missing values in the sample) are also
properties of ordered statistics are examined in Chapter 3. Filtering as a
given.
statistical method of estimating signal is discussed in Chapter 4. Part II,
Linear models with heavy-tailed error terms (such as the t distribution) are
“Signal Processing With Order Statistics,” mainly deals with various median
discussed in detail in Chapter 4. The scale mixture of normal distributions is
smoothers. Part III, “Signal Processing With Stable Models,” introduces myr-
introduced, and the t distribution, the double-exponential distribution, and the
iad smoothers. The myriad estimator of a sample sequence of stable random
contaminated normal distribution are shown to be special cases of the scale
variables X1 , X2 , . . . , XN with a given positive constant K is
mixture of normal distributions. The estimation of the linear model with heavy-
tailed error terms is discussed in detail. The EM algorithm is used to compute
N
the maximum likelihood estimates of the parameters of the model. A method βK = Myriad(K, X1 , X2 , . . . , XN ) = arg min K 2 + (Xi − β)2 .
β
for handling missing values is included. The use of the EM algorithm to per- i=1
form robust factor analysis is also discussed. Chapters 8 and 9 of this part discuss the property of this estimator and its appli-
Chapter 5 is a very short chapter on the usefulness of the EM algorithm in cations.
solving a class of problems called latent structure models. Chapter 6 consists Although this book addresses the issues encountered in signal processing
of an extension of the EM algorithm, called the partial imputation EM (PIEM) engineering, the reader can also treat it as an advanced statistical methods text.
algorithm. It is shown that the PIEM converges faster than the EM algorithm. Of especial interest are the various discussions dealing with heavy tailed distri-
The chapter ends with a couple of examples. butions and impulsive signals. Geometrical interpretation of the myriad estima-
In Chapter 7, the speed of convergence of the EM algorithm is investigated, tion and K’s role in determining the extent of the impulse signal are especially
and methods for accelerating the EM algorithm are discussed. Chapter 8 is con- interesting. For practical reasons, the author also gives an iterative algorithm to
cerned with the geometry of the EM algorithm. Examples of neural network obtain best estimates of K and β from the data.
models in which the EM algorithm implicitly appears in the learning step are The book comes with a companion CD that contains more than 60 algo-
included. rithms in MATLAB m files. These functions provide almost all the major algo-
Chapter 9 discusses the applicability of the Markov chain Monte Carlo rithms discussed in the book. Even though it requires MATLAB to run, you can
method for performing Bayesian analysis with missing data. The Gibbs sampler open the m files as text using Microsoft Word or any text file reader.
and the Metropolis–Hastings algorithm are discussed in detail in this chapter. The proposed nonlinear estimators/filters usually have a corresponding lin-
Appendix A gives a detailed description of a software package SOLAS 3.0 for ear counterpart as a special case, which leads to the construction of more uni-
single/multiple data imputation. versal filters that expand the capability and the robustness of data processing.