Theoretic Methods in Data Science
Theoretic Methods in Data Science
Learn about the state-of-the-art at the interface between information theory and data
science with this first unified treatment of the subject. Written by leading experts
in a clear, tutorial style, and using consistent notation and definitions throughout, it
shows how information-theoretic methods are being used in data acquisition, data
representation, data analysis, and statistics and machine learning.
Coverage is broad, with chapters on signal data acquisition, data compression,
compressive sensing, data communication, representation learning, emerging topics in
statistics, and much more. Each chapter includes a topic overview, definition of the key
problems, emerging and open problems, and an extensive reference list, allowing readers
to develop in-depth knowledge and understanding.
Providing a thorough survey of the current research area and cutting-edge trends, this
is essential reading for graduate students and researchers working in information theory,
signal processing, machine learning, and statistics.
YONINA C. ELDAR
Weizmann Institute of Science
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
79 Anson Road, #06-04/06, Singapore 079906
www.cambridge.org
Information on this title: www.cambridge.org/9781108427135
DOI: 10.1017/9781108616799
A catalogue record for this publication is available from the British Library.
To my wife Eduarda and children Isabel and Diana for their unconditional love,
encouragement, and support
MR
Contents
2.1 Introduction 44
2.2 Lossy Compression of Finite-Dimensional Signals 48
2.3 ADX for Continuous-Time Analog Signals 49
2.4 The Fundamental Distortion Limit 55
2.5 ADX under Uniform Sampling 63
2.6 Conclusion 69
References 70
3.3 Definitions 76
3.4 Compressible Signal Pursuit 77
3.5 Compression-Based Gradient Descent (C-GD) 81
3.6 Stylized Applications 85
3.7 Extensions 92
3.8 Conclusions and Discussion 100
References 100
Index 529
Preface
Since its introduction in 1948, the field of information theory has proved instrumen-
tal in the analysis of problems pertaining to compressing, storing, and transmitting
data. For example, information theory has allowed analysis of the fundamental limits
of data communication and compression, and has shed light on practical communica-
tion system design for decades. Recent years have witnessed a renaissance in the use
of information-theoretic methods to address problems beyond data compression, data
communications, and networking, such as compressive sensing, data acquisition, data
analysis, machine learning, graph mining, community detection, privacy, and fairness.
In this book, we explore a broad set of problems on the interface of signal processing,
machine learning, learning theory, and statistics where tools and methodologies orig-
inating from information theory can provide similar benefits. The role of information
theory at this interface has indeed been recognized for decades. A prominent example is
the use of information-theoretic quantities such as mutual information, metric entropy
and capacity in establishing minimax rates of estimation back in the 1980s. Here we
intend to explore modern applications at this interface that are shaping data science in
the twenty-first century.
There are of course some notable differences between standard information-theoretic
tools and signal-processing or data analysis methods. Globally speaking, information
theory tends to focus on asymptotic limits, using large blocklengths, and assumes the
data is represented by a finite number of bits and viewed through a noisy channel.
The standard results are not concerned with complexity but focus more on fundamen-
tal limits characterized via achievability and converse results. On the other hand, some
signal-processing techniques, such as sampling theory, are focused on discrete-time rep-
resentations but do not necessarily assume the data is quantized or that there is noise in
the system. Signal processing is often concerned with concrete methods that are opti-
mal, namely, achieve the developed limits, and have bounded complexity. It is natural
therefore to combine these tools to address a broader set of problems and analysis which
allows for quantization, noise, finite samples, and complexity analysis.
This book is aimed at providing a survey of recent applications of information-
theoretic methods to emerging data-science problems. The potential reader of this book
could be a researcher in the areas of information theory, signal processing, machine
learning, statistics, applied mathematics, computer science or a related research area, or
xiv Preface
a graduate student seeking to learn about information theory and data science and to
scope out open problems at this interface. The particular design of this volume ensures
that it can serve as both a state-of-the-art reference for researchers and a textbook for
students.
The book contains 16 diverse chapters written by recognized leading experts world-
wide, covering a large variety of topics that lie on the interface of signal processing, data
science, and information theory. The book begins with an introduction to information
theory which serves as a background for the remaining chapters, and also sets the nota-
tion to be used throughout the book. The following chapters are then organized into four
categories: data acquisition (Chapters 2–4), data representation and analysis (Chapters
5–9), information theory and machine learning (Chapters 10 and 11), and information
theory, statistics, and compression (Chapters 12–15). The last chapter, Chapter 16, con-
nects several of the book’s themes via a survey of Fano’s inequality in a diverse range
of data-science problems. The chapters are self-contained, covering the most recent
research results in the respective topics, and can all be treated independently of each
other. A brief summary of each chapter is given next.
Chapter 1 by Rodrigues, Draper, Bajwa, and Eldar provides an introduction to infor-
mation theory concepts and serves two purposes: It provides background on classical
information theory, and presents a taster of modern information theory applied to
emerging data-science problems.
Chapter 2 by Kipnis, Eldar, and Goldsmith extends the notion of rate-distortion the-
ory to continuous-time inputs deriving bounds that characterize the minimal distortion
that can be achieved in representing a continuous-time signal by a series of bits when
the sampler is constrained to a given sampling rate. For an arbitrary stochastic input and
given a total bitrate budget, the authors consider the lowest sampling rate required to
sample the signal such that reconstruction of the signal from a bit-constrained represen-
tation of its samples results in minimal distortion. It turns out that often the signal can
be sampled at sub-Nyquist rates without increasing the distortion.
Chapter 3 by Jalali and Poor discusses the interplay between compressed sensing and
compression codes. In particular, the authors consider the use of compression codes to
design compressed sensing recovery algorithms. This allows the expansion of the class
of structures used by compressed sensing algorithms to those used by data compression
codes, which is a much richer class of inputs and relies on decades of developments in
the field of compression.
Chapter 4 by Pilanci develops information-theoretical lower bounds on sketching for
solving large statistical estimation and optimization problems. The term sketching is
used for randomized methods that aim to reduce data dimensionality in computationally
intensive tasks for gains in space, time, and communication complexity. These bounds
allow one to obtain interesting trade-offs between computation and accuracy and shed
light on a variety of existing methods.
Chapter 5 by Shakeri, Sarwate, and Bajwa treats the problem of dictionary learning,
which is a powerful signal-processing approach for data-driven extraction of features
from data. The chapter summarizes theoretical aspects of dictionary learning for vector-
and tensor-valued data and explores lower and upper bounds on the sample complexity
Preface xv
of dictionary learning which are derived using information-theoretic tools. The depen-
dence of sample complexity on various parameters of the dictionary learning problem
is highlighted along with the potential advantages of taking the structure of tensor data
into consideration in representation learning.
Chapter 6 by Riegler and Bölcskei presents an overview of uncertainty relations for
sparse signal recovery starting from the work of Donoho and Stark. These relations are
then extended to richer data structures and bases, which leads to the recently discovered
set-theoretic uncertainty relations in terms of Minkowski dimension. The chapter also
explores the connection between uncertainty relations and the “large sieve,” a family
of inequalities developed in analytic number theory. It is finally shown how uncertainty
relations allow one to establish fundamental limits of practical signal recovery problems
such as inpainting, declipping, super-resolution, and denoising of signals.
Chapter 7 by Reeves and Pfister examines high-dimensional inference problems
through the lens of information theory. The chapter focuses on the standard linear model
for which the performance of optimal inference is studied using the replica method from
statistical physics. The chapter presents a tutorial of these techniques and presents a new
proof demonstrating their optimality in certain settings.
Chapter 8 by Shah discusses the question of learning distributions over permutations
of a given set of choices based on partial observations. This is central to capturing
choice in a variety of contexts such as understanding preferences of consumers over
a collection of products based on purchasing and browsing data in the setting of retail
and e-commerce. The chapter focuses on the learning task from marginal distributions
of two types, namely, first-order marginals and pair-wise comparisons, and provides a
comprehensive review of results in this area.
Chapter 9 by Raman and Varshney studies universal clustering, namely, clustering
without prior access to the statistical properties of the data. The chapter formalizes the
problem in information theory terms, focusing on two main subclasses of clustering that
are based on distance and dependence. A review of well-established clustering algo-
rithms, their statistical consistency, and their computational and sample complexities is
provided using fundamental information-theoretic principles.
Chapter 10 by Raginsky, Rakhlin, and Xu introduces information-theoretic measures
of algorithmic stability and uses them to upper-bound the generalization bias of learn-
ing algorithms. The notion of stability implies that its output does not depend too
much on any individual training example and therefore these results shed light on the
generalization ability of modern learning techniques.
Chapter 11 by Piantanida and Vega introduces the information bottleneck principle
and explores its use in representation learning, namely, in the development of com-
putational algorithms that learn the different explanatory factors of variation behind
high-dimensional data. Using these tools, the authors obtain an upper bound on the
generalization gap corresponding to the cross-entropy risk. This result provides an inter-
esting connection between mutual information and generalization, and helps to explain
why noise injection during training can improve the generalization ability of encoder
models.
xvi Preface
H(·) entropy
H(·|·) conditional entropy
h(·) differential entropy
h(·|·) conditional differential entropy
D(··) relative entropy
I(·; ·) mutual information
I(·; ·|·) conditional mutual information
N μ, σ2 scalar Gaussian distribution with mean μ and variance σ2
N(μ, Σ) multivariate Gaussian distribution with mean μ and covariance
matrix Σ
Contributors
Lifeng Lai
Volkan Cevher
Department of Electrical and Computer
Laboratory for Information and
Engineering
Inference Systems
University of California, Davis
Institute of Electrical Engineering
School of Engineering, EPFL Muriel Médard
Department of Electrical Engineering
Jie Ding
and Computer Science
School of Statistics
Massachussets Institute of
University of Minnesota
Technology
Stark C. Draper
Henry D. Pfister
Department of Electrical and Computer
Department of Electrical and Computer
Engineering
Engineering
University of Toronto
Duke University
Yonina C. Eldar
Pablo Piantanida
Faculty of Mathematics and Computer
Laboratoire des Signaux et Systèmes
Science
Université Paris Saclay
Weizmann Institute of Science
CNRS-CentraleSupélec
Soheil Feizi Montreal Institute for Learning
Department of Computer Science Algorithms (Mila), Université
University of Maryland, College Park de Montréal
xx List of Contributors
Summary
The field of information theory – dating back to 1948 – is one of the landmark intel-
lectual achievements of the twentieth century. It provides the philosophical and math-
ematical underpinnings of the technologies that allow accurate representation, efficient
compression, and reliable communication of sources of data. A wide range of storage
and transmission infrastructure technologies, including optical and wireless commu-
nication networks, the internet, and audio and video compression, have been enabled
by principles illuminated by information theory. Technological breakthroughs based on
information-theoretic concepts have driven the “information revolution” characterized
by the anywhere and anytime availability of massive amounts of data and fueled by the
ubiquitous presence of devices that can capture, store, and communicate data.
The existence and accessibility of such massive amounts of data promise immense
opportunities, but also pose new challenges in terms of how to extract useful and action-
able knowledge from such data streams. Emerging data-science problems are different
from classical ones associated with the transmission or compression of information in
which the semantics of the data was unimportant. That said, we are starting to see
that information-theoretic methods and perspectives can, in a new guise, play impor-
tant roles in understanding emerging data-science problems. The goal of this book is
to explore such new roles for information theory and to understand better the modern
interaction of information theory with other data-oriented fields such as statistics and
machine learning.
The purpose of this chapter is to set the stage for the book and for the upcoming
chapters. We first overview classical information-theoretic problems and solutions. We
then discuss emerging applications of information-theoretic methods in various data-
science problems and, where applicable, refer the reader to related chapters in the book.
Throughout this chapter, we highlight the perspectives, tools, and methods that play
important roles in classic information-theoretic paradigms and in emerging areas of data
science. Table 1.1 provides a summary of the different topics covered in this chapter and
highlights the different chapters that can be read as a follow-up to these topics.
1
2 Miguel R. D. Rodrigues et al.
Table 1.1. Major topics covered in this chapter and their connections to other chapters
INFORMATION
SOURCE TRANSMITTER RECEIVER DESTINATION
Encoder
Decoder
Encoder
Decoder
Channel
Channel
Source
Source
SIGNAL RECEIVED
SIGNAL
MESSAGE MESSAGE
NOISE
SOURCE
Figure 1.1 Reproduction of Shannon’s Figure 1 in [1] with the addition of the source and channel
encoding/decoding blocks. In Shannon’s words, this is a “schematic diagram of a general
communication system.”
Introduction to Information Theory and Data Science 3
The message is then fed into a transmission system. The transmitter itself has two
main sub-components: the source encoder and the channel encoder. The source encoder
converts the message into a sequence of 0s and 1s, i.e., a bit sequence. There are two
classes of source encoders. Lossless source coding removes predictable redundancy
that can later be recreated. In contrast, lossy source coding is an irreversible process
wherein some distortion is incurred in the compression process. Lossless source cod-
ing is often referred to as data compression while lossy coding is often referred to as
rate-distortion coding. Naturally, the higher the distortion the fewer the number of bits
required.
The bit sequence forms the data payload that is fed into a channel encoder. The out-
put of the channel encoder is a signal that is transmitted over a noisy communication
medium. The purpose of the channel code is to convert the bits into a set of possible
signals or codewords that can be reliably recovered from the noisy received signal.
The communication medium itself is referred to as the channel. The channel can
model the physical separation of the transmitter and receiver. It can also, as in data
storage, model separation in time.
The destination observes a signal that is the output of the communication channel.
Similar to the transmitter, the receiver has two main components: a channel decoder and
a source decoder. The former maps the received signal into a bit sequence that is, one
hopes, the same as the bit sequence produced by the transmitter. The latter then maps
the estimated bit sequence to an estimate of the original message.
If lossless compression is used, then an apt measure of performance is the probabil-
ity that the message estimate at the destination is not equal to the original message at
the transmitter. If lossy compression (rate distortion) is used, then other measures of
goodness, such as mean-squared error, are more appropriate.
Interesting questions addressed by information theory include the following.
1. Architectures
• What trade-offs in performance are incurred by the use of the architecture
detailed in Figure 1.1?
• When can this architecture be improved upon; when can it not?
2. Source coding: lossless data compression
• How should the information source be modeled; as stochastic, as arbitrary
but unknown, or in some other way?
• What is the shortest bit sequence into which a given information source can
be compressed?
• What assumptions does the compressor work under?
• What are basic compression techniques?
3. Source coding: rate-distortion theory
• How do you convert an analog source into a digital bitstream?
• How do you reconstruct/estimate the original source from the bitstream?
• What is the trade-off involved between the number of bits used to describe
a source and the distortion incurred in reconstruction of the source?
4 Miguel R. D. Rodrigues et al.
4. Channel coding
• How should communication channels be modeled?
• What throughput, measured in bits per second, at what reliability, measured
in terms of probability of error, can be achieved?
• Can we quantify fundamental limits on the realizable trade-offs between
throughput and reliability for a given channel model?
• How does one build computationally tractable channel coding systems that
“saturate” the fundamental limits?
5. Multi-user information theory
• How do we design systems that involve multiple transmitters and receivers?
• How do many (perhaps correlated) information sources and transmission
channels interact?
The decades since Shannon’s first paper have seen fundamental advances in each of
these areas. They have also witnessed information-theoretic perspectives and thinking
impacting a number of other fields including security, quantum computing and com-
munications, and cryptography. The basic theory and many of these developments are
documented in a body of excellent texts, including [2–9]. Some recent advances in net-
work information theory, which involves multiple sources and/or multiple destinations,
are also surveyed in Chapter 15. In the next three sections, we illustrate the basics of
information-theoretic thinking by focusing on simple (point-to-point) binary sources and
channels. In Section 1.2, we discuss the compression of binary sources. In Section 1.3,
we discuss channel coding over binary channels. Finally, in Section 1.4, we discuss
computational issues, focusing on linear codes.
To gain a feel for the tools and results of classical information theory consider the fol-
lowing lossless source coding problem. One observes a length-n string of random coin
flips, X1 , X2 , . . . , Xn , each Xi ∈ {heads, tails}. The flips are independent and identically
distributed with P(Xi = heads) = p, where 0 ≤ p ≤ 1 is a known parameter. Suppose we
want to map this string into a bit sequence to store on a computer for later retrieval.
Say we are going to assign a fixed amount of memory to store the sequence. How much
memory must we allocate?
Since there are 2n possible sequences, all of which could occur if p is not equal to 0
or 1, if we use n bits we can be 100% certain we could index any heads/tails sequence
that we might observe. However, certain sequences, while possible, are much less likely
than others. Information theory exploits such non-uniformity to develop systems that
can trade off between efficiency (the storage of fewer bits) and reliability (the greater
certainty that one will later be able to reconstruct the observed sequence). In the follow-
ing, we accept some (arbitrarily) small probability > 0 of observing a sequence that we
Introduction to Information Theory and Data Science 5
choose not to be able to store a description of.1 One can think of as the probability of
the system failing. Under this assumption we derive bounds on the number of bits that
need to be stored.
1.2.1 Achievability: An Upper Bound on the Rate Required for Reliable Data Storage
To figure out which sequences we may choose not to store, let us think about the statis-
tics.Inexpectation, we observe np heads. Of the 2n possible heads/tails sequences there
n
are np sequences with np heads. (For the moment we ignore non-integer effects and
deal with them later.) There will be some variability about this mean but, at a minimum,
we must be able to store all these
expected realizations since these realizations all have
n
the same probability. While np is the cardinality of the set, we prefer to develop a good
approximation that is more amenable to manipulation. Further, rather than counting car-
dinality, we will count the log-cardinality. This is because given k bits we can index 2k
heads/tails source sequences. Hence, it is the exponent in which we are interested.
Using Stirling’s approximation to the factorial, log2 n! = n log2 n−(log2 e)n+O(log2 n),
and ignoring the order term, we have
n
log nlog2 n − n(1 − p)log2 (n(1 − p)) − np log2 (np) (1.1)
np
1 1− p
= n log2 + np log2
1− p p
1 1
= n (1 − p)log2 + p log2 . (1.2)
1− p p
In (1.1), the (log2 e)n terms have canceled and the term in square brackets in (1.2) is
called the (binary) entropy, which we denote as HB (p), so
where 0 ≤ p ≤ 1 and 0 log 0 = 0. The binary entropy function is plotted in Fig. 1.2 within
Section 1.3. One can compute that when p = 0 or p = 1 then HB (0) = HB (1) = 0. The
interpretation is that, since there is only one all-tails and one all-heads sequence, and
we are quantifying log-cardinality, there is only one sequence to index in each case so
log2 (1) = 0. In these cases, we a priori know the outcome (respectively, all the heads
or all tails) and so do not need to store any bits to describe
n the realization. On the other
hand, if the coin is fair then p = 0.5, HB (0.5) = 1, n/2 2n , and we must use n bits of
storage. In other words, on an exponential scale almost all binary sequences are 50%
heads and 50% tails. As an intermediate value, if p = 0.11 then HB (0.11) 0.5.
1 In source coding, this is termed near-lossless source coding as the arbitrarily small bounds the probability
of system failure and thus loss of the original data. In the variable-length source coding paradigm, one
stores a variable amount of bits per sequence, and minimizes the expected number of bits stored. We focus
on the near-lossless paradigm as the concepts involved more closely parallel those in channel coding.
6 Miguel R. D. Rodrigues et al.
The operational upshot of (1.2) is that if one allocates nHB (p) bits then basically all
expected sequences can be indexed. Of course, there are caveats. First, np need not be
integer. Second, there will be variability about the mean. To deal with both, we allocate
a few more bits, n(HB (p) + δ) in total. We use these bits not just to index the expected
sequences, but also the typical sequences, those sequences with empirical entropy close
to the entropy of the source.2 In the case of coin flips, if a particular sequence consists
of nH heads (and n − nH tails) then we say that the sequence is “typical” if
nH 1 n − nH 1
HB (p) − δ ≤ log2 + log2 ≤ HB (p) + δ. (1.4)
n p n 1− p
It can be shown that the cardinality of the set of sequences that satisfies condition (1.4) is
upper-bounded by 2n(HB (p)+δ) . Therefore if, for instance, one lists the typical sequences
lexicographically, then any typical sequence can be described using n(HB (p) + δ) bits.
One can also show that for any δ > 0 the probability of the source not producing a typical
sequence can be upper-bounded by any > 0 as n grows large. This follows from the
law of large numbers. As n grows the distribution of the fraction of heads in the realized
source sequence concentrates about its expectation. Therefore, as long as n is sufficiently
large, and as long as δ > 0, any > 0 will do. The quantity HB (p)+δ is termed the storage
“rate” R. For this example R = HB (p) + δ. The rate is the amount of memory that must
be made available per source symbol. In this case, there were n symbols (n coin tosses),
so one normalizes n(HB (p) + δ) by n to get the rate HB (p) + δ.
The above idea can immediately be extended to independent and identically dis-
tributed (i.i.d.) finite-alphabet (and more general) sources as well. The general definition
of the entropy of a finite-alphabet random variable X with probability mass function
(p.m.f.) pX is
H(X) = − pX (x)log2 pX (x), (1.5)
x∈X
2 In the literature, these are termed the “weakly” typical sequences. There are other definitions of typicality
that differ in terms of their mathematical use. The overarching concept is the same.
Introduction to Information Theory and Data Science 7
1.2.2 Converse: A Lower Bound on the Rate Required for Reliable Data Storage
A second hallmark of information theory is the emphasis on developing bounds. The
source coding scheme described above is known as an achievability result. Achievability
results involve describing an operational system that can, in principle, be realized in
practice. Such results provide (inner) bounds on what is possible. The performance of
the best system is at least this good. In the above example, we developed a source coding
technique that delivers high-reliability storage and requires a rate of H(X) + δ, where
both the error and the slack δ can be arbitrarily small if n is sufficiently large.
An important coupled question is how much (or whether) we can reduce the rate
further, thereby improving the efficiency of the scheme. In information theory, outer
bounds on what is possible – e.g., showing that if the encoding rate is too small one
cannot guarantee a target level of reliability – are termed converse results.
One of the key lemmas used in converse results is Fano’s inequality [7], named for
Robert Fano. The statement of the inequality is as follows: For any pair of random
variables (U, V) ∈ U × V jointly distributed according to pU,V (·, ·) and for any estimator
G : U → V with probability of error Pe = Pr[G(U) V],
On the left-hand side of (1.6) we encounter the conditional entropy H(V|U) of the joint
p.m.f. pU,V (·, ·). We use the notation H(V|U = u) to denote the entropy in V when the
realization of the random variable U is set to U = u. Let us name this the “pointwise”
conditional entropy, the value of which can be computed by applying our formula for
entropy (1.5) to the p.m.f. pV|U (·|u). The conditional entropy is the expected pointwise
conditional entropy:
⎡ ⎤⎥
⎢⎢⎢ 1 ⎥⎥⎥
H(V|U) = pU (u)H(V|U = u) = pU (u)⎢⎢⎣⎢ pV|U (v|u)log2 ⎥⎦⎥.
u∈U u∈U v∈V
pV|U (v|u)
(1.7)
Fano’s inequality (1.6) can be interpreted as a bound on the ability of any hypothesis
test function G to make a (single) correct guess of the realization of V on the basis of
its observation of U. As the desired error probability Pe → 0, both terms on the right-
hand side go to zero, implying that the conditional entropy must be small. Conversely, if
the left-hand side is not too small, that asserts a non-zero lower bound on Pe . A simple
explicit bound is achieved by upper-bounding HB (Pe ) as HB (Pe ) ≤ 1 and rearranging to
find that Pe ≥ (H(V|U) − 1)/log2 (|V| − 1).
The usefulness of Fano’s inequality stems, in part, from the weak assumptions it
makes. One can apply Fano’s inequality to any joint distribution. Often identification of
an applicable joint distribution is part of the creativity in the use of Fano’s inequality. For
instance in the source coding example above, one takes V to be the stored data sequence,
so |V| = 2n(HB (p)+δ) , and U to be the original source sequence, i.e., U = X n . While we
do not provide the derivation herein, the result is that to achieve an error probability of
at most Pe the storage rate R is lower-bounded by R ≥ H(X) − Pe log2 |X| − HB (Pe )/n,
8 Miguel R. D. Rodrigues et al.
where |X| is the source alphabet size; for the binary example |X| = 2. As we let Pe → 0
we see that the lower bound on the achievable rate is H(X) which, letting δ → 0, is also
our upper bound. Hence we have developed an operational approach to data compression
where the rate we achieve matches the converse bound.
We now discuss the interaction between achievability and converse results. As long as
the compression rate R > H(X) then, due to concentration in measure, in the achievability
case the failure probability > 0 and rate slack δ > 0 can both be chosen to be arbitrarily
small. Concentration of measure occurs as the blocklength n becomes large. In parallel
with n getting large, the total number of bits stored nR also grows.
The entropy H(X) thus specifies a boundary between two regimes of operation. When
the rate R is larger than H(X), achievability results tell us that arbitrarily reliable storage
is possible. When R is smaller than H(X), converse results imply that reliable storage
is not possible. In particular, rearranging the converse expression and once again noting
that HB (Pe ) ≤ 1, the error probability can be lower-bounded as
H(X) − R − 1/n
Pe ≥ . (1.8)
log2 |X|
If R < H(X), then for n sufficiently large Pe is bounded away from zero.
The entropy H(X) thus characterizes a phase transition between one state, the pos-
sibility of reliable data storage, and another, the impossibility. Such sharp information-
theoretic phase transitions also characterize classical information-theoretic results on
data transmission which we discuss in the next section, and applications of information-
theoretic tools in the data sciences which we turn to later in the chapter.
Shannon applied the same mix of ideas (typicality, entropy, conditional entropy) to solve
the, perhaps at first seemingly quite distinct, problem of reliable and efficient digital
communications. This is typically referred to as Shannon’s “channel coding” problem
in contrast to the “source coding” problem already discussed.
To gain a sense of the problem we return to the simple binary setting. Suppose our
source coding system has yielded a length-k string of “information bits.” For simplicity
we assume these bits are randomly distributed as before, i.i.d. along the sequence, but
are now fair; i.e., each is equally likely to be “0” or a “1.” The objective is to convey this
sequence over a communications channel to a friend. Importantly we note that, since
the bits are uniformly distributed, our result on source coding tells us that no further
compression is possible. Thus, uniformity of message bits is a worst-case assumption.
The channel we consider is the binary symmetric channel (BSC). We can transmit
binary symbols over a BSC. Each input symbol is conveyed to the destination, but not
entirely accurately. The binary symmetric channel “flips” each channel input symbol
(0 → 1 or 1 → 0) with probability p, 0 ≤ p ≤ 1. Flips occur independently. The challenge
is for the destination to deduce, one hopes with high accuracy, the k information bits
Introduction to Information Theory and Data Science 9
1
0.9
1– p
0 0 0.8
Figure 1.2 On the left we present a graphical description of the binary symmetric channel (BSC).
Each transmitted binary symbol is represented as a 0 or 1 input on the left. Each received binary
observation is represented by a 0 or 1 output on the right. The stochastic relationship between
inputs and outputs is represented by the connectivity of the graph where the probability of
transitioning each edge is represented by the edge label p or 1 − p. The channel is “symmetric”
due to the symmetries in these transition probabilities. On the right we plot the binary entropy
function HB (p) as a function of p, 0 ≤ p ≤ 1. The capacity of the BSC is CBSC = 1 − HB (p).
transmitted. Owing to the symbol flipping noise, we get some slack; we transmit n ≥ k
binary channel symbols. For efficiency’s sake, we want n to be as close to k as possible,
while meeting the requirement of high reliability. The ratio k/n is termed the “rate” of
communication. The length-n binary sequence transmitted is termed the “codeword.”
This “bit flipping” channel can be used, e.g., to model data storage errors in a computer
memory. A graphical representation of the BSC is depicted in Fig. 1.2.
receiver will be less likely to make an error when deciding on the transmitted codeword.
Once such a codeword estimate has been made it can then be mapped back to the length-
k information bit sequence. A natural decoding rule, in fact the maximum-likelihood rule,
is for the decoder to pick the codeword closest to Y n in terms of Hamming distance.
The design of the codebook (analogous to the choice of grammatically correct – and
thus allowable – sentences in a spoken language) is a type of probabilistic packing prob-
lem. The question is, how do we select the set of codewords so that the probability of a
decoding error is small? We can develop a simple upper bound on how large the set of
reliably decodable codewords can be. There are 2n possible binary output sequences. For
any codeword selected there are roughly 2nHB (p) typical output sequences, each associ-
ated with a typical noise sequence, that form a noise ball centered on the codeword. If we
were simply able to divide up the output space into disjoint sets of cardinality 2nHB (p) , we
would end up with 2n /2nHB (p) = 2n(1−HB (p)) distinct sets. This sphere-packing argument
tells us that the best we could hope to do would be to transmit this number of distinct
codewords reliably. Thus, the number of information bits k would equal n(1 − HB (p)).
Once we normalize by the number n of channels uses we get a transmission rate of
1 − HB (p).
Perhaps quite surprisingly, as n gets large, 1 − HB (p) is the supremum of achievable
rates at (arbitrarily) high reliability. This is the Shannon capacity CBSC = 1 − HB (p). The
result follows from the law of large numbers, which can be used to show that the typical
noise balls concentrate. Shannon’s proof that one can actually find a configuration of
codewords while keeping the probability of decoding error small was an early use of
the probabilistic method. For any rate R = CBSC − δ, where δ > 0 is arbitrarily small, a
randomized choice of the positioning of each codeword will with high probability, yield
a code with a small probability of decoding error. To see the plausibility of this statement
we revisit the sphere-packing argument. At rate R = CBSC − δ the 2nR codewords are each
associated with a typical noise ball of 2nHB (p) sequences. If the noise balls were all (in the
worst case) disjointed, this would be a total of 2nR 2nHB (p) = 2n(1−HB (p)−δ)+nHB (p) = 2n(1−δ)
sequences. As there are 2n binary sequences, the fraction of the output space taken up
by the union of typical noise spheres associated with the codewords is 2n(1−δ) /2n = 2−nδ .
So, for any δ > 0 fixed, as the blocklength n → ∞, only an exponentially disappearing
fraction of the output space is taken up by the noise balls. By choosing the codewords
independently at random, each uniformly chosen over all length-n binary sequences, one
can show that the expected (over the choice of codewords and channel noise realization)
average probability of error is small. Hence, at least one codebook exists that performs
at least as well as this expectation.
While Shannon showed the existence of such a code (actually a sequence of codes
as n → ∞), it took another half-century for researchers in error-correction coding to
find asymptotically optimal code designs and associated decoding algorithms that were
computationally tractable and therefore implementable in practice. We discuss this
computational problem and some of these recent code designs in Section 1.4.
While the above example is set in the context of a binary-input and binary-output
channel model, the result is a prototype of the result that holds for discrete memoryless
channels. A discrete memoryless channel is described by the conditional distribution
Introduction to Information Theory and Data Science 11
In (1.9), H(Y) is the entropy of the output space, induced by the choice of input dis-
tribution pX via pY (y) = x∈X pX (x)pY|X (y|x), and H(Y|X) is the conditional entropy of
pX (·)pY|X (·|·). For the BSC the optimal choice of pX (·) is uniform. We shortly develop an
operational intuition for this choice by connecting it to hypothesis testing. We note that
this choice induces the uniform distribution on Y. Since |Y| = 2, this means that H(Y) = 1.
Further, plugging the channel law of the BSC into (1.7) yields H(Y|X) = HB (p). Putting
the pieces together recovers the Shannon capacity result for the BSC, CBSC = 1 − HB (p).
In (1.9), we introduce the equality H(Y) − H(Y|X) = I(X; Y), where I(X; Y), denotes
the mutual information of the joint distribution pX (·)pY|X (·|·). The mutual information is
another name for the Kullback–Leibler (KL) divergence between the joint distribution
pX (·)pY|X (·|·) and the product of the joint distribution’s marginals, pX (·)pY (·). The gen-
eral formula for the KL divergence between a pair of distributions pU and pV defined
over a common alphabet A is
pU (a)
D(pU pV ) = pU (a)log2 . (1.10)
a∈A
pV (a)
In the definition of mutual information over A = X × Y, pX,Y (·, ·) plays the role of pU (·)
and pX (·)pY (·) plays the role of pV (·).
The KL divergence arises in hypothesis testing, where it is used to quantify the error
exponent of a binary hypothesis test. Conceiving of channel decoding as a hypothesis-
testing problem – which one of the codewords was transmitted? – helps us understand
why (1.9) is the formula for the Shannon capacity. One way the decoder can make its
decision regarding the identity of the true codeword is to test each codeword against
independence. In other words, does the empirical joint distribution of any particular
codeword X n and the received data sequence Y n look jointly distributed according to the
channel law or does it look independent? That is, does (X n , Y n ) look like it is distributed
i.i.d. according to pXY (·, ·) = pX (·)pY|X (·|·) or i.i.d. according to pX (·)pY (·)? The exponent
of the error in this test is −D(pXY pX pY ) = −I(X; Y). Picking the input distribution pX to
maximize (1.9) maximizes this exponent. Finally, via an application of the union bound
we can assert that, roughly, 2nI(X;Y) codewords can be allowed before more than one
codeword in the codebook appear to be jointly distributed with the observation vector
Y n according to pXY .
single-letterize the bound. To single-letterize means to express the final bound in terms
of only the pX (·)pY|X (·|·) distribution, rather than in terms of the joint distribution of
the length-n input and output sequences. This is an important step because n is allowed
to grow without bound. By single-letterizing we express the bound in terms of a fixed
distribution, thereby making the bound computable.
As at the end of the discussion of source coding, in channel coding we find a boundary
between two regimes of operation: the regime of efficient and reliable data transmission,
and the regime wherein such reliable transmission is impossible. In this instance, it is
the Shannon capacity C that demarks the phase-transition between these two regimes of
operation.
In the previous sections, we discussed the sharp phase transitions in both source and
channel coding discovered by Shannon. These phase transitions delineate fundamental
boundaries between what is possible and what is not. In practice, one desires schemes
that “saturate” these bounds. In the case of source coding, we can saturate the bound if
we can design source coding techniques with rates that can be made arbitrarily close to
H(X) (from above). For channel coding we desire coding methods with rates that can be
made arbitrarily close to C (from below). While Shannon discovered and quantified the
bounds, he did not specify realizable schemes that attained them.
Decades of effort have gone into developing methods of source and channel coding.
For lossless compression of memoryless sources, as in our motivating examples, good
approaches such as Huffman and arithmetic coding were found rather quickly. On the
other hand, finding computationally tractable and therefore implementable schemes of
error-correction coding that got close to capacity took much longer. For a long time it
was not even clear that computationally tractable techniques of error correction that sat-
urated Shannon’s bounds were even possible. For many years researchers thought that
there might be a second phase transition at the cutoff rate, only below which compu-
tationally tractable methods of reliable data transmission existed. (See [10] for a nice
discussion.) Indeed, only with the emergence of modern coding theory in the 1990s and
2000s that studies turbo, low-density parity-check (LDPC), spatially coupled LDPC, and
Polar codes has the research community, even for the BSC, developed computationally
tractable methods of error correction that closed the gap to Shannon’s bound.
In this section, we introduce the reader to linear codes. Almost all codes in use have
linear structure, structure that can be exploited to reduce the complexity of the decoding
process. As in the previous sections we only scratch the surface of the discipline of
error-correction coding. We point the reader to the many excellent texts on the subject,
e.g., [6, 11–15].
x = GT b, (1.11)
s = Hy. (1.12)
We caution the reader not to confuse the parity-check matrix H with the entropy function
H(·). By design, the rows of H are all orthogonal to the rows of G and thus span the
null-space of G.3 When the columns of G are linearly independent, the dimension of the
null-space of G is n − k and the relation m = n − k holds.
Substituting in the definition for x into the expression for y and thence into (1.12), we
compute
where the last step follows because the rows of G and H are orthogonal by design so that
HGT = 0, the m × k all-zeros matrix. Inspecting (1.13), we observe that the computation
of the syndrome s yields m linear constraints on the noise vector e.
Since e is of length n and m = n − k, (1.13) specifies an under-determined set of lin-
ear equations in F2 . However, as already discussed, while e could be any vector, when
the blocklength n becomes large, concentration of measure comes into play. With high
probability the realization of e will concentrate around those sequences that contain only
np non-zero elements. We recall that p ∈ [0, 1] is the bit-flip probability and note that
in F2 any non-zero element must be a one. In coding theory, we are therefore faced
with the problem of solving an under-determined set of linear equations subject to a
sparsity constraint: There are only about np non-zero elements in the solution vector.
Consider p ≤ 0.5. Then, as error vectors e containing fewer bit flips are more likely,
the maximum-likelihood solution for the noise vector e is to find the maximally sparse
vector that satisfies the syndrome constraints, i.e.,
e = arg minn dH (e) such that s = He, (1.14)
e∈F2
where dH (·) is the Hamming weight (or distance from 0n ) of the argument. As mentioned
before, the Hamming weight is the number of non-zero entries of e. It plays a role
3 Note that in finite fields vectors can be self-orthogonal; e.g., in F2 any even-weight vector is orthogonal to
itself.
14 Miguel R. D. Rodrigues et al.
4 We comment that this same syndrome decoding can also be used to provide a solution to the near-lossless
source coding problem of Section 1.2. One pre-multiplies the source sequence by the parity-check matrix
H, and stores the syndrome of the source sequence. For a biased binary source, one can solve (1.14) to
recover the source sequence with high probability. This approach does not feature prominently in source
coding, with the exception of distributed source coding, where it plays a prominent role. See [7, 9] for
further discussion.
Introduction to Information Theory and Data Science 15
Data science – a loosely defined concept meant to bring together various problems stud-
ied in statistics, machine learning, signal processing, harmonic analysis, and computer
science under a unified umbrella – involves numerous other challenges that go beyond
the traditional source coding and channel coding problems arising in communication
or storage systems. These challenges are associated with the need to acquire, represent,
16 Miguel R. D. Rodrigues et al.
and analyze information buried in data in a reliable and computationally efficient manner
in the presence of a variety of constraints such as security, privacy, fairness, hardware
resources, power, noise, and many more.
Figure 1.3 presents a typical data-science pipeline, encompassing functions such as
data acquisition, data representation, and data analysis, whose overarching purpose is to
turn data captured from the physical world into insights for decision-making. It is also
common to consider various other functions within a data-science “system” such as data
preparation, data exploration, and more. We restrict ourselves to this simplified version
because it serves to illustrate how information theory is helping shape data science. The
goals of the different blocks of the data-science pipeline in Fig. 1.3 are as follows.
• The data-acquisition block is often concerned with the act of turning physical-
world continuous-time analog signals into discrete-time digital signals for further
digital processing.
• The data-representation block concentrates on the extraction of relevant attributes
from the acquired data for further analysis.
• The data-analysis block concentrates on the extraction of meaningful actionable
information from the data features for decision-making.
From the description of these goals, one might think that information theory – a field
that arose out of the need to study communication systems in a principled manner – has
little to offer to the principles of data acquisition, representation, analysis, or processing.
But it turns out that information theory has been advancing our understanding of data
science in three major ways.
• First, information theory has been leading to new system architectures for the
different elements of the data-science pipeline. Representative examples associ-
ated with new architectures for data acquisition are overviewed in Section 1.6.
• Second, information-theoretic methods can be used to unveil fundamental
operational limits in various data-science tasks, including in data acquisition,
representation, analysis, and processing. Examples are overviewed in Sections
1.6–1.8.
• Third, information-theoretic measures can be used as the basis for developing
algorithms for various data-science tasks. We allude to some examples in Sections
1.7 and 1.8.
In fact, the questions one can potentially ask about the data-science pipeline depicted
in Fig. 1.3 exhibit many parallels to the questions one asks about the communications
Figure 1.3 A simplified data-science pipeline encompassing functions such as data acquisition,
data representation, and data analysis and processing.
Introduction to Information Theory and Data Science 17
system architecture shown in Fig. 1.1. Specifically, what are the trade-offs in perfor-
mance incurred by adopting this data-science architecture? In particular, are there other
systems that do not involve the separation of the different data-science elements and
exhibit better performance? Are there fundamental limits on what is possible in data
acquisition, representation, analysis, and processing? Are there computationally feasi-
ble algorithms for data acquisition, representation, analysis, and processing that attain
such limits?
There has been progress in data science in all three of these directions. As a
concrete example that showcases many similarities between the data-compression
and – communication problems and data-science problems, information-theoretic meth-
ods have been providing insight into the various operational regimes associated with
the following different data-science tasks: (1) the regime where there is no algorithm –
regardless of its complexity – that can perform the desired task subject to some accu-
racy; this “regime of impossibility” in data science has the flavor of converse results in
source coding and channel coding in information theory; (2) the regime where there are
algorithms, potentially very complex and computationally infeasible, that can perform
the desired task subject to some accuracy; this “regime of possibility” is akin to the
initial discussion of linear codes and the Elias and Gallager ensembles in channel cod-
ing; and (3) the regime where there are computationally feasible algorithms to perform
the desired task subject to some accuracy; this “regime of computational feasibility” in
data science has many characteristics that parallel those in design of computationally
tractable source and channel coding schemes in information theory.
Interestingly, in the same way that the classical information-theoretic problems of
source coding and channel coding exhibit phase transitions, many data-science prob-
lems have also been shown to exhibit sharp phase transitions in the high-dimensional
setting, where the number of data samples and the dimension of the data approach
infinity. Such phase transitions are typically expressed as a function of various param-
eters associated with the data-science problem. The resulting information-theoretic
limit/threshold/barrier (a.k.a. statistical phase transition) partitions the problem param-
eter space into two regions [23–25]: one defining problem instances that are impossible
to solve and another defining problem instances that can be solved (perhaps only
with a brute-force algorithm). In turn, the computational limit/threshold/barrier (a.k.a.
computational phase transition) partitions the problem parameter space into a region
associated with problem instances that are easy to solve and another region associated
with instances that are hard to solve [26, 27].
There can, however, be differences in how one establishes converse and achievability
results – and therefore phase transitions – in classical information-theoretic problems
and data-science ones. Converse results in data science can often be established using
Fano’s inequality or variations on it (see also Chapter 16). In contrast, achievability
results often cannot rely on classical techniques, such as the probabilistic method,
necessitating instead the direct analysis of the algorithms. Chapter 13 elaborates on
some emerging tools that may be used to establish statistical and computational limits
in data-science problems.
18 Miguel R. D. Rodrigues et al.
Data acquisition is a critical element of the data-science architecture shown in Fig. 1.3.
It often involves the conversion of a continuous-time analog signal into a discrete-time
digital signal that can be further processed in digital signal-processing pipelines.5
Conversion of a continuous-time analog signal x(t) into a discrete-time digital rep-
resentation typically entails two operations. The first operation – known as sampling –
involves recording the values of the original signal x(t) at particular instants of time. The
simplest form of sampling is direct uniform sampling in which the signal is recorded at
uniform sampling times x(kT s ) = x(k/Fs ), where T s denotes the sampling period (in
seconds), Fs denotes the sampling frequency (in Hertz), and k is an integer. Another
popular form of sampling is generalized shift-invariant sampling in which x(t) is first
filtered by a linear time-invariant (LTI) filter, or a bank of LTI filters, and only then sam-
pled uniformly [28]. Other forms of generalized and non-uniform sampling have also
been studied. Surprisingly, under certain conditions, the sampling process can be shown
to be lossless: for example, the classical sampling theorem for bandlimited processes
asserts that it is possible to perfectly recover the original signal from its uniform sam-
ples provided that the sampling frequency Fs is at least twice the signal bandwidth B.
This minimal sampling frequency FNQ = 2B is referred to as the Nyquist rate [28].
The second operation – known as quantization – involves mapping the continuous-
valued signal samples onto discrete-valued ones. The levels are taken from a finite set
of levels that can be represented using a finite sequence of bits. In (optimal) vector
quantization approaches, a series of signal samples are converted simultaneously to a
bit sequence, whereas in (sub-optimal) scalar quantization, each individual sample is
mapped to bits. The quantization process is inherently lossy since it is impossible to
accurately represent real-valued samples using a finite set of bits. Rate-distortion the-
ory establishes a trade-off between the average number of bits used to encode each
signal sample – referred to as the rate – and the average distortion incurred in the
reconstruction of each signal sample – referred to simply as the distortion – via two func-
tions. The rate-distortion function R(D) specifies the smallest number of bits required
on average per sample when one wishes to represent each sample with average distor-
tion less than D, whereas the distortion-rate function D(R) specifies the lowest average
distortion achieved per sample when one wishes to represent on average each sample
with R bits [7]. A popular measure of distortion in the recovery of the original signal
5 Note that it is also possible that the data are already presented in an inherently digital format; Chapters 3,
4, and 6 deal with such scenarios.
Introduction to Information Theory and Data Science 19
samples from the quantized ones is the mean-squared error (MSE). Note that this class of
problems – known as lossy source coding – is the counterpart of the lossless source
coding problems discussed earlier.
The motivation for this widely used data-acquisition architecture, involving (1) a
sampling operation at or just above the Nyquist rate and (2) scalar or vector quanti-
zation operations, is its simplicity that leads to a practical implementation. However,
it is well known that the separation of the sampling and quantization operations is not
necessarily optimal. Indeed, the optimal strategy that attains Shannon’s distortion-rate
function associated with arbitrary continuous-time random signals with known statis-
tics involves a general mapping from continuous-time signal space to a sequence of
bits that does not consider any practical constraints in its implementation [1, 3, 29].
Therefore, recent years have witnessed various generalizations of this data-acquisition
paradigm informed by the principles of information theory, on the one hand, and guided
by practical implementations, on the other.
One recent extension considers a data-acquisition paradigm that illuminates the
dependence between these two operations [30–32]. In particular, given a total rate bud-
get, Kipnis et al. [30–32] draw on information-theoretic methods to study the lowest
sampling rate required to sample a signal such that the reconstruction of the signal
from the bit-constrained representation of its samples results in minimal distortion. The
sampling operation consists of an LTI filter, or bank of filters, followed by pointwise
sampling of the outputs of the filters. The authors also show that, without assuming
any particular structure on the input analog signal, this sampling rate is often below the
signal’s Nyquist rate. That is, due to the fact that there is loss encountered by the quan-
tization operation, there is no longer in general a requirement to sample the signal at the
Nyquist rate.
As an example, consider the case where x(t) is a stationary random process bandlim-
ited to B with a triangular power spectral density (PSD) given formally by
σ2x
S(f) = [1 − | f /B|]+ (1.15)
B
with [a]+ = max(a, 0). In this case, the Nyquist sampling rate is 2B. However, when
quantization is taken into account, the sampling rate can be lowered without introducing
further distortion. Specifically, assuming a bitrate leading to distortion D, the minimal
sampling rate is shown to be equal to [32]
fR = 2B 1 − D/σ2x . (1.16)
Thus, as the distortion grows, the minimal sampling rate is reduced. When we do not
allow any distortion, namely, no quantization takes place, D = 0 and fR = 2B so that
Nyquist rate sampling is required.
Such results show how information-theoretic methods are leading to new insights
about the interplay between sampling and quantization. In particular, these new results
can be seen as an extension of the classical sampling theorem applicable to bandlimited
random processes in the sense that they describe the minimal amount of excess distortion
in the reconstruction of a signal due to lossy compression of its samples, leading to
20 Miguel R. D. Rodrigues et al.
the minimal sampling frequency required to achieve this distortion.6 In general, this
sampling frequency is below the Nyquist rate. Chapter 2 surveys some of these recent
results in data acquisition.
Another generalization of the classical data-acquisition paradigm considers scenar-
ios where the end goal is not to reconstruct the original analog signal x(t) but rather to
perform some other operation on it [33]. For example, in the context of parameter esti-
mation, Rodrigues et al. [33] show that the number of bits per sample required to achieve
a certain distortion in such task-oriented data acquisition can be much lower than that
required for task-ignorant data acquisition. More recently, Shlezinger et al. [34, 35]
study task-oriented hardware-efficient data-acquisition systems, where optimal vector
quantizers are replaced by practical ones. Even though the optimal rate-distortion curve
cannot be achieved by replacing optimal vector quantizers by simple serial scalar ones,
it is shown in [34, 35] that one can get close to the minimal distortion in settings where
the information of interest is not the signal itself, but rather a low-dimensional param-
eter vector embedded in the signal. A practical application of this setting is in massive
multiple-input multiple-output (MIMO) systems where there is a strong need to utilize
simple low-resolution quantizers due to power and memory constraints. In this context, it
is possible to design a simple quantization system, consisting of scalar uniform quantiz-
ers and linear pre- and post-processing, leading to minimal channel estimation distortion.
These recent results also showcase how information-theoretic methods can provide
insight into the interplay between data acquisition, representation, and analysis, in
the sense that knowledge of the data-analysis goal can influence the data-acquisition
process. These results therefore also suggest new architectures for the conventional data-
science pipeline that do not involve a strict separation between the data-acquisition,
data-representation, and data-analysis and processing blocks.
Beyond this data-acquisition paradigm involving the conversion of continuous-time
signals to digital ones, recent years have also witnessed the emergence of various
other data-acquisition approaches. Chapters 3, 4, and 6 cover further data-acquisition
strategies that are also benefiting from information-theoretic methods.
The outputs of the data-acquisition block – often known as “raw” data – typically need
to be turned into “meaningful” representations – known as features – for further data
analysis. Note that the act of transforming raw data into features, where the number
of dimensions in the features is lower than that in the raw data, is also referred to as
dimensionality reduction.
Recent years have witnessed a shift from model-based data representations, rely-
ing on predetermined transforms – such as wavelets, curvelets, and shearlets – to
compute the features from raw data, to data-driven representations that leverage a
6 In fact, this theory can be used even when the input signal is not bandlimited.
Introduction to Information Theory and Data Science 21
7 In some representation learning problems, instead of using F(·) to obtain the inverse images of data
samples, G(·) is learned directly from training samples.
22 Miguel R. D. Rodrigues et al.
Y = AX + W (1.19)
with A ∈ F ⊂ Rm×k , and the feature estimates being given by X = BY with B ∈ Rk×m . In
this setting, (F,G) = (A, B) and representation learning reduces to estimating the linear
operators A and/or B under various assumptions on F and the generative model.8 In the
case of PCA, for example, it is assumed that F is the Stiefel manifold in Rm and the
feature vector X is a random vector that has zero mean and uncorrelated entries. On
the other hand, ICA assumes X to have zero mean and independent entries. (The zero-
mean assumption in both PCA and ICA is for ease of analysis and can be easily removed
at the expense of extra notation.)
Information-theoretic frameworks have long been used to develop computational
approaches for estimating (A, B) in ICA and its variants; see, e.g., [49, 50, 54, 55].
Recent years have also seen the use of information-theoretic tools such as Fano’s
inequality to derive sharp bounds on the feasibility of linear representation learning.
One such result that pertains to PCA under the so-called spiked covariance model is
described next.
Suppose the training data samples are N independent realizations according to
(1.19), i.e.,
yi = Axi + wi , i = 1, . . . , N, (1.20)
where AT A = I and both xi and wi are independent realizations of X and W that have
zero mean and diagonal covariance matrices given by
and E[WWT ] = σ2 I, respectively. Note that the ideal B in this PCA example is
given by B = AT . It is then shown in [43, Theorem 5] using various analytical
tools, which include Fano’s inequality, that A can be reliably estimated from N training
samples only if 9
Nλ2k
→ ∞. (1.22)
k(m − k)(1 + λk )
This is the “converse” for the spiked covariance estimation problem.
The “achievability” result for this problem is also provided in [43]. Specifically, when
the condition given in (1.22) is satisfied, a practical algorithm exists that allows reliable
estimation of A [43]. This algorithm involves taking A to be the k eigenvectors corre-
N i iT
sponding to the k largest eigenvalues of the sample covariance (1/N) i=1 y y of the
training data. We therefore have a sharp information-theoretic phase transition in this
problem, which is characterized by (1.22). Notice here, however, that while the converse
makes use of information-theoretic tools, the achievability result does not involve the
use of the probabilistic method; rather, it requires analysis of an explicit (deterministic)
algorithm.
The sharp transition highlighted by the aforementioned result can be interpreted in
various ways. One of the implications of this result is that it is impossible to reliably
estimate the PCA features when m > N and m, N → ∞. In such high-dimensional PCA
settings, it is now well understood that sparse PCA, in which the columns of A are
approximately “sparse,” is more appropriate for linear representation learning. We refer
the reader to works such as [43, 56, 57] that provide various information-theoretic limits
for the sparse PCA problem.
We conclude by noting that there has been some recent progress regarding bounds
on the computational feasibility of linear representation learning. For example, the
fact that there is a practical algorithm to learn a linear data representation in some
high-dimensional settings implies that computational barriers can almost coincide with
information-theoretic ones. It is important to emphasize, though, that recent work –
applicable to the detection of a subspace structure within a data matrix [25, 58–62] – has
revealed that classical computationally feasible algorithms such as PCA cannot always
approach the information-theoretic detection threshold [25, 61].
include local linear embedding [63], Isomap [64], kernel entropy component analysis
(ECA) [65], and nonlinear generalizations of linear techniques using the kernel trick
(e.g., kernel PCA [66], kernel ICA [67], and kernel LDA [68]). The use of information-
theoretic machinery in these methods has mostly been limited to formulations of the
algorithmic problems, as in kernel ECA and kernel ICA. While there exist some results
that characterize the regime in which manifold learning is impossible, such results
leverage the probabilistic method rather than more fundamental information-theoretic
tools [69].
Recent years have seen the data-science community widely embrace another non-
linear representation learning approach that assumes data lie near a union of subspaces
(UoS). This approach tends to have several advantages over manifold learning because of
the linearity of individual components (subspaces) in the representation learning model.
While there exist methods that learn the subspaces explicitly, one of the most popu-
lar classes of representation learning under the UoS model in which the subspaces are
implicitly learned is referred to as dictionary learning [38]. Formally, dictionary learning
assumes the data space to be Y = Rm , the feature space to be
X = {x ∈ R p : x0 ≤ k} (1.23)
Y = DX + W (1.24)
yi = Dxi + wi , i = 1, . . . , N, (1.26)
matching achievability results, remains an open problem. Computational limits are also
in general open for dictionary learning.
Recent years have also seen extension of these results to the case of data that have a
multidimensional (tensor) structure [45]. We refer the reader to Chapter 5 in the book
for a more comprehensive review of dictionary-learning results pertaining to both vector
and tensor data.
Linear representation learning, manifold learning, and dictionary learning are all
based on a geometric viewpoint of data. It is also possible to view these representation-
learning techniques from a purely numerical linear algebra perspective. Data represen-
tations in this case are referred to as matrix factorization-based representations. The
matrix factorization perspective of representation learning allows one to expand the
classes of learning techniques by borrowing from the rich literature on linear algebra.
Non-negative matrix factorization [77], for instance, allows one to represent data that
are inherently non-negative in terms of non-negative features that can be assigned phys-
ical meanings. We refer the reader to [78] for a more comprehensive overview of matrix
factorizations in data science; [79] also provides a recent information-theoretic analysis
of non-negative matrix factorization.
where Wi ∈ Rni ×ni−1 is a weight matrix, bi ∈ Rni is a bias vector, fi : Rni → Rni is a non-
linear operator such as a rectified linear unit (ReLU), and L corresponds to the number
of layers in the deep neural network. The challenge then relates to how to learn the set
of weight matrices and bias vectors associated with the deep neural network. For exam-
ple, in classification problems where each data instance x is associated with a discrete
label , one typically relies on a training set (yi , i ), i = 1, . . . , N, to define a loss function
that can be used to tune the various parameters of the network using algorithms such as
gradient descent or stochastic gradient descent [81].
This approach to data representation underlies some of the most spectacular advances
in areas such as computer vision, speech recognition, speech translation, natural lan-
guage processing, and many more, but this approach is also not fully understood.
However, information-theoretically oriented studies have also been recently conducted
to gain insight into the performance of deep neural networks by enabling the analysis
of the learning process or the design of new learning algorithms. For example, Tishby
et al. [82] propose an information-theoretic analysis of deep neural networks based on
the information bottleneck principle. They view the neural network learning process as
a trade-off between compression and prediction that leads up to the extraction of a set of
26 Miguel R. D. Rodrigues et al.
minimal sufficient statistics from the data in relation to the target task. Shwartz-Ziv and
Tishby [83] – building upon the work in [82] – also propose an information-bottleneck-
based analysis of deep neural networks. In particular, they study information paths in the
so-called information plane capturing the evolution of a pair of items of mutual informa-
tion over the network during the training process: one relates to the mutual information
between the ith layer output and the target data label, and the other corresponds to the
mutual information between the ith layer output and the data itself. They also demon-
strate empirically that the widely used stochastic gradient descent algorithm undergoes
a “fitting” phase – where the mutual information between the data representations and
the target data label increases – and a “compression” phase – where the mutual infor-
mation between the data representations and the data decreases. See also related works
investigating the flow of information in deep networks [84–87].
Achille and Soatto [88] also use an information-theoretic approach to understand
deep-neural-networks-based data representations. In particular, they show how deep
neural networks can lead to minimal sufficient representations with properties such as
invariance to nuisances, and provide bounds that connect the amount of information in
the weights and the amount of information in the activations to certain properties of
the activations such as invariance. They also show that a new information-bottleneck
Lagrangian involving the information between the weights of a network and the training
data can overcome various overfitting issues.
More recently, information-theoretic metrics have been used as a proxy to learn data
representations. In particular, Hjelm et al. [89] propose unsupervised learning of repre-
sentations by maximizing the mutual information between an input and the output of a
deep neural network.
In summary, this body of work suggests that information-theoretic quantities such as
mutual information can inform the analysis, design, and optimization of state-of-the-art
representation learning approaches. Chapter 11 covers some of these recent trends in
representation learning.
The outputs of the data-representation block – the features – are often the basis for fur-
ther data analysis or processing, encompassing both statistical inference and statistical
learning tasks such as estimation, regression, classification, clustering, and many more.
Statistical inference forms the core of classical statistical signal processing and statis-
tics. Broadly speaking, it involves use of explicit stochastic data models to understand
various aspects of data samples (features). These models can be parametric, defined as
those characterized by a finite number of parameters, or non-parametric, in which the
number of parameters continuously increases with the number of data samples. There is
a large portfolio of statistical inference tasks, but we limit our discussion to the problems
of model selection, hypothesis testing, estimation, and regression.
Briefly, model selection involves the use of data features/samples to select a stochastic
data model from a set of candidate models. Hypothesis testing, on the other hand, is the
Introduction to Information Theory and Data Science 27
Model Selection
On the algorithmic front, the problem of model selection has been largely impacted by
information-theoretic tools. Given a data set, which statistical model “best” describes
the data? A huge array of work, dating back to the 1970s, has tackled this question
using various information-theoretic principles. The Akaike information criterion (AIC)
for model selection [91], for instance, uses the KL divergence as the main tool for deriva-
tion of the final criterion. The minimum description length (MDL) principle for model
selection [92], on the other hand, makes a connection between source coding and model
selection and seeks a model that best compresses the data. The AIC and MDL prin-
ciples are just two of a number of information-theoretically inspired model-selection
approaches; we refer the interested reader to Chapter 12 for further discussion.
11 The assumption is that raw data have been transformed into its features, which correspond to X.
Introduction to Information Theory and Data Science 29
results are for the generalized linear model (GLM), where the realizations of (Y, X, W)
are given by
T + w,
yi = xi β + wi =⇒ y = Xβ (1.29)
and provides matching minimax lower and upper bounds (i.e., the optimal minimax rate)
both for the estimation error,
β − β22 , and for the prediction error, (1/n)X( β − β)22 .
In particular, it is established that, under suitable assumptions on X, it is possible to
achieve estimation and prediction errors in GLMs that scale as Rq log p/N 1−q/2 . The
corresponding result for exact sparsity can be derived by setting q = 0 and Rq = s. Fur-
ther, there exist no algorithms, regardless of their computational complexity, that can
achieve errors smaller than this rate for every β in an q ball. As one might expect,
Fano’s inequality is the central tool used by Raskutti et al. [95] to derive this lower
bound (the “converse”). The achievability result requires direct analysis of algorithms,
as opposed to use of the probabilistic method in classical information theory. Since both
the converse and the achievability bounds coincide in regression and estimation under
the GLM, we end up with a sharp statistical phase transition. Chapters 6, 7, 8, and
16 elaborate further on various other recovery and estimation problems arising in data
science, along with key tools that can be used to gain insight into such problems.
Additional information-theoretic results are known for the standard linear model –
√
where Y = sXβ + W, with Y ∈ Rn , X ∈ Rn×p , β ∈ R p , W ∈ Rn ∼ N(0, I), and s a scaling
factor representing a signal-to-noise ratio. In particular, subject to mild conditions on the
distribution of the parameter vector, it has been established that the mutual information
and the minimum mean-squared error obey the so-called I-MMSE relationship given
by [96]:
√
dI β; sXβ + W 1 √
= · mmse Xβ| sXβ + W , (1.31)
ds 2
√
where I β; sXβ + W corresponds to the mutual information between the standard
linear model input and output and
√ √ 2 !
mmse Xβ| sXβ + W = E Xβ − E Xβ| sXβ + W (1.32)
2
30 Miguel R. D. Rodrigues et al.
√
is the minimum mean-squared error associated with the estimation of Xβ given sXβ +
W. Other relations involving information-theoretic quantities, such as mutual infor-
mation, and estimation-theoretic ones have also been established in a wide variety of
settings in recent years, such as Poisson models [97]. These relations have been shown to
have important implications in classical information-theoretic problems – notably in the
analysis and design of communications systems (e.g., [98–101]) – and, more recently, in
data-science ones. In particular, Chapter 7 elaborates further on how the I-MMSE rela-
tionship can be used to gain insight into modern high-dimensional inference problems.
Hypothesis Testing
Information-theoretic tools have also been advancing our understanding of hypothesis-
testing problems (one of the most widely used statistical inference techniques). In
general, we can distinguish between binary hypothesis-testing problems, where the data
are tested against two hypotheses often known as the null and the alternate hypotheses,
and multiple-hypothesis-testing problems in which the data are tested against multiple
hypotheses. We can also distinguish between Bayesian approaches to hypothesis test-
ing, where one specifies a prior probability associated with each of the hypotheses, and
non-Bayesian ones, in which one does not specify a priori any prior probability.
Formally, a classical formulation of the binary hypothesis-testing problem involves
testing whether a number of i.i.d. data samples (features) x1 , x2 , . . . , xN of a random
variable X ∈ X ∼ pX conform to one or other of the hypotheses H0 : pX = p0 and
H1 : pX = p1 , where under the first hypothesis one postulates that the data are generated
i.i.d. according to model (distribution) p0 and under the second hypothesis one assumes
the data are generated i.i.d. according to model (distribution) p1 . A binary hypothesis
test T : X × · · · × X → {H0 , H1 } is a mapping that outputs an estimate of the hypothesis
given the data samples.
In non-Bayesian settings, the performance of such a binary hypothesis test can be
described by two error probabilities. The type-I error probability, which relates to the
rejection of a true null hypothesis, is given by
Pe|0 (T ) = P T X1 , X2 , . . . , XN = H1 |H0 (1.33)
and the type-II error probability, which relates to the failure to reject a false null
hypothesis, is given by
Pe|1 (T ) = P T X1 , X2 , . . . , XN = H0 |H1 . (1.34)
In this class of problems, one is typically interested in minimizing one of the error
probabilities subject to a constraint on the other error probability as follows:
where the minimum can be achieved using the well-known Neymann–Pearson test [102].
Introduction to Information Theory and Data Science 31
Information-theoretic tools – such as typicality [7] – have long been used to analyze
the performance of this class of problems. For example, the classical Stein lemma asserts
that asymptotically with the number of data samples approaching infinity [7]
1
lim lim · log Pe (α) = −D(p0 ||p1 ), (1.36)
α→0 N→∞ N
H(C|X)
Pe,min = min Pe (T ) ≥ 1 − . (1.38)
T log2 (M − 1)
A number of other bounds on the minimum average error probability involving Shannon
information measures, Rényi information measures, or other generalizations have also
been devised over the years [104–106] and have led to stronger converse results not only
in classical information-theory problems but also in data-science ones [107].
12 Some datasets can also be represented by hyper-graphs of interacting items, where vertices denote the
different objects and hyper-edges denotes multi-way interactions between the different objects.
Introduction to Information Theory and Data Science 33
Supervised Learning
In the supervised learning setup, one desires to learn a hypothesis based on a set of data
examples that can be used to make predictions given new data [90]. In particular, in order
to formalize the problem, let X be the domain set, Y be the label set, Z = X × Y be the
examples domain, μ be a distribution on Z, and W a hypothesis class (i.e., W = {W} is
a set of hypotheses W : X → Y). Let also S = z1 , . . . , zN = (x1 , y1 ), . . . , (xN , yN ) ∈ ZN
be the training set – consisting of a number of data points and their associated labels –
drawn i.i.d. from Z according to μ. A learning algorithm is a Markov kernel that maps
the training set S to an element W of the hypothesis class W according to the probability
law pW|S .
A key challenge relates to understanding the generalization ability of the learning
algorithm, where the generalization error corresponds to the difference between the
expected (or true) error and the training (or empirical) error. In particular, by consider-
ing a non-negative loss function L : W × Z → R+ , one can define the expected error and
the training error associated with a hypothesis W as follows:
1
N
lossμ (W) = E{L(W, Z )} and lossS (W) = L(W, zi ),
N i=1
where the expectation is with respect to the joint distribution of the algorithm input (the
training set) and the algorithm output (the hypothesis).
A number of approaches have been developed throughout the years to characterize
the generalization error of a learning algorithm, relying on either certain complexity
measures of the hypothesis space or certain properties of the learning algorithm. These
include VC-based bounds [110], algorithmic stability-based bounds [111], algorithmic
robustness-based bounds [112], PAC-Bayesian bounds [113], and many more. However,
many of these generalization error bounds cannot explain the generalization abilities of a
variety of machine-learning methods for various reasons: (1) some of the bounds depend
only on the hypothesis class and not on the learning algorithm, (2) existing bounds do
not easily exploit dependences between different hypotheses, and (3) existing bounds
also do not exploit dependences between the learning algorithm input and output.
More recently, approaches leveraging information-theoretic tools have been emerging
to characterize the generalization ability of various learning methods. Such approaches
34 Miguel R. D. Rodrigues et al.
often express the generalization error in terms of certain information measures between
the algorithm input (the training dataset) and the algorithm output (the hypothesis),
thereby incorporating the various ingredients associated with the learning problem,
including the dataset distribution, the hypothesis space, and the learning algorithm itself.
In particular, inspired by [114], Xu and Raginsky [115] derive an upper bound on the
generalization error, applicable to σ-sub-Gaussian loss functions, given by
#
"" " 2σ2
"gen(μ, pW|S )"" ≤ · I(S; W),
N
where I(S; W) corresponds to the mutual information between the input – the dataset –
and the output – the hypothesis – of the algorithm. This bound supports the intuition that
the less information the output of the algorithm contains about the input to the algorithm
the less it will overfit, providing a means to strike a balance between the ability to fit
data and the ability to generalize to new data by controlling the algorithm’s input–output
mutual information. Raginsky et al. [116] also propose similar upper bounds on the
generalization error based on several information-theoretic measures of algorithmic sta-
bility, capturing the idea that the output of a stable learning algorithm cannot depend “too
much” on any particular training example. Other generalization error bounds involving
information-theoretic quantities appear in [117, 118]. In particular, Asadi et al. [118]
combine chaining and mutual information methods to derive generalization error bounds
that significantly outperform existing ones.
Of particular relevance, these information-theoretically based generalization error
bounds have also been used to gain further insight into machine-learning models and
algorithms. For example, Pensia et al. [119] build upon the work by Xu and Ragin-
sky [115] to derive very general generalization error bounds for a broad class of iterative
algorithms that are characterized by bounded, noisy updates with Markovian structure,
including stochastic gradient Langevin dynamics (SGLD) and variants of the stochas-
tic gradient Hamiltonian Monte Carlo (SGHMC) algorithm. This work demonstrates
that mutual information is a very effective tool for bounding the generalization error of a
large class of iterative empirical risk minimization (ERM) algorithms. Zhang et al. [120],
on the other hand, build upon the work by Xu and Raginsky [115] to study the expected
generalization error of deep neural networks, and offer a bound that shows that the error
decreases exponentially to zero with the increase of convolutional and pooling layers in
the network. Other works that study the generalization ability of deep networks based
on information-theoretic considerations and measures include [121, 122]. Chapters 10
and 11 scope these directions in supervised learning problems.
Unsupervised Learning
In unsupervised learning setups, one desires instead to understand the structure asso-
ciated with a set of data examples. In particular, multivariate information-theoretic
functionals such as partition information, minimum partition information, and multi-
information have been recently used in the formulation of unsupervised clustering
problems [123, 124]. Chapter 9 elaborates further on such approaches to unsupervised
learning problems.
Introduction to Information Theory and Data Science 35
Acknowledgments
The work of Miguel R. D. Rodrigues and Yonina C. Eldar was supported in part by the
Royal Society under award IE160348. The work of Stark C. Draper was supported in
part by a Discovery Research Grant from the Natural Sciences and Engineering Research
Council of Canada (NSERC). The work of Waheed U. Bajwa was supported in part by
the National Science Foundation under award CCF-1453073 and by the Army Research
Office under award W911NF-17-1-0546.
References
[16] E. Arikan, “Channel polarization: A method for constructing capacity-achieving codes for
symmetric binary-input memoryless channels,” IEEE Trans. Information Theory, vol. 55,
no. 7, pp. 3051–3073, 2009.
[17] A. Jiménez-Feltström and K. S. Zigangirov, “Time-varying periodic convolutional codes
with low-density parity-check matrix,” IEEE Trans. Information Theory, vol. 45, no. 2,
pp. 2181–2191, 1999.
[18] M. Lentmaier, A. Sridharan, D. J. J. Costello, and K. S. Zigangirov, “Iterative decod-
ing threshold analysis for LDPC convolutional codes,” IEEE Trans. Information Theory,
vol. 56, no. 10, pp. 5274–5289, 2010.
[19] S. Kudekar, T. J. Richardson, and R. L. Urbanke, “Threshold saturation via spatial cou-
pling: Why convolutional LDPC ensembles perform so well over the BEC,” IEEE Trans.
Information Theory, vol. 57, no. 2, pp. 803–834, 2011.
[20] E. J. Candès and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal
Processing Mag., vol. 25, no. 2, pp. 21–30, 2008.
[21] H. Q. Ngo and D.-Z. Du, “A survey on combinatorial group testing algorithms with appli-
cations to DNA library screening,” Discrete Math. Problems with Medical Appl., vol. 55,
pp. 171–182, 2000.
[22] G. K. Atia and V. Saligrama, “Boolean compressed sensing and noisy group testing,”
IEEE Trans. Information Theory, vol. 58, no. 3, pp. 1880–1901, 2012.
[23] D. Donoho and J. Tanner, “Observed universality of phase transitions in high-dimensional
geometry, with implications for modern data analysis and signal processing,” Phil. Trans.
Roy. Soc. A: Math., Phys. Engineering Sci., pp. 4273–4293, 2009.
[24] D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp, “Living on the edge: Phase
transitions in convex programs with random data,” Information and Inference, vol. 3,
no. 3, pp. 224–294, 2014.
[25] J. Banks, C. Moore, R. Vershynin, N. Verzelen, and J. Xu, “Information-theoretic bounds
and phase transitions in clustering, sparse PCA, and submatrix localization,” IEEE Trans.
Information Theory, vol. 64, no. 7, pp. 4872–4894, 2018.
[26] R. Monasson, R. Zecchina, S. Kirkpatrick, B. Selman, and L. Troyansky, “Determining
computational complexity from characteristic ‘phase transitions,”’ Nature, vol. 400, no.
6740, pp. 133–137, 1999.
[27] G. Zeng and Y. Lu, “Survey on computational complexity with phase transitions and
extremal optimization,” in Proc. 48th IEEE Conf. Decision and Control (CDC ’09), 2009,
pp. 4352–4359.
[28] Y. C. Eldar, Sampling theory: Beyond bandlimited systems. Cambridge University Press,
2014.
[29] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE
National Convention Record, vol. 4, no. 1, pp. 142–163, 1959.
[30] A. Kipnis, A. J. Goldsmith, Y. C. Eldar, and T. Weissman, “Distortion-rate function of
sub-Nyquist sampled Gaussian sources,” IEEE Trans. Information Theory, vol. 62, no. 1,
pp. 401–429, 2016.
[31] A. Kipnis, Y. C. Eldar, and A. J. Goldsmith, “Analog-to-digital compression: A new
paradigm for converting signals to bits,” IEEE Signal Processing Mag., vol. 35, no. 3,
pp. 16–39, 2018.
[32] A. Kipnis, Y. C. Eldar, and A. J. Goldsmith, “Fundamental distortion limits of analog-
to-digital compression,” IEEE Trans. Information Theory, vol. 64, no. 9, pp. 6013–6033,
2018.
38 Miguel R. D. Rodrigues et al.
[53] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: Data
mining, inference, and prediction, 2nd edn. Springer, 2016.
[54] A. Hyvärinen, “Fast and robust fixed-point algorithms for independent component
analysis,” IEEE Trans. Neural Networks, vol. 10, no. 3, pp. 626–634, 1999.
[55] D. Erdogmus, K. E. Hild, Y. N. Rao, and J. C. Príncipe, “Minimax mutual information
approach for independent component analysis,” Neural Comput., vol. 16, no. 6, pp. 1235–
1252, 2004.
[56] A. Birnbaum, I. M. Johnstone, B. Nadler, and D. Paul, “Minimax bounds for sparse PCA
with noisy high-dimensional data,” Annals Statist., vol. 41, no. 3, pp. 1055–1084, 2013.
[57] R. Krauthgamer, B. Nadler, and D. Vilenchik, “Do semidefinite relaxations solve sparse
PCA up to the information limit?” Annals Statist., vol. 43, no. 3, pp. 1300–1322, 2015.
[58] Q. Berthet and P. Rigollet, “Representation learning: A review and new perspectives,”
Annals Statist., vol. 41, no. 4, pp. 1780–1815, 2013.
[59] T. Cai, Z. Ma, and Y. Wu, “Optimal estimation and rank detection for sparse spiked
covariance matrices,” Probability Theory Related Fields, vol. 161, nos. 3–4, pp. 781–815,
2015.
[60] A. Onatski, M. Moreira, and M. Hallin, “Asymptotic power of sphericity tests for high-
dimensional data,” Annals Statist., vol. 41, no. 3, pp. 1204–1231, 2013.
[61] A. Perry, A. Wein, A. Bandeira, and A. Moitra, “Optimality and sub-optimality of PCA
for spiked random matrices and synchronization,” arXiv:1609.05573, 2016.
[62] Z. Ke, “Detecting rare and weak spikes in large covariance matrices,” arXiv:1609.00883,
2018.
[63] D. L. Donoho and C. Grimes, “Hessian eigenmaps: Locally linear embedding techniques
for high-dimensional data,” Proc. Natl. Acad. Sci. USA, vol. 100, no. 10, pp. 5591–5596,
2003.
[64] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework
for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323,
2000.
[65] R. Jenssen, “Kernel entropy component analysis,” IEEE Trans. Pattern Analysis Machine
Intelligence, vol. 32, no. 5, pp. 847–860, 2010.
[66] B. Schölkopf, A. Smola, and K.-R. Müller, “Kernel principal component analysis,” in
Proc. Intl. Conf. Artificial Neural Networks (ICANN ’97), 1997, pp. 583–588.
[67] J. Yang, X. Gao, D. Zhang, and J.-Y. Yang, “Kernel ICA: An alternative formulation and
its application to face recognition,” Pattern Recognition, vol. 38, no. 10, pp. 1784–1787,
2005.
[68] S. Mika, G. Ratsch, J. Weston, B. Schölkopf, and K. R. Mullers, “Fisher discriminant
analysis with kernels,” in Proc. IEEE Workshop Neural Networks for Signal Processing
IX, 1999, pp. 41–48.
[69] H. Narayanan and S. Mitter, “Sample complexity of testing the manifold hypothesis,”
in Proc. Advances in Neural Information Processing Systems (NeurIPS ’10), 2010, pp.
1786–1794.
[70] K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T.-W. Lee, and T. J. Sejnowski,
“Dictionary learning algorithms for sparse representation,” Neural Comput., vol. 15,
no. 2, pp. 349–396, 2003.
[71] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing over-
complete dictionaries for sparse representation,” IEEE Trans. Signal Processing, vol. 54,
no. 11, pp. 4311–4322, 2006.
40 Miguel R. D. Rodrigues et al.
[72] Q. Zhang and B. Li, “Discriminative K-SVD for dictionary learning in face recognition,”
in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’10),
2010, pp. 2691–2698.
[73] Q. Geng and J. Wright, “On the local correctness of 1 -minimization for dictionary learn-
ing,” in Proc. IEEE International Symposium on Information Theory (ISIT ’14), 2014, pp.
3180–3184.
[74] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon, “Learning
sparsely used overcomplete dictionaries,” in Proc. 27th Conference on Learning Theory
(COLT ’14), 2014, pp. 123–137.
[75] S. Arora, R. Ge, and A. Moitra, “New algorithms for learning incoherent and overcom-
plete dictionaries,” in Proc. 27th Conference on Learning Theory (COLT ’14), 2014, pp.
779–806.
[76] R. Gribonval, R. Jenatton, and F. Bach, “Sparse and spurious: Dictionary learning with
noise and outliers,” IEEE Trans. Information Theory, vol. 61, no. 11, pp. 6298–6319,
2015.
[77] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in
Proc. Advances in Neural Information Processing Systems 13 (NeurIPS ’01), 2001,
pp. 556–562.
[78] A. Cichocki, R. Zdunek, A. H. Phan, and S.-I. Amari, Nonnegative matrix and ten-
sor factorizations: Applications to exploratory multi-way data analysis and blind source
separation. John Wiley & Sons, 2009.
[79] M. Alsan, Z. Liu, and V. Y. F. Tan, “Minimax lower bounds for nonnegative matrix fac-
torization,” in Proc. IEEE Statistical Signal Processing Workshop (SSP ’18), 2018, pp.
363–367.
[80] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444,
2015.
[81] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press, 2016,
www.deeplearningbook.org.
[82] N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in
Proc. IEEE Information Theory Workshop (ITW ’15), 2015.
[83] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via
information,” arXiv:1703.00810, 2017.
[84] C. W. Huang and S. S. Narayanan, “Flow of Rényi information in deep neural net-
works,” in Proc. IEEE International Workshop Machine Learning for Signal Processing
(MLSP ’16), 2016.
[85] P. Khadivi, R. Tandon, and N. Ramakrishnan, “Flow of information in feed-forward deep
neural networks,” arXiv:1603.06220, 2016.
[86] S. Yu, R. Jenssen, and J. Príncipe, “Understanding convolutional neural network training
with information theory,” arXiv:1804.09060, 2018.
[87] S. Yu and J. Príncipe, “Understanding autoencoders with information theoretic concepts,”
arXiv:1804.00057, 2018.
[88] A. Achille and S. Soatto, “Emergence of invariance and disentangling in deep represen-
tations,” arXiv:1706.01350, 2017.
[89] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler,
and Y. Bengio, “Learning deep representations by mutual information estimation and
maximization,” in International Conference on Learning Representations (ICLR ’19),
2019.
Introduction to Information Theory and Data Science 41
[108] E. Abbe, “Community detection and stochastic block models: Recent developments,”
J. Machine Learning Res., vol. 18, pp. 1–86, 2018.
[109] B. Hajek, Y. Wu, and J. Xu, “Computational lower bounds for community detection on
random graphs,” in Proc. 28th Conference on Learning Theory (COLT ’15), Paris, 2015,
pp. 1–30.
[110] V. N. Vapnik, “An overview of statistical learning theory,” IEEE Trans. Neural Networks,
vol. 10, no. 5, pp. 988–999, 1999.
[111] O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Machine Learning Res.,
vol. 2, pp. 499–526, 2002.
[112] H. Xu and S. Mannor, “Robustness and generalization,” Machine Learning, vol. 86, no. 3,
pp. 391–423, 2012.
[113] D. A. McAllester, “PAC-Bayesian stochastic model selection,” Machine Learning,
vol. 51, pp. 5–21, 2003.
[114] D. Russo and J. Zou, “How much does your data exploration overfit? Controlling bias via
information usage,” arXiv:1511.05219, 2016.
[115] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability
of learning algorithms,” in Proc. Advances in Neural Information Processing Systems
(NeurIPS ’17), 2017.
[116] M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of
stability and bias of learning algorithms,” in Proc. IEEE Information Theory Workshop
(ITW ’16), 2016.
[117] R. Bassily, S. Moran, I. Nachum, J. Shafer, and A. Yehudayof, “Learners that use little
information,” arXiv:1710.05233, 2018.
[118] A. R. Asadi, E. Abbe, and S. Verdú, “Chaining mutual information and tightening
generalization bounds,” arXiv:1806.03803, 2018.
[119] A. Pensia, V. Jog, and P. L. Loh, “Generalization error bounds for noisy, iterative
algorithms,” arXiv:1801.04295v1, 2018.
[120] J. Zhang, T. Liu, and D. Tao, “An information-theoretic view for deep learning,”
arXiv:1804.09060, 2018.
[121] M. Vera, P. Piantanida, and L. R. Vega, “The role of information complexity and
randomization in representation learning,” arXiv:1802.05355, 2018.
[122] M. Vera, L. R. Vega, and P. Piantanida, “Compression-based regularization with an
application to multi-task learning,” arXiv:1711.07099, 2018.
[123] C. Chan, A. Al-Bashadsheh, and Q. Zhou, “Info-clustering: A mathematical theory of
data clustering,” IEEE Trans. Mol. Biol. Multi-Scale Commun., vol. 2, no. 1, pp. 64–91,
2016.
[124] R. K. Raman and L. R. Varshney, “Universal joint image clustering and registration using
multivariate information measures,” IEEE J. Selected Topics Signal Processing, vol. 12,
no. 5, pp. 928–943, 2018.
[125] Z. Zhang and T. Berger, “Estimation via compressed information,” IEEE Trans. Informa-
tion Theory, vol. 34, no. 2, pp. 198–211, 1988.
[126] T. S. Han and S. Amari, “Parameter estimation with multiterminal data compression,”
IEEE Trans. Information Theory, vol. 41, no. 6, pp. 1802–1833, 1995.
[127] Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Information-theoretic lower
bounds for distributed statistical estimation with communication constraints,” in Proc.
Advances in Neural Information Processing Systems (NeurIPS ’13), 2013.
Introduction to Information Theory and Data Science 43
[128] R. Ahlswede and I. Csiszár, “Hypothesis testing with communication constraints,” IEEE
Trans. Information Theory, vol. 32, no. 4, pp. 533–542, 1986.
[129] T. S. Han, “Hypothesis testing with multiterminal data compression,” IEEE Trans.
Information Theory, vol. 33, no. 6, pp. 759–772, 1987.
[130] T. S. Han and K. Kobayashi, “Exponential-type error probabilities for multiterminal
hypothesis testing,” IEEE Trans. Information Theory, vol. 35, no. 1, pp. 2–14, 1989.
[131] T. S. Han and S. Amari, “Statistical inference under multiterminal data compression,”
IEEE Trans. Information Theory, vol. 44, no. 6, pp. 2300–2324, 1998.
[132] H. M. H. Shalaby and A. Papamarcou, “Multiterminal detection with zero-rate data
compression,” IEEE Trans. Information Theory, vol. 38, no. 2, pp. 254–267, 1992.
[133] G. Katz, P. Piantanida, R. Couillet, and M. Debbah, “On the necessity of binning for
the distributed hypothesis testing problem,” in Proc. IEEE International Symposium on
Information Theory (ISIT ’15), 2015.
[134] Y. Xiang and Y. Kim, “Interactive hypothesis testing against independence,” in Proc.
IEEE International Symposium on Information Theory (ISIT ’13), 2013.
[135] W. Zhao and L. Lai, “Distributed testing against independence with conferencing
encoders,” in Proc. IEEE Information Theory Workshop (ITW ’15), 2015.
[136] W. Zhao and L. Lai, “Distributed testing with zero-rate compression,” in Proc. IEEE
International Symposium on Information Theory (ISIT ’15), 2015.
[137] W. Zhao and L. Lai, “Distributed detection with vector quantizer,” IEEE Trans. Signal
Information Processing Networks, vol. 2, no. 2, pp. 105–119, 2016.
[138] W. Zhao and L. Lai, “Distributed testing with cascaded encoders,” IEEE Trans. Informa-
tion Theory, vol. 64, no. 11, pp. 7339–7348, 2018.
[139] M. Raginsky, “Learning from compressed observations,” in Proc. IEEE Information
Theory Workshop (ITW ’07), 2007.
[140] M. Raginsky, “Achievability results for statistical learning under communication con-
straints,” in Proc. IEEE International Symposium on Information Theory (ISIT ’09),
2009.
[141] A. Xu and M. Raginsky, “Information-theoretic lower bounds for distributed function
computation,” IEEE Trans. Information Theory, vol. 63, no. 4, pp. 2314–2337, 2017.
[142] C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Foundations
and Trends Theoretical Computer Sci., vol. 9, no. 3–4, pp. 211–407, 2014.
[143] J. Liao, L. Sankar, V. Y. F. Tan, and F. P. Calmon, “Hypothesis testing under mutual
information privacy constraints in the high privacy regime,” IEEE Trans. Information
Forensics Security, vol. 13, no. 4, pp. 1058–1071, 2018.
[144] F. P. Calmon, D. Wei, B. Vinzamuri, K. N. Ramamurthy, and K. R. Varshney, “Data
pre-processing for discrimination prevention: Information-theoretic optimization and
analysis,” IEEE J. Selected Topics Signal Processing, vol. 12, no. 5, pp. 1106–1119, 2018.
2 An Information-Theoretic Approach
to Analog-to-Digital Compression
Alon Kipnis, Yonina C. Eldar, and Andrea J. Goldsmith
Summary
2.1 Introduction
Consider the minimal sampling rate that arises in classical sampling theory due to
Whittaker, Kotelnikov, Shannon, and Landau [1, 3, 4]. These works establish the Nyquist
rate or the spectral occupancy of the signal as the critical sampling rate above which
the signal can be perfectly reconstructed from its samples. This statement, however,
focuses only on the condition for perfectly reconstructing a bandlimited signal from
its infinite-precision samples; it does not incorporate the quantization precision of the
samples and does not apply to signals that are not bandlimited. In fact, as follows from
lossy source coding theory, it is impossible to obtain an exact representation of any
continuous-amplitude sequence of samples by a digital sequence of numbers due to
finite quantization precision, and therefore any digital representation of an analog sig-
nal is prone to error. That is, no continuous amplitude signal can be reconstructed from
its quantized samples with zero distortion regardless of the sampling rate, even when
the signal is bandlimited. This limitation raises the following question. In converting a
signal to bits via sampling and quantization at a given bit precision, can the signal be
44
An Information-Theoretic Approach to Analog-to-Digital Compression 45
sampling quantization
010010100110101011010
111010011010011101111
110010001011110010001
reconstructed from these samples with minimal distortion based on sub-Nyquist sam-
pling? One of the goals of this chapter is to discuss this question by extending classical
sampling theory to account for quantization and for non-bandlimited inputs. Namely,
for an arbitrary stochastic input and given a total bitrate budget, we consider the lowest
sampling rate required to sample the signal such that reconstruction of the signal from
a bit-constrained representation of its samples results in minimal distortion. As we shall
see, without assuming any particular structure for the input analog signal, this sampling
rate is often below the signal’s Nyquist rate.
The minimal distortion achievable in recovering a signal from its representation by
a finite number of bits per unit time depends on the particular way the signal is quan-
tized or, more generally, encoded, into a sequence of bits. Since we are interested in the
fundamental distortion limit in recovering an analog signal from its digital representa-
tion, we consider all possible encoding and reconstruction (decoding) techniques. As an
example, in Fig. 2.1 the smartphone display may be viewed as a reconstruction of the
real-world painting The Starry Night from its digital representation. No matter how fine
the smartphone screen, this recovery is not perfect since the digital representation of the
analog image is not accurate. That is, loss of information occurs during the transforma-
tion from analog to digital. Our goal is to analyze this loss as a function of hardware
limitations on the sampling mechanism and the number of bits used in the encoding. It
is convenient to normalize this number of bits by the signal’s free dimensions, that is,
the dimensions along which new information is generated. For example, the free dimen-
sions of a visual signal are usually the horizontal and vertical axes of the frame, and the
free dimension of an audio wave is time. For simplicity, we consider analog signals with
a single free dimension, and we denote this dimension as time. Therefore, our restriction
on the digital representation is given in terms of its bitrate – the number of bits per unit
time.
For an arbitrary continuous-time random signal with a known distribution, the funda-
mental distortion limit due to the encoding of the signal using a limited bitrate is given
by Shannon’s distortion-rate function (DRF) [5–7]. This function provides the optimal
trade-off between the bitrate of the signal’s digital representation and the distortion in
recovering the original signal from this representation. Shannon’s DRF is described only
in terms of the distortion criterion, the probability distribution of the continuous-time
signal, and the maximal bitrate allowed in the digital representation. Consequently, the
46 Alon Kipnis et al.
fs (smp/s) R (bits/s)
X(t) sampler encoder
distortion
X̂(t) decoder
analog analog
digital
(continuous-time) (discrete-time)
Figure 2.2 Analog-to-digital compression (ADX) and reconstruction setting. Our goal is to derive
the minimal distortion between the signal and its reconstruction from any encoding at bitrate R of
the samples of the signal taken at sampling rate fs .
optimal encoding scheme that attains Shannon’s DRF is a general mapping from the
space of continuous-time signals to bits that does not consider any practical constraints
in implementing such a mapping. In practice, the minimal distortion in recovering analog
signals from their mapping to bits considers the digital encoding of the signal samples,
with a constraint on both the sampling rate and the bitrate of the system [8–10]. Here the
sampling rate fs is defined as the number of samples per unit time of the continuous-time
source signal; the bitrate R is the number of bits per unit time used in the representation
of these samples. The resulting system describing our problem is illustrated in Fig. 2.2,
and is denoted as the analog-to-digital compression (ADX) setting.
The digital representation in this setting is obtained by transforming a continuous-time
continuous-amplitude random source signal X(·) through the concatenated operation of
a sampler and an encoder, resulting in a bit sequence. The decoder estimates the original
analog signal from this bit sequence. The distortion is defined to be the mean-squared
error (MSE) between the input signal X(·) and its reconstruction X(·). Since we are
interested in the fundamental distortion limit subject to a sampling constraint, we allow
optimization over the encoder and the decoder as the time interval over which X(·) is
sampled goes to infinity. When X(·) is bandlimited and the sampling rate fs exceeds its
Nyquist rate fNyq , the encoder can recover the signal using standard interpolation and
use the optimal source code at bitrate R to attain distortion equal to Shannon’s DRF
of the signal [11]. Therefore, for bandlimited signals, a non-trivial interplay between
the sampling rate and the bitrate arises only when fs is below a signal’s Nyquist rate.
In addition to the optimal encoder and decoder, we also explore the optimal sampling
mechanism, but limit ourselves to the class of linear and continuous deterministic sam-
plers. Namely, each sample is defined by a bounded linear functional over a class of
signals. Finally, in order to account for system imperfections or those due to external
interferences, we assume that the signal X(·) is corrupted by additive noise prior to sam-
pling. The noise-free version is obtained from our results by setting the intensity of this
noise to zero.
The minimal distortion in the ADX system of Fig. 2.2 is bounded from below by
two extreme cases of the sampling rate and the bitrate, as illustrated in Fig. 2.3: (1)
when the bitrate R is unlimited, the minimal ADX distortion reduces to the minimal
An Information-Theoretic Approach to Analog-to-Digital Compression 47
Distortion
distortion in ADX = ?
unlimited sampling rate
(Shannon’s DRF)
unlim
it
bitra ed
te
0 fs
? fNyq
Figure 2.3 The minimal sampling rate for attaining the minimal distortion achievable under a
bitrate-limited representation is usually below the Nyquist rate fNyq . In this figure, the noise is
assumed to be zero.
MSE (MMSE) in interpolating a signal from its noisy samples at rate fs [12, 13].
(2) When the sampling rate fs is unlimited or above the Nyquist rate of the signal and
when the noise is zero, the ADX distortion reduces to Shannon’s DRF of the signal.
Indeed, in this case, the optimal encoder can recover the original continuous-time source
without distortion and then encode this recovery in an optimal manner according to the
optimal lossy compression scheme attaining Shannon’s DRF. When fs is unlimited or
above the Nyquist rate and the noise is not zero, the minimal distortion is the indi-
rect (or remote) DRF of the signal given its noise-corrupted version, see Section 3.5 of
[7] and [14]. Our goal is therefore to characterize the MSE due to the joint effect of a
finite bitrate constraint and sampling at a sub-Nyquist sampling rate. In particular, we
are interested in the minimal sampling rate for which Shannon’s DRF, or the indirect
DRF, describing the minimal distortion subject to a bitrate constraint, is attained. As
illustrated in Fig. 2.3, and as will be explained in more detail below, this sampling rate
is usually below the Nyquist rate of X(·), or, more generally, the spectral occupancy
of X(·) when non-uniform or generalized sampling techniques are allowed. We denote
this minimal sampling rate as the critical sampling rate subject to a bitrate constraint,
since it describes the minimal sampling rate required to attain the optimal performance
in systems operating under quantization or bitrate restrictions. The critical sampling
rate extends the minimal-distortion sampling rate considered by Shannon, Nyquist, and
Landau. It is only as the bitrate goes to infinity that sampling at the Nyquist rate is
necessary to attain minimal (namely zero) distortion.
In order to gain intuition as to why the minimal distortion under a bitrate constraint
may be attained by sampling below the Nyquist rate, we first consider in Section 2.2 a
simpler version of the ADX setup involving the lossy compression of linear projections
of signals represented as finite-dimensional random real vectors. Next, in Section 2.3 we
formalize the combined sampling and source coding problem arising from Fig. 2.2 and
provide basic properties of the minimal distortion in this setting. In Section 2.4, we fully
characterize the minimal distortion in ADX as a function of the bitrate and sampling rate
and derive the critical sampling rate that leads to optimal performance. We conclude this
chapter in Section 2.5, where we consider uniform samplers, in particular single-branch
and more general multi-branch uniform samplers, and show that these samplers attain
the fundamental distortion limit.
48 Alon Kipnis et al.
Let X n be an n-dimensional Gaussian random vector with covariance matrix ΣX n , and let
Y m be a projected version of X n defined by
Y m = HX n , (2.1)
where H ∈ Rm×n is a deterministic matrix and m < n. This projection of X n into a lower-
dimensional space is the counterpart of sampling the continuous-time analog signal
X(·) in the ADX setting. We consider the normalized MMSE estimate of X n from a
representation of Y m using a limited number of bits.
Without constraining the number of bits, the distortion in this estimation is given by
1
mmse(X n |Y m ) tr ΣX n − ΣX n |Y m , (2.2)
n
where ΣX n |Y m is the conditional covariance matrix. However, when Y m is to be encoded
using a code of no more than nR bits, the minimal distortion cannot be smaller than
the indirect DRF of X n given Y m , denoted by DX n |Y m (R). This function is given by the
following parametric expression [14]:
m
D(Rθ ) = tr (ΣX n ) − λi ΣX n |Y m − θ + ,
i=1
(2.3)
1
m
+
Rθ = log λi ΣX n |Y m /θ ,
2 i=1
x+
where = max{x, 0} and λi ΣX n |Y m is the ith eigenvalue of ΣX n |Y m .
It follows from (2.2) that X n can be recovered from Y m with zero MMSE if and
only if
λi (ΣX n ) = λi ΣX n |Y m , (2.4)
for all i = 1, . . . , n. When this condition is satisfied, (2.3) takes the form
n
D(Rθ ) = min{λi (ΣX n ), θ},
i=1
(2.5)
1 +
n
Rθ = log [λi (ΣX n )/θ],
2 i=1
which is Kolmogorov’s reverse water-filling expression for the DRF of the vector Gaus-
sian source X n [15], i.e., the minimal distortion in encoding X n using codes of rate R bits
per source realization. The key insight is that the requirements for equality between (2.3)
and (2.5) are not as strict as (2.4): all that is needed is equality among those eigenvalues
that affect the value of (2.5). In particular, assume that for a point (R, D) on DX n (R),
only λn (ΣX n ), . . . , λn−m+1 (ΣX n ) are larger than θ, where the eigenvalues are organized in
ascending order. Then we can choose the rows of H to be the m left eigenvectors corre-
sponding to λn (ΣX n ), . . . , λn−m+1 (ΣX n ). With this choice of H, the m largest eigenvalues
of ΣX n |Y m are identical to the m largest eigenvalues of ΣX n , and (2.5) is equal to (2.3).
An Information-Theoretic Approach to Analog-to-Digital Compression 49
eigenvalues of ΣX n eigenvalues of ΣX n |Y m
n m
n−1 m−1
n−2 m−2
Figure 2.4 Optimal sampling occurs whenever DX n (R) = DX n |Y m (R). This condition is satisfied
even for m < n, as long as there is equality among the eigenvalues of ΣX n and ΣX n |Y m which are
larger than the water-level parameter θ.
Since the rank of the sampling matrix is now m < n, we effectively performed sam-
pling below the “Nyquist rate” of X n without degrading the performance dictated by its
DRF. One way to understand this phenomenon is as an alignment between the range
of the sampling matrix H and the subspace over which X n is represented, according
to Kolmogorov’s expression (2.5). That is, when Kolmogorov’s expression implies that
not all degrees of freedom are utilized by the optimal distortion-rate code, sub-sampling
does not incur further performance loss, provided that the sampling matrix is aligned
with the optimal code. This situation is illustrated in Fig. 2.4. Sampling with an H that
has fewer rows than the rank of ΣX 2 is the finite-dimensional analog of sub-Nyquist
sampling in the infinite-dimensional setting of continuous-time signals.
In the rest of this chapter, we explore the counterpart of the phenomena described
above in the richer setting of continuous-time stationary processes that may or may not
be bandlimited, and whose samples may be corrupted by additive noise.
(·)
X (·) N i ∈ 1, . . . , 2T R
Sampler YT ∈ R T
X(·) + Enc
S(KH , Λ)
distortion
spectral density (PSD) S X ( f ). This PSD is assumed to be a real, symmetric, and absolute
integrable function that satisfies
∞
E[X(t)X(s)] = S X ( f )e2π j(t−s) f d f, t, s ∈ R. (2.6)
−∞
The noise process ε(·) is another stationary process independent of X(·) with PSD S ε ( f )
of similar properties, so that the input to the sampler is the stationary process Xε (·)
X(·) + ε(·) with PSD S Xε ( f ) = S X ( f ) + S ε ( f ).
We note that, by construction, X(·) and ε(·) are regular processes in the sense that
their spectral measure has an absolutely continuous density with respect to the Lebesgue
measure. If in addition the support of S X ( f ), denoted by supp S X , is contained1 within
a bounded interval, then we say that X(·) is bandlimited and denote by fNyq its Nyquist
rate, defined as twice the maximal element in supp S X . The spectral occupancy of X(·)
is defined to be the Lebesgue measure of supp S X .
Although this is not necessary for all parts of our discussion, we assume that the
processes X(·) and ε(·) are Gaussian. This assumption leads to closed-form characteriza-
tions for many of the expressions we consider. In addition, it follows from [16, 17] that
a lossy compression policy that is optimal under a Gaussian distribution can be used to
encode non-Gaussian signals with matching second-order statistics, while attaining the
same distortion as if the signals were Gaussian. Hence, the optimal sampler and encod-
ing system we use to obtain the fundamental distortion limit for Gaussian signals attains
the same distortion limit for non-Gaussian signals as long as the second-order statistics
of the two signals are the same.
ΛT Λ ∩ [−T/2, T/2].
1 Since the PSD is associated with an absolutely continuous spectral measure, sets defined in term of the
PSD, e.g., supp S X , are understood to be unique up to symmetric difference of Lebesgue measure zero.
An Information-Theoretic Approach to Analog-to-Digital Compression 51
tn ∈ ΛT
X (t) KH (t, ) T ∈ RNT
We assume in addition that Λ is uniformly discrete in the sense that there exists ε > 0
such that |t − s| > ε for every non-identical t, s ∈ Λ. The density of ΛT is defined as the
number of points in ΛT divided by T and denoted here by d(ΛT ). Whenever it exists, we
define the limit
|Λ ∩ [−T/2, T/2]|
d(Λ) = lim d(ΛT ) = lim
T →∞ T →∞ T
as the symmetric density of Λ, or simply it’s density.
It is easy to check that d(T s Z) = fs and hence, in this case, the density of the sampling
set has the usual interpretation of sampling rate.
fs
X (·) H( f ) T ∈ RT fs
fs /L
H1 ( f )
fs /L
X(·) H2 ( f ) T ∈ RLT fs /L
fs /L
HL ( f )
where NT = dim(YT ) = |ΛT |. That is, the encoder receives the vector of samples YT
and outputs an index out of 2T R possible indices. The decoder receives this index, and
for the signal X(·) over the interval [−T/2, T/2]. Thus, it is a
produces an estimate X(·)
mapping
g : 1, . . . , 2T R → R[−T/2,T/2] . (2.8)
The goal of the joint operation of the encoder and the decoder is to minimize the
expected mean-squared error (MSE)
1 T/2 2 dt.
E X(t) − X(t)
T −T/2
In practice, an encoder may output a finite number of samples that are then interpo-
lated to the continuous-time estimate X(·). Since our goal is to understand the limits in
converting signals to bits, this separation between decoding and interpolation, as well
as the possible restrictions each of these steps encounters in practice, are not explored
within the context of ADX.
Given a particular bounded linear sampler S = (Λ, KH ) and a bitrate R, we are
interested in characterizing the function
1 T/2 2 dt,
DT (S , R) inf E X(t) − X(t) (2.9)
f,g T −T/2
or its limit as T → ∞, where the infimum is over all encoders and decoders of the form
(2.7) and (2.8). The function DT (S , R) is defined only in terms of the sampler S and the
bitrate R, and in this sense measures the minimal distortion that can be attained using
the sampler S subject to a bitrate constraint R on the representation of the samples.
Optimal Encoding
Denote by XT (·) the process that is obtained by estimating X(·) from the output of the
sampler according to an MSE criterion. That is
XT (t) E[X(t)|YT ], t ∈ R. (2.10)
From properties of the conditional expectation and MSE, under any encoder f we may
write
1 T/2 2 dt = mmseT (S ) + mmse XT | f (YT ) ,
E X(t) − X(t) (2.11)
T −T/2
where
T/2
1 2
mmseT (S ) E X(t) − XT (t) dt (2.12)
T −T/2
is the distortion associated with the lossy compression procedure, and depends on the
sampler only through XT (·).
The decomposition (2.11) already provides important clues on an optimal encoder
and decoder pair that attains DT (S , R). Specifically, it follows from (2.11) that there
is no loss in performance if the encoder tries to describe the process XT (·) subject to
the bitrate constraint, rather than the process X(·). Consequently, the optimal decoder
outputs the conditional expectation of XT (·) given f (YT ). The decomposition (2.11) was
first used in [14] to derive the indirect DRF of a pair of stationary Gaussian processes,
and later in [19] to derive indirect DRF expressions in other settings. An extension of the
principle presented in this decomposition to arbitrary distortion measures is discussed
in [20].
The decomposition (2.11) also sheds light on the behavior of the optimal distortion
DT (S , R) under the two extreme cases of unlimited bitrate and unrestricted sampling
rate, each of which is illustrated in Fig. 2.3. We discuss these two cases next.
Unlimited Bitrate
If we remove the bitrate constraint in the ADX setting (formally, letting R → ∞), loss of
information is only due to noise and sampling. In this case, the second term in the RHS
of (2.11) disappears, and the distortion in ADX is given by mmseT (S ). Namely, we have
lim DT (S , R) = mmseT (S ).
R→∞
The unlimited bitrate setting reduces the ADX problem to a classical problem in sam-
pling theory: the MSE under a given sampling system. Of particular interest is the case
of optimal sampling, i.e., when this MSE vanishes as T → ∞. For example, by con-
sidering the noiseless case and assuming that KH (t, s) = δ(t − s) is the identity operator,
the sampler is defined solely in terms of Λ. The condition on mmseT (S ) to converge
54 Alon Kipnis et al.
to zero is related to the conditions for stable sampling in Paley–Wiener spaces studied
by Landau and Beurling [21, 22]. In order to see this connection more precisely, note
that (2.6) defines an isomorphism between the Hilbert spaces of finite-variance ran-
dom variables measurable with respect to the sigma algebra generated by X(·) and the
Hilbert space generated by the inverse Fourier transform of e2π jt f S X ( f ), t ∈ R [23].
Specifically, this isomorphism is obtained by extending the map
X(t) ←→ F −1 e2π jt f S X ( f ) (s)
to the two aforementioned spaces. It follows that sampling and reconstructing X(·) with
vanishing MSE is equivalent to the same operation in the Paley–Wiener space of analytic
functions whose Fourier transform vanishes outside supp S X . In particular, the condition
T →∞
mmseT (S ) −→ 0 holds whenever Λ is a set of stable sampling in this Paley–Wiener
space, i.e., there exists a universal constant A > 0 such that the L2 norm of each function
in this space is bounded by A times the energy of the samples of this function. Landau
[21] showed that a necessary condition for this property is that the number of points in
Λ that fall within the interval [−T/2, T/2] is at least the spectral occupancy of X(·) times
T , minus a constant that is logarithmic in T . For this reason, this spectral occupancy is
termed the Landau rate of X(·), and we denote it here by fLnd . In the special case where
supp S X is an interval (symmetric around the origin since X(·) is real), the Landau and
Nyquist rates coincide.
Optimal Sampling
The other special case of the ADX setting is obtained when there is no loss of infor-
mation due to sampling. For example, this is the case when mmseT (S ) goes to zero
under the conditions mentioned above of zero noise, identity kernel, and sampling den-
sity exceeding the spectral occupancy. More generally, this situation occurs whenever
XT (·) converges (in expected norm) to the MMSE estimator of X(·) from Xε (·). This
MMSE estimator is a stationary process obtained by non-causal Wiener filtering, and
its PSD is
S 2X ( f )
S X|Xε ( f ) , (2.13)
S X ( f ) + S ε( f )
Since our setting does not limit the encoder from computing XT (·), the ADX problem
reduces in this case to the indirect source coding problem of recovering X(·) from a
bitrate R representation of its corrupted version Xε (·). This problem was considered and
An Information-Theoretic Approach to Analog-to-Digital Compression 55
mmse (X |X )
lossy compression distortion
preserved spectrum
(f)
SX
(f)
|X
SX
f
−3 −2 −1 0 1 2 3
Figure 2.9 Water-filling interpretation of (2.15). The distortion is the sum of mmse(X|Xε ) and the
lossy compression distortion.
solved by Dobrushin and Tsybakov in [14], where the following expression was given
for the optimal trade-off between bitrate and distortion:
∞
DX|Xε (Rθ ) mmse(X|Xε ) + min S X|Xε ( f ), θ d f, (2.15a)
−∞
∞
1
Rθ = log+ S X|Xε ( f )/θ d f. (2.15b)
2 −∞
A graphical water-filling interpretation of (2.15) is given in Fig. 2.9. When the noise ε(·)
is zero, S X|Xε ( f ) = S X ( f ), and hence (2.15) reduces to
∞
DX (Rθ ) min{S X ( f ), θ}d f, (2.16a)
−∞
1 ∞ +
Rθ = log S X ( f )/θ d f, (2.16b)
2 −∞
which is Pinsker’s expression [15] for the DRF of the process X(·), denoted here by
DX (R). Note that (2.16) is the continuous-time counterpart of (2.3).
From the discussion above, we conclude that
DT (S , R) ≥ DX|Xε (R) ≥ max{DX (R), mmse(X|Xε )}. (2.17)
Furthermore, when the estimator E[X(t)|Xε ] can be obtained from YT as T → ∞, we
T →∞
have that DT (S , R) −→ DX|Xε (R). In this situation, we say that the conditions for optimal
sampling are met, since the only distortion is due to the noise and the bitrate constraint.
The two lower bounds in Fig. 2.3 describe the behavior of DT (S , R) in the two special
cases of unrestricted bitrate and optimal sampling. Our goal in the next section is to
characterize the intermediate case of non-optimal sampling and a finite bitrate constraint.
Given a particular bounded linear sampler S = (Λ, KH ) and a bitrate R, we defined the
function DT (S , R) as the minimal distortion that can be attained in the combined sam-
pling and lossy compression setup of Fig. 2.5. Our goal in this section is to derive and
analyze a function D ( fs , R) that bounds from below DT (S , R) for any such bounded
56 Alon Kipnis et al.
linear sampler with symmetric density of Λ not exceeding fs . The achievability of this
lower bound is addressed in the next section.
over all Lebesgue measurable sets F whose Lebesgue measure does not exceed fs . In
other words, F ( fs ) consists of the fs spectral bands with the highest energy in the
spectrum of the process {E[X(t)|Xε (·)], t ∈ R}. Define
D ( fs , Rθ ) = mmse ( fs ) + min S X|Xε ( f ), θ d f, (2.19a)
F ( fs )
1
Rθ = log+ S X|Xε ( f )/θ d f, (2.19b)
2 F ( fs )
where
∞
mmse ( fs ) σ2X − S X|Xε ( f )d f = S X ( f ) − S X|Xε ( f )1F ( fs ) d f.
F ( fs ) −∞
DT (S , R) ≥ D ( fs , R).
preserved spectrum
)
mmse( f ) s
(f)
|X
SX
f
−3 −2 −1 0 1 2 3
Figure 2.10 Water-filling interpretation of D ( fs , Rθ ): the fundamental distortion limit under any
bounded linear sampling. This distortion is the sum of the fundamental estimation error
mmse ( f s ) and the lossy compression distortion.
An Information-Theoretic Approach to Analog-to-Digital Compression 57
(ii) Assume that the symmetric density of Λ exists and satisfies d(Λ) ≤ fs . Then, for
any bitrate R,
lim inf DT (S , R) ≥ D ( fs , R).
T →∞
In addition to the negative statement of Theorem 2.1, we show in the next section the
following positive coding result.
theorem 2.2 (achievability) Let X(·) be a Gaussian stationary process corrupted by a
Gaussian stationary noise ε(·). Then, for any fs and ε > 0, there exists a bounded linear
sampler S with a sampling set of symmetric density not exceeding fs such that, for any
R, the distortion in ADX attained by sampling Xε (·) using S over a large enough time
interval T , and encoding these samples using T R bits, does not exceed D ( fs , R) + ε.
A full proof of Theorem 2.1 can be found in [24]. Intuition for Theorem 2.1 may be
obtained by representing X(·) according to its Karhunen–Loève (KL) expansion over
[−T/2, T/2], and then using a sampling matrix that keeps only NT T fs of these
coefficients. The function D ( fs , R) arises as the limiting expression in the noisy version
of (2.5), when the sampling matrix is tuned to keep those KL coefficients corresponding
to the NT largest eigenvalues in the expansion.
In Section 2.5, we provide a constructive proof of Theorem 2.2 that also shows that
D ( fs , R) is attained using a multi-branch LTI uniform sampler with an appropriate
choice of pre-sampling filters. The rest of the current section is devoted to studying
properties of the minimal ADX distortion D ( fs , R).
Unlimited Bitrate
As R → ∞, the parameter θ goes to zero and (2.19a) reduces to mmse ( fs ). This function
describes the MMSE that can be attained by any bounded linear sampler with symmetric
density at most fs . In particular, in the non-noisy case, mmse ( fs ) = 0 if and only if fs
exceeds the Landau rate of X(·). Therefore, in view of the explanation in Section 2.3.3
and under unlimited bitrate, zero noise, and the identity pre-sampling operation,
Theorem 2.1 agrees with the necessary condition derived by Landau for stable sampling
in the Paley–Wiener space [21].
Optimal Sampling
The other extreme in the expression for D ( fs , R) is when fs is large enough that it does
not impose any constraint on sampling. In this case, we expect the ADX distortion to
coincide with the function DX|X (R) of (2.15), since the latter is the minimal distortion
only due to noise and lossy compression at bitrate R. From the definition of F ( fs ), we
58 Alon Kipnis et al.
In other words, the condition fs ≥ fLnd means that there is no loss due to sampling in
the ADX system. This property of the minimal distortion is not surprising. It merely
expresses the fact anticipated in Section 2.3.3 that, when (2.10) vanishes as T goes to
infinity, the estimator E[X(t)|Xε ] is obtained from the samples in this limit and thus the
only loss of information after sampling is due to the noise.
In the next section we will see that, under some conditions, the equality (2.20) is
extended to sampling rates smaller than the Landau rate of the signal.
SX ( f ) SX ( f )
fs fs
f f
– fR / 2 fR / 2 – fR / 2 fR / 2
(a) (b)
SX ( f ) SX ( f )
fs
f f
– fR / 2 fR / 2 – fR / 2 fR / 2
(c) (d)
Figure 2.11 Water-filling interpretation for the function D ( fs , R) under zero noise, a fixed bitrate
R, and three sampling rates: (a) fs < fR , (b) fs = fR , and (c) fs > fR . (d) corresponds to the DRF
of X(·) at bitrate R. This DRF is attained whenever fs ≥ fR , where fR is smaller than the
Nyquist rate.
An Information-Theoretic Approach to Analog-to-Digital Compression 59
2
X
D( fs,R = 1)
DX|X (R = 1)
Distortion
D( fs,R = 2)
DX|X (R = 2)
mmse( fs)
mmse(X|X ) fs
fR=1 fR=2 fNyq
Figure 2.12 The function D ( fs , R) for the PSD of Fig. 2.11 and two values of the bitrate R. Also
shown is the DRF of X(·) at these values that is attained at the sub-Nyquist sampling rates
marked by fR .
have S X|Xε ( f ) = S X ( f ) since the noise is zero, fLnd = fNyq since S X ( f ) has a connected
support, and F ( fs ) is the interval of length fs centered around the origin since S X ( f )
is unimodal. In all cases in Fig. 2.11 the bitrate R is fixed and corresponds to the pre-
served part of the spectrum through (2.19b). The distortion D ( fs , R) changes with fs ,
and is given by the sum of two terms in (2.19a): mmse ( fs ) and the lossy compres-
sion distortion. For example, the increment in fs from (a) to (b) reduces mmse ( fs ) and
increases the lossy compression distortion, although the overall distortion decreases due
to this increment. However, the increase in fs leading from (b) to (c) is different: while
(c) shows an additional reduction in mmse ( fs ) compared with (b), the sum of the two
distortion terms is identical in both cases and, as illustrated in (d), equals the DRF of
X(·) from (2.16). It follows that, in the case of Fig. 2.11, the optimal ADX performance
is attained at some sampling rate fR that is smaller than the Nyquist rate, and depends
on the bitrate R through expression (2.16). The full behavior of D ( fs , R) as a function
of fs is illustrated in Fig. 2.12 for two values of R.
The phenomenon described above and in Figs. 2.11 and 2.12 can be generalized to any
Gaussian stationary process with arbitrary PSD and noise in the ADX setting, according
to the following theorem.
theorem 2.3 (optimal sampling rate [24]) Let X(·) be a Gaussian stationary process
with PSD S X ( f ) corrupted by a Gaussian noise ε(·). For each point (R, D) on the graph
of DX|Xε (R) associated with a water-level θ via (2.15), let fR be the Lebesgue measure of
the set
Fθ f : S X|Xε ( f ) ≥ θ .
D ( fs , R) = DX|Xε (R).
60 Alon Kipnis et al.
The proof of Theorem 2.3 is relatively straightforward and follows from the definition
of Fθ and D ( fs , R).
We emphasize that the critical frequency fR depends only on the PSDs S X ( f )
and S ε ( f ), and on the operating point on the graph of D ( fs , R). This point may be
parametrized by D, R, or the water-level θ using (2.15). Furthermore, we can consider
a version of Theorem 2.3 in which the bitrate is a function of the distortion and the
sampling rate, by inverting D ( fs , R) with respect to R. This inverse function, R ( fs , D),
is the minimal number of bits per unit time one must provide on the samples of Xε (·),
obtained by any bounded linear sampler with sampling density not exceeding fs , in order
to attain distortion not exceeding D. The following representation of R ( fs , D) in terms
of fR is equivalent to Theorem 2.3.
theorem 2.4 (rate-distortion lower bound) Consider the samples of a Gaussian sta-
tionary process X(·) corrupted by a Gaussian noise ε(·) obtained by a bounded linear
sampler of maximal sampling density fs . The bitrate required to recover X(·) with MSE
at most D > mmse ( fs ) is at least
⎧
⎪
⎪
⎪ 1 + f s S X|X ( f )
d f, f s < fR ,
⎨ 2 F ( fs ) log D−mmse ( f s )
R ( f s , D) = ⎪
⎪ (2.21)
⎪
⎩RX|X (D), f s ≥ fR ,
where
∞
1
RX|Xε (Dθ ) = log+ S X|Xε ( f )/θ d f
2 −∞
is the indirect rate-distortion function of X(·) given Xε (·), and θ is determined by
∞
Dθ = mmse ( fs ) + min S X|Xε ( f ), θ d f.
−∞
Theorems 2.3 and 2.4 imply that the equality in (2.20), which was previously shown to
hold for fs ≥ fLnd , is extended to all sampling rates above fR ≤ fLnd . As R goes to infinity,
D ( fs , R) converges to mmse ( fs ), the water-level θ goes to zero, the set Fθ coincides
with the support of S X ( f ), and fR converges to fLnd . Theorem 2.3 then implies that
mmse ( fs ) = 0 for all fs ≥ fLnd , a fact that agrees with Landau’s characterization of sets
of sampling for perfect recovery of signals in the Paley–Wiener space, as explained in
Section 2.3.3.
An intriguing way to explain the critical sampling rate subject to a bitrate constraint
arising from Theorem 2.3 follows by considering the degrees of freedom in the repre-
sentation of the analog signal pre- and post-sampling and with lossy compression of the
samples. For stationary Gaussian signals with zero sampling noise, the degrees of free-
dom in the signal representation are those spectral bands in which the PSD is non-zero.
When the signal energy is not uniformly distributed over these bands, the optimal lossy
compression scheme described by (2.16) calls for discarding those bands with the lowest
energy, i.e., the parts of the signal with the lowest uncertainty.
The degree to which the new critical rate fR is smaller than the Nyquist rate depends
on the energy distribution of X(·) across its spectral occupancy. The more uniform this
An Information-Theoretic Approach to Analog-to-Digital Compression 61
SΠ ( f ) S ( f ) S (f) SΩ ( f )
1.6
1.4
1.2
fR [smp/time]
1
0.8
0.6
0.4
0.2
0 0.5 1 1.5 2
R [bit/time]
Figure 2.13 The critical sampling rate fR as a function of the bitrate R for the PSDs given in the
small frames at the top of the figure. For the bandlimited PSDs S Π ( f ), S ( f ), and S ω ( f ), the
critical sampling rate is always at or below the Nyquist rate. The critical sampling rate is finite
for any R, even for the non-bandlimited PSD S Ω ( f ).
distribution, the more degrees of freedom are required to represent the lossy compressed
signal and therefore the closer fR is to the Nyquist rate. In the examples below we derive
the precise relation between fR and R for various PSDs. These relations are illustrated in
Fig. 2.13 for the signals SΠ ( f ), S ( f ), S ω ( f ), and S Ω ( f ) defined below.
2.4.4 Examples
Example 2.1 Consider the Gaussian stationary process X (·) with PSD
+
2 1 − | f /W|
S (f) σ ,
W
for some W > 0. Assuming that the noise is zero,
Fθ = [W(Wθ − 1), W(1 − Wθ)]
and thus fR = 2W(1 − Wθ). The exact relation between fR and R is obtained from (2.19b)
and found to be
1 fR /2 1 − | f /W| 2W fR
R= log d f = W log − .
2 − fR /2 1 − fR /2W 2W − fR 2 ln 2
In particular, note that R → ∞ leads to fR → fNyq = 2W, as anticipated.
Assume that ε(·) is noise with a flat spectrum within the band (−W, W) such that
γ S Π ( f )/S ε ( f ) is the SNR at the spectral component f . Under these conditions, the
water-level θ in (2.15) satisfies
γ −R/W
θ = σ2 2 ,
1+γ
and hence
DX|Xε R) 1 γ −R/W
= + 2 . (2.23)
σ2 1+γ 1+γ
In particular, Fθ = [−W, W], so that fR = 2W = fNyq for any bitrate R and D ( fs , R) =
DX|Xε (R) only for fs ≥ fNyq . That is, for the process XΠ (·), optimal sampling under a
bitrate constraint occurs only at or above its Nyquist rate.
Note that, although the Nyquist rate of XΩ (·) is infinite, for any finite R there exists a
critical sampling frequency fR satisfying (2.24) such that DXΩ (R) is attained by sampling
at or above fR .
The asymptotic behavior of (2.24) as R goes to infinity is given by R ∼ fR /ln 2. Thus,
for R sufficiently large, the optimal sampling rate is linearly proportional to R. The ratio
R/ fs is the average number of bits per sample used in the resulting digital representa-
tion. It follows from (2.24) that, asymptotically, the “right” number of bits per sample
converges to 1/ ln 2 ≈ 1.45. If the number of bits per sample is below this value, then
An Information-Theoretic Approach to Analog-to-Digital Compression 63
the distortion in ADX is dominated by the DRF DXΩ (·), as there are not enough bits to
represent the information acquired by the sampler. If the number of bits per sample is
greater than this value, then the distortion in ADX is dominated by the sampling distor-
tion, as there are not enough samples for describing the signal up to a distortion equal to
its DRF.
As a numerical example, assume that we encode XΩ (t) using two bits per sample,
i.e., fs = 2R. As R → ∞, the ratio between the minimal distortion D ( fs , R) and DXΩ (R)
converges to approximately 1.08, whereas the ratio between D ( fs , R) and mmse ( fs )
converges to approximately 1.48. In other words, it is possible to attain the optimal
encoding performance within an approximately 8% gap by providing one sample per
each two bits per unit time used in this encoding. On the other hand, it is possible to
attain the optimal sampling performance within an approximately 48% gap by providing
two bits per each sample taken.
We now analyze the distortion in the ADX setting of Fig. 2.5 under the important class of
single- and multi-branch LTI uniform samplers. Our goal in this section is to show that
for any source and noise PSDs, S X ( f ) and S ε ( f ), respectively, the function D ( fs , R)
describing the fundamental distortion limit in ADX is attainable using a multi-branch
LTI uniform sampler. By doing so, we also provide a proof of Theorem 2.2.
We begin by analyzing the ADX system of Fig. 2.5 under an LTI uniform sampler.
As we show, the asymptotic distortion in this case can be obtained in a closed form
that depends only on the signal and noise PSDs, the sampling rate, the bitrate, and the
pre-sampling filter H( f ). We then show that, by taking H( f ) to be a low-pass filter with
cutoff frequency fs /2, we can attain the fundamental distortion limit D ( fs , R) whenever
the function S X|Xε ( f ) of (2.13) attains its maximum at the origin. In the more general
case of an arbitrarily shaped S X|Xε ( f ), we use multi-branch sampling in order to achieve
D ( fs , R).
algebra generated by {Xε (n/ fs ), n in Z}. Using standard linear estimation techniques,
this conditional expectation has a representation similar to that of a Wiener filter given
by [12]:
X(t) E X(t)|{X(n/ fs ), n ∈ Z} = Xε (n/ fs )w(t − n/ fs ), t ∈ R, (2.25)
n∈Z
where
S 2X ( f − fs n)|H( f − fs n)|2
S X( f ) . (2.27)
n∈Z S Xε ( f − fs n)|H( f − fs n)|2
From the decomposition (2.11), it follows that, when S is an LTI uniform sampler, the
distortion can be expressed as
where DX (R) is the DRF of the Gaussian process X(·) defined by (2.25), satisfying the
law of the process XT (·) in the limit as T goes to infinity.
Note that, whenever fs ≥ fNyq and supp S X is included within the passband of H( f ),
we have that S X ( f ) = S X|Xε ( f ) and thus mmseH ( fs ) = mmse(X|Xε ), i.e., no distortion
due to sampling. Moreover, in this situation, X(t) = E[X(t)|Xε (·)] and
The equality (2.28) is a special case of (2.20) for LTI uniform sampling, and says that
there is no loss due to sampling in ADX whenever the sampling rate exceeds the Nyquist
rate of X(·).
When the sampling rate is below fNyq , (2.25) implies that the estimator X(·) has the
form of a stationary process modulated by a deterministic pulse, and is therefore a block-
stationary process, also called a cyclostationary process [25]. The DRF for this class of
processes can be described by a generalization of the orthogonal transformation and rate
allocation that leads to the water-filling expression (2.16) [26]. Evaluating the result-
ing expression for the DRF of the cyclostationary process X(·) leads to a closed-form
expression for DH ( fs , R), which was initially derived in [27].
theorem 2.5 (achievability for LTI uniform sampling) Let X(·) be a Gaussian station-
ary process corrupted by a Gaussian stationary noise ε(·). The minimal distortion in
An Information-Theoretic Approach to Analog-to-Digital Compression 65
)
Y (f
X|
S
fs )
SX
(f−
(f+
f)
SX (
f s)
SX
f
fs/2 fNyq/2
Figure 2.14 Water-filling interpretation of (2.29) with an all-pass filter H( f ). The function
k∈Z S X ( f − fs k) is the aliased PSD that represents the full energy of the original signal within
the discrete-time spectrum interval (− fs /2, fs /2). The part of the energy recovered by the X(·) is
S X ( f ). The distortion due to lossy compression is obtained by water-filling over the recovered
energy according to (2.29a). The overall distortion DH ( fs , R) is the sum of the sampling
distortion and the distortion due to lossy compression.
ADX at bitrate R with an LTI uniform sampler with sampling rate fs and pre-sampling
filter H( f ) is given by
fs
2
DH ( fs , Rθ ) = mmseH ( fs ) + min S X ( f ), θ d f, (2.29a)
f
− 2s
fs
1 2
Rθ = log+2 S X ( f )/θ d f, (2.29b)
2 −
fs
2
Example 2.4 (continuation of Example 2.2) As a simple example for using formula
(2.29), consider the process XΠ (·) of Example 2.2. Assuming that the noise ε(·) ≡ 0
(equivalently, γ → ∞) and that H( f ) passes all frequencies f ∈ [−W, W], the relation
between the distortion in (2.29a) and the bitrate in (2.29b) is given by
66 Alon Kipnis et al.
distortion
DX (R = 1) DH ( fs, R = 1)
DX (R = 2) DH ( fs, R = 2)
fs
sampling rate
Figure 2.15 Distortion as a function of sampling rate for the source with PSD S Π ( f ) of (2.22),
zero noise, and source coding rates R = 1 and R = 2 bits per time unit.
⎧
⎪
⎪ f − 2R
⎨mmseH ( fs ) + σ2 2Ws 2 fs , fs < 2W,
DH ( fs , R) = ⎪
⎪ (2.30)
⎩σ2 2− WR , fs ≥ 2W,
where mmseH ( fs ) = σ2 1 − fs /2W + . Expression (2.30) is shown in Fig. 2.15 for two
fixed values of the bitrate R. It has a very intuitive structure: for frequencies below
fNyq = 2W, the distortion as a function of the bitrate increases by a constant factor due
to the error as a result of non-optimal sampling. This factor completely vanishes once
the sampling rate exceeds the Nyquist frequency, in which case DH ( fs , R) coincides with
the DRF of X(·).
In the noisy case when γ = S Π ( f )/S ε ( f ), we have mmseH ( fs ) = σ2 (1 −
fs /(2W(1 + γ))) and the distortion takes the form
⎧
⎪
⎪ −2R/ fs ,
2 ⎨mmseH ( fs ) + ( fs /2W)(γ/(1 + γ))2 fs < 2W,
D ( fs , R) = σ ⎪⎪ (2.31)
⎩mmse(X|Xε ) + (γ/(1 + γ))2−R/W , fs ≥ 2W,
Next, we show that, when S X|Xε ( f ) is unimodal, an LTI uniform sampler can be used
to attain D ( fs , R).
Comparing (2.32) with (2.19), we see that the two expressions coincide whenever the
interval [− fs /2, fs /2] minimizes (2.18). Therefore, we conclude that, when the function
S X|Xε ( f ) is unimodal in the sense that it attains its maximal value at the origin, the
An Information-Theoretic Approach to Analog-to-Digital Compression 67
fundamental distortion in ADX is attained using an LTI uniform sampler with a low-pass
filter of cutoff frequency fs /2 as its pre-sampling operation.
An example for a PSD for which (2.32) describes its fundamental distortion limit is
the one in Fig. 2.11. Note the LPF with cutoff frequency fs /2 in cases (a)–(c) there.
Another example of this scenario for a unimodal PSD is given in Example 2.4 above.
Example 2.5 (continuation of Examples 2.2 and 2.4) In the case of the process XΠ (·)
with a flat spectrum noise as in Examples 2.2 and 2.4, (2.32) leads to (2.31). It follows
that the fundamental distortion limit in ADX with respect to XΠ (·) and a flat spectrum
noise is given by (2.31), which was obtained from (2.29). Namely, the fundamental
distortion limit in this case is obtained using any pre-sampling filter whose passband
contains [−W, W], and using an LPF is unnecessary.
In particular, the distortion in (2.31) corresponding to fs ≥ fNyq equals the indirect
DRF of X(·) given Xε (·), which can be found directly from (2.15). Therefore, (2.31)
implies that optimal sampling for XΠ (·) under LTI uniform sampling occurs only at or
above its Nyquist rate. This conclusion is not surprising since, according to Example 2.2,
super-Nyquist sampling of XΠ (·) is necessary for (2.20) to hold under any bounded linear
sampler.
L
fs
2
DH1 ,...,HL ( fs , R) = mmseH1 ,...,HL ( fs ) + min{λl ( f )}d f, (2.33a)
f
l=1 − 2s
1
L fs
2
Rθ = log+2 λl ( f )/θ d f, (2.33b)
2 l=1 −
fs
2
68 Alon Kipnis et al.
with
(SY ( f ))i, j = S Xε ( f − fs n)Hi ( f − fs n)H ∗j ( f − fs n), i, j = 1, . . . , L,
n∈Z
(K( f ))i, j = S 2X ( f − fs n)Hi ( f − fs n)H ∗j ( f − fs n), i, j = 1, . . . , L.
n∈Z
In addition,
fs
2
mmseH1 ,...,HL ( fs ) σ2X − tr SX ( f ) d f
f
− 2s
is the minimal MSE in estimating X(·) from the combined output of the L sampling
branches as T approaches infinity.
The most interesting feature in the extension of (2.29) provided by (2.33) is the depen-
dences between samples obtained over different branches, expressed in the definition of
the matrix SX ( f ). In particular, if fs ≥ fNyq , then we may choose the bandpasses of the L
filters to be a set of L disjoint intervals, each of length at most fs /L, such that the union
of their supports contains the support of this choice, the matrix SX ( f ) is diagonal and its
eigenvalues are
n∈Z S X ( f − fs n)
2
λl = S l ( f ) 1suppHl ( f ).
n∈Z S X+ε ( f − fs n)
Since the union of the filters’ support contains the support of S X|Xε , we have
While it is not surprising that a multi-branch sampler attains the optimal sampling dis-
tortion when fs is above the Nyquist rate, we note that at each branch the sampling rate
can be as small as fNyq /L. This last remark suggests that a similar principle may be used
under sub-Nyquist sampling to sample those particular parts of the spectrum of maximal
energy whenever S X|Xε ( f ) is not unimodal.
Our goal now is to prove Theorem 2.2 by showing that, for any PSDs S X ( f ) and
S ε ( f ), the distortion in (2.33) can be made arbitrarily close to the fundamental distor-
tion limit D ( fs , R) with an appropriate choice of the number of sampling branches and
their filters. Using the intuition gained above, given a sampling rate fs we cover the set
of maximal energy F ( fs ) of (2.18) using L disjoint intervals, such that the length of
each interval does not exceed fs /L. For any ε > 0, it can be shown that there exists L
large enough such Δ S X|Xε ( f )d f < ε, where Δ is the part that is not covered by the L
intervals [28].
An Information-Theoretic Approach to Analog-to-Digital Compression 69
From this explanation, we conclude that, for any PSD S X|Xε ( f ), fs > 0, and ε > 0,
there exists an integer L and a set of L pre-sampling filters H1 ( f ), . . . , HL ( f ) such that,
for every bitrate R,
Since DH1 ,...,HL ( fs , R) is obtained in the limit as T approaches infinity of the mini-
mal distortion in ADX under the aforementioned multi-branch uniform sampler, the
fundamental distortion limit in ADX is achieved up to an arbitrarily small constant.
The description starting from Theorem 2.6 and ending in (2.34) sketches the proof
of the achievability side of the fundamental ADX distortion (Theorem 2.2). Below we
summarize the main points in the procedure described in this section.
(i) Given a sampling rate fs , use a multi-branch LTI uniform sampler with a sufficient
number of sampling branches L that the effective passband of all branches is close
enough to F , which is a set of Lebesgue measure fs that maximizes (2.18).
(ii) Estimate the signal X(·) under an MSE criterion, leading to XT (·) defined in (2.10).
As T → ∞ this process converges in L2 norm to X(·) defined in (2.25).
(iii) Given a bitrate constraint R, encode a realization of XT (·) in an optimal manner
subject to an MSE constraint as in standard source coding [7]. For example, for
ρ > 0 arbitrarily small, we may use a codebook consisting of 2T (R+ρ) waveforms
of duration T generated by independent draws from the distribution defined by
the preserved part of the spectrum in Fig. 2.10. We then use minimum distance
encoding with respect to this codebook.
2.6 Conclusion
The processing, communication, and digital storage of an analog signal requires first
representing it as a bit sequence. Hardware and modeling constraints in processing ana-
log information imply that the digital representation is obtained by first sampling the
analog waveform and then quantizing or encoding its samples. That is, the transforma-
tion from analog signals to bits involves the composition of sampling and quantization
or, more generally, lossy compression operations.
In this chapter we explored the minimal sampling rate required to attain the funda-
mental distortion limit in reconstructing a signal from its quantized samples subject to a
strict constraint on the bitrate of the system. We concluded that, when the energy of the
signal is not uniformly distributed over its spectral occupancy, the optimal signal repre-
sentation can be attained by sampling at some critical rate that is lower than the Nyquist
rate or, more generally, the Landau rate, in bounded linear sampling. This critical sam-
pling rate depends on the bitrate constraint, and converges to the Nyquist or Landau rates
in the limit of infinite bitrate. This reduction in the optimal sampling rate under finite bit
precision is made possible by designing the sampling mechanism to sample only those
parts of the signals that are not discarded due to optimal lossy compression.
70 Alon Knipis et al.
References
[1] Y. C. Eldar, Sampling theory: Beyond bandlimited systems. Cambridge University Press,
2015.
[2] R. M. Gray and D. L. Neuhoff, “Quantization,” IEEE Trans. Information Theory, vol. 44,
no. 6, pp. 2325–2383, 1998.
[3] C. E. Shannon, “Communication in the presence of noise,” IRE Trans. Information Theory,
vol. 37, pp. 10–21, 1949.
[4] H. Landau, “Sampling, data transmission, and the Nyquist rate,” Proc. IEEE, vol. 55,
no. 10, pp. 1701–1706, 1967.
[5] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical J.,
vol. 27, pp. 379–423, 623–656, 1948.
[6] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE
National Convention Record, vol. 4, no. 1, pp. 142–163, 1959.
[7] T. Berger, Rate-distortion theory: A mathematical basis for data compression. Prentice-
Hall, 1971.
[8] R. Walden, “Analog-to-digital converter survey and analysis,” IEEE J. Selected Areas in
Communications, vol. 17, no. 4, pp. 539–550, 1999.
[9] J. Candy, “A use of limit cycle oscillations to obtain robust analog-to-digital converters,”
IEEE Trans. Communications, vol. 22, no. 3, pp. 298–305, 1974.
[10] B. Oliver, J. Pierce, and C. Shannon, “The philosophy of PCM,” IRE Trans. Information
Theory, vol. 36, no. 11, pp. 1324–1331, 1948.
[11] D. L. Neuhoff and S. S. Pradhan, “Information rates of densely sampled data: Distributed
vector quantization and scalar quantization with transforms for Gaussian sources,” IEEE
Trans. Information Theory, vol. 59, no. 9, pp. 5641–5664, 2013.
[12] M. Matthews, “On the linear minimum-mean-squared-error estimation of an undersampled
wide-sense stationary random process,” IEEE Trans. Signal Processing, vol. 48, no. 1, pp.
272–275, 2000.
[13] D. Chan and R. Donaldson, “Optimum pre- and postfiltering of sampled signals with appli-
cation to pulse modulation and data compression systems,” IEEE Trans. Communication
Technol., vol. 19, no. 2, pp. 141–157, 1971.
An Information-Theoretic Approach to Analog-to-Digital Compression 71
[14] R. Dobrushin and B. Tsybakov, “Information transmission with additional noise,” IRE
Trans. Information Theory, vol. 8, no. 5, pp. 293–304, 1962.
[15] A. Kolmogorov, “On the Shannon theory of information transmission in the case of
continuous signals,” IRE Trans. Information Theory, vol. 2, no. 4, pp. 102–108, 1956.
[16] A. Lapidoth, “On the role of mismatch in rate distortion theory,” IEEE Trans. Information
Theory, vol. 43, no. 1, pp. 38–47, 1997.
[17] I. Kontoyiannis and R. Zamir, “Mismatched codebooks and the role of entropy coding in
lossy data compression,” IEEE Trans. Information Theory, vol. 52, no. 5, pp. 1922–1938,
2006.
[18] A. H. Zemanian, Distribution theory and transform analysis: An introduction to general-
ized functions, with applications. Courier Corporation, 1965.
[19] J. Wolf and J. Ziv, “Transmission of noisy information to a noisy receiver with minimum
distortion,” IEEE Trans. Information Theory, vol. 16, no. 4, pp. 406–411, 1970.
[20] H. Witsenhausen, “Indirect rate distortion problems,” IEEE Trans. Information Theory,
vol. 26, no. 5, pp. 518–521, 1980.
[21] H. Landau, “Necessary density conditions for sampling and interpolation of certain entire
functions,” Acta Mathematica, vol. 117, no. 1, pp. 37–52, 1967.
[22] A. Beurling and L. Carleson, The collected works of Arne Beurling: Complex analysis.
Birkhäuser, 1989, vol. 1.
[23] F. J. Beutler, “Sampling theorems and bases in a Hilbert space,” Information and Control,
vol. 4, nos. 2–3, pp. 97–117, 1961.
[24] A. Kipnis, Y. C. Eldar, and A. J. Goldsmith, “Fundamental distortion limits of analog-
to-digital compression,” IEEE Trans. Information Theory, vol. 64, no. 9, pp. 6013–6033,
2018.
[25] W. Bennett, “Statistics of regenerative digital transmission,” Bell Labs Technical J., vol. 37,
no. 6, pp. 1501–1542, 1958.
[26] A. Kipnis, A. J. Goldsmith, and Y. C. Eldar, “The distortion rate function of cyclostationary
Gaussian processes,” IEEE Trans. Information Theory, vol. 64, no. 5, pp. 3810–3824, 2018.
[27] A. Kipnis, A. J. Goldsmith, Y. C. Eldar, and T. Weissman, “Distortion rate function of sub-
Nyquist sampled Gaussian sources,” IEEE Trans. Information Theory, vol. 62, no. 1, pp.
401–429, 2016.
[28] A. Kipnis, “Fundamental distortion limits of analog-to-digital compression,” Ph.D. disser-
tation, Stanford University, 2018.
[29] A. Kipnis, A. J. Goldsmith, and Y. C. Eldar, “The distortion-rate function of sampled
Wiener processes,” in IEEE Transactions on Information Theory, vol. 65, no. 1, pp. 482–
499, Jan. 2019. doi: 10.1109/TTT.2018.2878446
[30] A. Kipnis, G. Reeves, and Y. C. Eldar, “Single letter formulas for quantized compressed
sensing with Gaussian codebooks,” in 2018 IEEE International Symposium on Information
Theory (ISIT), 2018, pp. 71–75.
[31] A. Kipnis, G. Reeves, Y. C. Eldar, and A. J. Goldsmith, “Compressed sensing under opti-
mal quantization,” in 2017 IEEE International Symposium on Information Theory (ISIT),
2017, pp. 2148–2152.
3 Compressed Sensing via
Compression Codes
Shirin Jalali and H. Vincent Poor
Summary
Data acquisition refers to capturing signals that lie in the physical world around us and
converting them to processable digital signals. It is a basic operation that takes place
before any other data processing can begin to happen. Audio recorders, cameras, X-ray
computed tomography (CT) scanners, and magnetic resonance imaging (MRI) machines
are some examples of data-acquisition devices that employ different mechanisms to
perform this crucial step.
72
Compressed Sensing via Compression Codes 73
Figure 3.1 Block diagram of a CS measurement system. Signal x is measured as y ∈ Rm and later
recovered as x.
For decades the Nyquist–Shannon sampling theorem served as the theoretical foun-
dation of most data-acquisition systems. The sampling theorem states that for a band-
limited signal with a maximum frequency fc , a sampling rate of 2 fc is enough for perfect
reconstruction. CS is arguably one of the most disruptive ideas conceived after this fun-
damental theorem [1, 2]. CS proves that, for sparse signals, the sampling rate can in
fact be substantially below the rate required by the Nyquist–Shannon sampling theorem
while perfect reconstruction still being possible.
CS is a special data-acquisition technique that is applicable to various measurement
devices that can be described as a noisy linear measurement system. Let x ∈ Rn denote
the desired signal that is to be measured. Then, the measured signal y ∈ Rm in such
systems can be written as
y = Ax + z, (3.1)
where A ∈ Rm×n and z ∈ Rm denote the sensing matrix and the measurement noise,
respectively. A reconstruction algorithm then recovers x from measurements y, while
having access to sensing matrix A. Fig. 3.1 shows a block diagram representation of the
described measurement system.
For x to be recoverable from y = Ax + z, for years, the conventional wisdom has been
that, since there are n unknown variables, the number of measurements m should be
equal to or larger than n. However, in recent years, researchers have observed that, since
most signals we are interested in acquiring do not behave like random noise, and are
typically “structured,” therefore, employing our knowledge about their structure might
enable us to recover them, even if the number of measurements (m) is smaller than the
ambient dimension of the signal (n). CS theory proves that this is in fact the case, at least
for signals that have sparse representations in some transform domain.
More formally, the problem of CS can be stated as follows. Consider a class of sig-
nals denoted by a set Q, which is a compact subset of Rn . A signal x ∈ Q is measured
by m noisy linear projections as y = Ax + z. A CS recovery (reconstruction) algorithm
estimates the signal x from measurements y, while having access to the sensing matrix
A and knowing the set Q.
While the original idea of CS was developed for sparse or approximately sparse
signals, the results were soon extended to other types of structure, such as block-
sparsity and low-rankness as well. (See [3–26] for some examples of this line of work.)
74 Shirin Jalali and H. Vincent Poor
Despite such extensions, for one-dimensional signals, and to a great extent for higher-
dimensional signals such as images and video files, the main focus has still remained on
sparsity and its extensions. For two-dimensional signals, in addition to sparse signals,
there has also been extensive study of low-rank matrices.
The main reasons for such a focus on sparsity have been two-fold. First, sparsity is
a relevant structure that shows up in many signals of interest. For instance, images are
known to have (approximately) sparse wavelet representations. In most applications of
CS in wireless communications, the desired coefficients that are to be estimated are
sparse. The second reason for focusing on sparsity has been theoretical results that show
that the 0 -“norm” can be replaced with the 1 -norm. The 0 -“norm” of a signal x ∈ Rn
counts the number of non-zero elements of x, and hence serves as a measure of its
sparsity. For solving y = Ax, when x is known to be sparse, a natural optimization is the
following:
While sparsity is an important and attractive structure that is present in many signals of
interest, most such signals, including images and video files, in addition to exhibiting
sparsity, follow structures that are beyond sparsity. A CS recovery algorithm that takes
advantage of the full structure that is present in a signal, potentially, outperforms those
that merely rely on a simple structure. In other words, such an algorithm would poten-
tially require fewer measurements, or, for an equal number of measurements, present a
better reconstruction quality.
To have high-performance CS recovery methods, it is essential to design algorithms
that take advantage of the full structure that is present in a signal. To address this issue,
researchers have proposed different potential solutions. One line of work has been based
Compressed Sensing via Compression Codes 75
3.3 Definitions
3.3.1 Notation
For x ∈ R, δ x denotes the Dirac measure with an atom at x. Consider x ∈ R and b ∈ N+ .
Every real number can be written as x = x + xq , where x denotes the largest integer
smaller than or equal to x and xq = x − x. Since x − 1 < x ≤ x, xq ∈ [0, 1). Let 0.a1 a2 . . .
denote the binary expansion of xq . That is,
∞
xq = ai 2−i .
i=1
Then, define the b-bit quantized version of x as
b
[x]b = x + ai 2−i .
i=1
Using this definition,
x − 2−b < [x]b ≤ x.
For a vector xn ∈ Rn , let [xn ]b = ([x1 ]b , . . . , [xn ]b ).
Given two vectors x and y both in Rn , x, y = ni=1 xi yi denotes their inner product.
Throughout this chapter, ln and log denote the natural logarithm and logarithm to base
2, respectively.
Sets are denoted by calligraphic letters and the size of set A is denoted by |A|. The
0 -“norm” of x ∈ Rn is defined as x0 = |{i : xi 0}|.
3.3.2 Compression
Data compression is about efficiently storing an acquired signal, and hence is a step that
happens after the data have been collected. The main goal of data-compression codes is
to take advantage of the structure of the data and represent it as efficiently as possible, by
minimizing the required number of bits. Data-compression algorithms are either lossy
or lossless. In this part, we briefly review the definition of a lossy compression scheme
for real-valued signals in Rn . (Note that lossless compression of real-valued signals is
not feasible.)
Consider Q, a compact subset of Rn . A fixed-rate compression code for set Q is
defined via encoding and decoding mappings (E, D), where
E : Rn → {1, . . . , 2r }
and
D : {1, . . . , 2r } → Rn .
Compressed Sensing via Compression Codes 77
Figure 3.2 Encoder maps signal x to r bits, E(x), and, later, the decoder maps the coded bits to
x.
Here r denotes the rate of the code. The codebook of such code is defined as
C = {D(E(x)) : x ∈ Q}.
Note that C is always a finite set with at most 2r distinct members. Fig. 3.2 shows a
block diagram presentation of the described compression code. The compression code
defined by (E, D) encodes signal x ∈ Q into r bits as E(x), and decodes the encoded bits
to
x = D(E(x)). The performance of this code is measured in terms of its rate r and its
induced distortion δ defined as
δ = sup x −
x2 .
x∈Q
Consider a family of compression codes {(Er , Dr ) : r > 0} for set Q indexed by their
rate r. The (deterministic) distortion-rate function of this family of compression codes
is defined as
δ(r) = sup x − Dr (Er (x))2 .
x∈Q
In other words, δ(r) denotes the distortion of the code operating at rate r. The
corresponding rate-distortion function of this family of compression codes is defined as
r(δ) = inf{r : δ(r) ≤ δ}.
Finally, the α-dimension of a family of compression codes {(Er , Dr ) : r > 0} is defined
as [42]
r(δ)
α = lim sup . (3.3)
δ→0 log (1/δ)
This dimension, as shown later, serves as a measure of structuredness and plays a key
role in understanding the performance of compression-based CS schemes.
x = arg min y − Ac22 . (3.4)
c∈Cr
The following theorem considers the noise-free regime where z = 0, and connects
the number of measurements m and the reconstruction quality of CSP x − x2 to the
properties of the compression code, i.e., its rate and its distortion.
theorem 3.1 Consider compact set Q ⊂ Rn with a rate-r compression code (Er , Dr )
operating at distortion δ. Let A ∈ Rm×n , where Ai, j are independently and identically
distributed (i.i.d.) as N(0, 1). For x ∈ Q, let
x denote the reconstruction of x from y = Ax,
generated by the CSP optimization employing code (Er , Dr ). Then,
1 + τ1
x − x2 ≤ δ
,
1 − τ2
with probability at least
m m
1 − 2r e 2 (τ2 +log(1−τ2 )) − e− 2 (τ1 −log(1+τ1 )) ,
Proof Let x = Dr (Er (x)) and x = arg minc∈Cr y − Ac22 . Since x is the minimizer of
y − Ac2 over all c ∈ Cr , and since
2 x ∈ Cr , it follows that y − A
x2 ≤ y − A
x2 . That is,
x2 ≤ Ax − A
Ax − A x2 . (3.5)
Compressed Sensing via Compression Codes 79
Given x, define set U to denote the set of all possible normalized error vectors as
x−c
U= : c ∈ Cr .
x − c
Note that since |Cr | ≤ 2r and x is fixed, |U| ≤ 2r as well.
For τ1 > 0 and τ2 ∈ (0, 1), define events E1 and E2 as
E1 {A(x −
x)22 ≤ m(1 + τ1 )x −
x2 }
and
E2 {Au2 ≥ m(1 − τ2 ) : u ∈ U},
respectively.
For a fixed u ∈ U, since the entries of A are i.i.d. N(0, 1), Au is a vector of m
i.i.d. standard normal random variables. Hence, by Lemma 2 in [39],
m
P(Au22 ≥ m(1 + τ1 )) ≤ e− 2 (τ1 −log(1+τ1 )) (3.6)
and
m
P({Au22 ≤ m(1 − τ2 )}) ≤ e 2 (τ2 +log(1−τ2 )) . (3.7)
Since
x ∈ Cr ,
x −
x
∈U
x −
x2
and, from (3.6),
m
P(Ec1 ) ≤ e− 2 (τ1 −log(1+τ1 )) .
Moreover, since |U| ≤ 2r , by the union bound, we have
m
P(Ec2 ) ≤ 2r e 2 (τ2 +log(1−τ2 )) . (3.8)
Hence, by the union bound,
P(E1 ∩ E2 ) = 1 − P(Ec1 ∪ Ec2 )
m m
≥ 1 − 2r e 2 (τ2 +log(1−τ2 )) − e− 2 (τ1 −log(1+τ1 )) .
By definition,
x is the reconstruction of x using code (Er , Dr ). Hence, x −
x2 ≤ δ,
and, conditioned on E1 ,
y − A
x2 ≤ δ m(1 + τ1 ). (3.9)
On the other hand, conditioned on E2 ,
y − A
x2 ≥ x −
x m(1 − τ2 ). (3.10)
Combining (3.9) and (3.10), conditioned on E1 ∩ E2 , we have
1 + τ1
x −x ≤ δ .
1 − τ2
80 Shirin Jalali and H. Vincent Poor
where θ = 2e−(1+ν)/η .
Proof In Theorem 3.1, let τ1 = 3, τ2 = 1 − (eδ)2(1+)/η . For τ1 = 3, 0.5(τ1 −
log(1 + τ1 )) > 0.8. For τ2 = 1 − (eδ)2(1+)/η ,
m
r ln 2 + (τ2 + ln(1 − τ2 ))
2
m 2(1 + )
= r ln 2 + 1 − (eδ)2(1+)/η + ln(eδ)
2 η
η
≤ r ln 2 + − (1 + ) ln 2
2 log (1/eδ)
(a)
≤ r ln 2 1 + − (1 + )
2
≤ −0.3r, (3.11)
where (a) is due to the fact that
η η ln 2
= < .
log (1/eδ) ln (1/eδ) 2
Finally,
1 + τ1 4
δ =δ
1 − τ2 (eδ)2(1+)/η
= θδ1−(1+)/η , (3.12)
where θ = 2e−(1+)/η .
Corollary 3.1 implies that, using a family of compression codes {(Er , Dr ) : r > 0},
as r → ∞ and δ → 0, the achieved reconstruction error converges to zero (as
limδ→0 θδ1−(1+)/η = 0), while the number of measurements converges to ηα, where α
is the α dimension of the compression algorithms and η > 1 is a free parameter. In other
words, as long as the number of measurements m is larger than α, using an appropriate
compression code, CSP recovers x.
Compressed Sensing via Compression Codes 81
Theorem 3.1 and its corollary both ignore the effect of noise and consider noise-free
measurements. In practice, noise is always present. So it is important to understand the
effect of noise on the performance. For instance, the following theorem characterizes the
effect of a deterministic noise with a bounded 2 -norm on the performance of CSP.
theorem 3.2 (Theorem 2 in [42]) Consider compression code (E, D) operating at rate
r and distortion δ on set Q ⊂ Rn . For x ∈ Q, and y = Ax + z with z2 ≤ ζ, let
x denote
the reconstruction of x from y offered by the CSP optimization employing code (E, D).
Then,
1 + τ1 2ζ
x − x2 ≤ δ
+ √ ,
1 − τ2 (1 − τ2 )d
with probability exceeding
d d
1 − 2r e 2 (τ2 +log(1−τ2 )) − e− 2 (τ1 −log(1+τ1 )) ,
where τ1 > 0 and τ2 ∈ (0, 1) are free parameters.
CSP optimization is a discrete optimization that minimizes a convex cost function
over a discrete set of exponential size. Hence, solving CSP in its original form is com-
putationally prohibitive. In the next section, we study this critical issue and review an
efficient algorithm with low computational complexity that is designed to approximate
the solution of the CSP optimization.
As discussed in the previous section, CSP is based on an exhaustive search over expo-
nentially many codewords and as a result is computationally infeasible. Compression-
based gradient descent (C-GD) is a computationally efficient and theoretically
analyzable approach to approximating the solution of CSP. The C-GD algorithm
[44, 45], inspired by the projected gradient descent (PGD) algorithm [46], works as
follows. Start from some x0 ∈ Rn . For t = 1, 2, . . ., proceed as follows:
st+1 = xt + ηAT (y − Axt ), (3.13)
and
xt+1 = PCr (st+1 ), (3.14)
where PCr (·) denotes projection into the set of codewords. In other words, for x ∈ Rn ,
PCr (·) = arg min x − c2 . (3.15)
c∈Cr
Here index t denotes the iteration number and η ∈ R denotes the step size. Each iteration
of this algorithm involves performing two operations. The first step is moving in the
direction of the negative of the gradient of y − Ax22 with respect to x to find solutions
that are closer to the y = Ax hyperplane. The second step, i.e., the projection step, ensures
82 Shirin Jalali and H. Vincent Poor
that the estimate C-GD obtains belongs to the codebook and hence conforms with the
source structure. The following theorem characterizes the convergence performance of
the described C-GD algorithm.
theorem 3.3 (Theorem 2 in [45]) Consider x ∈ Rn . Let y = Ax + z and assume that
the entries of the sensing matrix A are i.i.d. N(0, 1) and that zi , i = 1, . . . , m, are
i.i.d. N(0, σ2z ). Let
1
η=
m
and define
x = PCr (x), where PCr (·) is defined in (3.15). Given > 0, for m ≥ 80r(1 + ),
m
with a probability larger than 1 − e− 2 − 2−40r − 2−2r+0.5 − e−0.15m , we have
2
n 32(1 + )r
x −
t+1
x2 ≤ 0.9x −
t
x2 + 2 2 + δ + σz , (3.16)
m m
for k = 0, 1, 2, . . ..
As will be clear in the proof, the choice of 0.9 in Theorem 3.3 is arbitrary, and the
result could be derived for any positive value strictly smaller than one. We present the
result for this choice as it clarifies the statement of the result and its proof.
As stated earlier, each iteration of the C-GD involves two steps: (i) moving in the
direction of the gradient and (ii) projection of the result onto the set of codewords of
the compression code. The first step is straightforward and requires two matrix–vector
multiplications. For the second step, optimally solving (3.15) might be challenging.
However, for any “good” compression code, it is reasonable to assume that employ-
ing the code’s encoder and decoder consecutively well approximates this operation.
In fact, it can be proved that, if PCr (x) − Dr (Er (x)) is smaller than for all x, then
replacing PCr (·) with Dr (Er (·)) only results in an additive error of in (3.16). (Refer
to Theorem 3 in [45].) Under this simplification, the algorithm’s two steps can be
summarized as
Proof By definition, xt+1 = PCr (st+1 ) and x = PCr (x). Hence, since xt+1 is the closest
vector in Cr to s and
t+1 x is also in Cr , we have xt+1 − st+1 22 ≤
x − st+1 22 . Equivalently,
(x −
t+1 x) − (s −
t+1 x)2 ≤
2 x − s 2 , or
t+1 2
xt+1 −
x22 ≤ 2xt+1 −
x, st+1 −
x. (3.17)
θt xt −
x
and
θt
θt ,
θt
respectively.
Compressed Sensing via Compression Codes 83
2
P θt+1 , g ≥ 1 + τ2 ≤ e− 2 (τ2 −ln(1+τ2 )) .
1
(3.25)
Hence, by the union bound,
τ2
P(Ec4 ) ≤ |F |e− 2
1
≤ 22r e− 2 (τ2 −ln(1+τ2 ))
τ2
≤ 22r− 2 , (3.26)
where the last inequality holds for all τ2 > 7. Setting τ2 = 4(1 + )r − 1, where
> 0, ensures that P(Ec4 ) ≤ 2−2r+0.5 . Setting τ1 = 1,
P(Ec3 ) ≤ e−0.15m . (3.27)
x2 ≤ δ, conditioned on E1 ∩ E2 ∩
Now using the derived bounds and noting that x −
E3 ∩ E4 , we have
2 θt+1 , θt − ηAθt+1 , Aθt ≤ 0.9, (3.28)
2
2 2 √ √ 2 n
(σmax (A))2 x −
x2 ≤ 2 m+ n δ = 2 2+ δ, (3.29)
m m m
and
2 t+1 T
2ηθt+1 , AT z = θ , A z
m
2
≤ σ2z (1 + τ1 )m(1 + τ2 )
m
2σz
= 8m(1 + )r
m
32(1 + )r
= σz . (3.30)
m
Compressed Sensing via Compression Codes 85
Hence, combining (3.28), (3.29), and (3.30) yields the desired error bound. Finally, note
that, by the union bound,
4
P(E1 ∩ E2 ∩ E3 ∩ E4 ) ≥ 1 − P(Ei )
i=1
− m2
≥ 1−e − 2−40r − 2−2r+0.5 − e−0.15m .
In this section, we consider three well-studied classes of signals, namely, (i) sparse sig-
nals, (ii) piecewise polynomials, and (iii) natural images. For each class of signal, we
explore the implications of the main results discussed so far for these specific classes of
signals. For the first two classes, we consider simple families of compression codes to
study the performance of the compression-based CS methods. For images, on the other
hand, we use standard compression codes such as JPEG and JPEG2000. These examples
enable us to shed light on different aspects of the CSP optimization and the C-GD algo-
rithm, such as (i) their required number of measurements, (ii) the reconstruction error
in a noiseless setting, (iii) the reconstruction error in the presence of noise, and (iv) the
convergence rate of the C-GD algorithm.
and
Consider the following compression code for signals in Γnk (ρ). Consider x ∈ Γnk (ρ).
By definition, x contains at most k non-zero entries. The encoder encodes x by first
describing the locations of the k non-zero entries and then the values of those non-zero
entries, each b-bit quantized. To encode x ∈ Γnk (ρ), the described code spends at most
r bits, where
86 Shirin Jalali and H. Vincent Poor
δ = sup x −
x2
x∈Γnk (ρ)
= sup (xi −
xi )2
x∈Γnk (ρ) i:xi 0
= 2−2b
i:xi 0
√
= 2−b k. (3.34)
The α-dimension of this compression code can be bounded as
r(δ)
α = lim
δ→0 log (1/δ)
k(b + log ρ + log n + 2)
≤ lim = k. (3.35)
b→∞ b − log k
It can in fact be shown that the α-dimension is equal to k.
Consider using this specific compression algorithm in the C-GD framework. The
resulting algorithm is very similar to the well-known iterative hard thresholding (IHT)
algorithm [33]. The IHT algorithm, a CS recovery algorithm for sparse signals, is an
iterative algorithm. Its first step, as with C-GD, is moving in the opposite direction of
the gradient of the cost function. At the projection step, IHT keeps the k largest elements
and sets the rest to zero. The C-GD algorithm, on the other hand, after moving in the
opposite direction to the gradient, finds the codeword in the described code that is clos-
est to the result. For the special code described earlier, this is equivalent to first finding
the k largest entries and setting the rest to zero. Then, each remaining non-zero entry xi
is first clipped between [−1, 1] as
xi 1 xi ∈(−1,1) + sign(xi )1|xi |≥1 ,
where 1A is an indicator of event A, and then quantized to b + 1 bits.
i.i.d. i.i.d.
Consider x ∈ Γnk (1) and let y = Ax + z, where Ai, j ∼ N(0, 1/n) and zi ∼ N(0, σ2z ). Let
x denote the projection of x, as described earlier. The following corollary of Theorem 3.3
characterizes the convergence performance of C-GD applied to y when using this code.
corollary 3.2 (Corollary 2 of [45]) Given γ > 0, set b = γ log n + 12 log k bits. Also,
set η = n/m. Then, given > 0, for m ≥ 80 r(1 + ), where r = (1 + γ)k log n + (k/2)
log k + 2k,
2
1 t+1 0.9 t n −1/2−γ 8(1 + )r
√ x − x2 ≤ √ x − x2 + 2 2 + n + σz , (3.36)
n n m m
for t = 0, 1, . . . , with probability larger than 1 − 2−2r .
Compressed Sensing via Compression Codes 87
Proof Consider u ∈ [−1, 1]. Quantizing u by a uniform quantizer that uses b + 1 bits
yields
u, which satisfies |u − u| < 2−b . Therefore, using b + 1 bits to quantize
√ each non-
zero element of x ∈ Γnk yields a code which achieves distortion δ ≤ 2−b k. Hence, for
b + 1 = γ log n + 12 log k + 1,
δ ≤ n−γ .
On the other hand, the code rate r can be upper-bounded as
k
n
r≤ log + k(b + 1) ≤ log nt+1 + k(b + 1) = (k + 1) log n + k(b + 1),
i=0
i
where the last inequality holds for all n large enough. The rest of the proof follows
directly from inserting these numbers in to the statement of Theorem 3.3.
In the noiseless setting, according to this corollary, if the number of measurements m
satisfies m = Ω(k log n), using (3.36), the final reconstruction error can be bounded as
⎛ 2 ⎞
1 t+1 ⎜⎜⎜⎜ n −1/2−γ ⎟⎟⎟⎟
lim √ x − x2 = O⎜⎜⎝ 2 + n ⎟⎟⎠,
t→∞ n m
or
⎛ 1 ⎞
1 t+1 ⎜⎜ n 2 −γ ⎟⎟⎟⎟
lim √ x − x2 = O⎜⎜⎜⎝ ⎟.
t→∞ n m ⎠
Hence, if γ > 0.5, the error vanishes as the ambient dimension of the signal space grows
to infinity. There are two key observations regarding the number of measurements used
by C-GD and CSP that one can make.
1. The α-dimension of the described compression code is equal to k. This implies
that, using slightly more than k measurements, CSP is able to almost accurately
recover the signal from its under-determined linear measurements. However, solv-
ing CSP involves an exhaustive search over all of the exponentially many code-
books. On the other hand, the computationally efficient C-GD method employs
k log n measurements, which is more than what is required by CSP by a factor
log n.
2. The results derived for C-GD are slightly weaker than those derived for IHT, as (i)
the C-GD algorithm’s reconstruction is not exact, even in the noiseless setting,
and (ii) C-GD requires O(k log n) measurements, compared with O(k log(n/k))
required by IHT.
In the case where the measurements are distorted by additive white Gaussian noise,
Corollary 3.2 states that there will be an additive
⎛ ⎞
⎜⎜⎜ k log n ⎟⎟⎟⎟
⎜
O⎜⎝σz ⎟
m ⎠
distortion term. While no similar result on the performance of the IHT algorithm in the
presence of stochastic noise exists, the noise sensitivity of C-GD is comparable to the
noise-sensitivity performance of other algorithms based on convex optimization, such as
LASSO and the Dantzig selector [48, 49].
88 Shirin Jalali and H. Vincent Poor
Set
r = ((γ + 0.5)(N + 1)(Q + 1) + Q) log n + (N + 1)(Q + 1)(log(N + 1) + 1) + 1.
δ ≤ n−γ .
Therefore, for γ > 0.5, the reconstruction distortion converges to zero, as n grows to
infinity.
3. In the case of noisy measurements, the effect of the noise in the reconstruction error
behaves as
⎛ ⎞
⎜⎜⎜ log n ⎟⎟⎟⎟
⎜
O⎜⎝σz (Q + 1)(N + 1) ⎟.
m ⎠
90 Shirin Jalali and H. Vincent Poor
where the projection operation PCr is defined in (3.15). The optimization described in
(3.41) is a scalar optimization problem. Derivative-free methods, such as the Nelder–
Mead method also known as the downhill simplex method [50] can be employed to
solve it.
Tuning the parameter r is a more challenging task, which can be described as a model-
selection problem. (See Chapter 7 of [51].) There are some standard techniques such as
multi-fold cross validation that address this issue. However, finding a method with low
computational complexity that works well for the C-GD algorithm is an open problem
that has yet to be addressed.
Finally, controlling the number of iterations in the C-GD method can be done using
standard stopping rules, such as limiting the maximum number of iterations, or lower-
bounding the reduction of the squared error per iteration, i.e., xt+1 − xt 2 .
Using these parameter-selection methods results in an algorithm described below as
Algorithm 3.1. Here, as usual, E and D denote the encoder and decoder of the com-
pression code, respectively. K1,max denotes the maximum number of iterations and T
denotes the lower bound in the reduction mentioned earlier. In all the simulation results
cited later, K1,max = 50 and T = 0.001.
Tables 3.1 and 3.2 show some simulation results reported in [45], for noiseless and
noisy measurements, respectively. In both cases the measurement matrix is a random
partial-Fourier matrix, generated by randomly choosing rows from a Fourier matrix.
Compressed Sensing via Compression Codes 91
The derived results are compared with a state-of-the-art recovery algorithm for partial-
Fourier matrices, non-local low-rank regularization (NLR-CS) [52]. In both tables, when
compression algorithm a is employed in the platform of C-GD, the resulting C-GD algo-
rithm is referred to as a-GD. For instance, C-GD that uses JPEG compression code is
referred to as JPEG-GD. The results are derived using the C-GD algorithm applied to
a number of standard test images. In these results, standard JPEG and JPEG2000 codes
available in the Matlab-R2016b image- and video-processing package are used as com-
pression codes for images. Boat, House, and Barbara test images are the standard images
available in the Matlab image toolbox.
The mean square error (MSE) between x ∈ Rn and x ∈ Rn is measured as follows:
1
MSE = x − x22 . (3.42)
n
Then, the corresponding peak signal-to-noise (PSNR) is defined as
255
PSNR = 20 log √ . (3.43)
MSE
In the case of noisy measurements, y = Ax + z, the signal-to-noise ratio (SNR) is
defined as
Ax2
SNR = 20 log , (3.44)
z2
92 Shirin Jalali and H. Vincent Poor
Table 3.2. PSNR of 512 × 512 reconstructions with Gaussian measurement noise with various SNR
values − sampled by a random partial-Fourier measurement matrix
Note: The bold numbers highlight the best performance achieved in each case.
3.7 Extensions
where as usual Cr denotes the codebook of the code defined by (Er , Dr ). In other words,
similarly to CSP, COPER also seeks the codeword that minimizes the measurement
error. The following theorem characterizes the performance of COPER and connects the
number of measurements m with the rate r, distortion δ, and reconstruction quality.
Before stating the theorem, note that, since the measurements are phaseless, for all
values of θ ∈ [0, 2π], e jθ x and x will generate the same measurements. That is, |Ax| =
|A(e jθ x)|. Hence, without any additional information about the phase of the signal, any
recovery algorithm is expected only to recover x, up to an unknown phase shift.
94 Shirin Jalali and H. Vincent Poor
have adopted the alternative definitions as the main setup in this chapter, and finally
specify how compression-based CS of stochastic processes raises new questions about
fundamental limits of noiseless CS.
Lossy Compression
Consider a stationary stochastic process X = {Xi }i∈Z , with alphabet X, where X ⊂ R. A
(fixed-length) lossy compression code for process X is defined by its blocklength n and
encoding and decoding mappings (E, D), where
E : Xn → {1, . . . , 2nR }
and
n .
D : {1, . . . , 2nR } → X
Here, X and X denote the source and the reconstruction alphabet, respectively. Unlike
the earlier definitions, where the rate r represented the total number of bits used to
encode each source vector, here the rate R denotes the number of bits per source symbol.
The distortion performance of the described code is measured in terms of its induced
expected average distortion defined as
1 !
n
D= i ) ,
E d(Xi , X
n i=1
1 !
n
i ) ≤ D + ,
E d(Xi , X
n i=1
Structured Processes
Consider a real-valued stationary stochastic process X = {Xi }i∈Z . CS of process X, i.e.,
recovering X n from measurements Y m = AX n + Z m , where m < n, is possible only if X is
“structured.” This raises the following important questions.
question 4 What does it mean for a real-valued stationary stochastic process to be
structured?
question 5 How can we measure the level of structuredness of a process?
For stationary processes with a discrete alphabet, information theory provides us with
a powerful tool, namely, the entropy rate function, that measures the complexity or
compressibility of such sources [24, 64]. However, the entropy rate of all real-valued
processes is infinite and hence this measure by itself is not useful to measure the
complexity of real-valued sources.
In this section, we briefly address the above questions in the context of CS. Before
proceeding, to better understand what it means for a process to be structured, let us
review a classic example, which has been studied extensively in CS. Consider process X
which is i.i.d., and for which Xi , i ∈ N, is distributed as
Example 3.1 Consider random variable X distributed such that with probability 1 − p
it is equal to 0, and with probability p it is uniformly distributed between 0 and 1. That
is,
X ∼ (1 − p)δ0 + pν,
P([X]b = 0) = 1 − p + p2−b
and
P([X]b = i) = p2−b ,
Compressed Sensing via Compression Codes 97
As expected, H([X]b ) grows to infinity, as b grows without bound. Dividing both sides
by b, it follows that, for a fixed p,
H([X]b )
= p + δb , (3.51)
b
where δb = o(1). This suggests that H([X]b ) grows almost linearly with b and the
asymptotic slope of its growth is equal to
H([X]b )
lim = p.
b→∞ b
The result derived in Example 3.1 on the slope of the growth of the quantized entropy
function holds for any absolutely continuous distribution satisfying H(X) < ∞ [66]. In
fact this slope is a well-known quantity referred to as the Rényi information dimension
of X [66].
definition 3.1 (Rényi information dimension) Given random variable X, the upper
Rényi information dimension of X is defined as
H([X]b )
d̄(X) = lim sup . (3.52)
b→∞ b
The lower information dimension of process X is defined as do (X) = limk→∞ dk (X). When
d̄o (X) = do (X), the information dimension of process X is defined as do (X) = do (X) =
d̄o (X).
Just like the Rényi information dimension, the information dimension of general sta-
tionary processes has an operational meaning in the context of universal CS of such
sources.2
Another possible measure of structuredness for stochastic processes, which is more
closely connected to compression-based CS, is the rate-distortion dimension defined as
follows.
definition 3.4 (Rate-distortion dimension) The upper rate-distortion dimension of
stationary process X is defined as
2R(X, D)
dimR (X) = lim sup . (3.55)
D→0 log (1/D)
The lower rate-distortion dimension of process X, dimR (X), is defined similarly by
replacing lim sup with lim inf. Finally, if
the rate-distortion dimension of process X is defined as dimR (X) = dimR (X) = dimR (X).
For memoryless stationary sources, the information dimension simplifies to the Rényi
information dimension of the marginal distribution, which is known to be equal to the
rate-distortion dimension [69]. For more general sources, under some technical condi-
tions, the rate-distortion dimension and the information dimension can be proved to be
equal [70].
2 In information theory, an algorithm is called universal if it does not need to know the source distribution
and yet, asymptotically, achieves the optimal performance.
Compressed Sensing via Compression Codes 99
2ηR
m= n. (3.56)
log (1/D)
Further, let X n =
n denote the solution of the CSP optimization. That is, X
arg minc∈Cn Ac − Y 2 . Then,
m 2
⎛ ⎞
⎜⎜ 1 ⎟⎟⎟
n 2 ≥ 2 + n D 2
1 1− 1+δ
P⎜⎜⎝ √ X n − X
1 m
η ⎟⎠ ≤ + 2− 2 nRα + e− 2 .
n m
corollary 3.5 (Corollary 3 in [70]) Consider a stationary process X with upper rate-
distortion dimension dimR (X). Let Y m = AX n , where Ai, j are i.i.d. N(0, 1). For any Δ > 0,
if the number of measurements m = mn satisfies
mn
lim inf > dimR (X),
n→∞ n
then there exists a family of compression codes which used by the CSP optimization
yields
1 n 2 ≥ Δ = 0,
lim P √ X n − X
n→∞ n
References
[1] D. L. Donoho, “Compressed sensing,” IEEE Trans. Information Theory, vol. 52, no. 4,
pp. 1289–1306, 2006.
[2] E. J. Candès and T. Tao, “Near-optimal signal recovery from random projections: Universal
encoding strategies?” IEEE Trans. Information Theory, vol. 52, no. 12, pp. 5406–5425,
2006.
[3] S. Bakin, “Adaptive regression and model selection in data mining problems,” Ph.D.
Thesis, Australian National University, 1999.
[4] Y. C. Eldar and M. Mishali, “Robust recovery of signals from a structured union of
subspaces,” IEEE Trans. Information. Theory, vol. 55, no. 11, pp. 5302–5316, 2009.
[5] Y. C. Eldar, P. Kuppinger, and H. Bölcskei, “Block-sparse signals: Uncertainty
relations and efficient recovery,” IEEE Trans. Signal Processing, vol. 58, no. 6,
pp. 3042–3054, 2010.
[6] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,”
J. Roy. Statist. Soc. Ser. B, vol. 68, no. 1, pp. 49–67, 2006.
[7] S. Ji, D. Dunson, and L. Carin, “Multi-task compressive sensing,” IEEE Trans. Signal
Processing, vol. 57, no. 1, pp. 92–106, 2009.
[8] A. Maleki, L. Anitori, Z. Yang, and R. G. Baraniuk, “Asymptotic analysis of complex lasso
via complex approximate message passing (CAMP),” IEEE Trans. Information Theory,
vol. 59, no. 7, pp. 4290–4308, 2013.
[9] M. Stojnic, “Block-length dependent thresholds in block-sparse compressed sensing,”
arXiv:0907.3679, 2009.
Compressed Sensing via Compression Codes 101
[10] M. Stojnic, F. Parvaresh, and B. Hassibi, “On the reconstruction of block-sparse signals
with an optimal number of measurements,” IEEE Trans. Signal Processing, vol. 57, no. 8,
pp. 3075–3085, 2009.
[11] M. Stojnic, “2 /1 -optimization in block-sparse compressed sensing and its strong thresh-
olds,” IEEE J. Selected Topics Signal Processing, vol. 4, no. 2, pp. 350–357, 2010.
[12] L. Meier, S. Van De Geer, and P. Buhlmann, “The group Lasso for logistic regression,”
J. Roy. Statist. Soc. Ser. B, vol. 70, no. 1, pp. 53–71, 2008.
[13] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of
linear inverse problems,” Found. Comput. Math., vol. 12, no. 6, pp. 805–849, 2012.
[14] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde, “Model-based compressive
sensing,” IEEE Trans. Information Theory, vol. 56, no. 4, pp. 1982–2001, 2010.
[15] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum rank solutions to linear matrix
equations via nuclear norm minimization,” SIAM Rev., vol. 52, no. 3, pp. 471–501, 2010.
[16] M. Vetterli, P. Marziliano, and T. Blu, “Sampling signals with finite rate of innovation,”
IEEE Trans. Signal Processing, vol. 50, no. 6, pp. 1417–1428, 2002.
[17] S. Som and P. Schniter, “Compressive imaging using approximate message passing and a
Markov-tree prior,” IEEE Trans. Signal Processing, vol. 60, no. 7, pp. 3439–3448, 2012.
[18] D. Donoho and G. Kutyniok, “Microlocal analysis of the geometric separation problem,”
Comments Pure Appl. Math., vol. 66, no. 1, pp. 1–47, 2013.
[19] E. J. Candentss, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?”
J. ACM, vol. 58, no. 3, pp. 1–37, 2011.
[20] A. E. Waters, A. C. Sankaranarayanan, and R. Baraniuk, “Sparcs: Recovering low-rank
and sparse matrices from compressive measurements,” in Proc. Advances in Neural
Information Processing Systems, 2011, pp. 1089–1097.
[21] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. Willsky, “Rank-sparsity incoherence
for matrix decomposition,” SIAM J. Optimization, vol. 21, no. 2, pp. 572–596, 2011.
[22] M. F. Duarte, W. U. Bajwa, and R. Calderbank, “The performance of group Lasso for
linear regression of grouped variables,” Technical Report TR-2010-10, Duke University,
Department of Computer Science, Durham, NC, 2011.
[23] T. Blumensath and M. E. Davies, “Sampling theorems for signals from the union of finite-
dimensional linear subspaces,” IEEE Trans. Information Theory, vol. 55, no. 4, pp. 1872–
1882, 2009.
[24] M. B. McCoy and J. A. Tropp, “Sharp recovery bounds for convex deconvolution, with
applications,” arXiv:1205.1580, 2012.
[25] C. Studer and R. G. Baraniuk, “Stable restoration and separation of approximately sparse
signals,” Appl. Comp. Harmonic Analysis (ACHA), vol. 37, no. 1, pp. 12–35, 2014.
[26] G. Peyré and J. Fadili, “Group sparsity with overlapping partition functions,” in Proc.
EUSIPCO Rev, 2011, pp. 303–307.
[27] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,”
SIAM Rev., vol. 43, no. 1, pp. 129–159, 2001.
[28] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” J. Roy. Statist. Soc. Ser.
B vol. 58, no. 1, pp. 267–288, 1996.
[29] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear
inverse problems,” SIAM J. Imaging Sci., vol. 2, no. 1, pp. 183–202, 2009.
[30] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal
matching pursuit,” IEEE Trans. Information Theory, vol. 53, no. 12, pp. 4655–4666, 2007.
102 Shirin Jalali and H. Vincent Poor
[31] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear
inverse problems with a sparsity constraint,” Commun. Pure Appl. Math., vol. 57, no. 11,
pp. 1413–1457, 2004.
[32] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Annals
Statist., vol. 32, no. 2, pp. 407–499, 2004.
[33] T. Blumensath and M. E. Davies, “Iterative hard thresholding for compressed sensing,”
Appl. Comp. Harmonic Analysis (ACHA), vol. 27, no. 3, pp. 265–274, 2009.
[34] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inac-
curate samples,” Appl. Comp. Harmonic Analysis (ACHA), vol. 26, no. 3, pp. 301–321,
2009.
[35] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed
sensing,” Proc. Natl. Acad. Sci. USA, vol. 106, no. 45, pp. 18 914–18 919, 2009.
[36] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensing signal reconstruc-
tion,” IEEE Trans. Information Theory, vol. 55, no. 5, pp. 2230–2249, 2009.
[37] C. A. Metzler, A. Maleki, and R. G. Baraniuk, “From denoising to compressed sensing,”
IEEE Trans. Information Theory, vol. 62, no. 9, pp. 5117–5144, 2016.
[38] J. Zhu, D. Baron, and M. F. Duarte, “Recovery from linear measurements with complexity-
matching universal signal estimation,” IEEE Trans. Signal Processing, vol. 63, no. 6, pp.
1512–1527, 2015.
[39] S. Jalali, A. Maleki, and R. G. Baraniuk, “Minimum complexity pursuit for universal
compressed sensing,” IEEE Trans. Information Theory, vol. 60, no. 4, pp. 2253–2268,
2014.
[40] S. Jalali and H. V. Poor, “Universal compressed sensing for almost lossless recovery,” IEEE
Trans. Information Theory, vol. 63, no. 5, pp. 2933–2953, 2017.
[41] S. Jalali and A. Maleki, “New approach to Bayesian high-dimensional linear regression,”
Information and Inference, vol. 7, no. 4, pp. 605–655, 2018.
[42] S. Jalali and A. Maleki, “From compression to compressed sensing,” Appl. Comput
Harmonic Analysis, vol. 40, no. 2, pp. 352–385, 2016.
[43] D. S. Taubman and M. W. Marcellin, JPEG2000: Image compression fundamentals,
standards and practice. Kluwer Academic Publishers, 2002.
[44] S. Beygi, S. Jalali, A. Maleki, and U. Mitra, “Compressed sensing of compressible signals,”
in Proc. IEEE International Symposium on Information Theory, 2017, pp. 2158–2162.
[45] S. Beygi, S. Jalali, A. Maleki, and U. Mitra, “An efficient algorithm for compression-based
compressed sensing,” vol. 8, no. 2, pp. 343–375, June 2019.
[46] R. T. Rockafellar, “Monotone operators and the proximal point algorithm,” SIAM J. Cont.
Opt., vol. 14, no. 5, pp. 877–898, 1976.
[47] E. J. Candès, J. Romberg, and T. Tao, “Decoding by linear programming,” IEEE Trans.
Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005.
[48] E. Candès and T. Tao, “The Dantzig selector: Statistical estimation when p is much larger
than n,” Annals Statist, vol. 35, no. 6, pp. 2313–2351, 2007.
[49] P. J. Bickel, Y. Ritov, and A. B. Tsybakov, “Simultaneous analysis of Lasso and Dantzig
selector,” Annals Statist, vol. 37, no. 4, pp. 1705–1732, 2009.
[50] J. A. Nelder and R. Mead, “A simplex method for function minimization,” Comp. J, vol. 7,
no. 4, pp. 308–313, 1965.
[51] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning: Data mining,
inference, and prediction, 2nd edn. Springer, 2009.
Compressed Sensing via Compression Codes 103
[52] W. Dong, G. Shi, X. Li, Y. Ma, and F. Huang, “Compressive sensing via nonlocal low-rank
regularization,” IEEE Trans. Image Processing, 2014.
[53] R. W. Harrison, “Phase problem in crystallography,” J. Opt. Soc. America A, vol. 10, no. 5,
pp. 1046–1055, 1993.
[54] C. Fienup and J. Dainty, “Phase retrieval and image reconstruction for astronomy,” Image
Rec.: Theory and Appl., pp. 231–275, 1987.
[55] F. Pfeiffer, T. Weitkamp, O. Bunk, and C. David, “Phase retrieval and differential phase-
contrast imaging with low-brilliance X-ray sources,” Nature Physics, vol. 2, no. 4,
pp. 258–261, 2006.
[56] E. J. Candès, T. Strohmer, and V. Voroninski, “Phaselift: Exact and stable signal recovery
from magnitude measurements via convex programming,” Commun. Pure Appl. Math.,
vol. 66, no. 8, pp. 1241–1274, 2013.
[57] E. J. Candès, Y. C. Eldar, T. Strohmer, and V. Voroninski, “Phase retrieval via matrix
completion,” SIAM Rev., vol. 57, no. 2, pp. 225–251, 2015.
[58] H. Ohlsson, A. Yang, R. Dong, and S. Sastry, “CPRL – an extension of compressive sens-
ing to the phase retrieval problem,” in Proc. Advances in Neural Information Processing
Systems 25, 2012, pp. 1367–1375.
[59] P. Schniter and S. Rangan, “Compressive phase retrieval via generalized approximate
message passing,” IEEE Trans. Information Theory, vol. 63, no. 4, pp. 1043–1055, 2015.
[60] M. Bakhshizadeh, A. Maleki, and S. Jalali, “Compressive phase retrieval of struc-
tured signals,” in Proc. IEEE International Symposium on Information Theory, 2018,
pp. 2291–2295.
[61] Y. Steinberg and S. Verdu, “Simulation of random processes and rate-distortion theory,”
IEEE Trans. Information Theory, vol. 42, no. 1, pp. 63–86, 1996.
[62] S. Ihara and M. Kubo, “Error exponent of coding for stationary memoryless sources with
a fidelity criterion,” IEICE Trans. Fund. Elec., Comm. and Comp. Sciences, vol. 88, no. 5,
pp. 1339–1345, 2005.
[63] K. Iriyama, “Probability of error for the fixed-length lossy coding of general sources,”
IEEE Trans. Information Theory, vol. 51, no. 4, pp. 1498–1507, 2005.
[64] C. E. Shannon, “A mathematical theory of communication: Parts I and II,” Bell Systems
Technical J., vol. 27, pp. 379–423 and 623–656, 1948.
[65] T. Cover and J. Thomas, Elements of information theory, 2nd edn. Wiley, 2006.
[66] A. Rényi, “On the dimension and entropy of probability distributions,” Acta Math. Acad.
Sci. Hungarica, vol. 10, no. 5, 1–2, pp. 193–215, 1959.
[67] Y. Wu and S. Verdú, “Rényi information dimension: Fundamental limits of almost lossless
analog compression,” IEEE Trans. Information Theory, vol. 56, no. 8, pp. 3721–3748,
2010.
[68] B. C. Geiger and T. Koch, “On the information dimension rate of stochastic processes,” in
Proc. IEEE International Symposium on Information Theory, 2017, pp. 888–892.
[69] T. Kawabata and A. Dembo, “The rate-distortion dimension of sets and measures,” IEEE
Trans. Information Theory, vol. 40, no. 5, pp. 1564–1572, 1994.
[70] F. E. Rezagah, S. Jalali, E. Erkip, and H. V. Poor, “Compression-based compressed
sensing,” IEEE Trans. Information Theory, vol. 63, no. 10, pp. 6735–6752, 2017.
4 Information-Theoretic Bounds
on Sketching
Mert Pilanci
Summary
4.1 Introduction
104
Information-Theoretic Bounds on Sketching 105
This chapter focuses on the role of information theory in sketching methods for
solving large-scale statistical estimation and optimization problems, and investigates
fundamental lower bounds on their performance. By exploring these lower bounds, we
obtain interesting trade-offs in computation and accuracy. Moreover, we may hope to
obtain improved sketching constructions by understanding their information-theoretic
properties. The lower-bounding techniques employed here parallel the information-
theoretic techniques used in statistical minimax theory [7, 8]. We apply Fano’s inequality
and packing constructions to understand fundamental lower bounds on the accuracy of
sketching.
Randomness and sketching also have applications in privacy-preserving queries
[9, 10]. Privacy has become an important concern in the age of information where
breaches of sensitive data are frequent. We will illustrate that randomized sketch-
ing offers a computationally simple and effective mechanism to preserve privacy in
optimization and machine learning.
We start with an overview of different constructions of sketching matrices in Sec-
tion 4.2. In Section 4.3, we briefly review some background on convex analysis and
optimization. Then we present upper bounds on the performance of sketching from an
optimization viewpoint in Section 4.4. To be able to analyze upper bounds, we introduce
the notion of localized Gaussian complexity, which also plays an important role in the
characterization of minimax statistical bounds. In Section 4.5, we discuss information-
theoretic lower bounds on the statistical performance of sketching. In Section 4.6,
we turn to non-parametric problems and information-theoretic lower bounds. Finally,
in Section 4.7 we discuss privacy-preserving properties of sketching using a mutual
information characterization, and communication-complexity lower bounds.
d d
m S = m SA
n A
Figure 4.1 Sketching a tall matrix A. The smaller matrix SA ∈ Rm×d is a compressed version of
the original data A ∈ Rn×d .
from i.i.d. zero-mean Gaussian random variables with variance 1/m. Note that we have
m
ES = 0m×d and also EST S = m i=1 Esi si = i=1 Id (1/m) = Id . Analyzing the Gaussian
T
sketches is considerably easier than analyzing sketches of other types, because of the
special properties of the Gaussian distribution such as rotation invariance. However,
Gaussian sketches may not be the most computationally efficient choice for many data
matrices, as we will discuss in the following sections.
or Fourier bases, for which matrix–vector multiplication can be performed in O(n log n)
time via the fast Hadamard or Fourier transforms, respectively. For example, an n × n
Hadamard matrix H = Hn can be recursively constructed as follows:
1 1 1 1 H2 H2
H2 = √ , H4 = √ , H2t = H2 ⊗ H2 ⊗ · · · ⊗ H2 .
2 1 −1 2 H2 −H2
Kronecker product t times
From any such matrix, a sketching matrix S ∈ Rm×n from a ROS ensemble can be
obtained by sampling i.i.d. rows of the form
√
sT = neTj HD with probability 1/n for j = 1, . . . , n,
where the random vector e j ∈ Rn is chosen uniformly at random from the set of all
n canonical basis vectors, and D = diag(r) is a diagonal matrix of i.i.d. Rademacher
variables r ∈ {−1, +1}n , where P[ri = +1] = P[ri = −1] = 12 ∀i. Alternatively, the rows
of the ROS sketch can be sampled without replacement and one can obtain similar
guarantees to sampling with replacement. Given a fast routine for matrix–vector
multiplication, ROS sketch SA of the data A ∈ Rn×d can be formed in O(n d log m) time
(for instance, see [11]).
where u1 , u2 , ..., un are the rows of U ∈ Rn × d , which is the matrix of left singular vectors
of A. Leverage scores can be obtained using a singular value decomposition A = UΣVT .
Moreover, there also exist faster randomized algorithms to approximate the leverage
scores (e.g., see [12]). In our analysis of lower bounds to follow, we assume that the
weights are α-balanced, meaning that
α
max p j ≤ (4.4)
j=1,...,n n
for some constant α that is independent of n.
108 Mert Pilanci
In this section, we first briefly review relevant concepts from convex analysis and
optimization. A set C ⊆ Rd is convex if, for any x, y ∈ C,
tx + (1 − t)y ∈ C for all t ∈ [0, 1] .
1 A hash function is from a pair-wise independent family if P[h( j) = i, h(k) = l] = 1/m 2 and P[h( j) = i]
= 1/m for all i, j, k, l.
Information-Theoretic Bounds on Sketching 109
Figure 4.2 Different types of sketching matrices: (a) Gaussian sketch, (b) ±1 random sign sketch,
(c) randomized orthogonal system sketch, and (d) sparse sketch.
Given a matrix A ∈ Rn×d , we define the linear transform of the convex set C as AC =
{Ax | x ∈ C}. It can be shown that AC is convex if C is convex.
A convex optimization problem is a minimization problem of the form
where f (x) is a convex function and C is a convex set. In order to characterize optimality
of solutions, we will define the tangent cone of C at a fixed vector x∗ as follows:
TC (x∗ ) = t(x − x∗ ) | t ≥ 0 and x ∈ C . (4.6)
Figures 4.3 and 4.4 illustrate2 examples of tangent cones of a polyhedral convex set in
R2 . A first-order characterization of optimality in the convex optimization problem (4.5)
is given by the tangent cone. If a vector x∗ is optimal in (4.5), it holds that
We refer the reader to Hiriart-Urruty and Lemaréchal [17] for details on convex analysis,
and Boyd and Vandenberghe [18] for an in-depth discussion of convex optimization
problems and applications.
2 Note that the tangent cones extend toward infinity in certain directions, whereas the shaded regions in
Figs. 4.3 and 4.4 are compact for illustration.
110 Mert Pilanci
K
C xLS
Figure 4.3 A narrow tangent cone where the Gaussian complexity is small.
C
K
xLS
Figure 4.4 A wide tangent cone where the Gaussian complexity is large.
where A ∈ Rn×d and b ∈ Rn are the input data and C ⊆ Rd is a closed and convex constraint
set. In statistical and signal-processing applications, it is typical to use the constraint
set to impose structure on the obtained solution x. Important examples of the convex
constraint C include the non-negative orthant, 1 -ball for promoting sparsity, and the
∞ -ball as a relaxation to the combinatorial set {0, 1}d .
In the unconstrained case when C = Rd , a closed-form solution exists for the solution
of (4.8), which is given by x∗ = (AT A)−1 AT b. However, forming the Gram matrix AT A
and inverting using direct methods such as QR decomposition, or the singular value
decomposition, typically requires O(nd2 ) + O(nd min(n, d)) operations. Faster iterative
algorithms such as the conjugate gradient (CG) method can be used to obtain an approx-
imate solution in O(ndκ(A)) time, where κ(A) is the condition number of the data matrix
A. Using sketching methods, it is possible to obtain even faster approximate solutions,
as we will discuss in what follows.
In the constrained case, a variety of efficient iterative algorithms have been developed
in the last couple of decades to obtain the solution, such as proximal and projected
gradient methods, their accelerated variants, and barrier-based second-order methods.
Sketching can also be used to improve the run-time of these methods.
Information-Theoretic Bounds on Sketching 111
x = arg min SAx − Sb22 . (4.9)
x∈C
After applying the sketch to the data matrices, the sketched problem has dimensions
m × d, which is lower than the original dimensions when m < n. Note that the objective
in the above problem (4.9) can be seen as an unbiased approximation of the original
objective function (4.8), since it holds that
for any fixed choice of A, x, and b. This is a consequence of the condition (4.1), which
is satisfied by all of the sketching matrices considered in Section 4.2.
where g is a random vector with i.i.d. standard Gaussian entries, i.e., g ∼ N(0, In ). The
parameter t > 0 controls the radius at which the random deviations are localized. For a
finite value of t, the supremum in (4.10) is always achieved since the constraint set is
compact.
Analyzing the sketched optimization problem requires us to control the random devi-
ations constrained to the set of possible descent directions {x − x∗ | x ∈ C}. We now define
a transformed tangent cone at x∗ as follows:
K = tA(x − x∗ ) | t ≥ 0 and x ∈ C ,
which can be alternatively defined as ATC (x∗ ) using the definition given in (4.6). The
next theorem provides an upper bound on the performance of the sketching method for
constrained optimization based on localized Gaussian complexity.
theorem 4.1 Let S be a Gaussian sketch, and let x be the solution of (4.9). Suppose
that m ≥ c0 W1 (K)2 / 2 , where c0 is a universal constant, then it holds that
x − x∗ )2
A(
≤,
f (x∗ )
112 Mert Pilanci
f (x∗ ) ≤ f (
x) ≤ f (x∗ )(1 + ) . (4.11)
As predicted by the theorem, the approximation ratio improves as the sketch dimen-
sion m increases, and converges to one as m → ∞. However, we are often interested in the
rate of convergence of the approximation ratio. Theorem 4.1 characterizes this rate by
relating the geometry of the constraint set to the accuracy of the sketching method (4.9).
As an illustration, Figs. 4.3 and 4.4 show narrow and wide tangent cones in R2 , respec-
tively. The proof of Theorem 4.1 combines the convex optimality condition involving
the tangent cone in (4.7) with results on empirical processes, and can be found in Pilanci
and Wainwright [22]. An important feature of Theorem 4.1 is that the approximation
quality is relative to the optimal value f (x∗ ). This is advantageous when f (x∗ ) is small,
e.g., the optimal value can be zero in noiseless signal recovery problems. However, in
problems where the signal-to-noise ratio is low, f (x∗ ) can be large, and hence negatively
affects the approximation quality. We illustrate the implications of Theorem 4.1 on some
concrete examples in what follows.
Proof Let U be an orthonormal basis for the subspace Q. We have the following rep-
resentation: L = {Ux | x ∈ Rq }. Consequently the Gaussian complexity W1 (Q) can be
written as
Eg sup g, Ux = Eg sup UT g, x = t Eg UT g2 ≤ t E tr UUT ggT
x x
Ux2 ≤t x2 ≤t
= t tr UTU
√
= t q.
Where the inequality follows from Jensen’s inequality and concavity of the square root,
and first and fifth equality follow since UT U = Iq . Therefore, the Gaussian complexity
of the range of A for t = 1 satisfies
W1 (range(A)) ≤ rank(A) .
As a result, we conclude that for 1 -constrained problems, the sketch dimension can be
substantially smaller when 1 -constrained eigenvalues are well behaved.
Note that the vector z ∈ Rm is of smaller dimension than the original variable x ∈ Rd .
After solving the reduced-dimensional problem and obtaining its optimal solution z∗ , the
x = ST z∗ . We will investigate this
final estimate for the original variable x can be taken as
approach in Section 4.5 in non-parametric statistical estimation problems and present
concrete theoretical guarantees.
It is instructive to note that, in the special case where we have 2 regularization and
C = Rd , we can easily transform the under-determined least-squares problem into an
over-determined one using convex duality, or the matrix-inversion lemma. We first write
the sketched problem (4.12) as the constrained convex program
1
min y − b22 + ρz22 ,
z∈Rm ,y∈Rn 2
y=AST z
and form the convex dual. It can be shown that strong duality holds, and consequently
primal and dual programs can be stated as follows:
1 1 1
min AST z − b22 + ρz22 = max − SAT x22 − x22 + xT b ,
z∈Rm 2 x∈Rd 4ρ 2
3 The term support refers to the set of indices where the solution has a non-zero value.
114 Mert Pilanci
where the primal and dual solutions satisfy z∗ = (1/2ρ)SAT x∗ at the optimum [18].
Therefore the sketching matrix applied from the right, AST , corresponds to a sketch
applied on the left, SAT , in the dual problem which parallels (4.9). This observation can
be used to derive approximation results on the dual program. We refer the reader to [22]
for an application in support vector machine classification where b = 0n .
4 This assumption means that, for any x ∈ C0 and scalar t ∈ [0, 1], the point tx also belongs to C0 .
5 We may also consider an approximation of C0 which doesn’t necessarily satisfy C ⊂ C0 , for example, the
1 and 0 unit balls.
Information-Theoretic Bounds on Sketching 115
where δ∗ (n) is the critical radius, equal to the smallest positive solution δ > 0 to the
inequality
Wδ (C) δ
√ ≤ . (4.14)
δ n σ
We refer the reader to [20, 23] for a proof of this theorem. This result provides a
baseline against which to compare the statistical recovery performance of the random-
ized sketching method. In particular, an important goal is characterizing the minimal
projection dimension m that will enable us to find an estimate
x with the error guarantee
x − x∗ ) ≤ Ax∗ − b2 ,
A(
where Ax∗ − b2 = f (x∗ ) is the optimal value of the optimization problem (4.8).
However, under the model (4.13) we have
6 See [23] for a proof of this fact for Gaussian and ROS sketches. To be more precise, for ROS sketches, the
condition (4.15) holds when rows are sampled without replacement.
116 Mert Pilanci
lemma 4.2 Let S ∈ Rm×n be a random matrix with i.i.d. Gaussian entries. We have
m
||E ST (SST )−1 S ||op = .
n
Proof Let S = UΣVT denote the singular value decomposition of the random matrix
S. Note that we have ST (SST )−1 S = VVT . By virtue of the rotation invariance of
the Gaussian distribution, columns of V denoted by {vi }m
i=1 are uniformly distributed
over the n-dimensional unit sphere, and it holds that Evi vTi = (1/n)In for i = 1, ..., m.
Consequently, we obtain
m
m
E ST (SST )−1 S = E vi vTi = m Ev1 vT1 = In ,
i=1
n
Fano’s inequality follows as a simple consequence of the chain rule for entropy. How-
ever, it is very powerful for deriving lower bounds on the error probabilities in coding
theory, statistics, and machine learning [7, 26–30].
xk − xl > δ ∀k l .
We define the metric entropy of the set C with respect to a norm · as the logarithm of
the corresponding packing number
The concept of metric entropy provides a way to measure the complexity, or effec-
tive size, of a set with infinitely many elements and dates back to the seminal work
of Kolmogorov and Tikhomirov [31].
Information-Theoretic Bounds on Sketching 117
A proof of this lemma can be found in Section A4.2. Lemma 4.3 allows us to apply
Fano’s method after transforming the estimation problem into a hypothesis-testing
problem based on sketched data. Let us recall the condition on sketching matrices stated
earlier,
m
||E[ST (SST )−1 S]|| op ≤ η , (4.18)
n
where η is a constant that is independent of n and m. Now we are ready to present the
lower bound on the statistical performance of sketching.
118 Mert Pilanci
theorem 4.3 For any random sketching matrix S ∈ Rm×n satisfying condition (4.18),
any estimator (SA, Sb) → x† has MSE lower-bounded as
1
1 σ2 log2 ( 2 M1/2 )
sup ES,w A(x† − x∗ )22 ≥ , (4.19)
x† ∈C0 n 128 η min{m, n}
√
where M1/2 is the 1/2-packing number of C0 ∩ BA (1) in the semi-norm (1/ n)A( · )2 .
We defer the proof to Section 4.8, and investigate the implications of the lower bound in
the next section. It can be shown that Theorem 4.3 is tight, since Theorem 4.1 provides
a matching upper bound.
Example 4.3 Unconstrained Least Squares We first consider the simple unconstrained
case, where the constraint is the entire d-dimensional space, i.e., C = Rd . With this
choice, it is well known that, under the observation model (4.13), the least-squares
solution x∗ has prediction mean-squared error upper-bounded as follows:7
1 σ2 rank(A)
E A(x∗ − x† )22 (4.20a)
n n
σ2 d
≤ , (4.20b)
n
where the expectation is over the noise variable w in (4.13). On the other hand, with the
choice C0 = B2 (1), it is well known that we can construct a 1/2-packing with M = 2d
elements, so that Theorem 4.3 implies that any estimator x† based on (SA, Sb) has
prediction MSE lower-bounded as
1 σ2 d
ES,w A( x − x† )22 . (4.20c)
n min{m, n}
Consequently, the sketch dimension m must grow proportionally to n in order for the
sketched solution to have a mean-squared error comparable to the original least-squares
estimate. This may not be desirable for least-squares problems in which n d, since
it should be possible to sketch down to a dimension proportional to rank(A) which is
always upper-bounded by d. Thus, Theorem 4.3 reveals a surprising gap between the
classical least-squares sketch (4.9) and the accuracy of the original least-squares esti-
mate. In the regime n m, the prediction MSE of the sketched solution is O(σ2 (d/m))
7 In fact, a closed-form solution exists for the prediction error, which it is straightforward to obtain from
the closed-form solution of the least-squares estimator. However, this simple form is sufficient to illustrate
information-theoretic lower bounds.
Information-Theoretic Bounds on Sketching 119
which is a factor of n/m larger than the optimal prediction MSE in (4.20b). In Section
4.5.7, we will see that this gap can be removed by iterative sketching algorithms which
don’t obey the information-theoretic lower bound (4.20c).
Example 4.4 1 Constrained Least Squares We can consider other forms of con-
strained least-squares estimates as well, such as those involving an 1 -norm constraint
to encourage sparsity in the solution. We now consider the sparse variant of the linear
regression problem, which involves the 0 “ball”
d
B0 (s) : = x ∈ Rd | I[x j 0] ≤ s ,
j=1
corresponding to the set of all vectors with at most s non-zero entries. Fixing some
√
radius R ≥ s, consider a vector x† ∈ C0 : = B0 (s) ∩ {x1 = R}, and suppose that we
have noisy observations of the form b = Ax† + w.
Given this setup, one way in which to estimate x† is by computing the least-squares
estimate x∗ constrained to the 1 -ball C = {x ∈ Rn | x1 ≤ R}.8 This estimator is a form
of the Lasso [2, 32] which has been studied extensively in the context of statistical
estimation and signal reconstruction.
On the other hand, the 1/2-packing number M of the set C0 can be lower-bounded
as log2 M s log2 ed/s . We refer the reader to [33] for a proof. Consequently, in
application to this particular problem, Theorem 4.3 implies that any estimator x based
on the pair (SA, Sb) has mean-squared error lower-bounded as
1 σ2 s log ed/s
Ew,S A( x − x† )22 2
. (4.21)
n min{m, n}
Again, we see that the projection dimension m must be of the order of n in order to
match the mean-squared error of the constrained least-squares estimate x∗ up to constant
factors.
8 This setup is slightly unrealistic, since the estimator is assumed to know the radius R = x† 1 . In practice,
one solves the least-squares problem with a Lagrangian constraint, but the underlying arguments are
essentially the same.
120 Mert Pilanci
(e.g., [34–37]), it is reasonable to model the matrix X† as being a low-rank matrix. Note
that a rank constraint on the matrix X can be written as an 0 -“norm” sparsity constraint
on its singular values. In particular, we have
min{d 1 ,d2 }
rank(X) ≤ r if and only if I[γ j (X) > 0] ≤ r,
j=1
where γ j (X) denotes the jth singular value of X. This observation motivates a standard
1 ,d2 }
relaxation of the rank constraint using the nuclear norm ||X|| nuc : = min{d
j=1 γ j (X).
Accordingly, let us consider the constrained least-squares problem
!
1
X∗ = arg min ||Y − AX|| 2fro such that ||X|| nuc ≤ R, (4.23)
X∈Rd1 ×d2 2
where || · ||fro denotes the Frobenius norm on matrices, or equivalently the Euclidean norm
on its vectorized version. Let C0 denote the set of matrices with rank r < 12 min{d1 , d2 },
and Frobenius norm at most one. In this case the constrained least-squares solution X∗
satisfies the bound
1 σ2 r (d1 + d2 )
E A(X∗ − X† )22 . (4.24a)
n n
On the other hand, the 1/2-packing number of the set C0 is lower-bounded as
log2 M r d1 + d2 (see [36] for a proof), so that Theorem 4.3 implies that any
based on the pair (SA, SY) has MSE lower-bounded as
estimator X
2
1
Ew,S A(X − X† )2 σ r d1 + d2 . (4.24b)
2
n min{m, n}
As with the previous examples, we see the sub-optimality of the sketched approach
in the regime m < n.
We may use an iterative method to obtain x∗ which uses the gradient ∇ f (x) = AT (Ax− b)
and Hessian ∇2 f (x) = AT A to minimize the second-order Taylor expansion of f (x) at a
current iterate xt using ∇ f (xt ) and ∇2 f (xt ) as follows:
" 1/2 "2
x = x + arg min "" ∇2 f (x) x"" + xT ∇ f (x )
t+1 t 2 t (4.27)
x∈C
= xt + arg min Ax22 − xT AT (b − Axt ) . (4.28)
x∈C
Information-Theoretic Bounds on Sketching 121
We apply a sketching matrix S to the data A on the formulation (4.28) and define this
procedure as an iterative sketch
xt+1 = xt + arg min SAx22 − 2xT AT (b − Axt ) . (4.29)
x∈C
Note that this procedure uses more information then the classical sketch (4.9), in
particular it calculates the left matrix–vector multiplications with the data A in the
following order:
sT1 A
sT2 A
..
.
sTm A
..
.
(b − Ax1 )T A
..
.
(b − Axt )T A ,
where sT1 , ..., sTm are the rows of the sketching matrix S. This can be considered as
an adaptive form of sketching where the residual directions (b − Axt ) are used after
the random directions s1 , ..., sm . As a consequence, the information-theoretic bounds
we considered in Section 4.4.6 do not apply to iterative sketching. In Pilanci and
Wainwright [23], it is shown that this algorithm achieves the minimax statistical risk
given in (4.16) using at most O(log2 n) iterations while obtaining equivalent speedups
from sketching. We also note that the iterative sketching method can also be applied
to more general convex optimization problems other than the least-squares objective.
We refer the reader to Pilanci and Wainwright [38] for the application of sketching in
solving general convex optimization problems.
[39, 40]. For these regression problems it is customary to consider the kernel ridge
regression (KRR) problem based on convex optimization
⎧ ⎫
⎪
⎨1
⎪ n ⎪
⎪
2 ⎬
f = arg min⎪ ⎪ (yi − f (xi )) + λ f H ⎪
2
⎪ . (4.31)
f ∈H ⎩ 2n ⎭
i=1
for all collections of points {x1 , ..., xn }, {y1 , ..., yn } and ∀r ∈ Z+ . The vector space of all
functions of the form
r
f (·) = yi K(·, xi )
i
generates an RKHS by taking closure of all such linear combinations. It can be shown
that this RKHS is uniquely associated with the kernel function K (see Aronszajn [41] for
details). Let us define a finite-dimensional kernel matrix K using n covariates as follows
1
Ki j = K(xi , x j ) ,
n
which is a positive semidefinite matrix. In the linear least-squares regression the
kernel matrix reduces to the Gram matrix given by K = AT A. It is also known that
the above infinite-dimensional program can be recast as a finite-dimensional quadratic
optimization problem involving the kernel matrix
1 √
= arg minn Kw − (1/ n)y22 + λwT Kw
w (4.32)
w∈R 2
1 Ky
= arg minn wT K2 w − wT √ + λwT Kw , (4.33)
w∈R 2 n
and we can find the optimal solution to the infinite-dimensional problem (4.31) via the
following relation:9
1
n
f (·) = i K(·, xi ) .
w (4.34)
n i=1
We now define a kernel complexity measure that is based on the eigenvalues of the ker-
nel matrix K. Let λ1 ≥ λ2 ≥ · · · ≥ λn correspond to the real eigenvalues of the symmetric
positive-definite kernel matrix K. The kernel complexity is defined as follows.
9 Our definition of the kernel optimization problem slightly differs from the literature. The classical kernel
problem can be recovered by a variable change w = K1/2 w, where K1/2 is the matrix square root. We
refer the reader to [40] for more details on kernel-based methods.
Information-Theoretic Bounds on Sketching 123
where c0 is a numerical constant and δ∗ (n) is the critical radius defined in (4.35).
The lower bound given by Theorem 4.4 can be shown to be tight, and is achieved
by the kernel-based optimization procedure (4.33) and (4.34) (see Bartlett et al. [20]).
The proof of Theorem 4.4 can be found in Yang et al. [42]. We may define the effective
dimension d∗ (n) of the kernel via the relation
d∗ (n) := nδ∗ (n)2 .
This definition allows us to interpret the convergence rate in (4.36) as d∗ (n)/n, which
resembles the classical parametric convergence rate where the number of variables is
d∗ (n).
w = ST v. The next theorem shows that the sketched kernel-based optimization method
achieves the optimal prediction error.
theorem 4.5 Let S ∈ Rm×n be a Gaussian sketching matrix where m ≥ c3 dn , and choose
λ = 3δ∗ (n). Given n i.i.d. samples from the model (4.30), the sketching procedure (4.42)
produces a regression estimate f which satisfies the bound
1
n
( f (xi ) − f ∗ (xi ))2 ≤ c2 δ∗ (n)2 ,
n i=1
I(SA; A) 1
= {H(A) − H(A|SA)}
nd nd
1
= D PSA,A ||PSA PA ,
nd
where we normalize by nd since the data matrix A has nd entries in total. The following
corollary is a direct application of Theorem 4.1.
corollary 4.1 Let the entries of the matrix A be i.i.d from an arbitrary distribution
with finite variance σ2 . Using sketched data, we can obtain an -approximate10 solution
to the optimization problem while ensuring that the revealed mutual information satisfies
I(SA; A) c0 W2 (AK)
≤ 2 log2 (2πeσ2 ) .
nd n
Therefore, we can guarantee the mutual information privacy of the sketching-based
methods, whenever the term W(AK) is small.
An alternative and popular characterization of privacy is referred to as the differential
privacy (see Dwork et al. [9]), where other randomized methods, such as additive noise
for preserving privacy, were studied. It is also possible to directly analyze differential
privacy-preserving aspects of random projections as considered in Blocki et al. [10].
10 Here -approximate solution refers to the approximation defined in Theorem 4.1, relative to the optimal
value.
126 Mert Pilanci
be updated more than once, and the value a is any arbitrary real number. The sketches
introduced in this chapter provide a valuable data structure when the matrix is very
large in size, and storing and updating the matrix directly can be impractical. Owing
to the linearity of sketches, we can update the sketch SA by adding a Sei eTj to SA, and
maintain an approximation with limited memory.
The following theorem due to Clarkson and Woodruff [49] provides a lower bound
of the space used by any algorithm for least-squares regression which performs a single
pass over the data.
theorem 4.6 Any randomized 1-pass algorithm which returns an -approximate
solution to the unconstrained least-squares problem with probability at least 7/9 needs
Ω(d2 (1/ + log(nd))) bits of space.
This theorem confirms that the space complexity of sketching for unconstrained
least-squares regression is near optimal. Because of the choice of the sketching
dimension m = O(d), the space used by the sketch SA is O(d2 ), which is optimal up to
constants according to the theorem.
In this section, we illustrate the sketching method numerically and confirm the theoreti-
cal predictions of Theorems 4.1 and 4.3. We consider both the classical low-dimensional
statistical regime where n > d, and the 1 -constrained least-squares minimization known
as LASSO (see Tibshirani [51]):
x∗ = arg min Ax − b2 .
x s.t. x1 ≤λ
We generate a random i.i.d. data matrix A ∈ Rn×d , where n = 10 000 and d = 1000, and
set the observation vector b = Ax† + σw, where x† ∈ {−1, 0, 1}d is a random s-sparse
vector and w has i.i.d. N(0, 10−4 ) components. For the sketching matrix S ∈ Rm×n , we
consider the Gaussian and Rademacher (±1 i.i.d.-valued) random matrices, where m
ranges between 10 and 400. Consequently, we solve the sketched program
x = arg min SAx − Sb2 .
x s.t. x1 ≤λ
Figures 4.5 and 4.6 show the relative prediction mean-squared error given by the ratio
x − x† )22
(1/n)A(
,
(1/n)A(x∗ − x† )22
where it is averaged over 20 realizations of the sketching matrix, and x and x∗ are
the sketched and the original solutions, respectively. As predicted by the upper and
lower bounds given in Theorems 4.1 and 4.3, the prediction mean-squared error of
the sketched estimator scales as O((s log d)/m), since the corresponding Gaussian
complexity W1 (K)2 is O(s log d). These plots reveal that the prediction mean-squared
error of the sketched estimators for both Gaussian and Rademacher sketches are in
agreement with the theory.
Information-Theoretic Bounds on Sketching 127
0.6
0.4
0.2
0
0 50 100 150 200 250 300 350 400
Sketch Dimension m
Figure 4.5 Sketching LASSO using Gaussian random projections.
0.8
0.6
0.4
0.2
0
0 50 100 150 200 250 300 350 400
Sketch Dimension m
Figure 4.6 Sketching LASSO using Rademacher random projections.
4.9 Conclusion
problems. Sketching yields faster algorithms with lower space complexity while
maintaining strong approximation guarantees.
For the upper bound on the approximation accuracy in Theorem 4.2, Gaussian
complexity plays an important role, and also provides a geometric characterization of
the dimension of the sketch. The lower bounds given in Theorem 4.3 are statistical in
nature, and involve packing numbers, and consequently metric entropy, which measures
the complexity of the sets. The upper bounds on the Gaussian sketch can be extended
to Rademacher sketches, sub-Gaussian sketches, and randomized orthogonal system
sketches (see Pilanci and Wainwright [22] and also Yun et al. [42] for the proofs). How-
ever, the results for non-Gaussian sketches often involve superfluous logarithmic factors
and large constants as artifacts of the analysis. As can be observed in Figs. 4.5 and 4.6,
the mean-squared error curves for Gaussian and Rademacher sketches are in agreement
with each other. It can be conjectured that the approximation ratio of sketching is uni-
versal for random matrices with entries sampled from well-behaved distributions. This
is an important theoretical question for future research. We refer the reader to the work
of Donoho and Tanner [51] for observations of the universality in compressed sensing.
Finally, a number of important limitations of the analysis techniques need to be
considered. The minimax criterion (4.16) is a worst-case criterion in nature by virtue
of its definition, and may not correctly reflect the average error of sketching when the
unknown vector x† is randomly distributed. Furthermore, in some applications, it might
be suitable to consider prior information on the unknown vector. As an interesting
direction of future research, it would be interesting to study lower bounds for sketching
in a Bayesian setting.
where the infimum ranges over all testing functions ψ. Consequently, it suffices to show
that the testing error is lower-bounded by 1/2.
Information-Theoretic Bounds on Sketching 129
In order to do so, we first apply Fano’s inequality [27] conditionally on the sketching
matrix S and get
- . /
ES IS (Ȳ; J) + 1
P[ψ(Ȳ) J] = ES P[ψ(Ȳ) J | S] ≥ 1 − , (4.39)
log2 M
where IS (Ȳ; J) denotes the mutual information between Ȳ and J with S fixed. Our next
step is to upper-bound the expectation ES [I(Ȳ; J)].
Letting D Px j Pxk denote the Kullback–Leibler (KL) divergence between the
distributions Px j and Pxk , the convexity of KL divergence implies that
⎛ ⎞
1 ⎜⎜⎜⎜ 1 ⎟⎟⎟⎟
M M
IS (Ȳ; J) = D⎜⎜⎝Px j Pxk ⎟⎟⎠
M j=1
M k=1
1
M
≤ 2 D Px j Pxk .
M j,k=1
x − x† 2A we have
By Markov’s inequality applied on the random variable
x − x† 2A ≥ δ2 P[
E x − x† 2A ≥ δ2 ]. (4.40)
Now note that
x − x† A ≥ δ] ≥ max P[
sup P[ x − x( j) A ≥ δ | Jδ = j]
x∗ ∈C j∈{1,...,M}
1
M
≥ P[
x − x( j) A ≥ δ | Jδ = j] , (4.41)
M j=1
130 Mert Pilanci
since every element of the packing set satisfies x( j) ∈ C and the discrete maximum
is upper-bounded by the average over {1, ..., M}. Since we have P[Jδ = j] = 1/M, we
equivalently have
1
M M 6
P[
x − x A ] ≥ δ | Jδ = j] =
( j)
P x − x( j) A ≥ δ 66 Jδ = j P[Jδ = j]
M j=1 j=1
= P x − x(Jδ ) A ≥ δ . (4.42)
Combining the above with the earlier lower bound (4.41) and the identity (4.42), we
obtain
/ /
x − x∗ A ≥ δ] ≥ P φ(Z) Jδ ≥ inf P φ(Z) Jδ ,
sup P[
x∗ ∈C φ
where the second inequality follows by taking the infimum over all tests, which can only
make the probability smaller. Plugging in the above lower bound in (4.40) completes
the proof of the lemma.
References
[1] S. Vempala, The random projection method. American Mathematical Society, 2004.
[2] E. J. Candès and T. Tao, “Near-optimal signal recovery from random projections: Univer-
sal encoding strategies?” IEEE Trans. Information Theory, vol. 52, no. 12, pp. 5406–5425,
2006.
Information-Theoretic Bounds on Sketching 131
[3] N. Halko, P. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilis-
tic algorithms for constructing approximate matrix decompositions,” SIAM Rev., vol. 53,
no. 2, pp. 217–288, 2011.
[4] M. W. Mahoney, Randomized algorithms for matrices and data. Now Publishers, 2011.
[5] D. P. Woodruff, “Sketching as a tool for numerical linear algebra,” Foundations and Trends
Theoretical Computer Sci., vol. 10, nos. 1–2, pp. 1–157, 2014.
[6] S. Muthukrishnan, “Data streams: Algorithms and applications,” Foundations and Trends
Theoretical Computer Sci., vol. 1, no. 2, pp. 117–236, 2005.
[7] B. Yu, “Assouad, Fano, and Le Cam,” in Festschrift in Honor of Lucien Le Cam. Springer,
1997, pp. 423–435.
[8] Y. Yang and A. Barron, “Information-theoretic determination of minimax rates of
convergence,” Annals Statist., vol. 27, no. 5, pp. 1564–1599, 1999.
[9] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in
private data analysis,” in Proc. Theory of Cryptography Conference, 2006, pp. 265–284.
[10] J. Blocki, A. Blum, A. Datta, and O. Sheffet, “The Johnson–Lindenstrauss transform
itself preserves differential privacy,” in Proc. 2012 IEEE 53rd Annual Symposium on
Foundations of Computer Science, 2012, pp. 410–419.
[11] N. Ailon and B. Chazelle, “Approximate nearest neighbors and the fast Johnson-
Lindenstrauss transform,” in Proc. 38th Annual ACM Symposium on Theory of Computing,
2006, pp. 557–563.
[12] P. Drineas and M. W. Mahoney, “Effective resistances, statistical leverage, and applications
to linear equation solving,” arXiv:1005.3097, 2010.
[13] D. A. Spielman and N. Srivastava, “Graph sparsification by effective resistances,” SIAM J.
Computing, vol. 40, no. 6, pp. 1913–1926, 2011.
[14] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in data streams,” in
International Colloquium on Automata, Languages, and Programming, 2002, pp. 693–703.
[15] D. M. Kane and J. Nelson, “Sparser Johnson–Lindenstrauss transforms,” J. ACM, vol. 61,
no. 1, article no. 4, 2014.
[16] J. Nelson and H. L. Nguyên, “Osnap: Faster numerical linear algebra algorithms via sparser
subspace embeddings,” in Proc. 2013 IEEE 54th Annual Symposium on Foundations of
Computer Science (FOCS), 2013, pp. 117–126.
[17] J. Hiriart-Urruty and C. Lemaréchal, Convex analysis and minimization algorithms.
Springer, 1993, vol. 1.
[18] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge University Press, 2004.
[19] M. Ledoux and M. Talagrand, Probability in Banach spaces: Isoperimetry and processes.
Springer, 1991.
[20] P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local Rademacher complexities,” Annals
Statist., vol. 33, no. 4, pp. 1497–1537, 2005.
[21] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of
linear inverse problems,” Foundations Computational Math., vol. 12, no. 6, pp. 805–849,
2012.
[22] M. Pilanci and M. J. Wainwright, “Randomized sketches of convex programs with sharp
guarantees,” UC Berkeley, Technical Report, 2014, full-length version at arXiv:1404.7203;
Presented in part at ISIT 2014.
[23] M. Pilanci and M. J. Wainwright, “Iterative Hessian sketch: Fast and accurate solution
approximation for constrained least-squares,” J. Machine Learning Res., vol. 17, no. 1,
pp. 1842–1879, 2016.
132 Mert Pilanci
[46] Q. Le, T. Sarlós, and A. Smola, “Fastfood – approximating kernel expansions in loglinear
time,” in Proc. 30th International Conference on Machine Learning, 2013, 9 unnumbered
pages.
[47] S. Zhou, J. Lafferty, and L. Wasserman, “Compressed regression,” IEEE Trans.
Information Theory, vol. 55, no. 2, pp. 846–866, 2009.
[48] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal
reconstruction from highly incomplete frequency information,” IEEE Trans. Information
Theory, vol. 52, no. 2, pp. 489–509, 2004.
[49] K. L. Clarkson and D. P. Woodruff, “Numerical linear algebra in the streaming model,” in
Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, 2009,
pp. 205–214.
[50] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc.
Ser. B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
[51] D. Donoho and J. Tanner, “Observed universality of phase transitions in high-dimensional
geometry, with implications for modern data analysis and signal processing,” Phil. Trans.
Roy. Soc. London A: Math., Phys. Engineering Sci., vol. 367, no. 1906, pp. 4273–4293,
2009.
5 Sample Complexity Bounds for
Dictionary Learning from Vector- and
Tensor-Valued Data
Zahra Shakeri, Anand D. Sarwate, and Waheed U. Bajwa
Summary
During the last decade, dictionary learning has emerged as one of the most powerful
methods for data-driven extraction of features from data. While the initial focus on dic-
tionary learning had been from an algorithmic perspective, recent years have seen an
increasing interest in understanding the theoretical underpinnings of dictionary learning.
Many such results rely on the use of information-theoretic analytic tools and help us to
understand the fundamental limitations of different dictionary-learning algorithms. This
chapter focuses on the theoretical aspects of dictionary learning and summarizes existing
results that deal with dictionary learning both from vector-valued data and from tensor-
valued (i.e., multi-way) data, which are defined as data having multiple modes. These
results are primarily stated in terms of lower and upper bounds on the sample complexity
of dictionary learning, defined as the number of samples needed to identify or recon-
struct the true dictionary underlying data from noiseless or noisy samples, respectively.
Many of the analytic tools that help yield these results come from the information-theory
literature; these include restating the dictionary-learning problem as a channel coding
problem and connecting the analysis of minimax risk in statistical estimation Fano’s
inequality. In addition to highlighting the effects of different parameters on the sample
complexity of dictionary learning, this chapter also brings out the potential advantages
of dictionary learning from tensor data and concludes with a set of open problems that
remain unaddressed for dictionary learning.
5.1 Introduction
Modern machine learning and signal processing relies on finding meaningful and suc-
cinct representations of data. Roughly speaking, data representation entails transforming
“raw” data from its original domain to another domain in which it can be processed
more effectively and efficiently. In particular, the performance of any information-
processing algorithm is dependent on the representation it is built on [1]. There are
two major approaches to data representation. In model-based approaches, a predeter-
mined basis is used to transform data. Such a basis can be formed using predefined
134
Sample Complexity Bounds for Dictionary Learning 135
transforms such as the Fourier transform [2], wavelets [3], and curvelets [4]. The
data-driven approach infers transforms from the data to yield efficient representations.
Prior works on data representation show that data-driven techniques generally outper-
form model-based techniques as the learned transformations are tuned to the input
signals [5, 6].
Since contemporary data are often high-dimensional and high-volume, we need
efficient algorithms to manage them. In addition, rapid advances in sensing and data-
acquisition technologies in recent years have resulted in individual data samples or
signals with multimodal structures. For example, a single observation may contain
measurements from a two-dimensional array over time, leading to a data sample
with three modes. Such data are often termed tensors or multi-way arrays [7]. Spe-
cialized algorithms can take advantage of this tensor structure to handle multimodal
data more efficiently. These algorithms represent tensor data using fewer parame-
ters than vector-valued data-representation methods by means of tensor decomposi-
tion techniques [8–10], resulting in reduced computational complexity and storage
costs [11–15].
In this chapter, we focus on data-driven representations. As data-collection sys-
tems grow and proliferate, we will need efficient data representations for processing,
storage, and retrieval. Data-driven representations have successfully been used for sig-
nal processing and machine-learning tasks such as data compression, recognition, and
classification [5, 16, 17]. From a theoretical standpoint, there are several interest-
ing questions surrounding data-driven representations. Assuming there is an unknown
generative model forming a “true” representation of data, these questions include the
following. (1) What algorithms can be used to learn the representation effectively? (2)
How many data samples are needed in order for us to learn the representation? (3) What
are the fundamental limits on the number of data samples needed in order for us to
learn the representation? (4) How robust are the solutions addressing these questions to
parameters such as noise and outliers? In particular, state-of-the-art data-representation
algorithms have excellent empirical performance but their non-convex geometry makes
analyzing them challenging.
The goal of this chapter is to provide a brief overview of some of the aforementioned
questions for a class of data-driven-representation methods known as dictionary learning
(DL). Our focus here will be both on the vector-valued and on the tensor-valued (i.e.,
multidimensional/multimodal) data cases.
principal component analysis (PCA) [5], linear discriminant analysis (LDA) [18], and
independent component analysis (ICA) [19].
The second class consists of nonlinear methods. Despite the fact that historically lin-
ear representations have been preferred over nonlinear methods because of their lesser
computational complexity, recent advances in available analytic tools and computational
power have resulted in an increased interest in nonlinear representation learning. These
techniques have enhanced performance and interpretability compared with linear tech-
niques. In nonlinear methods, data are transformed into a higher-dimensional space,
in which they lie on a low-dimensional manifold [6, 20–22]. In the world of nonlin-
ear transformations, nonlinearity can take different forms. In manifold-based methods
such as diffusion maps, data are projected onto a nonlinear manifold [20]. In kernel
(nonlinear) PCA, data are projected onto a subspace in a higher-dimensional space [21].
Auto-encoders encode data according to the desired task [22]. DL uses a union of sub-
spaces as the underlying geometric structure and projects input data onto one of the
learned subspaces in the union. This leads to sparse representations of the data, which
can be represented in the form of an overdetermined matrix multiplied by a sparse vec-
tor [6]. Although nonlinear representation methods result in non-convex formulations,
we can often take advantage of the problem structure to guarantee the existence of a
unique solution and hence an optimal representation.
Focusing specifically on DL, it is known to have slightly higher computational com-
plexity than linear methods, but it surpasses their performance in applications such as
image denoising and inpainting [6], audio processing [23], compressed sensing [24],
and data classification [17, 25]. More specifically, given input training signals y ∈ Rm ,
the goal in DL is to construct a basis such that y ≈ Dx. Here, D ∈ Rm×p is denoted as
the dictionary that has unit-norm columns and x ∈ R p is the dictionary coefficient vector
that has a few non-zero entries. While the initial focus in DL had been on algorith-
mic development for various problem setups, works in recent years have also provided
fundamental analytic results that help us understand the fundamental limits and perfor-
mance of DL algorithms for both vector-valued [26–33] and tensor-valued [12, 13, 15]
data.
There are two paradigms in the DL literature: the dictionary can be assumed to be a
complete or an overcomplete basis (effectively, a frame [34]). In both cases, columns of
the dictionary span the entire space [27]; in complete dictionaries, the dictionary matrix
is square (m = p), whereas in overcomplete dictionaries the matrix has more columns
than rows (m < p). In general, overcomplete representations result in more flexibility to
allow both sparse and accurate representations [6].
Data Representation
Model-Based Data-Driven
Linear Nonlinear
Square Overcomplete
Vector-Valued Data
This Chapter:
Tensor Data
Figure 5.1 A graphical representation of the scope of this chapter in relation to the literature on
representation learning.
estimation, i.e., the number of observations that are necessary to recover the true
dictionary that generates the data up to some predefined error. The main information-
theoretic tools that are used to derive these results range from reformulating the
dictionary-learning problem as a channel coding problem to connecting the minimax
risk analysis to Fano’s inequality. We refer the reader to Fig. 5.1 for a graphical overview
of the relationship of this chapter to other themes in representation learning.
We address the DL problem for vector-valued data in Section 5.2, and that for ten-
sor data in Section 5.3. Finally, we talk about extensions of these works and some open
problems in DL in Section 5.4. We focus here only on the problems of identifiability and
fundamental limits; in particular, we do not survey DL algorithms in depth apart from
some brief discussion in Sections 5.2 and 5.3. The monograph of Okoudjou [35] dis-
cusses algorithms for vector-valued data. Algorithms for tensor-valued data are relatively
more recent and are described in our recent paper [13].
We first address the problem of reliable estimation of dictionaries underlying data that
have a single mode, i.e., are vector-valued. In particular, we focus on the subject of the
sample complexity of the DL problem from two prospectives: (1) fundamental limits
on the sample complexity of DL using any DL algorithm, and (2) the numbers of sam-
ples that are needed for different DL algorithms to reliably estimate a true underlying
dictionary that generates the data.
138 Zahra Shakeri et al.
p
1 A frame F ∈ Rm×p , m ≤ p, is defined as a collection of vectors {Fi ∈ Rm }i=1 in some separable Hilbert
p
space H that satisfy c1 v22 ≤ i=1 |Fi , v |2 ≤ c2 v22 for all v ∈ H and for some constants c1 and c2 such
that 0 < c1 ≤ c2 < ∞. If c1 = c2 , then F is a tight frame [36, 37].
Sample Complexity Bounds for Dictionary Learning 139
2
EY d D(Y), D0 , (5.4)
where d(·, ·) is some distance metric and D(Y) is the recovered dictionary according to
observations Y. For example, if we restrict the analysis to a local neighborhood of the
generating dictionary, then we can use the Frobenius norm as the distance metric.
We now discuss an optimization approach to solving the dictionary recovery problem.
Understanding the objective function within this approach is the key to understanding
the sample complexity of DL. Recall that solving the DL problem involves using the
such that D
observations to estimate a dictionary D is close to D0 . In the ideal case, the
objective function involves solving the statistical risk minimization problem as follows:
Here, R(·) is a regularization operator that enforces the pre-specified structure, such
as sparsity, on the coefficient vectors. Typical choices for this parameter include
functions of x0 or its convex relaxation, x1 .2 However, solving (5.5) requires knowl-
edge of exact distributions of the problem parameters as well as high computational
power. Hence, works in the literature resort to algorithms that solve the empirical risk
minimization (ERM) problem [40]:
⎧ N ⎫
⎪
⎪ 1 n 2 ⎪
⎪
∈ arg min⎨
D ⎪ inf y − Dxn 2 + R(xn )
⎬
⎪. (5.6)
⎪ ⎪
D∈D ⎩ n=1 x ∈X 2 ⎭
n
In particular, to provide analytic results, many estimators solve this problem in lieu
of (5.5) and then show that the solution of (5.6) is close to (5.5).
There are a number of computational algorithms that have been proposed to solve
(5.6) directly for various regularizers, or indirectly using heuristic approaches. One of
the most popular heuristic approaches is the K-SVD algorithm, which can be thought
of as solving (5.6) with 0 -norm regularization [6]. There are also other methods such
as the method of optimal directions (MOD) [41] and online DL [25] that solve (5.6)
with convex regularizers. While these algorithms have been known to perform well in
practice, attention has shifted in recent years to theoretical studies to (1) find the funda-
mental limits of solving the statistical risk minimization problem in (5.5), (2) determine
conditions on objective functions like (5.6) to ensure recovery of the true dictionary, and
(3) characterize the number of samples needed for recovery using either (5.5) or (5.6). In
this chapter, we are also interested in understanding the sample complexity for the DL
statistical risk minimization and ERM problems. We summarize such results in the exist-
ing literature for the statistical risk minimization of DL in Section 5.2.2 and for the ERM
problem in Section 5.2.3. Because the measure of closeness or error differs between
these theoretical results, the corresponding sample complexity bounds are different.
2 The so-called 0 -norm counts the number of non-zero entries of a vector; it is not a norm.
140 Zahra Shakeri et al.
remark 5.1 In this section, we assume that the data are available in a batch, central-
ized setting and the dictionary is deterministic. In the literature, DL algorithms have
been proposed for other settings such as streaming data, distributed data, and Bayesian
dictionaries [42–45]. Discussion of these scenarios is beyond the scope of this chapter.
In addition, some works have looked at ERM problems that are different from (5.6). We
briefly discuss these works in Section 5.4.
Note that the minimax risk does not depend on any specific DL method and provides a
lower bound for the error achieved by any estimator.
The first result we present pertains to lower bounds on the minimax risk, i.e., minimax
lower bounds, for the DL problem using the Frobenius norm as the distance metric
between dictionaries. The result is based on the following assumption.
A1.1 (Local recovery) The true dictionary lies in the neighborhood of a fixed, known
where
reference dictionary,3 D∗ ∈ D, i.e., D0 ∈ D,
= D|D ∈ D, D − D∗ ≤ r .
D (5.8)
F
√
The range for the neighborhood radius r in (5.8) is (0, 2 p]. This conditioning comes
√
from the fact that, for any D, D ∈ D, D − D F ≤ DF + D F = 2 p. By restricting
dictionaries to this class, for small enough r, ambiguities that are a consequence of using
the Frobenius norm can be prevented. We also point out that any lower bound on ε∗ is
also a lower bound on the global DL problem.
theorem 5.1 (Minimax lower bounds [33]) Consider a DL problem for vector-valued
data with N i.i.d. observations and true dictionary D satisfying assumption A1.1 for
√
some r ∈ (0, 2 p]. Then, for any coefficient distribution with mean zero and covariance
matrix Σ x , and white Gaussian noise with mean zero and variance σ2 , the minimax risk
ε∗ is lower-bounded as
σ2
ε∗ ≥ c1 min r2 , (c2 p(m − 1) − 1) , (5.9)
NΣ x 2
for some positive constants c1 and c2 .
3
The use of a reference dictionary is an artifact of the proof technique and, for sufficiently large r, D ≈ D.
Sample Complexity Bounds for Dictionary Learning 141
Theorem 5.1 holds both for square and for overcomplete dictionaries. To obtain
this lower bound on the minimax risk, a standard information-theoretic approach is
taken in [33] to reduce the dictionary-estimation problem to a multiple-hypothesis-
∗
testing
problem. In this technique, given fixed D and r, and L ∈ N, a packing DL =
1 2 L
D , D , . . . , D ⊆ D of D is constructed. The distance of the packing is chosen to
ensure a tight lower bound on the minimax risk. Given observations Y = Dl X+W, where
Dl ∈ DL and the index l is chosen uniformly at random from {1, . . . , L}, and any estima-
tion algorithm that recovers a dictionary D(Y), a minimum-distance detector can be used
to find the recovered dictionary index l ∈ {1, . . . , L}. Then, Fano’s inequality can be used
to relate the probability of error, i.e., P( l(Y) l), to the mutual information between
observations and the dictionary (equivalently, the dictionary index l), i.e., I(Y; l) [46].
Let us assume that r is sufficiently large that the minimizer of the left-hand side
of (5.9) is the second term. In this case, Theorem 5.1 states that, to achieve any error
ε ≥ ε∗ , we need the number of samples to be on the order of
2
σ mp 4
N=Ω .
Σ x 2 ε
Hence, the lower bound on the minimax risk of DL can be translated to a lower bound
on the number of necessary samples, as a function of the desired dictionary error. This
can further be interpreted as a lower bound on the sample complexity of the dictionary
recovery problem.
We can also specialize this result to sparse coefficient vectors. Assume xn has up to s
non-zero elements, and the random support of the non-zero elements of xn is assumed to
be uniformly distributed over the set {S ⊆ {1, . . . , p} : |S| = s}, for n = {1, . . . , N}. Assum-
ing that the non-zero entries of xn are i.i.d. with variance σ2x , we get Σ x = (s/p)σ2x I p .
Therefore, for sufficiently large r, the sample complexity scaling to achieve any error ε
becomes Ω((σ2 mp2 )/(σ2x sε)). In this special case, it can be seen that, in order to achieve
a fixed error ε, the sample complexity scales with the number of degrees of freedom of
the dictionary multiplied by the number of dictionary columns, i.e., N = Ω(mp2 ). There
is also an inverse dependence on the sparsity level s. Defining the signal-to-noise ratio
of the observations as SNR = (sσ2x )/(mσ2 ), this can be interpreted as an inverse relation-
ship with the SNR. Moreover, if all parameters except the data dimension, m, are fixed,
increasing m requires a linear increase in N. Evidently, this linear relation is limited by
the fact that m ≤ p has to hold in order to maintain completeness or overcompleteness of
the dictionary: increasing m by a large amount requires increasing p also.
While the tightness of this result remains an open problem, Jung et al. [33] have
shown that for a special class of square dictionaries that are perturbations of the identity
matrix, and for sparse coefficients following a specific distribution, this result is order-
wise tight. In other words, a square dictionary that is perturbed from the identity matrix
can be recovered from this sample size order. Although this result does not extend to
overcomplete dictionaries, it suggests that the lower bounds may be tight.
4 We use f (n) = Ω(g(n)) and f (n) = O(g(n)) if, for sufficiently large n ∈ N, f (n) > c1 g(n) and f (n) < c2 g(n),
respectively, for some positive constants c1 and c2 .
142 Zahra Shakeri et al.
Finally, while distance metrics that are invariant to dictionary ambiguities have
been used for achievable overcomplete dictionary recovery results [30, 31], obtaining
minimax lower bounds for DL using these distance metrics remains an open problem.
In this subsection, we discussed the number of necessary samples for reliable dictio-
nary recovery (the sample complexity lower bound). In the next subsection, we focus
on achievability results, i.e., the number of sufficient samples for reliable dictionary
recovery (the sample complexity upper bound).
Noiseless Recovery
We begin by discussing the first work that proves local identifiability of the overcomplete
DL problem. The objective function that is considered in that work is
D
X, = arg minX1 subject to Y = DX, (5.10)
D∈D,X
where X1 i, j |Xi, j | denotes the sum of absolute values of X.
This result is based on the following set of assumptions.
A2.1 (Gaussian random coefficients). The values of the non-zero entries of xn s are
independent Gaussian
random variables with zero mean and common standard
deviation σ x = p/sN.
A2.2 (Sparsity level). The sparsity level satisfies s ≤ min c1 /μ(D0 ), c2 p for some
constants c1 and c2 .
theorem 5.2 (Noiseless, local recovery [29]) There exist positive constants c1 , c2 such
that if assumptions A2.1 and A2.2 are satisfied for true (X, D0 ), then (X, D0 ) is a local
minimum of (5.10) with high probability.
The probability
in this theorem depends on various problem parameters and implies
that N = Ω sp3 samples are sufficient for the desired solution, i.e., true dictionary and
coefficient matrix, to be locally recoverable. The proof of this theorem relies on studying
the local properties of (5.10) around its optimal solution and does not require defining a
distance metric.
We now present a result that is based on the use of a combinatorial algorithm, which
can provably and exactly recover the true dictionary. The proposed algorithm solves
the objective function (5.6) with R(x) = λx1 , where λ is the regularization parameter
and the distance metric that is used is the column-wise distance. Specifically, for two
dictionaries D1 and D2 , their column-wise distance is defined as
d(D1j , D2j ) = min D1j − lD2j , j ∈ {1, . . . , p}, (5.11)
l∈{−1,1} 2
where D1j and D2j are the jth column of D1 and D2 , respectively. This distance metric
avoids the sign ambiguity among dictionaries belonging to the same equivalence class.
To solve (5.6), Agarwal et al. [30] provide a novel DL algorithm that consists of an
initial dictionary-estimation stage and an alternating minimization stage to update the
dictionary and coefficient vectors. The provided guarantees are based on using this algo-
rithm to update the dictionary and coefficients. The forthcoming result is based on the
following set of assumptions.
A3.1 (Bounded random coefficients) The non-zero entries of xn s are drawn from a
zero-mean unit-variance distribution and their magnitude satisfies xmin ≤ |xni | ≤
xmax .
144 Zahra Shakeri et al.
A3.2 (Sparsity level) The sparsity level satisfies s ≤ min c1 / μ(D0 ), c2 m1/9 , c3 p1/8
for some positive constants c1 , c2 , c3 that depend on xmin , xmax , and the spectral
norm of D0 .
A3.3 (Dictionary
assumptions) The true dictionary has bounded spectral norm, i.e.,
D0 2 ≤ c4 p/m, for some positive c4 .
theorem 5.3 (Noiseless, exact recovery [30]) Consider a DL problem with N i.i.d.
observations and assume that assumptions A3.1–A3.3 are satisfied. Then, there exists a
universal constant c such that for given η > 0, if
2
xmax 2p
N≥c p2 log , (5.12)
xmin η
there exists a procedure consisting of an initial dictionary estimation stage and an alter-
nating minimization stage such that after T = O(log(1/ε)) iterations of the second stage,
D0 ) ≤ ε, ∀ε > 0.
with probability at least 1 − 2η − 2ηN 2 , d(D,
This theorem guarantees that the true dictionary can be recovered to an arbitrary pre-
cision given N = Ω(p2 log p) samples. This result is based on two steps. The first step is
guaranteeing an error bound for the initial dictionary-estimation step. This step involves
using a clustering-style algorithm to approximate the dictionary columns. The second
step is proving a local convergence result for the alternating minimization stage. This
step involves improving estimates of the coefficient vectors and the dictionary through
Lasso [47] and least-square steps, respectively. More details for this work can be found
in [30].
While the works in [29, 30] study the sample complexity of the overcomplete DL
problem, they do not take noise into account. Next, we present works that obtain sample
complexity results for noisy reconstruction of dictionaries.
Noisy Reconstruction
The next result we discuss is based on the following objective function:
1 2
N
max maxPS (D)yn 2 , (5.13)
D∈D N n=1 |S|=s
where PS (D) denotes projection of D onto the span of DS = D j .5 Here, the distance
j∈S
metric that is used is d(D1 , D2 ) = max j∈{1,...,p} D1j − D2j . In addition, the results are
2
based on the following set of assumptions.
A4.1 (Unit-norm tight frame) The true dictionary is a unit-norm tight frame, i.e., for
p 0 2
all v ∈ Rm we have D , v = pv /m.
j=1 j
2
2
5 This objective function can be thought of as a manipulation of (5.6) with the 0 -norm regularizer for the
coefficient vectors. See equation 2 of [38] for more details.
Sample Complexity Bounds for Dictionary Learning 145
theorem 5.4 (Noisy, local recovery [38]) Consider a DL problem with N i.i.d. obser-
vations and assume that assumptions A4.1–A4.5 are satisfied. If, for some 0 < q < 1/4,
the number of samples satisfies
−q −2q c1 1 − δ s (D0 )
2N + N ≤ ,
√
(5.16)
s 1 + c2 log c3 p ps / c4 s 1 − δ s D0
then, with high probability, there is a local maximum of (5.13) within distance at most
2N −q of D0 .
The constants c1 , c2 , c3 , and c4 in Theorem 5.4 depend on the underlying dictio-
nary, coefficient vectors, and the underlying noise. The proof of this theorem relies on
the
fact that, for the true dictionary and its perturbations, the maximal response, i.e.,
0 xn ,7 is attained for the set S = Sπ for most signals. A detailed explanation
PS (D)D 2
of the theorem and its proof can be found in the paper of Schnass [38].
In order to understand Theorem 5.4, let us set q ≈ 14 − (log p)/(log N) . We can then
understand this theorem as follows. Given N/ log N = Ω(mp3 ), except with probability
O(N −mp ), there is a local minimum of (5.13) within distance O(pN −1/4 ) of the true
dictionary. Moreover, since the objective function that is considered in this work is also
solved for the K-SVD algorithm, this result gives an understanding of the performance
of the K-SVD algorithm. Compared with results with R(x) being a function of the 1 -
norm [29, 30], this result requires the true dictionary to be a tight frame. On the flip side,
6 A probability measure ν on the unit sphere S p−1 is called symmetric if, for all measurable sets X ⊆ S p−1 ,
for all sign sequences l ∈ {−1, 1} p and all permutations π : {1, . . . , p} → {1, . . . , p}, we have
ν(lX) = ν(X), where lX = l1 x1 , . . . , l p x p : x ∈ X ,
ν(π(X)) = ν(X), where π(X) = xπ(1) , . . . , xπ(p) : x ∈ X . (5.14)
7 can be D0 itself or some perturbation of D0 .
D
146 Zahra Shakeri et al.
the coefficient vector in Theorem 5.4 is not necessarily sparse; instead, it only has to
satisfy a decaying condition.
Next, we present a result obtained by Arora et al. [31] that is similar to that of
Theorem 5.3 in the sense that it uses a combinatorial algorithm that can provably recover
the true dictionary given noiseless observations. It further obtains dictionary reconstruc-
tion results for the case of noisy observations. The objective function considered in this
work is similar to that of the K-SVD algorithm and can be thought of as (5.6) with
R(x) = λx0 , where λ is the regularization parameter.
Arora et al. [31] define two dictionaries D1 and D2 to be column-wise ε close if
there exists a permutation π and l ∈ {−1, 1} such that D1j − lD2π( j) ≤ ε. This distance
2
metric captures the distance between equivalent classes of dictionaries and avoids the
sign-permutation ambiguity. They propose a DL algorithm that first uses combinatorial
techniques to recover the support of coefficient vectors, by clustering observations into
overlapping clusters that use the same dictionary columns. To find these large clusters,
a clustering algorithm is provided. Then, the dictionary is roughly estimated given the
clusters, and the solution is further refined. The provided guarantees are based on using
the proposed DL algorithm. In addition, the results are based on the following set of
assumptions.
A5.1 (Bounded coefficient distribution) Non-zero entries of xn are drawn from a
zero-mean distribution and lie in [−xmax , −1] ∪ [1, xmax ], where xmax = O(1).
Moreover, conditioned on any subset of coordinates in xn being non-zero, non-
zero values of xni are independent from each other. Finally, the distribution has
bounded 3-wise moments, i.e., the probability that n
! x isnon-zero in any subset
S of three coordinates is at most c times i∈S P xi 0 , where c = O(1).8
3 n
A5.2 (Gaussian noise) The wn s are independent and follow a spherical Gaussian
√
distribution with standard deviation σ = o( m).
A5.3 (Dictionary coherence) The true dictionary is μ-incoherent, that is, for all i j,
√
D0i , D0j ≤
μ(D0 )/ m and
μ(D0 ) = O(log(m)).
√
A5.4 (Sparsity level) The sparsity level satisfies s ≤ c1 min p2/5 , m/ μ(D0 ) log m ,
for some positive constant c1 .
theorem 5.5 (Noisy, exact recovery [31]) Consider a DL problem with N i.i.d.
observations and assume that assumptions A5.1–A5.4 are satisfied. Provided that
p 1
N = Ω σ2 ε−2 p log p 2 + s2 + log , (5.17)
s ε
there is a universal constant c1 and a polynomial-time algorithm that learns the under-
that is column-wise ε
lying dictionary. With high probability, this algorithm returns D
close to D0 .
8 This condition is trivially satisfied if the set of the locations of non-zero entries of xn is a random subset
of size s.
Sample Complexity Bounds for Dictionary Learning 147
For desired error ε, the run-time of the algorithm and the sample complexity depend
on log(1/ε). With the addition of noise, there is also a dependence on ε−2 for N, which
is inevitable for noisy reconstruction of the true
dictionary [31, 38]. In the noiseless
setting, this result translates into N = Ω p log p (p/s ) + s + log(1/ε) .
2 2
9 The sign of the vector v is defined as l = sign(v), whose elements are li = vi /|vi | for vi 0 and li = 0 for
vi = 0, where i denotes any index of the elements of v.
148 Zahra Shakeri et al.
where cmin and cmax depend on problem parameters such as s, the coefficient
distribution, and D0 .
" 2
A6.5 (Sparsity level) The sparsity level satisfies s ≤ p 16 D0 2 + 1 .
A6.6 (Radius range) The error radius ε > 0 satisfies ε ∈ λ̄cmin , λ̄cmax .
A6.7 (Outlier energy) Given inlier matrix Y = {yn }n=1 N and outlier matrix Yout =
n Nout
{y }n=1 , the energy of Yout satisfies
√ 2 ⎡ & ⎤
Yout 1,2 c1 ε sE x2 A0 3/2 ⎢⎢⎢ 1 c λ̄ mp + η ⎥⎥⎥
≤ ⎢⎣ 1 − min − c2 ⎥⎦, (5.21)
N λ̄E{|x|} p p ε N
where Yout 1,2 denotes the sum of the 2 -norms of the columns of Yout , c1 and
c2 are positive constants, independent of parameters, and A0 is the lower frame
2
bound of D0 , i.e., A0 v22 ≤ D0T v2 for any v ∈ Rm .
theorem 5.6 (Noisy with outliers, local recovery [32]) Consider a DL problem with N
i.i.d. observations and assume that assumptions A6.1–A6.6 are satisfied. Suppose
⎛ ⎞ ⎛ 2 ⎞
⎜⎜ M 2 ⎟⎟⎟2 ⎜⎜⎜⎜ ε + (Mw /M x ) + λ̄ + (Mw /M x ) + λ̄ ⎟⎟⎟⎟
2⎜
N > c0 (mp + η)p ⎜⎜⎜⎝ x
⎟⎟⎟⎠ ⎜⎜⎜⎜ ⎟⎟⎟,
⎟⎠ (5.22)
E x22 ⎝ ε − cmin λ̄
then, with probability at least 1 − 2−η , (5.6) admits a local minimum within distance ε of
D0 . In addition, this result is robust to the addition of outlier matrix Yout , provided that
the assumption in A6.7 is satisfied.
The proof of this theorem relies on using the Lipschitz continuity property of the
objective function in (5.6) with respect to the dictionary and sample complexity analysis
using Rademacher averages and Slepian’s Lemma [48]. Theorem 5.6 implies that
⎛ ⎞
⎜⎜⎜ 3 Mw 2 ⎟⎟
⎜
N = Ω⎜⎝ mp + ηp 2 ⎟⎟⎟ (5.23)
Mx ε ⎠
samples are sufficient for the existence of a local minimum within distance ε of true
dictionary D 0
, with high probability. In the noiseless setting, this result translates into
N = Ω mp , and sample complexity becomes independent of the radius ε. Furthermore,
3
complexity results depend on the presence or absence of noise and outliers. All the
presented results require that the underlying dictionary satisfies incoherence conditions
in some way. For a one-to-one comparison of these results, the bounds for the case of
absence of noise and outliers can be compared. A detailed comparison of the noiseless
recovery for square and overcomplete dictionaries can be found in Table I of [32].
Many of today’s data are collected using various sensors and tend to have a multidi-
mensional/tensor structure (Fig. 5.2). Examples of tensor data include (1) hyperspectral
images that have three modes, two spatial and one spectral; (2) colored videos that have
four modes, two spatial, one depth, and one temporal; and (3) dynamic magnetic res-
onance imaging in a clinical trial that has five modes, three spatial, one temporal, and
one subject. To find representations of tensor data using DL, one can follow two paths.
A naive approach is to vectorize tensor data and use traditional vectorized representa-
tion learning techniques. A better approach is to take advantage of the multidimensional
structure of data to learn representations that are specific to tensor data. While the main
focus of the literature on representation learning has been on the former approach, recent
works have shifted focus to the latter approaches [8–11]. These works use various tensor
decompositions to decompose tensor data into smaller components. The representation
learning problem can then be reduced to learning the components that represent different
modes of the tensor. This results in a reduction in the number of degrees of freedom in
the learning problem, due to the fact that the dimensions of the representations learned
for each mode are significantly smaller than the dimensions of the representation learned
for the vectorized tensor. Consequently, this approach gives rise to compact and efficient
representation of tensors.
To understand the fundamental limits of dictionary learning for tensor data, one can
use the sample complexity results in Section 5.2, which are a function of the underlying
dictionary dimensions. However, considering the reduced number of degrees of freedom
in the tensor DL problem compared with vectorized DL, this problem should be solv-
able with a smaller number of samples. In this section, we formalize this intuition and
address the problem of reliable estimation of dictionaries underlying tensor data. As in
the previous section, we will focus on the subject of sample complexity of the DL prob-
lem from two prospectives: (1) fundamental limits on the sample complexity of DL for
tensor data using any DL algorithm, and (2) the numbers of samples that are needed for
different DL algorithms in order to reliably estimate the true dictionary from which the
tensor data are generated.
Reference Jung et al. [33] Geng et al. [29] Agarwal et al. [30] Schnass et al. [38] Arora et al. [31] Gribonval et al. [32]
Distance metric D1 − D2 F – min max j D1j − D2j min D1 − D2 F
l∈{±1} 2 l∈{±1}
D1 − lD2 D1 − lD2
j π( j) 2 j j 2
Regularizer 0 1 1 1 0 1
mp2 p mp3
Sample sp3 p2 log p mp3 log p(p/s2 +
complexity ε2 ε2 ε2
s2 + log(1/ε))
Sample Complexity Bounds for Dictionary Learning 151
Temporal
Temporal
Spatial (horizontal) Spatial (horizontal)
Figure 5.2 Two of countless examples of tensor data in today’s sensor-rich world.
better understand the results reported in this section, we first need to define some tensor
notation that will be useful throughout this section.
Tensor Unfolding: Elements of tensors can be rearranged to form matrices. Given
a Kth-order !tensor, X ∈ R p1 ×···×pK , its mode-k unfolding is denoted as
X(k) ∈ R pk × ik pi . The columns of X(k) are formed by fixing all the indices,
except one in the kth mode.
Tensor Multiplication: The mode-k product between the Kth-order tensor, X ∈
R p1 ×···×pK , and a matrix, A ∈ Rmk ×pk , is defined as
pk
(X ×k A)i1 ,...,ik−1 , j,ik+1 ,...,iK = Xi1 ,...,ik−1 ,ik ,ik+1 ,...,iK A j,ik . (5.24)
ik =1
Y = X ×1 D1 ×2 D2 ×3 · · · ×K DK . (5.25)
where ⊗ denotes the matrix Kronecker product [50] and vec(·) denotes stacking
of the columns of a matrix
into one
0 column. We will use the shorthand notation
vec(Y) to denote vec Y(1) and k Dk to denote D1 ⊗ · · · ⊗ DK .
data modes, and (2) Kronecker-structured matrices have successfully been used for
data representation in applications such as magnetic resonance imaging, hyperspectral
imaging, video acquisition, and distributed sensing [8, 9].
coefficient tensor, and Wn ∈ Rm1 ×···×mK is the underlying noise tensor. In this case, the
true dictionary D0 ∈ Rm×p is Kronecker-structured (KS) and has the form
1 2
K 2
K
D0 = D0k , m= mk , and p= pk ,
k k=1 k=1
where
D0k ∈ Dk = Dk ∈ Rmk ×pk , Dk, j 2 = 1 ∀ j ∈ {1, . . . , pk } . (5.28)
Comparing (5.27) with the traditional formulation in (5.1), it can be seen that KS-DL
also involves vectorizing the observation tensor, but it has the main difference that the
structure of the tensor is captured in the underlying KS dictionary. An illustration of this
for a second-order tensor is shown in Fig. 5.3. As with (5.3), we can stack the vectorized
observations, yn = vec(Yn ), vectorized coefficient tensors, xn = vec(Xn ), and vectorized
noise tensors, wn = vec(Wn ), in columns of Y, X, and W, respectively. We now discuss
the role of sparsity in coefficient tensors for dictionary learning. While in vectorized DL
it is usually assumed that the random support of non-zero entries of xn is uniformly dis-
tributed, there are two different definitions of the random support of Xn for tensor data.
(1) Random sparsity. The random support of xn is uniformly distributed over the set
{S ⊆ {1, . . . , p} : |S| = s}.
(2) Separable sparsity. The random support of xn is uniformly distributed over the
set S that is related to {S1 × . . . SK : Sk ⊆ {1, . . . , pk }, |Sk | = sk } via lexicographic
!
indexing. Here, s = k sk .
Separable sparsity requires non-zero entries of the coefficient tensor to be grouped in
blocks. This model also implies that the columns of Y(k) have sk -sparse representations
with respect to D0k [53].
Vectorized DL
KS-DL ⊗
Figure 5.3 Illustration of the distinctions of KS-DL versus vectorized DL for a second-order
tensor: both vectorize the observation tensor, but the structure of the tensor is exploited in the KS
dictionary, leading to the learning of two coordinate dictionaries with fewer parameters than the
dictionary learned in vectorized DL.
The aim in KS-DL is to estimate coordinate dictionaries, D k s, such that they are close
0
to Dk s. In this scenario, the statistical risk minimization problem has the form
⎧ ⎧ ⎫⎫
⎪
⎪
⎨ ⎪
⎨ 1
⎪ 1 2 ⎪
⎬⎪
⎪⎪
⎬
D1 , . . . , DK ∈ arg min ⎪ E −
+ R(x) ,
⎪
⎩
inf ⎪
⎪
⎩ y Dk x ⎪
⎪
⎭⎪
⎪
⎭
(5.30)
{Dk ∈Dk } K x∈X 2 k 2
k=1
where R(·) is a regularization operator on the coefficient vectors. Various KS-DL algo-
rithms have been proposed that solve (5.31) heuristically by means of optimization tools
such as alternative minimization [9] and tensor rank minimization [54], and by taking
advantage of techniques in tensor algebra such as the higher-order SVD for tensors [55].
In particular, an algorithm is proposed in [11], which shows that the Kronecker prod-
uct of any number of matrices can be rearranged to form a rank-1 tensor. In order
to solve (5.31), therefore, in [11] a regularizer is added to the objective function that
enforces this low-rankness on the rearrangement tensor. The dictionary update stage of
this algorithm involves learning the rank-1 tensor and rearranging it to form the KS dic-
tionary. This is in contrast to learning the individual coordinate dictionaries by means of
alternating minimization [9].
In the case of theory for KS-DL, the notion of closeness can have two interpretations.
One
is the distance
between the true KS dictionary and the recovered KS dictionary, i.e.,
d D(Y), D0 . The other is the distance between each true coordinate dictionary and the
corresponding recovered coordinate dictionary, i.e., d D k (Y), D0 . While small recovery
k
errors for coordinate dictionaries imply a small recovery error for the KS dictionary,
154 Zahra Shakeri et al.
the other side of the statement does not necessarily hold. Hence, the latter notion is of
importance when we are interested in recovering the structure of the KS dictionary.
In this section, we focus on the sample complexity of the KS-DL problem. The ques-
tions that we address in this section are as follows. (1) What are the fundamental limits
of solving the statistical risk minimization problem in (5.30)? (2) Under what kind of
conditions do objective functions like (5.31) recover the true coordinate dictionaries and
how many samples do they need for this purpose? (3) How do these limits compare with
their vectorized DL counterparts? Addressing these questions will help in understanding
the benefits of KS-DL for tensor data.
theorem 5.7 (KS-DL minimax lower bounds [13]) Consider a KS-DL problem with
N i.i.d. observations and true KS dictionary D0 satisfying assumption A7.1 for some
√
r ∈ (0, 2 p]. Then, for any coefficient distribution with mean zero and covariance matrix
Σ x , and white Gaussian noise with mean zero and variance σ2 , the minimax risk ε∗ is
lower-bounded as
⎧ ⎛ ⎛ K ⎞ ⎞⎫
⎪
⎪
⎨ r2 σ2 ⎜⎜⎜ ⎜⎜⎜ ⎟⎟⎟ K ⎟⎟⎪
⎪
⎬
⎜⎜⎝⎜c1 ⎜⎜⎝⎜ (mk − 1)pk ⎟⎟⎠⎟ − log2 (2K) − 2⎟⎟⎟⎠⎟⎪
∗ t
ε ≥ min⎪ ⎪ p, , ⎪ , (5.33)
4 ⎩ 2K 4NKΣ x 2 2 ⎭
k=1
for any 0 < t < 1 and any 0 < c1 < ((1 − t)/(8 log 2)).
Similarly to Theorem 5.1, the proof of this theorem relies on using the standard
procedure for lower-bounding the minimax risk by connecting it to the maximum prob-
ability of error of a multiple-hypothesis-testing problem. Here, since the constructed
hypothesis-testing class consists of KS dictionaries, the construction procedure and the
minimax risk analysis are different from that in [33].
To understand this theorem, let us assume that r and p are sufficiently large that the
minimizer of the left-hand side of (5.33) is the third term. In this case, Theorem 5.7
states that to achieve any error ε for the Kth-order tensor dictionary-recovery problem,
we need the number of samples to be on the order of
2
σ k mk pk
N=Ω .
KΣ x 2 ε
Comparing this scaling with the results for the unstructured dictionary-learning problem
provided in Theorem 5.1, the lower bound here is decreased from the scaling Ω(mp)
Sample Complexity Bounds for Dictionary Learning 155
to Ω k mk pk /K . This reduction can be attributed to the fact that the average number
of degrees of freedom in a KS-DL problem is k mk pk /K, compared with the num-
ber of degrees of the vectorized DL problem, which is mp. For the case of K = 2,
√ √
m1 = m2 = m, and p1 = p2 = p, the sample complexity lower bound scales with
√
Ω(mp) for vectorized DL and with Ω( mp) for KS-DL. On the other hand, when
m1 = αm, m2 = 1/α and p1 = αm1 , p2 = 1/α, where α < 1, 1/α ∈ N, the sample com-
plexity lower bound scales with Ω(mp) for KS-DL, which is similar to the scaling for
vectorized DL.
Specializing this result to random sparse coefficient vectors and assuming that the
non-zero entries of xn are i.i.d. with variance σ2x , we get Σ x = (s/p)σ2x I p . Therefore, for
sufficiently large r, the sample complexity scaling required in order to achieve any error
ε for strictly sparse representations becomes
2
σ p k mk pk
Ω .
σ2x sKε
A very simple KS-DL algorithm is also provided in [13], which can recover a square
KS dictionary that consists of the Kronecker product of two smaller dictionaries and
is a perturbation of the identity matrix. It is shown that, in this case, the lower bound
provided in (5.33) is order-wise achievable for the case of sparse coefficient vectors. This
suggests that the obtained sample complexity lower bounds for overcomplete KS-DL are
not too loose.
In the next section, we focus on achievability results for the KS dictionary recovery
problem, i.e., upper bounds on the sample complexity of KS-DL.
Noisy Recovery
We present a result that states conditions that ensure reliable recovery of the coordinate
dictionaries from noisy observations using (5.31). Shakeri et al. [15] solve (5.31) with
R(x) = λx1 , where λ is a regularization parameter. Here, the coordinate dictionary error
is defined as
εk = D k − D0 , k ∈ {1, . . . , K}, (5.34)
k F
where Dk is the recovered coordinate dictionary. The result is based on the following set
of assumptions.
156 Zahra Shakeri et al.
Vectorized DL KS-DL
mp2 p k mk pk
Minimax lower bound [33] [13]
ε2 Kε2
3
mp3 mk p
Achievability bound [32] max 2 k [15]
ε2 k εk
The proof of this theorem relies on coordinate-wise Lipschitz continuity of the objec-
tive function in (5.31) with respect to coordinate dictionaries and the use of similar
sample complexity analysis arguments to those in [32]. Theorem 5.8 implies that, for
fixed K and SNR, N = maxk∈{1,...,K} Ω mk p3k ε−2k is sufficient for the existence of a local
minimum within distance εk of true coordinate dictionaries, with high probability. This
result holds for coefficients that are generated according to the separable sparsity model.
The case of coefficients generated according to the random sparsity model requires a
different analysis technique that is not explored in [15].
We compare this result with the scaling in the vectorized DL problem in
−2 ! −2
Theorem 5.6, which stated that N = Ω mp ε 3 = Ω k mk pk ε3 is sufficient for the
existence ofD0 as a local
minimum of (5.6) up to the predefined error ε. In contrast,
N = maxk Ω mk p3k ε−2k is sufficient in the case of tensor data for the existence of D0k s as
local minima of (5.31) up to predefined errors εk . This reduction in the scaling can be
attributed to the reduction in the number of degrees of freedom of the KS-DL problem.
We can also compare this result with the sample complexity lower-bound scaling
obtained in Theorem 5.7 for KS-DL, which stated that, given sufficiently large r and p,
N = Ω p k mk pk ε−2 /K is necessary in order to recover the true KS dictionary D0
√
up to error ε. We can relate ε to εk s using the relation ε ≤ p k εk [15]. Assum-
√
εk s are equal to each other, this implies that ε ≤ pKεk , and we have
ing all of the
N = maxk Ω 2K K 2 p(mk p3k )ε−2 . It can be seen from Theorem 5.7 that the sample com-
plexity lower bound depends on the average dimension of coordinate dictionaries; in
contrast, the sample complexity upper bound reported in this section depends on the
maximum dimension of coordinate dictionaries. There is also a gap between the lower
bound and the upper bound of order maxk p2k . This suggests that the obtained bounds
may be loose.
The sample complexity scaling results in Theorems 5.1, 5.6, 5.7, and 5.8 are shown
in Table 5.2 for sparse coefficient vectors.
In Sections 5.2 and 5.3, we summarized some of the key results of dictionary identifica-
tion for vectorized and tensor data. In this section, we look at extensions of these works
and discuss related open problems.
158 Zahra Shakeri et al.
N
max maxD∗S yn 1 . (5.38)
D∈D |S|=s
n=1
Given distance metric d(D1 , D2 ) = max j D1j − D2j , Schnass shows that the sample
2
complexity
needed to recover a true generating dictionary up to precision ε scales as
O mp3 ε−2 using this objective. This sample complexity is achieved by a novel DL
algorithm, called Iterative Thresholding and K-Means (ITKM), that solves (5.38) under
certain conditions on the coefficient distribution, noise, and the underlying dictionary.
Efficient representations can help improve the complexity and performance of
machine-learning tasks such as prediction. This means that a DL algorithm could explic-
itly tune the representation to optimize prediction performance. For example, some
works learn dictionaries to improve classification performance [17, 25]. These works
add terms to the objective function that measure the prediction performance and min-
imize this loss. While these DL algorithms can yield improved performance for their
desired prediction task, proving sample complexity bounds for these algorithms remains
an open problem.
Tightness guarantees. While dictionary identifiability has been well studied for
vector-valued data, there remains a gap between the upper and lower bounds on the sam-
ple complexity. The lower bound presented in Theorem 5.1 is for the case of a particular
distance metric, i.e., the Frobenius norm, whereas the presented achievability results in
Theorems 5.2–5.6 are based on a variety of distance metrics. Restricting the distance
metric to the Frobenius norm, we still observe a gap of order p between the sample
complexity lower bound in Theorem 5.1 and the upper bound in Theorem 5.6. The par-
tial converse result for square dictionaries that is provided in [33] shows that the lower
bound is achievable for square dictionaries close to the identity matrix. For more gen-
eral square matrices, however, the gap may be significant: either improved algorithms
can achieve the lower bounds or the lower bounds may be further tightened. For over-
complete dictionaries the question of whether the upper bound or lower bound is tight
remains open. For metrics other than the Frobenius norm, the bounds are incomparable,
making it challenging to assess the tightness of many achievability results.
Finally, the works reported in Table 5.1 differ significantly in terms of the mathemat-
ical tools they use. Each approach yields a different insight into the structure of the DL
problem. However, there is no unified analytic framework encompassing all of these
perspectives. This gives rise to the following question. Is there a unified mathematical
tool that can be used to generalize existing results on DL?
Sample Complexity Bounds for Dictionary Learning 159
Acknowledgments
This work is supported in part by the National Science Foundation under awards CCF-
1525276 and CCF-1453073, and by the Army Research Office under award W911NF-
17-1-0546.
160 Zahra Shakeri et al.
References
[1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new
perspectives,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 35, no. 8,
pp. 1798–1828, 2013.
[2] R. N. Bracewell and R. N. Bracewell, The Fourier transform and its applications.
McGraw-Hill, 1986.
[3] I. Daubechies, Ten lectures on wavelets. SIAM, 1992.
[4] E. J. Candès and D. L. Donoho, “Curvelets: A surprisingly effective nonadaptive repre-
sentation for objects with edges,” in Proc. 4th International Conference on Curves and
Surfaces, 1999, vol. 2, pp. 105–120.
[5] I. T. Jolliffe, “Principal component analysis and factor analysis,” in Principal component
analysis. Springer, 1986, pp. 115–128.
[6] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcom-
plete dictionaries for sparse representation,” IEEE Trans. Signal Processing, vol. 54, no. 11,
pp. 4311–4322, 2006.
[7] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Rev.,
vol. 51, no. 3, pp. 455–500, 2009.
[8] M. F. Duarte and R. G. Baraniuk, “Kronecker compressive sensing,” IEEE Trans. Image
Processing, vol. 21, no. 2, pp. 494–504, 2012.
[9] C. F. Caiafa and A. Cichocki, “Multidimensional compressed sensing and their applica-
tions,” Wiley Interdisciplinary Rev.: Data Mining and Knowledge Discovery, vol. 3, no. 6,
pp. 355–380, 2013.
[10] S. Hawe, M. Seibert, and M. Kleinsteuber, “Separable dictionary learning,” in Proc. IEEE
Conference Computer Vision and Pattern Recognition, 2013, pp. 438–445.
[11] M. Ghassemi, Z. Shakeri, A. D. Sarwate, and W. U. Bajwa, “STARK: Structured dictionary
learning through rank-one tensor recovery,” in Proc. IEEE 7th International Workshop
Computational Advances in Multi-Sensor Adaptive Processing, 2017, pp. 1–5.
[12] Z. Shakeri, W. U. Bajwa, and A. D. Sarwate, “Minimax lower bounds for Kronecker-
structured dictionary learning,” in Proc. IEEE International Symposium on Information
Theory, 2016, pp. 1148–1152.
[13] Z. Shakeri, W. U. Bajwa, and A. D. Sarwate, “Minimax lower bounds on dictionary learn-
ing for tensor data,” IEEE Trans. Information Theory, vol. 64, no. 4, pp. 2706–2726, 2018.
[14] Z. Shakeri, A. D. Sarwate, and W. U. Bajwa, “Identification of Kronecker-structured dictio-
naries: An asymptotic analysis,” in Proc. IEEE 7th International Workshop Computational
Advances in Multi-Sensor Adaptive Processing, 2017, pp. 1–5.
[15] Z. Shakeri, A. D. Sarwate, and W. U. Bajwa, “Identifiability of Kronecker-structured
dictionaries for tensor data,” IEEE J. Selected Topics Signal Processing, vol. 12, no. 5,
pp. 1047–1062, 2018.
[16] R. Vidal, Y. Ma, and S. Sastry, “Generalized principal component analysis (GPCA),” IEEE
Trans. Pattern Analysis Machine Intelligence, vol. 27, no. 12, pp. 1945–1959, 2005.
[17] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning: Transfer learn-
ing from unlabeled data,” in Proc. 24th International Conference on Machine Learning,
2007, pp. 759–766.
[18] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals Human
Genetics, vol. 7, no. 2, pp. 179–188, 1936.
[19] A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis. John Wiley &
Sons, 2004.
Sample Complexity Bounds for Dictionary Learning 161
[20] R. R. Coifman and S. Lafon, “Diffusion maps,” Appl. Comput. Harmonic Analysis, vol. 21,
no. 1, pp. 5–30, 2006.
[21] B. Schölkopf, A. Smola, and K.-R. Müller, “Kernel principal component analysis,” in Proc.
International Conference on Artificial Neural Networks, 1997, pp. 583–588.
[22] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural
networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
[23] R. Grosse, R. Raina, H. Kwong, and A. Y. Ng, “Shift-invariance sparse coding for audio
classification,” in Proc. 23rd Conference on Uncertainty in Artificial Intelligence, 2007,
pp. 149–158.
[24] J. M. Duarte-Carvajalino and G. Sapiro, “Learning to sense sparse signals: Simultaneous
sensing matrix and sparsifying dictionary optimization,” IEEE Trans. Image Processing,
vol. 18, no. 7, pp. 1395–1408, 2009.
[25] J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,” IEEE Trans. Pattern
Analysis Machine Intelligence, vol. 34, no. 4, pp. 791–804, 2012.
[26] M. Aharon, M. Elad, and A. M. Bruckstein, “On the uniqueness of overcomplete dictio-
naries, and a practical way to retrieve them,” Linear Algebra Applications, vol. 416, no. 1,
pp. 48–67, 2006.
[27] R. Remi and K. Schnass, “Dictionary identification – sparse matrix-factorization via 1 -
minimization,” IEEE Trans. Information Theory, vol. 56, no. 7, pp. 3523–3539, 2010.
[28] D. A. Spielman, H. Wang, and J. Wright, “Exact recovery of sparsely-used dictionaries,”
in Proc. Conference on Learning Theory, 2012, pp. 37.11–37.18.
[29] Q. Geng and J. Wright, “On the local correctness of 1 -minimization for dictionary
learning,” in Proc. IEEE International Symposium on Information Theory, 2014, pp. 3180–
3184.
[30] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon, “Learning sparsely used
overcomplete dictionaries,” in Proc. 27th Annual Conference on Learning Theory, 2014,
pp. 1–15.
[31] S. Arora, R. Ge, and A. Moitra, “New algorithms for learning incoherent and overcomplete
dictionaries,” in Proc. 25th Annual Conference Learning Theory, 2014, pp. 1–28.
[32] R. Gribonval, R. Jenatton, and F. Bach, “Sparse and spurious: Dictionary learning with
noise and outliers,” IEEE Trans. Information Theory, vol. 61, no. 11, pp. 6298–6319, 2015.
[33] A. Jung, Y. C. Eldar, and N. Görtz, “On the minimax risk of dictionary learning,” IEEE
Trans. Information Theory, vol. 62, no. 3, pp. 1501–1515, 2015.
[34] O. Christensen, An introduction to frames and Riesz bases. Springer, 2016.
[35] K. A. Okoudjou, Finite frame theory: A complete introduction to overcompleteness.
American Mathematical Society, 2016.
[36] W. U. Bajwa, R. Calderbank, and D. G. Mixon, “Two are better than one: Fundamen-
tal parameters of frame coherence,” Appl. Comput. Harmonic Analysis, vol. 33, no. 1,
pp. 58–78, 2012.
[37] W. U. Bajwa and A. Pezeshki, “Finite frames for sparse signal processing,” in Finite
frames, P. Casazza and G. Kutyniok, eds. Birkhäuser, 2012, ch. 10, pp. 303–335.
[38] K. Schnass, “On the identifiability of overcomplete dictionaries via the minimisation prin-
ciple underlying K-SVD,” Appl. Comput. Harmonic Analysis, vol. 37, no. 3, pp. 464–491,
2014.
[39] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,”
J. Roy. Statist. Soc. Ser. B, vol. 68, no. 1, pp. 49–67, 2006.
[40] V. Vapnik, “Principles of risk minimization for learning theory,” in Proc. Advances in
Neural Information Processing Systems, 1992, pp. 831–838.
162 Zahra Shakeri et al.
[41] K. Engan, S. O. Aase, and J. H. Husoy, “Method of optimal directions for frame design,” in
Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5,
1999, pp. 2443–2446.
[42] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and
sparse coding,” J. Machine Learning Res., vol. 11, no. 1, pp. 19–60, 2010.
[43] H. Raja and W. U. Bajwa, “Cloud K-SVD: A collaborative dictionary learning algorithm
for big, distributed data,” IEEE Trans. Signal Processing, vol. 64, no. 1, pp. 173–188, 2016.
[44] Z. Shakeri, H. Raja, and W. U. Bajwa, “Dictionary learning based nonlinear classifier train-
ing from distributed data,” in Proc. 2nd IEEE Global Conference Signal and Information
Processing, 2014, pp. 759–763.
[45] M. Zhou, H. Chen, J. Paisley, L. Ren, L. Li, Z. Xing, D. Dunson, G. Sapiro, and
L. Carin, “Nonparametric Bayesian dictionary learning for analysis of noisy and incom-
plete images,” IEEE Trans. Image Processing, vol. 21, no. 1, pp. 130–144, 2012.
[46] B. Yu, “Assouad, Fano, and Le Cam,” in Festschrift for Lucien Le Cam. Springer, 1997,
pp. 423–435.
[47] M. J. Wainwright, “Sharp thresholds for high-dimensional and noisy sparsity recovery
using 1 -constrained quadratic programming (lasso),” IEEE Trans. Information Theory,
vol. 55, no. 5, pp. 2183–2202, 2009.
[48] P. Massart, Concentration inequalities and model selection. Springer, 2007.
[49] L. R. Tucker, “Implications of factor analysis of three-way matrices for measurement
of change,” in Problems Measuring Change. University of Wisconsin Press, 1963,
pp. 122–137.
[50] C. F. Van Loan, “The ubiquitous Kronecker product,” J. Comput. Appl. Math., vol. 123,
no. 1, pp. 85–100, 2000.
[51] R. A. Harshman, “Foundations of the PARAFAC procedure: Models and conditions for an
explanatory multi-modal factor analysis,” in UCLA Working Papers in Phonetics, vol. 16,
pp. 1–84, 1970.
[52] M. E. Kilmer, C. D. Martin, and L. Perrone, “A third-order generalization of the matrix
SVD as a product of third-order tensors,” Technical Report, 2008.
[53] C. F. Caiafa and A. Cichocki, “Computing sparse representations of multidimensional
signals using Kronecker bases,” Neural Computation, vol. 25, no. 1, pp. 186–220, 2013.
[54] S. Gandy, B. Recht, and I. Yamada, “Tensor completion and low-n-rank tensor recovery
via convex optimization,” Inverse Problems, vol. 27, no. 2, p. 025010, 2011.
[55] L. De Lathauwer, B. De Moor, and J. Vandewalle, “A multilinear singular value decompo-
sition,” SIAM J. Matrix Analysis Applications, vol. 21, no. 4, pp. 1253–1278, 2000.
[56] K. Schnass, “Local identification of overcomplete dictionaries,” J. Machine Learning Res.,
vol. 16, pp. 1211–1242, 2015.
[57] S. Zubair and W. Wang, “Tensor dictionary learning with sparse Tucker decomposition,”
in Proc. IEEE 18th International Conference on Digital Signal Processing, 2013, pp. 1–6.
[58] F. Roemer, G. Del Galdo, and M. Haardt, “Tensor-based algorithms for learning multidi-
mensional separable dictionaries,” in Proc. IEEE International Conference on Acoustics,
Speech and Signal Processing, 2014, pp. 3963–3967.
[59] C. F. Dantas, M. N. da Costa, and R. da Rocha Lopes, “Learning dictionaries as a sum of
Kronecker products,” IEEE Signal Processing Lett., vol. 24, no. 5, pp. 559–563, 2017.
[60] Y. Zhang, X. Mou, G. Wang, and H. Yu, “Tensor-based dictionary learning for spectral CT
reconstruction,” IEEE Trans. Medical Imaging, vol. 36, no. 1, pp. 142–154, 2017.
[61] S. Soltani, M. E. Kilmer, and P. C. Hansen, “A tensor-based dictionary learning approach to
tomographic image reconstruction,” BIT Numerical Mathe., vol. 56, no. 4, pp. 1–30, 2015.
6 Uncertainty Relations and Sparse
Signal Recovery
Erwin Riegler and Helmut Bölcskei
Summary
6.1 Introduction
The uncertainty principle in quantum mechanics says that certain pairs of physical prop-
erties of a particle, such as position and momentum, can be known to within a limited
precision only [1]. Uncertainty relations in signal analysis [2–5] state that a signal
and its Fourier transform cannot both be arbitrarily well concentrated; corresponding
mathematical formulations exist for square-integrable or integrable functions [6, 7], for
vectors in (Cm , · 2 ) or (Cm , · 1 ) [6–10], and for finite abelian groups [11, 12]. These
results feature prominently in many areas of the mathematical data sciences. Specifi-
cally, in compressed sensing [6–9, 13, 14] uncertainty relations lead to sparse signal
recovery thresholds, in Gabor and Wilson frame theory [15] they characterize limits on
the time–frequency localization of frame elements, in communications [16] they play a
fundamental role in the design of pulse shapes for orthogonal frequency division multi-
plexing (OFDM) systems [17], in the theory of partial differential equations they serve to
163
164 Erwin Riegler and Helmut Bölcskei
characterize existence and smoothness properties of solutions [18], and in coding theory
they help to understand questions around the existence of good cyclic codes [19].
This chapter provides a principled introduction to uncertainty relations underlying
sparse signal recovery, starting with the seminal work by Donoho and Stark [6], rang-
ing over the Elad–Bruckstein coherence-based uncertainty relation for general pairs
of orthonormal bases [8], later extended to general pairs of dictionaries [10], to the
recently discovered set-theoretic uncertainty relation [13] which leads to information-
theoretic recovery thresholds for general notions of parsimony. We also elaborate on the
remarkable connection [7] between uncertainty relations for signals and their Fourier
transforms–with concentration measured in terms of support–and the “large sieve,” a
family of inequalities involving trigonometric polynomials, originally developed in the
field of analytic number theory [20, 21].
Uncertainty relations play an important role in data science beyond sparse sig-
nal recovery, specifically in the sparse signal separation problem, which comprises
numerous practically relevant applications such as (image or audio signal) inpainting,
declipping, super-resolution, and the recovery of signals corrupted by impulse noise or
by narrowband interference. We provide a systematic treatment of the sparse signal sep-
aration problem and develop its limits out of uncertainty relations for general pairs of
dictionaries as introduced in [10]. While the flavor of these results is that beyond certain
thresholds something is not possible, for example a non-zero vector cannot be concen-
trated with respect to two different orthonormal bases beyond a certain limit, uncertainty
relations can also reveal that something unexpected is possible. Specifically, we demon-
strate that signals that are sparse in certain bases can be recovered in a stable fashion
from partial and noisy observations.
In practice one often encounters more general concepts of parsimony, such as, e.g.,
manifold structures and fractal sets. Manifolds are prevalent in the data sciences, e.g.,
in compressed sensing [22–27], machine learning [28], image processing [29, 30],
and handwritten-digit recognition [31]. Fractal sets find application in image com-
pression and in modeling of Ethernet traffic [32]. In the last part of this chapter, we
develop an information-theoretic framework for sparse signal separation and recovery,
which applies to arbitrary signals of “low description complexity.” The complexity mea-
sure our results are formulated in, namely Minkowski dimension, is agnostic to signal
structure and goes beyond the notion of sparsity in terms of the number of non-zero
entries or concentration in 1-norm or 2-norm. The corresponding recovery thresholds
are information-theoretic in the sense of applying to arbitrary signal structures and pro-
viding results of best possible nature that are, however, not constructive in terms of
recovery algorithms.
To keep the exposition simple and to elucidate the main conceptual aspects, we restrict
ourselves to the finite-dimensional cases (Cm , · 2 ) and (Cm , · 1 ) throughout. Refer-
ences to uncertainty relations for the infinite-dimensional case will be given wherever
possible and appropriate. Some of the results in this chapter have not been reported
before in the literature. Detailed proofs will be provided for most of the statements, with
the goal of allowing the reader to acquire a technical working knowledge that can serve
as a basis for own further research.
Uncertainty Relations and Sparse Signal Recovery 165
The chapter is organized as follows. In Sections 6.2 and 6.3, we derive uncertainty
relations for vectors in (Cm , · 2 ) and (Cm , · 1 ), respectively, discuss the connec-
tion to the large sieve, present applications to noisy signal recovery problems, and
establish a fundamental relation between uncertainty relations for sparse vectors and
null-space properties of the accompanying dictionary matrices. Section 6.4 is devoted
to understanding the role of uncertainty relations in sparse signal separation problems.
In Section 6.5, we generalize the classical sparsity notion as used in compressed
sensing to a more comprehensive concept of description complexity, namely, lower
modified Minkowski dimension, which in turn leads to a set-theoretic null-space
property and corresponding recovery thresholds. Section 6.6 presents a large sieve
inequality in (Cm , · 2 ) that one of our results in Section 6.2 is based on. Section 6.7
lists infinite-dimensional counterparts–available in the literature–to some of the results
in this chapter. In Section 6.8, we provide a proof of the set-theoretic null-space
property stated in Section 6.5. Finally, Section 6.9 contains results on operator norms
used frequently in this chapter.
Notation. For A ⊆ {1, . . . , m}, DA denotes the m × m diagonal matrix with diago-
nal entries (DA )i,i = 1 for i ∈ A, and (DA )i,i = 0 else. With U ∈ Cm×m unitary and
A ⊆ {1, . . . , m}, we define the orthogonal projection ΠA (U) = UDA U∗ and set WU,A =
range (ΠA (U)). For x ∈ Cm and A ⊆ {1, . . . , m}, we let xA = DA x. With A ∈ Cm×m , |||A|||1 =
maxx: x1 =1 Ax1 refers to the operator 1-norm, |||A|||2 = maxx: x2 =1 Ax2 designates
√
the operator 2-norm, A2 = tr(AA∗ ) is the Frobenius norm, and A1 = m i, j=1 |Ai, j |.
√
The m × m discrete Fourier transform (DFT) matrix F has entry (1/ m)e−2π jkl/m in its
kth row and lth column for k, l ∈ {1, . . . , m}. For x ∈ R, we set [x]+ = max{x, 0}. The vec-
tor x ∈ Cm is said to be s-sparse if it has at most s non-zero entries. The open ball in
(Cm , · 2 ) of radius ρ centered at u ∈ Cm is denoted by Bm (u, ρ), and Vm (ρ) refers to its
volume. The indicator function on the set A is χA . We use the convention 0 · ∞ = 0.
Donoho and Stark [6] define uncertainty relations as upper bounds on the operator norm
of the band-limitation operator followed by the time-limitation operator. We adopt this
elegant concept and extend it to refer to an upper bound on the operator norm of a general
orthogonal projection operator (replacing the band-limitation operator) followed by the
“time-limitation operator” DP as an uncertainty relation. More specifically, let U ∈ Cm×m
be a unitary matrix, P, Q ⊆ {1, . . . , m}, and consider the orthogonal projection ΠQ (U) onto
the subspace WU,Q which is spanned by {ui : i ∈ Q}. Let1 ΔP,Q (U) = |||DP ΠQ (U)|||2 . In
the setting of [6] U would correspond to the DFT matrix F and ΔP,Q (F) is the operator
2-norm of the band-limitation operator followed by the time-limitation operator, both in
finite dimensions. By Lemma 6.12 we have
1 We note that, for general unitary A, B ∈ Cm×m , unitary invariance of · 2 yields |||ΠP (A)ΠQ (B)|||2 =
|||DP ΠQ (U)|||2 with U = A∗ B. The situation where both the band-limitation operator and the time-limitation
operator are replaced by general orthogonal projection operators can hence be reduced to the case
considered here.
166 Erwin Riegler and Helmut Bölcskei
xP 2
ΔP,Q (U) = max . (6.1)
x∈WU,Q \{0} x2
An uncertainty relation in (Cm , · 2 ) is an upper bound of the form ΔP,Q (U) ≤ c with
c ≥ 0, and states that xP 2 ≤ cx2 for all x ∈ WU,Q . ΔP,Q (U) hence quantifies how well
a vector supported on Q in the basis U can be concentrated on P. Note that an uncertainty
relation in (Cm , · 2 ) is non-trivial only if c < 1. Application of Lemma 6.13 now yields
DP ΠQ (U)2
≤ ΔP,Q (U) ≤ DP ΠQ (U)2 , (6.2)
rank(DP ΠQ (U))
where the upper bound constitutes an uncertainty relation and the lower bound will allow
us to assess its tightness. Next, note that
DP ΠQ (U)2 = tr(DP ΠQ (U)) (6.3)
and
where (6.5) follows from rank(DP UDQ ) ≤ min{|P|, |Q|} and Property (c) in Section 0.4.5
of [33]. When used in (6.2) this implies
tr(DP ΠQ (U))
≤ ΔP,Q (U) ≤ tr(DP ΠQ (U)). (6.6)
min{|P|, |Q|}
Particularizing to U = F, we obtain
tr(DP ΠQ (F)) = tr(DP FDQ F∗ ) (6.7)
= |Fi, j |2 (6.8)
i∈P j∈Q
|P||Q|
= , (6.9)
m
so that (6.6) reduces to
max{|P|, |Q|} |P||Q|
≤ ΔP,Q (F) ≤ . (6.10)
m m
There exist sets P, Q ⊆ {1, . . . , m} that saturate both bounds in (6.10), e.g., P = {1} and Q =
√ √
{1, . . . , m}, which yields max{|P|, |Q|}/m = |P||Q|/m = 1 and therefore ΔP,Q (F) = 1.
An example of sets P, Q ⊆ {1, . . . , m} saturating only the lower bound in (6.10) is as
follows. Take n to divide m and set
m 2m (n − 1)m
P= , ,..., ,m (6.11)
n n n
Uncertainty Relations and Sparse Signal Recovery 167
and
Q = {l + 1, . . . , l + n}, (6.12)
with l ∈ {1, . . . , m} and Q interpreted circularly in {1, . . . , m}. Then, the upper bound in
(6.10) is
|P||Q| n
= √ , (6.13)
m m
whereas the lower bound becomes
max{|P|, |Q|} n
= . (6.14)
m m
Thus, for m → ∞ with fixed ratio m/n, the upper bound in (6.10) tends to infinity whereas
the corresponding lower bound remains constant. The following result states that the
lower bound in (6.10) is tight for P and Q as in (6.11) and (6.12), respectively. This
√
implies a lack of tightness of the uncertainty relation ΔP,Q (F) ≤ |P||Q|/m by a factor
√
of n. The large sieve-based uncertainty relation developed in the next section will be
seen to remedy this problem.
lemma 6.1 (Theorem 11 of [6]) Let n divide m and consider
m 2m (n − 1)m
P= , ,..., ,m (6.15)
n n n
and
Q = {l + 1, . . . , l + n}, (6.16)
√
with l ∈ {1, . . . , m} and Q interpreted circularly in {1, . . . , m}. Then, ΔP,Q (F) = n/m.
Proof We have
ΔP,Q (F) = |||ΠQ (F)DP |||2 (6.17)
DQ F∗ xP 2
= max (6.20)
x: x0 x2
DQ F∗ x2
= max , (6.21)
x: x=xP x2
x0
where in (6.17) we applied Lemma 6.12 and in (6.18) we used unitary invariance of · 2 .
Next, consider an arbitrary but fixed x ∈ Cm with x = xP and define y ∈ Cn according to
y s = xms/n for s = 1, . . . , n. It follows that
2
∗ 1 2π jpq
DQ F x22 = xp e m (6.22)
m p∈P
q∈Q
168 Erwin Riegler and Helmut Bölcskei
1
n 2
2π jsq
= xms/n e n (6.23)
m s=1
q∈Q
1
n 2
2π jsq
= ys e n (6.24)
m s=1q∈Q
n ∗ 2
F y2
= (6.25)
m
n
= y22 , (6.26)
m
where F in (6.25) is the n × n DFT matrix and in (6.26) we used unitary invariance of
√
· 2 . With (6.22)–(6.26) and x2 = y2 in (6.21), we get ΔP,Q (F) = n/m.
|x p |2 = |(Fa) p |2 (6.30)
2
1 − 2π jpq
= aq e m (6.31)
m
q∈Q
Uncertainty Relations and Sparse Signal Recovery 169
1 − 2π jpk
n 2
= ak e m (6.32)
m k=1
1 p2
= ψ for p ∈ {1, . . . , m}, (6.33)
m m
where we defined the 1-periodic trigonometric polynomial ψ(s) according to
n
ψ(s) = ak e−2π jks . (6.34)
k=1
Next, let νt denote the unit Dirac measure centered at t ∈ R and set μ = p∈P ν p/m with
1-periodic extension outside [0, 1). Then,
1 p2
xP 22 = ψ (6.35)
m p∈P m
1
= |ψ(s)|2 dμ(s) (6.36)
m [0,1)
n−1 1 λ
≤ + sup μ r, r + (6.37)
m λ r∈[0,1) m
for all λ ∈ (0, m], where (6.35) is by (6.30)–(6.33) and in (6.37) we applied the large
sieve inequality Lemma 6.10 with δ = λ/m and a2 = 1. Now,
λ
sup μ r, r + (6.38)
r∈[0,1) m
= sup (ν p ((r, r + λ)) + νm+p ((r, r + λ))) (6.39)
r∈[0,m) p∈P
We next apply Theorem 6.1 to specific choices of P and Q. First, consider P = {1} and
Q = {1, . . . , m}, which were shown to saturate the upper and the lower bound in (6.10),
leading to ΔP,Q (F) = 1. Since P consists of a single point, ρ(P, λ) = 1/λ for all λ ∈ (0, m].
Thus, Theorem 6.1 with n = m yields
m−1 1
ΔP,Q (F) ≤ + for all λ ∈ (0, m]. (6.45)
m λ
Setting λ = m in (6.45) yields ΔP,Q (F) ≤ 1.
Next, consider P and Q as in (6.11) and (6.12), respectively, which, as already men-
√
tioned, have the uncertainty relation in (6.10) lacking tightness by a factor of n. Since
P consists of points spaced m/n apart, we get ρ(P, λ) = 1/λ for all λ ∈ (0, m/n]. The
upper bound (6.29) now becomes
n−1 1 m
ΔP,Q (F) ≤ + for all λ ∈ 0, . (6.46)
m λ n
Setting λ = m/n in (6.46) yields
√
ΔP,Q (F) ≤ (2n − 1)/m ≤ 2 n/m, (6.47)
√
which is tight up to a factor of 2 (cf. Lemma 6.1). We hasten to add, however, that the
large sieve technique applies to U = F only.
Combining Lemma 6.3 with the uncertainty relation Lemma 6.2 yields the announced
result stating that a non-zero vector cannot be arbitrarily well concentrated with respect
to two different orthonormal bases.
corollary 6.1 Let A, B ∈ Cm×m be unitary and P, Q ⊆ {1, . . . , m}. Suppose that there
exist a non-zero εP -concentrated p ∈ Cm and a non-zero εQ -concentrated q ∈ Cm such
that Ap = Bq. Then,
[1 − εP − εQ ]2+
|P||Q| ≥ . (6.63)
μ2 ([A B])
Proof Let U = A∗ B. Then, by Lemmata 6.2 and 6.3, we have
[1 − εP − εQ ]+ ≤ ΔP,Q (U) ≤ |P||Q| μ([I U]). (6.64)
The claim now follows by noting that μ([I U]) = μ([A B]).
For εP = εQ = 0, we recover the well-known Elad–Bruckstein result.
corollary 6.2 (Theorem 1 of [8]) Let A, B ∈ Cm×m be unitary. If Ap = Bq for non-zero
p, q ∈ Cm , then p0 q0 ≥ 1/μ2 ([A B]).
1
|P||Q| < (6.67)
μ2 ([I U])
is sufficient for ΔP,Q (U) < 1.
Proof For ΔP,Q (U) < 1, it follows that (see p. 301 of [33]) (I − DP ΠQ (U)) is invertible
with
1
|||(I − DP ΠQ (U))−1 |||2 ≤ (6.68)
1 − |||DP ΠQ (U)|||2
1
= . (6.69)
1 − ΔP,Q (U)
We now set L = (I − DP ΠQ (U))−1 DPc and note that
LpPc = (I − DP ΠQ (U))−1 pPc (6.70)
−1
= (I − DP ΠQ (U)) (I − DP )p (6.71)
−1
= (I − DP ΠQ (U)) (I − DP ΠQ (U))p (6.72)
= p, (6.73)
where in (6.72) we used ΠQ (U)p = p, which is by assumption. Next, we upper-bound
Ly − p2 according to
Ly − p2 = LpPc + Ln − p2 (6.74)
= Ln2 (6.75)
−1
≤ |||(I − DP ΠQ (U)) |||2 nPc 2 (6.76)
1
≤ nPc 2 , (6.77)
1 − ΔP,Q (U)
where in (6.75) we used (6.70)–(6.73). Finally, Lemma 6.2 implies that (6.67) is
sufficient for ΔP,Q (U) < 1.
We next particularize Lemma 6.4 for U = F,
m 2m (n − 1)m
P= , ,..., ,m (6.78)
n n n
and
Q = {l + 1, . . . , l + n}, (6.79)
with l ∈ {1, . . . , m} and Q interpreted circularly in {1, . . . , m}. This means that p is n-sparse
in F and we are missing n entries in the noisy observation y. From Lemma 6.1, we know
√
that ΔP,Q (F) = n/m. Since n divides m by assumption, stable recovery of p is possible
for n ≤ m/2. In contrast, the coherence-based uncertainty relation in Lemma 6.2 yields
√
ΔP,Q (F) ≤ n/ m, and would hence suggest that n2 < m is needed for stable recovery.
and consider the orthogonal projection ΠQ (U) onto the subspace WU,Q , which is
spanned by {ui : i ∈ Q}. Let2 ΣP,Q (U) = |||DP ΠQ (U)|||1 . By Lemma 6.12 we have
xP 1
ΣP,Q (U) = max . (6.80)
x∈WU,Q \{0} x1
1
DP ΠQ (U)1 ≤ ΣP,Q (U) ≤ DP ΠQ (U)1 , (6.81)
m
which constitutes the 1-norm equivalent of (6.2).
Proof Let ui denote the column vectors of U∗ . It follows from Lemma 6.14 that
With
For P = {1}, Q = {1, . . . , m}, and U = F, the upper bounds on ΣP,Q (F) in (6.81) and
(6.82) coincide and equal 1. We next present an example where (6.82) is sharper
2 In contrast to the operator 2-norm, the operator 1-norm is not invariant under unitary transformations, so
that we do not have |||ΠP (A)ΠQ (B)|||1 =|||DP ΠQ (A∗ B)|||1 for general unitary A, B. This, however, does not
constitute a problem as, whenever we apply uncertainty relations in (Cm , · 1 ), the case of general unitary
A, B can always be reduced directly to ΠP (I) = DP and ΠQ (A∗ B), simply by rewriting Ap = Bq according
to p = A∗ Bq.
Uncertainty Relations and Sparse Signal Recovery 175
than (6.81). Let m be even, P = {m}, Q = {1, . . . , m/2}, and U = F. Then, (6.82) becomes
ΣP,Q (F) ≤ 1/2, whereas
1 2π jlk
m m/2
DP ΠQ (F)1 = e m (6.87)
m l=1 k=1
1 1 1 − eπ jl
m−1
= + (6.88)
2 m l=1 1 − e 2πmjl
1 2
m/2
1
= + (6.89)
2 m l=1 1 − e 2π j(2l−1)
m
1 1
m/2
1
= + . (6.90)
2 m l=1 sin(π(2l − 1)/m)
1 − εP
|P||Q| ≥ . (6.91)
μ2 ([A B])
176 Erwin Riegler and Helmut Bölcskei
y = p + n, (6.95)
If ΣP,Q (U) < 1/2, then z − p1 ≤ CεP n1 with C = 2/(1 − 2ΣP,Q (U)). In particular,
1
|P||Q| < (6.97)
2μ2 ([I U])
is sufficient for ΣP,Q (U) < 1/2.
Proof Set Pc = {1, . . . , m}\P and let q = U∗ p. Note that qQ = q as a consequence of
p ∈ WU,Q , which is by assumption. We have
≥ y − z1 (6.99)
= n − z1 (6.100)
Note that, for εP = 0, i.e., the noise vector is supported on P, we can recover p from
y = p + n perfectly, provided that ΣP,Q (U) < 1/2. For the special case U = F, this is
guaranteed by
m
|P||Q| < , (6.107)
2
and perfect recovery of p from y = p + n amounts to the finite-dimensional version of
what is known as Logan’s phenomenon (see Section 6.2 of [6]).
p
|a∗i Ap| = pi + a∗i ak pk (6.111)
k=1
ki
p
≥ |pi | − a∗i ak pk (6.112)
k=1
ki
p
≥ |pi | − |a∗i ak ||pk | (6.113)
k=1
ki
p
≥ |pi | − μ(A) |pk | (6.114)
k=1
ki
= (1 + μ(A))|pi | − μ(A)p1 , (6.115)
where (6.112) is by the reverse triangle inequality and in (6.114) we used Definition
6.1. Next, we upper-bound the right-hand side of (6.110) according to
q
|a∗i Bq| = a∗i bk qk (6.116)
k=1
q
≤ |a∗i bk ||qk | (6.117)
k=1
≤ μ̄(A, B)q1 , (6.118)
where the last step is by Definition 6.4. Combining the lower bound (6.111)–(6.115)
and the upper bound (6.116)–(6.118) yields
Since (6.119) holds for arbitrary i ∈ {1, . . . , p}, we can sum over all i ∈ P and get
μ(A)p1 + μ̄(A, B)q1
pP 1 ≤ |P| . (6.120)
1 + μ(A)
For the special case A = I ∈ Cm×m and B ∈ Cm×m with B unitary, we have
μ(A) = μ(B) = 0 and μ̄(I, B) = μ([I B]), so that (6.108) and (6.109) simplify to
and
qQ 1 ≤ |Q| μ([I B]) p1 , (6.122)
respectively. Thus, for arbitrary but fixed p ∈ WB,Q and q = B∗ p, we have qQ = q so
that (6.121) and (6.122) taken together yield
pP 1 ≤ | P || Q | μ2 ([I B]) p 1 . (6.123)
As p was assumed to be arbitrary, by (6.80) this recovers the uncertainty relation
ΣP,Q (B) ≤ |P||Q|μ2 ([I B]) (6.124)
in Lemma 6.5.
Using (6.130) in (6.128) yields (6.125). The relation (6.126) follows from (6.125)
by swapping the roles of A and B, p and q, and P and Q, and upon noting that
μ̄(A, B) = μ̄(B, A). It remains to establish (6.127). Using pP 1 ≥ (1 − εP )p1 in (6.128)
and qQ 1 ≥ (1 − εQ )q1 in (6.129) yields
p1 [(1 + μ(A))(1 − εP ) − μ(A)|P|]+ ≤ μ̄(A, B)q1 |P| (6.131)
and
q1 [(1 + μ(B))(1 − εQ ) − μ(B)|Q|]+ ≤ μ̄(A, B)p1 |Q|, (6.132)
respectively. Suppose first that p = 0. Then q 0 by assumption, and (6.132) becomes
[(1 + μ(B))(1 − εQ ) − μ(B)|Q|]+ = 0. (6.133)
In this case (6.127) holds trivially. Similarly, if q = 0, then p 0 again by assumption,
and (6.131) becomes
[(1 + μ(A))(1 − εP ) − μ(A)|P|]+ = 0. (6.134)
As before, (6.127) holds trivially. Finally, if p 0 and q 0, then we multiply (6.131)
by (6.132) and divide the result by μ̄ 2 (A, B)p1 q1 , which yields (6.127).
Corollary 6.3 will be used in Section 6.4 to derive recovery thresholds for sparse
signal separation. The lower bound on |P||Q| in (6.127) is Theorem 1 of [9] and states
that a non-zero vector cannot be arbitrarily well concentrated with respect to two
different general matrices A and B. For the special case εQ = 0 and A and B unitary, and
hence μ(A) = μ(B) = 0 and μ̄(A, B) = μ([A B]), (6.127) recovers Lemma 6.6.
Particularizing (6.127) to εP = εQ = 0 yields the following result.
corollary 6.4 (Lemma 33 of [10]) Let A ∈ Cm×p and B ∈ Cm×q , both with column
vectors · 2 -normalized to 1, and consider p ∈ C p and q ∈ Cq with (pT qT )T 0.
Suppose that Ap = Bq. Then, p0 q0 ≥ fA,B (p0 , q0 ), where
[1 + μ(A)(1 − u)]+ [1 + μ(B)(1 − v)]+
fA,B (u, v) = . (6.135)
μ̄ 2 (A, B)
Proof Let P = {i ∈ {1, . . . , p} : pi 0} and Q = {i ∈ {1, . . . , q} : qi 0}, so that pP = p,
qQ = q, |P| = p0 , and |Q| = q0 . The claim now follows directly from (6.127) with
εP = εQ = 0.
If A and B are both unitary, then μ(A) = μ(B) = 0 and μ̄(A, B) = μ([A B]), and
Corollary 6.4 recovers the Elad–Bruckstein result in Corollary 6.2.
Corollary 6.4 admits the following appealing geometric interpretation in terms of a
null-space property, which will be seen in Section 6.5 to pave the way to an extension
of the classical notion of sparsity to a more general concept of parsimony.
lemma 6.8 Let A ∈ Cm×p and B ∈ Cm×q , both with column vectors · 2 -normalized to 1.
Then, the set (which actually is a finite union of subspaces)
p
S= : p ∈ C p , q ∈ Cq , p0 q0 < fA,B (p0 , q0 ) (6.136)
q
Uncertainty Relations and Sparse Signal Recovery 181
Proof The statement of this lemma is equivalent to the statement of Corollary 6.4
through a chain of equivalences between the following statements:
(1) ker([A B]) ∩ S = {0},
(2) if (pT − qT )T ∈ ker([A B])\{0}, then p0 q0 ≥ fA,B (p0 , q0 ),
(3) if Ap = Bq with (pT qT )T 0, then p0 q0 ≥ fA,B (p0 , q0 ),
where (1) ⇔ (2) is by the definition of S, (2) ⇔ (3) follows from the fact that Ap = Bq
with (pT qT )T 0 is equivalent to (pT − qT )T ∈ ker([A B])\{0}, and (3) is the statement
in Corollary 6.4.
Numerous practical signal recovery tasks can be cast as sparse signal separation
problems of the following form. We want to recover y ∈ C p with y0 ≤ s and/or z ∈ Cq
with z0 ≤ t from the noiseless observation
w = Ay + Bz, (6.138)
where A ∈ Cm×p and B ∈ Cm×q . Here, s and t are the sparsity levels of y and z with cor-
responding ambient dimensions p and q, respectively. Prominent applications include
(image) inpainting, declipping, super-resolution, the recovery of signals corrupted by
impulse noise, and the separation of (e.g., audio or video) signals into two distinct
components (see Section I of [9]). We next briefly describe some of these problems.
1. Clipping. Non-linearities in power-amplifiers or in analog-to-digital converters
often cause signal clipping or saturation [35]. This effect can be cast into the signal
model (6.138) by setting B = I, identifying s = Ay with the signal to be clipped, and
setting z = (ga (s) − s) with ga (·) realizing entry-wise clipping of the amplitude to the
interval [0, a]. If the clipping level a is not too small, then z will be sparse, i.e., t q.
2. Missing entries. Our framework also encompasses super-resolution [36, 37] and
inpainting [38] of, e.g., images, audio, and video signals. In both these applications
only a subset of the entries of the (full-resolution) signal vector s = Ay is available
and the task is to fill in the missing entries, which are accounted for by writing
w = s + z with zi = −si if the ith entry of s is missing and zi = 0 else. If the number
of entries missing is not too large, then z is sparse, i.e., t q.
3. Signal separation. Separation of (audio, image, or video) signals into two struc-
turally distinct components also fits into the framework described above. A promi-
nent example is the separation of texture from cartoon parts in images (see [39, 40]
and references therein). The matrices A and B are chosen to allow sparse representa-
tions of the two distinct features. Note that here Bz no longer plays the role of unde-
sired noise, and the goal is to recover both y and z from the observation w = Ay+Bz.
182 Erwin Riegler and Helmut Bölcskei
The first two examples above demonstrate that in many practically relevant applica-
tions the locations of the possibly non-zero entries of one of the sparse vectors, say z,
may be known. This can be accounted for by removing the columns of B corresponding
to the other entries, which results in t = q, i.e., the sparsity level of z equals the ambient
dimension. We next show how Corollary 6.3 can be used to state a sufficient condition
for recovery of y from w = Ay + Bz when t = q. For recovery guarantees in the case
where the sparsity levels of both y and z are strictly smaller than their corresponding
ambient dimensions, we refer to Theorem 8 of [9].
theorem 6.3 (Theorems 4 and 7 of [9]) Let y ∈ C p with y0 ≤ s, z ∈ Cq , A ∈ Cm×p , and
B ∈ Cm×q , both with column vectors · 2 -normalized to 1 and μ̄(A, B) > 0. Suppose that
with
[1 + μ(A)(1 − u)]+ [1 + μ(B)(1 − v)]+
fA,B (u, v) = . (6.140)
μ̄2 (A, B)
Then, y can be recovered from w = Ay + Bz by either of the following algorithms:
⎧
⎪
⎪
⎨minimize y0
(P0) ⎪ ⎪ (6.141)
⎩subject to Ay ∈ {w + Bz : z ∈ Cq },
⎧
⎪
⎪
⎨minimize y1
(P1) ⎪
⎪ (6.142)
⎩subject to Ay ∈ {w + Bz : z ∈ Cq }.
Proof We provide the proof for (P1) only. The proof for recovery through (P0) is very
similar and can be found in Appendix B of [9].
Let w = Ay + Bz and suppose that (P1) delivers y ∈ C p . This implies y1 ≤ y1 and
the existence of a z ∈ Cq such that
Ay = w + Bz. (6.143)
Ay = w − Bz. (6.144)
A( y − y ) = B(
z + z ). (6.145)
=p =q
We now set
U = {i ∈ {1, . . . , p} : yi 0} (6.146)
and
and show that p is εU -concentrated (with respect to the 1-norm) for εU = 1/2, i.e.,
1
pU c 1 ≤ p1 . (6.148)
2
We have
where (6.151) follows from the definition of U in (6.146), and in (6.152) we applied
the reverse triangle inequality. Now, (6.149)–(6.153) implies pU 1 ≥ pU c 1 . Thus,
2pU c 1 ≤ pU 1 + pU c 1 = p1 , which establishes (6.148). Next, set V = {1, . . . , q}
and note that q is trivially εV -concentrated (with respect to 1-norm) for εV = 0.
Suppose, toward a contradiction, that p 0. Then, we have
The notion of sparsity underlying the theory developed so far is that of either the number
of non-zero entries or of concentration in terms of 1-norm or 2-norm. In practice, one
often encounters more general concepts of parsimony, such as manifold or fractal
set structures. Manifolds are prevalent in data science, e.g., in compressed sensing
[22–27], machine learning [28], image processing [29, 30], and handwritten-digit
recognition [31]. Fractal sets find application in image compression and in modeling
Uncertainty Relations and Sparse Signal Recovery 185
of Ethernet traffic [32]. Based on the null-space property established in Lemma 6.8,
we now extend the theory to account for more general notions of parsimony. To this
end, we first need a suitable measure of “description complexity” that goes beyond
the concepts of sparsity and concentration. Formalizing this idea requires an adequate
dimension measure, which, as it turns out, is lower modified Minkowski dimension. We
start by defining Minkowski dimension and modified Minkowski dimension.
definition 6.5 (from Section 3.1 of [50])3 For U ⊆ Cm non-empty, the lower and upper
Minkowski dimensions of U are defined as
log NU (ρ)
dimB (U) = lim inf (6.168)
ρ→0 log(1/ρ)
and
log NU (ρ)
dimB (U) = lim sup , (6.169)
ρ→0 log(1/ρ)
respectively, where
NU (ρ) = min k ∈ N : U ⊆ Bm (ui , ρ), ui ∈ U (6.170)
i∈{1,...,k}
is the covering number of U for radius ρ > 0. If dimB (U) = dimB (U), this common
value, denoted by dimB (U), is the Minkowski dimension of U.
definition 6.6 (from Section 3.3 of [50]) For U ⊆ Cm non-empty, the lower and upper
modified Minkowski dimensions of U are defined as
⎧ ⎫
⎪
⎪
⎨ ⎪ ⎪
⎬
dimMB (U) = inf ⎪⎪ sup dimB (Ui ) : U ⊆ Ui ⎪
⎪ (6.171)
⎩ i∈N ⎭
i∈N
and
⎧ ⎫
⎪
⎪
⎨ ⎪ ⎪
⎬
dimMB (U) = inf ⎪
⎪sup dimB (Ui ) : U ⊆ Ui ⎪
⎪ , (6.172)
⎩ i∈N ⎭
i∈N
respectively, where in both cases the infimum is over all possible coverings {Ui }i∈N
of U by non-empty compact sets Ui . If dimMB (U) = dimMB (U), this common value,
denoted by dimMB (U), is the modified Minkowski dimension of U.
For further details on (modified) Minkowski dimension, we refer the interested reader
to Section 3 of [50].
We are now ready to extend the null-space property in Lemma 6.8 to the following
set-theoretic null-space property.
theorem 6.4 Let U ⊆ C p+q be non-empty with dimMB (U) < 2m, and let B ∈ Cm×q with
m ≥ q be a full-rank matrix. Then, ker[A B] ∩ (U\{0}) = ∅ for Lebesgue-a.a. A ∈ Cm×p .
Proof See Section 6.8.
3 Minkowski dimension is sometimes also referred to as box-counting dimension, which is the origin of the
subscript B in the notation dimB (·) used henceforth.
186 Erwin Riegler and Helmut Bölcskei
The set U in this set-theoretic null-space property generalizes the finite union of linear
subspaces S in Lemma 6.8. For U ⊆ R p+q , the equivalent of Theorem 6.4 was reported
previously in Proposition 1 of [13]. The set-theoretic null-space property can be inter-
preted in geometric terms as follows. If p + q ≤ m, then [A B] is a tall matrix so that the
kernel of [A B] is {0} for Lebesgue-a.a. matrices A. The statement of the theorem holds
trivially in this case. If p + q > m, then the kernel of [A B] is a (p + q − m)-dimensional
subspace of the ambient space C p+q for Lebesgue-a.a. matrices A. The theorem
therefore says that, for Lebesgue-a.a. A, the set U intersects the subspace ker([A B])
at most trivially if the sum of dim ker([A B]) and4 dimMB (U)/2 is strictly smaller
than the dimension of the ambient space. What is remarkable here is that the notions
of Euclidean dimension (for the kernel of [A B]) and of lower modified Minkowski
dimension (for the set U) are compatible. We finally note that, by virtue of the chain
of equivalences in the proof of Lemma 6.8, the set-theoretic null-space property in
Theorem 6.4 leads to a set-theoretic uncertainty relation, albeit not in the form of an
upper bound on an operator norm; for a detailed discussion of this equivalence the
interested reader is referred to [13].
We next put the set-theoretic null-space property in Theorem 6.4 in perspective with
the null-space property in Lemma 6.8. Fix the sparsity levels s and t, consider the set
p
S s,t = : p ∈ C , q ∈ C , p0 ≤ s, q0 ≤ t ,
p q
(6.173)
q
which is a finite union of (s + t)-dimensional linear subspaces, and, for the sake of
concreteness, let A = I and B = F of size q × q. Lemma 6.8 then states that the kernel of
[I F] intersects S s,t trivially, provided that
which leads to a recovery threshold in the signal separation problem that is quadratic in
the sparsity levels s and t (see Theorem 8 of [9]). To see what the set-theoretic null-space
property gives, we start by noting that, by Example II.2 of [26], dimMB (S s,t ) = 2(s + t).
Theorem 6.4 hence states that, for Lebesgue-a.a. matrices A ∈ Cm×p , the kernel of
[A B] intersects S s,t trivially, provided that
m > s + t. (6.175)
This is striking as it says that, while the threshold in (6.174) is quadratic in the sparsity
levels s and t and, therefore, suffers from the square-root bottleneck, the threshold in
(6.175) is linear in s and t.
To understand the operational implications of the observation just made, we demon-
strate how the set-theoretic null-space property in Theorem 6.4 leads to a sufficient con-
dition for the recovery of vectors in sets of small lower modified Minkowski dimension.
4 The factor 1/2 stems from the fact that (modified) Minkowski dimension “counts real dimensions.”
For example, the modified Minkowski dimension of an n-dimensional linear subspace of Cm is 2n (see
Example II.2 of [26]).
Uncertainty Relations and Sparse Signal Recovery 187
lemma 6.9 Let S ⊆ C p+q be non-empty with dimMB (S S) < 2m, where
S S = {u − v : u, v ∈ S}, and let B ∈ Cm×q , with m ≥ q, be a full-rank matrix.
Then, [A B] is one-to-one on S for Lebesgue-a.a. A ∈ Cm×p .
Proof The proof follows immediately from the set-theoretic null-space property in
Theorem 6.4 and the linearity of matrix-vector multiplication.
To elucidate the implications of Lemma 6.9, consider S s,t defined in (6.173). Since
S s,t S s,t is again a finite union of linear subspaces of dimensions no larger than
min{p, 2s} + min{q, 2t}, where the min{·, ·} operation accounts for the fact that the
dimension of a linear subspace cannot exceed the dimension of its ambient space, we
have (see Example II.2 of [26])
dimMB (S s,t S s,t ) = 2(min{p, 2s} + min{q, 2t}). (6.176)
Application of Lemma 6.9 now yields that, for Lebesgue-a.a. matrices A ∈ Cm×p , we
can recover y ∈ C p with y0 ≤ s and z ∈ Cq with z0 ≤ t from w = Ay + Bz, provided
that m > min{p, 2s} + min{q, 2t}. This qualitative behavior (namely, linear in s and t) is
best possible as it cannot be improved even if the support sets of y and z were known
prior to recovery. We emphasize, however, that the statement in Lemma 6.9 guarantees
injectivity of [A B] only absent computational considerations for recovery.
We present a slightly improved and generalized version of the large sieve inequality
stated in Equation (32) of [7].
lemma 6.10 Let μ be a 1-periodic, σ-finite measure on R, n ∈ N, ϕ ∈ [0, 1), a ∈ Cn , and
consider the 1-periodic trigonometric polynomial
n
ψ(s) = e2π jϕ ak e−2π jks . (6.177)
k=1
Then,
1
|ψ(s)|2 dμ(s) ≤ n − 1 + sup μ((r, r + δ))a22 (6.178)
[0,1) δ r∈[0,1)
for all δ ∈ (0, 1].
Proof Since
n
|ψ(s)| = ak e−2π jks , (6.179)
k=1
we can assume, without loss of generality, that ϕ = 0. The proof now follows closely
the line of argumentation on pp. 185–186 of [51] and in the proof of Lemma 5 of [7].
Specifically, we make use of the result on p. 185 of [51] saying that, for every δ > 0,
there exists a function g ∈ L2 (R) with Fourier transform
188 Erwin Riegler and Helmut Bölcskei
∞
G(s) = g(t)e−2π jst dt (6.180)
−∞
such that G22 = n−1+1/δ, |g(t)|2 ≥ 1 for all t ∈ [1, n], and G(s) = 0 for all s [−δ/2, δ/2].
With this g, consider the 1-periodic trigonometric polynomial
n
ak −2π jks
θ(s) = e (6.181)
k=1
g(k)
1
1
= G22 μ (r − δ/2, r + δ/2) ∩ [i, 1 + i) |θ(r)|2 dr (6.190)
i=−1 0
1
= G22 μ (r − δ/2, r + δ/2) ∩ [−1, 2) |θ(r)|2 dr (6.191)
0
1
= G22 μ (r − δ/2, r + δ/2) |θ(r)|2 dr (6.192)
0
for all δ ∈ (0, 1], where (6.185) follows from (6.182)–(6.184), in (6.186) we applied the
Cauchy–Schwartz inequality (Theorem 1.37 of [52]), (6.188) is by Fubini’s theorem
(Theorem 1.14 of [53]) (recall that μ is σ-finite by assumption) upon noting that
{(r, s) : s ∈ [0, 1), r ∈ (s − δ/2, s + δ/2)} = {(r, s) : r ∈ [−1, 2), s ∈ (r − δ/2, r + δ/2) ∩ [0, 1)}
(6.193)
Uncertainty Relations and Sparse Signal Recovery 189
for all δ ∈ (0, 1], in (6.190) we used the 1-periodicity of μ and θ, and (6.191) is by
σ-additivity of μ. Now,
1 1
μ (r − δ/2, r + δ/2) |θ(r)|2 dr ≤ sup μ((r, r + δ)) |θ(r)|2 dr (6.194)
0 r∈[0,1) 0
n
|ak |2
= sup μ((r, r + δ)) (6.195)
r∈[0,1) k=1
|g(k)|2
for all δ > 0, where (6.196) follows from |g(t)|2 ≥ 1 for all t ∈ [1, n]. Using (6.194)–(6.196)
and G22 = n − 1 + 1/δ in (6.192) establishes (6.178).
Lemma 6.10 is a slightly strengthened version of the large sieve inequality (Equation
(32) of [7]). Specifically, in (6.178) it is sufficient to consider open intervals (r, r + δ),
whereas Equation (32) of [7] requires closed intervals [r, r + δ]. Thus, the upper bound
in Equation (32) of [7] can be strictly larger than that in (6.178) whenever μ has mass
points.
L2 analog L1 analog
By definition of lower modified Minkowski dimension, there exists a covering {Ui }i∈N
of U by non-empty compact sets Ui satisfying dimB (Ui ) < 2m for all i ∈ N. The
countable sub-additivity of Lebesgue measure λ now implies
∞
λ({A ∈ Cm×p : ker[A B] ∩ (U\{0}) ∅}) ≤ λ({A ∈ Cm×p : ker[A B] ∩ (Ui \{0}) ∅}).
i=1
(6.197)
190 Erwin Riegler and Helmut Bölcskei
We next establish that every term in the sum on the right-hand side of (6.197) equals
zero. Take an arbitrary but fixed i ∈ N. Repeating the steps in Equations (10)–(14) of
[13] shows that it suffices to prove that
with
u
V= : u ∈ C , v ∈ C , u2 > 0 ∩ Ui
p q
(6.199)
v
and A = (A1 . . . Am )∗ , where the random vectors Ai are independent and uniformly
distributed on B p (0, r) for arbitrary but fixed r > 0. Suppose, toward a contradiction,
that (6.198) is false. This implies
log P[ker([A B]) ∩ V ∅]
0 = lim inf (6.200)
ρ→0 log(1/ρ)
NV (ρ)
log i=1 P[ker([A B]) ∩ B p+q (ci , ρ) ∅]
≤ lim inf , (6.201)
ρ→0 log(1/ρ)
where we have chosen {ci : i = 1, . . . , NV (ρ)} ⊆ V such that
N
V (ρ)
V⊆ B p+q (ci , ρ), (6.202)
i=1
with NV (ρ) denoting the covering number of V for radius ρ > 0 (cf. (6.170)). Now let
i ∈ {1, . . . , NV (ρ)} be arbitrary but fixed and write ci = (uTi vTi )T . It follows that
where in the last step we made use of Ai 2 ≤ r for i = 1, . . . , m. We now have
C(p, m, r) √
≤ ρ2m (1 + r m + B2 )2m , (6.211)
ui 2m
2
Uncertainty Relations and Sparse Signal Recovery 191
where (6.214) follows from dimB (V) ≤ dimB (Ui ) < 2m, which constitutes a
contradiction. Therefore, (6.198) must hold.
lemma 6.11 Let A = (A1 . . . Am )∗ with independent random vectors Ai uniformly
distributed on B p (0, r) for r > 0. Then
C(p, m, r) 2m
P[Au + v2 < δ] ≤ δ , (6.215)
u2m
2
with
πV p−1 (r) m
C(p, m, r) = (6.216)
V p (r)
for all u ∈ C p \{0}, v ∈ Cm , and δ > 0.
Proof Since
m
P[Au + v2 < δ] ≤ P[|A∗i u + vi | < δ] (6.217)
i=1
owing to the independence of the Ai and as Au + v2 < δ implies |A∗i u + vi | < δ for
i = 1, . . . , m, it is sufficient to show that
D(p, r) 2
P[|B∗ u + v| < δ] ≤ δ (6.218)
u22
for all u ∈ C p \{0}, v ∈ C, and δ > 0, where the random vector B is uniformly distributed
on B p (0, r) and
πV p−1 (r)
D(p, r) = . (6.219)
V p (r)
We have
! ∗ "
∗ |B u + v| δ
P[|B u + v| < δ] = P < (6.220)
u2 u2
= P[|B∗ U∗ e1 + v| < δ] (6.221)
= P[|B∗ e1 + v| < δ] (6.222)
1
= χ (b1 ) db (6.223)
V p (r) B p (0,r) {b1 :|b1 +v| < δ}
192 Erwin Riegler and Helmut Bölcskei
1
≤ db1 d(b2 . . . b p )T (6.224)
V p (r) |b1 +v|≤δ B p−1 (0,r)
V p−1 (r)
= db1 (6.225)
V p (r) |b1 +v|<δ
V p−1 (r) 2
= πδ (6.226)
V p (r)
πV p−1 (r) 2
= δ , (6.227)
V p (r)u2
where the unitary matrix U in (6.221) has been chosen such that U(u/u2 ) = e1 =
(1 0 . . . 0)T ∈ C p and we set δ := δ/u2 and v := v/u2 . Further, (6.222) follows
from the unitary invariance of the uniform distribution on B p (0, r), and in (6.223) the
factor 1/V p (r) is owing to the assumption of a uniform probability density function on
B p (0, r).
lemma 6.12 Let U ∈ Cm×m be unitary, P, Q ⊆ {1, . . . , m}, and consider the orthogonal
projection ΠQ (U) = UDQ U∗ onto the subspace WU,Q . Then
Moreover, we have
xP 2
|||DP ΠQ (U)|||2 = max (6.229)
x∈WU,Q \{0} x2
and
xP 1
|||DP ΠQ (U)|||1 = max . (6.230)
x∈WU,Q \{0} x1
Proof The identity (6.228) follows from
where in (6.231) we used that ||| · |||2 is self-adjoint (see p. 309 of [33]), ΠQ (U)∗ = ΠQ (U),
and D∗P = DP . To establish (6.229), we note that
≤ max D ΠQ (U)x (6.237)
x: ΠQ (U)x0 ΠQ (U)x2 2
P
xP 2
= max (6.238)
x∈W \{0} x2
U,Q
ΠQ (U)x
= max DP ΠQ (U) (6.239)
x: ΠQ (U)x0 ΠQ (U)x2 2
≤ max DP ΠQ (U)x2 (6.240)
x: x2 =1
where in (6.236) we used ΠQ (U)x2 ≤ x2 , which implies ΠQ (U)x2 ≤ 1 for all x
with x2 = 1. Finally, (6.230) follows by repeating the steps in (6.234)–(6.241) with
· 2 replaced by · 1 at all occurrences.
lemma 6.13 Let A ∈ Cm×n . Then
A2
√ ≤ |||A|||2 ≤ A2 . (6.242)
rank(A)
Proof The proof is trivial for A = 0. If A 0, set r = rank(A) and let σ1 , . . . , σr denote
the non-zero singular values of A organized in decreasing order. Unitary invariance # of
||| · |||2 and ·2 (see Problem 5, on p. 311 of [33]) yields |||A|||2 = σ1 and A2 = i=1 σi .
r 2
and
1
A1 ≤ |||A|||1 ≤ A1 . (6.245)
n
Proof The identity (6.244) is established on p. 294 of [33], and (6.245) follows
directly from (6.244).
References
[1] W. Heisenberg, The physical principles of the quantum theory. University of Chicago
Press, 1930.
[2] W. G. Faris, “Inequalities and uncertainty principles,” J. Math. Phys., vol. 19, no. 2, pp.
461–466, 1978.
[3] M. G. Cowling and J. F. Price, “Bandwidth versus time concentration: The Heisenberg–
Pauli–Weyl inequality,” SIAM J. Math. Anal., vol. 15, no. 1, pp. 151–165, 1984.
194 Erwin Riegler and Helmut Bölcskei
[4] J. J. Benedetto, Wavelets: Mathematics and applications. CRC Press, 1994, ch. Frame
decompositions, sampling, and uncertainty principle inequalities.
[5] G. B. Folland and A. Sitaram, “The uncertainty principle: A mathematical survey,” J.
Fourier Analysis and Applications, vol. 3, no. 3, pp. 207–238, 1997.
[6] D. L. Donoho and P. B. Stark, “Uncertainty principles and signal recovery,” SIAM J. Appl.
Math., vol. 49, no. 3, pp. 906–931, 1989.
[7] D. L. Donoho and B. F. Logan, “Signal recovery and the large sieve,” SIAM J. Appl. Math.,
vol. 52, no. 2, pp. 577–591, 1992.
[8] M. Elad and A. M. Bruckstein, “A generalized uncertainty principle and sparse represen-
tation in pairs of bases,” IEEE Trans. Information Theory, vol. 48, no. 9, pp. 2558–2567,
2002.
[9] C. Studer, P. Kuppinger, G. Pope, and H. Bölcskei, “Recovery of sparsely corrupted
signals,” IEEE Trans. Information Theory, vol. 58, no. 5, pp. 3115–3130, 2012.
[10] P. Kuppinger, G. Durisi, and H. Bölcskei, “Uncertainty relations and sparse signal recovery
for pairs of general signal sets,” IEEE Trans. Information Theory, vol. 58, no. 1, pp.
263–277, 2012.
[11] A. Terras, Fourier analysis on finite groups and applications. Cambridge University Press,
1999.
[12] T. Tao, “An uncertainty principle for cyclic groups of prime order,” Math. Res. Lett.,
vol. 12, no. 1, pp. 121–127, 2005.
[13] D. Stotz, E. Riegler, E. Agustsson, and H. Bölcskei, “Almost lossless analog signal
separation and probabilistic uncertainty relations,” IEEE Trans. Information Theory,
vol. 63, no. 9, pp. 5445–5460, 2017.
[14] S. Foucart and H. Rauhut, A mathematical introduction to compressive sensing.
Birkhäuser, 2013.
[15] K. Gröchenig, Foundations of time–frequency analysis. Birkhäuser, 2001.
[16] D. Gabor, “Theory of communications,” J. Inst. Elec. Eng., vol. 96, pp. 429–457, 1946.
[17] H. Bölcskei, Advances in Gabor analysis. Birkhäuser, 2003, ch. Orthogonal frequency
division multiplexing based on offset QAM, pp. 321–352.
[18] C. Fefferman, “The uncertainty principle,” Bull. Amer. Math. Soc., vol. 9, no. 2, pp.
129–206, 1983.
[19] S. Evra, E. Kowalski, and A. Lubotzky, “Good cyclic codes and the uncertainty principle,”
L’Enseignement Mathématique, vol. 63, no. 2, pp. 305–332, 2017.
[20] E. Bombieri, Le grand crible dans la théorie analytique des nombres. Société
Mathématique de France, 1974.
[21] H. L. Montgomery, Twentieth century harmonic analysis – A celebration. Springer, 2001,
ch. Harmonic analysis as found in analytic number theory.
[22] R. G. Baraniuk and M. B. Wakin, “Random projections of smooth manifolds,” Found.
Comput. Math., vol. 9, no. 1, pp. 51–77, 2009.
[23] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Found.
Comput. Math., vol. 9, no. 6, pp. 717–772, 2009.
[24] Y. C. Eldar, P. Kuppinger, and H. Bölcskei, “Block-sparse signals: Uncertainty relations
and efficient recovery,” IEEE Trans. Signal Processing, vol. 58, no. 6, pp. 3042–3054,
2010.
[25] E. J. Candès and Y. Plan, “Tight oracle inequalities for low-rank matrix recovery from
a minimal number of noisy random measurements,” IEEE Trans. Information Theory,
vol. 4, no. 57, pp. 2342–2359, 2011.
Uncertainty Relations and Sparse Signal Recovery 195
Summary
7.1 Introduction
197
198 Galen Reeves and Henry D. Pfister
Traditional problems in information theory assume that all statistics are known and
that certain system parameters can be chosen to optimize performance. In contrast, data-
science problems typically assume that the important distributions are either given by
the problem or must be estimated from the data. When the distributions are unknown,
the implied inference problems are more challenging and their analysis can become
intractable. Nevertheless, similar behavior has also been observed in more general high-
dimensional inference problems such as Gaussian mixture clustering [2].
In this chapter, we use the standard linear model as a simple example to illustrate
phase transitions in high-dimensional inference. In Section 7.2, basic properties of the
standard linear model are described and examples are given to describe its behav-
ior. In Section 7.3, a number of connections to information theory are introduced. In
Section 7.4, we present an overview of the authors’ proof that the replica formula for
mutual information is exact. In Section 7.5, connections between posterior correla-
tion and phase transitions are discussed. Finally, in Section 7.6, we offer concluding
remarks.
Notation The probability P[Y = y|X = x] is denoted succinctly by pY|X (y|x) and short-
ened to p(y|x) when the meaning is clear. Similarly, the distinction between discrete and
continuous distributions is neglected when it is inconsequential.
p(y | x)p(x)
p(x | y) = . (7.1)
p(y)
In the high-dimensional setting where both M and N are large, direct evaluation of the
posterior distribution can become intractable, and one often resorts to summary statistics
such as the posterior mean/covariance or the marginal posterior distribution of a small
subset of the variables.
The analysis of high-dimensional inference problems focuses on two questions.
• What is the fundamental limit of inference without computational constraints?
• What can be inferred from data using computationally efficient methods?
It is becoming increasingly common for the answers to these questions to be framed
in terms of phase diagrams, which provide important information about fundamental
trade-offs involving the amount and quality of data. For example, the phase diagram
in Fig. 7.1 shows that increasing the amount of data not only provides more informa-
tion, but also moves the problem into a regime where efficient methods are optimal. By
contrast, increasing the SNR may lead to improvements that can be attained only with
significant computational complexity.
Understanding Phase Transitions via Mutual Information and MMSE 199
Amount
Hard – All known efficient methods fail
of data
but brute-force methods can still succeed.
Hard
Impossible
Impossible – All methods fail regardless
of computational complexity.
Information-Theoretic Analysis
The standard approach taken in information theory is to first obtain precise characteri-
zations of the fundamental limits, without any computational constraints, and then use
these limits to inform the design and analysis of practical methods. In many cases, the
fundamental limits can be understood by studying macroscopic system properties in the
large-system limit, such as the mutual information and the minimum mean-squared error
(MMSE). There are a wide variety of mathematical tools to analyze these quantities in
the context of compression and communication [3, 4]. Unfortunately, these tools alone
are often unable to provide simple descriptions for the behavior of high-dimensional
statistical inference problems.
where {Wn } is i.i.d. standard Gaussian noise and s ∈ [0, ∞) parameterizes the signal-to-
noise ratio. The Gaussian channel, which is also known as the Gaussian sequence model
in the statistics literature [16], provides a useful first-order approximation for a wide
variety of applications in science and engineering.
The standard linear model is an important generalization of the Gaussian channel in
which the observations consist of noisy linear measurements:
Ym = Am , X + Wm , m = 1, . . . , M, (7.3)
where ·, · denotes the standard Euclidean inner product, {Am } is a known sequence
of N-length measurement vectors, and {Wm } is i.i.d. Gaussian noise. Unlike for the
Gaussian channel, the number of observations M may be different from the number of
unknown variables N. For this reason the measurement indices are denoted by m instead
of n. In matrix form, the standard linear model can be expressed as
Y = AX + W, (7.4)
sensing [17, 18]. In the standard linear model, the matrix A induces dependences
between the unknown variables which make the inference problem significantly more
difficult.
Typical inference questions for the standard linear model include the following.
• is
Estimation of unknown variables. The performance of an estimator Y → X
often measured using its mean-squared error (MSE),
E X−X 2.
The optimal MSE, computed by minimizing over all possible estimators, is called
the minimum mean-squared error (MMSE),
E X − E[X | Y] 2 .
The MMSE is also equivalent to the Bayes risk under squared-error loss.
• Prediction of a new observation Ynew = Anew , X + Wnew . The performance of an
estimator (Y, Anew ) →
Ynew is often measured using the prediction mean-squared
error,
E (Ynew −
Ynew )2 .
• Detection of whether the ith entry belongs to a subset K of the real line. For
example, K = R \ {0} tests whether entries are non-zero. In practice, one typically
defines a test statistic
p(Y = y|Xi ∈ K)
T (y) = ln
p(Y = y|Xi K)
and then uses a threshold rule that chooses Xi ∈ K if T (y) ≥ λ and Xi K otherwise.
The performance of this detection rule can be measured using the true-positive
rate (TPR) and the false-positive rate (FPR) given by
TPR = p(T (Y) ≥ λ|Xi ∈ K), FPR = p(T (Y) ≥ λ|Xi K).
The receiver operating characteristic (ROC) curve for this binary decision prob-
lem is obtained by plotting the TPR versus the FPR as a parametric function of
the threshold λ. An example is given in Fig. 7.3 below.
• Posterior marginal approximation of a subset S of unknown variables. The goal
is to compute an approximation p(xS | Y) of the marginal distribution of entries in
S, which can be used to provide summary statistics and measures of uncertainty. In
some cases, accurate approximation of the posterior for small subsets is possible
even though the full posterior distribution is intractable.
with a very wide distribution). Consider a sequence of problems where the number of
measurements per signal dimension converges to δ. In this case, the normalized mutual
information and MMSE corresponding to the large-system limit1 are given by
1 1
I(δ) lim I(X; Y | A), M(δ) lim mmse(X | Y, A).
M,N→∞ N M,N→∞ N
M/N→δ M/N→δ
Part of what makes this problem interesting is that the MMSE can have discon-
tinuities, which are referred to as phase transitions. The values of δ at which these
discontinuities occur are of significant interest because they correspond to problem set-
tings in which a small change in the number of measurements makes a large difference
in the ability to estimate the unknown variables. In the above limit, the value of M(δ) is
undefined at these points.
Replica-Symmetric Formula
Using the heuristic replica method from statistical physics, Guo and Verdú [19] derived
single-letter formulas for the mutual information and MMSE in the standard linear
model with i.i.d. variables and an i.i.d. Gaussian matrix:
√ δ δ s
I(δ) = min I X; sX + W + log + −1 (7.5)
s≥0 2 s δ
F (s)
and M(δ) = mmse X | s∗ (δ) X + W . (7.6)
In these expressions, X is a univariate random variable drawn according to the prior pX ,
W ∼ N(0, 1) is independent Gaussian noise, and s∗ (δ) is a minimizer of the objective
function F (s). Precise definitions of the mutual information and the MMSE are pro-
vided in Section 7.3 below. By construction, the replica mutual information (7.5) is a
continuous function of the measurement rate δ. However, the replica MMSE predic-
tion (7.6) may have discontinuities when the global minimizer s∗ (δ) jumps from one
minimum to another, and M(δ) is well defined only if s∗ (δ) is the unique minimizer.
In [20–22], the authors prove these expressions are exact for the standard linear model
with an i.i.d. Gaussian measurement matrix. An overview of this proof is presented in
Section 7.4.
1 Under reasonable conditions, one can show that these limits are well defined for almost all non-negative
values of δ and that I(δ) is continuous.
Understanding Phase Transitions via Mutual Information and MMSE 203
of the objective function F (s) in (7.5). For cases where the replica formulas are exact,
this means that AMP-type algorithms can be optimal with respect to marginal inference
problems whenever the largest local minimizer of F (s) is also the global minimizer [32].
parameter γ ∈ (0, 1) determines the expected fraction of non-zero entries. The mean and
variance of a random variable X ∼ BG(x | μ, σ2 , γ) are given by
E[X] = γμ, Var(X) = γ(1 − γ)μ2 + γσ2 . (7.9)
√ −1
sμ2 − 2 sμyn − sσ2 y2n
γn = 1 + (1/γ − 1) 1 + sσ2 exp . (7.13)
2(1 + sσ2 )
Given these parameters, the posterior mean E[Xn | Yn ] and the posterior variance
Var(Xn | Yn ) can be computed using (7.9). The parameter γn is the conditional probability
that Xn is non-zero given Yn . This parameter is often called the posterior inclusion
probability in the statistics literature.
The decoupling of the posterior distribution makes it easy to characterize the funda-
mental limits of performance measures. For example the MMSE is the expectation of the
posterior variance Var(Xn | Yn ) and the optimal trade-off between the true-positive rate
and false-positive rate for detecting the event {Xn 0} is characterized by the distribution
of γn .
To investigate the statistical properties of the posterior distribution we perform a
numerical experiment. First, we draw N = 10 000 variables according to the Bernoulli–
Gaussian variables with μ = 0, σ2 = 106 , and prior inclusion probability γ = 0.2. Then,
for various values of the signal-to-noise-ratio parameter s, we evaluate the posterior
distribution corresponding to the output of the Gaussian channel.
In Fig. 7.2 (left panel), we plot three quantities associated with the estimation error:
1
N
average squared error: (Xn − E[X | Yn ])2 ,
N n=1
1
N
average posterior variance: Var(Xn | Yn ),
N n=1
1
N
average MMSE: E[Var(Xn | Yn )].
N n=1
Note that the squared error and posterior variance are both random quantities because
they are functions of the data. This means that the corresponding plots would look
3
10 10 3
10 2 10 2
1
10 10 1
B B
0
10 10 0
–1
10 10
–1
–8 –6 –4 –2 0
10 10 10 10 10 0 1000 2000 3000 4000 5000 6000 7000 8000
Signal-to-noise ratio (dB) Number of observations
Figure 7.2 Comparison of average squared error for the Gaussian channel as a function of the
signal-to-noise ratio (left panel) and the standard linear model as a function of the number of
observations (right panel). In both cases, the unknown variables are i.i.d. Bernoulli–Gaussian
with zero mean and a fraction γ = 0.20 of non-zero entries.
Understanding Phase Transitions via Mutual Information and MMSE 205
1
Evaluated at points
0.9 B in Figure 7.2
0.8
Evaluated at points
0.7
A in Figure 7.2
True-positive rate
0.6
0.5
0.4
0.3
0.2
slightly different if the experiment were repeated multiple times. The MMSE, however,
is a function of the joint distribution of (X, Y) and is thus non-random. In this setting, the
fact that there is little difference between the averages of these quantities can be seen
as a consequence of the decoupling of the posterior distribution and the law of large
numbers.
In Fig. 7.3, we plot the ROC curve for the problem of detecting the non-zero vari-
ables. The curves are obtained by thresholding the posterior inclusion probabilities {γn }
associated with the values of the signal-to-noise ratio at the points A and B in Fig. 7.2.
p(x | y, A) = p(u | y, A) N x | E X | y, A, u , Cov(X | y, A, u) ,
u∈{0,1}N
206 Galen Reeves and Henry D. Pfister
where the summation is over all possible support sets. The posterior probability of the
support set is given by
p(u | y, A) ∝ p(u)N y | μAu 1, I + σ2 ATu Au ,
p(xn | y, A) = BG(xn | μn , σ2n , γn ), (7.14)
where the parameters (μn , σ2n , γn ) are the outputs of the AMP algorithm.
Similarly to the previous example, we perform a numerical experiment to investi-
gate the statistical properties of the marginal approximations. First, we draw N = 10 000
variables according to the Bernoulli–Gaussian variables with μ = 0, σ2 = 106 , and
prior inclusion probability γ = 0.2. Then, for various values of M, we obtain mea-
surements from the standard linear model with i.i.d. Gaussian measurement vectors
Am ∼ N(0, N −1 I) and use AMP to compute the parameters (μn , σ2n , γn ) used in the
marginal posterior approximations.
In Fig. 7.2 (right panel), we plot the squared error and the approximation of the
posterior variance associated with the AMP marginal approximations:
1
N
2
average AMP squared error: Xn − Ep [Xn | Y, A] ,
N n=1
1
N
average AMP posterior variance: Varp (Xn | Y, A).
N n=1
In these expressions, the expectation and the variance are computed with respect to the
marginal approximation in (7.14). Because these quantities are functions of the random
data, one expects that they would look slightly different if the experiment were repeated
multiple times.
At this point, there are already some interesting observations that can be made. First,
we note that the AMP approximation of the mean can be viewed as a point-estimate
of the unknown variables. Similarly, the AMP approximation of the posterior variance
(which depends on the observations but not the ground truth) can be viewed as a point-
estimate of the squared error. From this perspective, the close correspondence between
Understanding Phase Transitions via Mutual Information and MMSE 207
the squared error and the AMP approximation of the variance seen in Fig. 7.2 suggests
that AMP is self-consistent in the sense that it provides an accurate estimate of its
square error.
Another observation is that the squared error undergoes an abrupt change at around
3500 observations, between the points labeled A and B. Before this, the squared error
is within an order of magnitude of the prior variance. After this, the squared error drops
discontinuously. This illustrates that the estimator provided by AMP is quite accurate in
this setting.
However, there are still some important questions that remain. For example, how
accurate are the AMP posterior marginal approximations? Is it possible that a different
algorithm (e.g., one that computes the true posterior marginals) would lead to estimates
with significantly smaller squared error? Further questions concern how much informa-
tion is lost in focusing only on the marginals of the posterior distribution as opposed to
the full posterior distribution.
Unlike the Gaussian channel, it is not possible to evaluate the MMSE directly because
the summation over all 210000 support vectors is prohibitively large. For comparison, we
plot the large-system MMSE predicted by (7.6), which corresponds to the large-N limit
where the fraction of observations is parameterized by δ = M/N.
The behavior of the large-system MMSE is qualitatively similar to the AMP squared
error because it has a single jump discontinuity (or phase transition). However, the jump
occurs after only 2850 observations for the MMSE as opposed to after 3600 observations
for AMP. By comparing the AMP squared error with the asymptotic MMSE, we see that
the AMP marginal approximations are accurate in some cases (e.g., when the number
of observations is fewer than 2840 or greater than 3600) but highly inaccurate in others
(e.g., when the number of observations is between 2840 and 3600).
In Fig. 7.3, we plot the ROC curve for the problem of detecting the non-zero variables
in the standard linear model. In this case, the curves are obtained by thresholding AMP
approximations of the posterior inclusion probabilities {γn }. It is interesting to note that
the ROC curves corresponding to the two different observation models (the Gaussian
channel and the standard linear model) have similar shapes when they are evaluated in
problem settings with matched squared error.
The amount one learns about an unknown vector X from an observation Y can be quan-
tified in terms of the difference between the prior and posterior distributions. For a
particular realization of the observations y, a fundamental measure of this difference
is provided by the relative entropy
D pX|Y (· | y) pX (·) .
This quantity is non-negative and equal to zero if and only if the posterior is the same as
the prior almost everywhere.
For real vectors, another way to assess the difference between the prior and posterior
distributions is to compare the first and second moments of their distributions. These
208 Galen Reeves and Henry D. Pfister
moments are summarized by the mean E[X], the conditional mean E X | Y = y , the
covariance matrix
Cov(X) E (X − E[X])(X − E[X])T ,
Together, these provide some measure of how much “information” there is in the data.
One of the difficulties of working with the posterior distribution directly is that it can
depend non-trivially on the particular realization of the data. It can be much easier to
focus on the behavior for typical realizations of the data by studying the distribution of
the relative entropy when Y is drawn according to the marginal distribution p(y). For
example, the expectation of the relative entropy is the mutual information
I(X; Y) E D pX|Y (· | Y) pX (·) .
The trace of this matrix equals the Bayes risk for squared-error loss, which is more
commonly called the MMSE and defined by
mmse(X | Y) tr(E[Cov(X | Y)]) = E X − E[X | Y] 22 ,
where · 2 denotes the Euclidean norm. Part of the appeal of working with the mutual
information and the MMSE is that they satisfy a number of useful functional properties,
including chain rules and data-processing inequalities.
The prudence of focusing on the expectation with respect to the data depends on the
extent to which the random quantities of interest deviate from their expectations. In the
statistical physics literature, the concentration of the relative entropy and squared error
around their expectations is called the self-averaging property and is often assumed for
large systems [5].
2 Observe that Cov(X | Y) is a random variable in the same sense as E[X | Y].
Understanding Phase Transitions via Mutual Information and MMSE 209
Proof The MMSE function is real analytic, and hence infinitely differentiable, on
(0, ∞) [41]. By differentiation, one finds that
d
kX (s) = −MX (s)/MX2 (s) − 1. (7.18)
ds
Let λ1 , . . . , λN ∈ [0, ∞) be the eigenvalues of the N × N matrix E[Cov(X | Y))], where
√
Y = s X + W. Starting with (7.16), we find that
1 1
−MX (s) = E Cov(X | Y) 2F ≥ E[Cov(X | Y)] 2F
N N
⎛ ⎞2
1 2 ⎜⎜⎜⎜ 1 ⎟⎟⎟⎟
N N
= λ ≥ ⎜⎜ λn ⎟⎟ = MX2 (s),
N n=1 n ⎝ N n=1 ⎠
where both inequalities are due to Jensen’s inequality. Combining this inequality with
(7.18) establishes that the derivative of kX (s) is non-negative and hence kX (s) is non-
decreasing.
We remark that Theorem 7.1 implies the single-crossing property. To see this, note
that if Z ∼ N(0, σ2 I) then kZ (s) = σ−2 is a constant and thus kX (s) and kZ (s) cross
at most once. Furthermore, Theorem 7.1 shows that, for many problems, the Gaussian
distribution plays an extremal role for distributions with finite second moments. For
example, if we let Z be a Gaussian random vector with the same mean and covariance
as X, then we have
IX (s) ≤ IZ (s),
MX (s) ≤ MZ (s),
MX ≤ MZ (s),
where equality holds if and only if X is Gaussian. The importance of these inequalities
follows from the fact that the Gaussian distribution is easy to analyze and often well
behaved.
• we have
Low Error Probability. For the MAP decoding decision X,
Pr X = X ≥ P pX|Y (x(J)|Y) > max pX|Y (x()|Y) ≥ 1 − η.
J
snr
1
≤ − MX (t) dt (7.28)
0 1+t
= log(1 + snr) − 2IX (snr) (7.29)
≤2 , (7.30)
where (7.26) follows from Theorem 7.1, (7.28) holds because the upper bound in (7.23)
ensures that the integrand is non-negative, and (7.30) follows from the assumed lower
bound on the mutual information. Exponentiating both sides, rearranging terms, and
recalling the definition of kX (s) leads to the stated lower bound.
An immediate consequence of Theorem 7.2 is that the MMSE function associated
with a sequence of good codes undergoes a phase transition in the large-N limit. In
particular,
⎧
⎪
⎪
⎪ 1
⎪
⎨ 1 + s , 0 ≤ s < snr,
lim MX (s) = ⎪
⎪ (7.31)
N→∞ ⎪
⎪
⎩0, snr < s.
The case s ∈ [snr, ∞) follows from the definition of a good code and the monotonicity of
the MMSE function. The case s ∈ [0, snr) follows from Theorem 7.2 and the fact that
can be arbitrarily small.
An analogous result for binary linear codes on the Gaussian channel can be
found in [43]. The characterization of the asymptotic MMSE for good Gaussian
codes, described by (7.31), was also obtained previously using ideas from statisti-
cal physics [44]. The derivation presented in this chapter, which relies only on the
monotonicity of kX (s), bypasses some technical difficulties encountered in the previous
approach.
Im Im+1 − Im ,
Im Im+1
− Im .
Using the chain rule for mutual information, it is straightforward to show that the first-
and second-order difference sequences can also be expressed as
1
Im = I(X; Yπ(m+1) | Yπ(1) , . . . Yπ(m) ),
M! π
1
Im = I(Yπ(m+2) ; Yπ(m+1) | X, Yπ(1) , . . . Yπ(m) ) (7.32)
M! π
1
− I(Yπ(m+2) ; Yπ(m+1) | Yπ(1) , . . . Yπ(m) ).
M! π
The class of models satisfying this condition is quite broad and includes memory-
less channels and generalized linear models as special cases. The significance of the
conditional independence assumption is summarized in the following result.
theorem 7.3 (Monotonicity of incremental information) The first-order difference
sequence {Im } is monotonically decreasing for any observation model satisfying the
conditional independence condition in (7.33).
Proof Under assumption (7.33), two new observations Yπ(m+1) and Yπ(m+2) are condi-
tionally independent given X, and thus the first term on the right-hand side of (7.33)
is zero. This means that the second-order difference sequence is non-positive, which
implies monotonicity of the first-order difference.
The monotonicity in Theorem 7.3 can also be seen as a consequence of the subset
inequalities studied by Han; see Chapter 17 of [3]. Our focus on the incremental infor-
mation is also related to prior work in coding theory that uses an integral–derivative
relationship for the mutual information called the area theorem [45].
Similarly to the monotonicity properties studied in Section 7.3.1, the monotonicity
of the first-order difference imposes a number of constraints on the mutual informa-
tion sequence. Some examples illustrating the usefulness of these constraints will be
provided in the following sections.
The authors’ prior work [20–22] provided the first rigorous proof of the replica formulas
(7.5) and (7.6) for the standard linear model with an i.i.d. signal and a Gaussian sensing
matrix.
216 Galen Reeves and Henry D. Pfister
MMSE. The following theorem, from [20, 21], quantifies the precise sense in which
these approximations hold.
theorem 7.6 Consider the standard linear model (7.3) with i.i.d. Gaussian measure-
−1
vectors Am ∼ N(0, N IN ). If the entries of X are i.i.d. with bounded fourth moment
ment
E Xn4 ≤ B, then the sequences {Im } and {Mm } satisfy
))
δN )
))I − 1 log(1 + M )))) ≤ C N α , (7.44)
)m 2 m ) B,δ
m=1
) )
)
δN
)) m/N ))) α
)) Mm − MX )) ≤ C B,δ N (7.45)
m=1
1 + Mm
for every integer N and δ > 0, where α ∈ (0, 1) is a universal constant and C B,δ is a
constant that depends only on the pair (B, δ).
Theorem 7.6 shows that the normalized sum of the cumulative absolute error in
approximations (7.42) and (7.43) grows sub-linearly with the vector length N. Thus,
if one normalizes these sums by M = δN, then the resulting expressions converge to
zero as N → ∞. This is sufficient to establish (7.40) and (7.41).
4.5
The information fixed-point curve is the graph of
4
all possible solutions to MMSE fixed-point equation
3.5 (7.40) and I-MMSE formula (7.41).
3
2.5
2
The correct is the non-increasing subset of
1.5 the information fixed-point curve that matches the
1 boundary conditions for the mutual information.
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Measurement ratio
Figure 7.4 The derivative of the mutual information I (δ) as a function of the measurement rate δ
for linear estimation with i.i.d. Gaussian matrices. A phase transition occurs when the derivative
jumps from one branch of the information fixed-point curve to another.
218 Galen Reeves and Henry D. Pfister
non-increasing, in this case, it must jump from the upper solution branch to the lower
solution branch (see Fig. 7.4) at the phase transition.
The final step in the authors’ proof technique is to resolve the location of the phase
transition using boundary conditions on the mutual information for δ = 0 and δ → ∞,
which can be obtained directly using different arguments. Under the signal property
stated below in Definition 7.1, it is shown that the only solution consistent with the
boundary conditions is the one predicted by the replica method. A graphical illustration
of this argument is provided in Fig. 7.4.
3 Regrettably, this is unrelated to the “single-crossing property” described earlier that says the MMSE
function MX (s) may cross the matched Gaussian MMSE MZ (s) at most once.
Understanding Phase Transitions via Mutual Information and MMSE 219
one of the key differences in the compressed sensing problem is that the conditional
entropy of the signal does not drop to zero after the phase transition.
The authors’ proof technique also differs from previous approaches that use system-
wide interpolation methods to obtain one-sided bounds [5, 48] or that focus on special
cases, such as sparse matrices [49], Gaussian mixture models [50], or the detection
problem of support recovery [51, 52]. After [20, 22], Barbier et al. obtained similar
results using a substantially different method [53, 54]. More recent work has provided
rigorous results for the generalized linear model [55].
Going beyond the MMSE, there is also important information contained in the off-
diagonal entries, which describe the pair-wise correlations. A useful measure of this
correlation is provided by the mean-squared covariance:
220 Galen Reeves and Henry D. Pfister
N
N
E Cov(X | Y) 2F = E Cov2 (Xk , Xn | Y) . (7.49)
k=1 n=1
Note that, while the MMSE corresponds to N terms, the mean-squared covariance
corresponds
to N 2 terms. If the entries in X have bounded fourth moments (i.e.,
E Xi ≤ B), then it follows from the Cauchy–Schwarz inequality that each summand
4
Taking the expectations of these random quantaties and rearranging terms, one finds that
the mean-squared covariance can be decomposed into three non-negative terms:
2
N 2
E Cov(X | Y) 2F = N E Λ̄ + NVar(Λ̄) + E Λn − Λ̄ , (7.51)
n=1
(N
where Λ̄ = (1/N) n=1 Λn denotes the arithmetic mean of the eigenvalues. The first term
on the right-hand side corresponds to the square of the MMSE and is equal to the lower
bound in (7.50). The second term on the right-hand side corresponds to the variance of
(1/N) tr(Cov(X | Y)) with respect to the randomness in Y. This term is equal to zero if
X and Y are jointly Gaussian. The last term corresponds to the expected variation in
the eigenvalues. If a small number of eigenvalues are significantly larger than the others
then it is possible for this term to be N times larger than the first term. When this occurs,
most of the uncertainty in the posterior distribution is concentrated on a low-dimensional
subspace.
without bound. From (7.53), we see that this also implies significant correlation in the
posterior distribution.
Evaluating the conditional MMSE function and its derivative at s = 0 provides
expressions for the MMSE and mean-squared covariance associated with the orignal
observation model:
1
MX|Y (0) = mmse(X | Y), (7.54)
N
1
MX|Y (0) = − E Cov(X | Y) 2F . (7.55)
N
In light of the discussion above, the mean-squared covariance can be interpreted as the
rate of MMSE change with s that occurs when one is presented with an independent
observation of X from a Gaussian channel with infinitesimally small signal-to-noise
ratio. Furthermore, we see that significant correlation in the posterior distribution
corresponds to a jump discontinuity in the large-N limit of MX|Y (s) at the point s = 0.
theorem 7.8 For any observation model satisfying the conditional independence con-
dition in (7.33) and positive number T , the second-order difference sequence {Im }
satisfies
* +
| m : |Im | ≥ T | ≤ I1 /T, (7.57)
(M
where I1 = (1/M) m=1 I(X; Ym ) is the first term in the information sequence.
Proof The monotonicity of the first-order difference (Theorem 7.3) means that Im is
non-positive, and hence the indicator function of the event {|Im | ≥ T } is upper-bounded
by −Im /T . Summing this inequality over m, we obtain
+
M−2 M−2
*
| m : |Im | ≥ T | = 1[T,∞) (|Im |) ≤ −Im /T = I0 − I M−1
/T.
m=1 m=1
for all integers N and m = 1, . . . , M, where C B is a constant that depends only on the
fourth-moment upper bound B.
Theorem 7.9 shows that significant correlation in the posterior distribution implies
pair-wise dependence in the joint distribution of new measurements and, hence, a
significant decrease in the first-order difference sequence Im . In particular, if the mean-
squared covariance is order N 2 (corresponding to the upper bound in (7.50)), then |Im |
is lower-bounded by a constant. If we consider the large-N limit in which the number
of observations is parameterized by the fraction δ = m/N, then an order-one difference
in Im corresponds to a jump discontinuity with respect to δ. In other words, signif-
icant pair-wise correlation implies a phase transition with respect to the fraction of
observations.
Viewed in the other direction, Theorem 7.9 also shows that small changes in the first-
order difference sequence imply that the average pair-wise correlation is small. From
Understanding Phase Transitions via Mutual Information and MMSE 223
Theorem 7.8, we see that this is, in fact, the typical situation. Under the assumptions of
Theorem 7.9, it can be verified that
)) )
)) m : E '''Cov(X | Y m , Am )'''2 ≥ N 2− /4 ))) ≤ C
,B N (7.59)
) F )
for all 0 ≤ ≤ 1, where C ,B is a constant that depends only on the fourth-moment bound
B. In other words, the number of m-values for which the mean-squared covariance has
the same order as the upper bound in (7.50) must be sub-linear in N. This fact plays a
key role in the proof of Theorem 7.6; see [20, 21] for details.
7.6 Conclusion
This chapter provides a tutorial introduction to high-dimensional inference and its con-
nection to information theory. The standard linear model is analyzed in detail and used
as a running example. The primary goal is to present intuitive links between phase tran-
sitions, mutual information, and estimation error. To that end, we show how general
functional properties (e.g., the chain rule, data-processing inequality, and I-MMSE rela-
tionship) of mutual information and MMSE can imply meaningful constraints on the
solutions of challenging problems. In particular, the replica prediction of the mutual
information and MMSE is described and an outline is given for the authors’ proof that it
is exact in some cases. We hope that the approach described here will make this material
accessible to a wider audience.
This appendix describes the mapping between the standard linear model and a signal-
plus-noise response model for a subset of the observations. Recall the problem
formulation
Y = AX + W (7.60)
where A is an M × N matrix. Suppose that we are interested in the posterior distribution
of a subset S ⊂ {1, . . . , N} of the signal entries where the size of the subset K = |S | is
small relative to the signal length N and the number of measurements M. Letting S c =
{1, . . . , N}\S denote the complement of S , the measurements can be decomposed as
224 Galen Reeves and Henry D. Pfister
Y = AS XS + AS c XS c + W, (7.61)
where AS is an M × K matrix corresponding to the columns of A indexed by S and AS c
is an M × (N − K) matrix corresponding to the columns indexed by the complement of S .
This decomposition suggests an alternative interpretation of the linear model in which
XS is a low-dimensional signal of interest and XS c is a high-dimensional interference
term. Note that AS is a tall skinny matrix, and thus the noiseless measurements of XS
lie in a K-dimensional subspace of the M-dimensional measurement space.
Next, we introduce a linear transformation of the problem that attempts to sepa-
rate the signal of interest from the interference term. The idea is to consider the QR
decomposition of the tall skinny matrix AS of the form
R
AS = Q1 , Q2 ,
0
Q
to be the measurements Y,1 after subtracting the conditional expectation of the interfer-
ence term. Rearranging terms, one finds that the relationship between Z and XS can be
expressed succinctly as
Z = RXS + V, ,2 , B),
V ∼ p(v | Y (7.63)
Understanding Phase Transitions via Mutual Information and MMSE 225
where
,2 , B2 + -
V B1 XS c − E XS c | Y W1 (7.64)
is the error due to both the interference and the measurement noise.
Thus far, this decomposition is quite general in the sense that it can be applied for any
matrix A and subset S of size less than M. The key question at this point is whether the
error term V is approximately Gaussian.
References
[18] Y. C. Eldar and G. Kutyniok, Compressed sensing theory and applications. Cambridge
University Press, 2012.
[19] D. Guo and S. Verdú, “Randomly spread CDMA: Asymptotics via statistical physics,”
IEEE Trans. Information Theory, vol. 51, no. 6, pp. 1983–2010, 2005.
[20] G. Reeves and H. D. Pfister, “The replica-symmetric prediction for compressed sensing
with Gaussian matrices is exact,” in Proc. IEEE International Symposium on Information
Theory (ISIT), 2016, pp. 665–669.
[21] G. Reeves and H. D. Pfister, “The replica-symmetric prediction for compressed sensing
with Gaussian matrices is exact,” 2016, https://arxiv.org/abs/1607.02524.
[22] G. Reeves, “Understanding the MMSE of compressed sensing one measurement at a time,”
presented at the Institut Henri Poincaré Spring 2016 Thematic Program on the Nexus of
Information and Computation Theories, Paris, 2016, https://youtu.be/vmd8-CMv04I.
[23] M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, with
applications to compressed sensing,” IEEE Trans. Information Theory, vol. 57, no. 2, pp.
764–785, 2011.
[24] S. Rangan, “Generalized approximate message passing for estimation with random linear
mixing,” in Proc. IEEE International Symposium on Information Theory (ISIT), 2011, pp.
2174–2178.
[25] J. P. Vila and P. Schniter, “Expectation-maximization Gaussian-mixture approximate
message passing,” IEEE Trans. Signal Processing, vol. 61, no. 19, pp. 4658–4672, 2013.
[26] Y. Ma, J. Zhu, and D. Baron, “Compressed sensing via universal denoising and approxi-
mate message passing,” IEEE Trans. Signal Processing, vol. 64, no. 21, pp. 5611–5622,
2016.
[27] C. A. Metzler, A. Maleki, and R. G. Baraniuk, “From denoising to compressed sensing,”
IEEE Trans. Information Theory, vol. 62, no. 9, pp. 5117–5144, 2016.
[28] P. Schniter, S. Rangan, and A. K. Fletcher, “Vector approximate message passing for the
generalized linear model,” in Asilomar Conference on Signals, Systems and Computers,
2016.
[29] S. Rangan, P. Schniter, and A. K. Fletcher, “Vector approximate message passing,” in Proc.
IEEE International Symposium on Information Theory (ISIT), 2017, pp. 1588–1592.
[30] Y. Kabashima, “A CDMA multiuser detection algorithm on the basis of belief propaga-
tion,” J. Phys. A: Math. General, vol. 36, no. 43, pp. 11 111–11 121, 2003.
[31] M. Bayati, M. Lelarge, and A. Montanari, “Universality in polytope phase transitions and
iterative algorithms,” in IEEE International Symposium on Information Theory, 2012.
[32] G. Reeves, “Additivity of information in multilayer networks via additive Gaussian noise
transforms,” in Proc. Allerton Conference on Communication, Control, and Computing,
2017, https://arxiv.org/abs/1710.04580.
[33] B. Çakmak, O. Winther, and B. H. Fleury, “S-AMP: Approximate message passing for
general matrix ensembles,” 2014, http://arxiv.org/abs/1405.2767.
[34] A. Fletcher, M. Sahree-Ardakan, S. Rangan, and P. Schniter, “Expectation consistent
approximate inference: Generalizations and convergence,” in Proc. IEEE International
Symposium on Information Theory (ISIT), 2016.
[35] S. Rangan, P. Schniter, and A. K. Fletcher, “Vector approximate message passing,” 2016,
https://arxiv.org/abs/1610.03082.
[36] P. Schniter, S. Rangan, and A. K. Fletcher, “Vector approximate message passing for the
generalized linear model,” 2016, https://arxiv.org/abs/1612.01186.
Understanding Phase Transitions via Mutual Information and MMSE 227
[37] B. Çakmak, M. Opper, O. Winther, and B. H. Fleury, “Dynamical functional theory for
compressed sensing,” 2017, https://arxiv.org/abs/1705.04284.
[38] H. He, C.-K. Wen, and S. Jin, “Generalized expectation consistent signal recovery for
nonlinear measurements,” in Proc. IEEE International Symposium on Information Theory
(ISIT), 2017.
[39] D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimum mean-square error in
Gaussian channels,” IEEE Trans. Information Theory, vol. 51, no. 4, pp. 1261–1282, 2005.
[40] A. J. Stam, “Some inequalities satisfied by the quantities of information of Fisher and
Shannon,” Information and Control, vol. 2, no. 2, pp. 101–112, 1959.
[41] D. Guo, Y. Wu, S. Shamai, and S. Verdú, “Estimation in Gaussian noise: Properties of
the minimum mean-square error,” IEEE Trans. Information Theory, vol. 57, no. 4, pp.
2371–2385, 2011.
[42] G. Reeves, H. D. Pfister, and A. Dytso, “Mutual information as a function of matrix SNR
for linear Gaussian channels,” in Proc. IEEE International Symposium on Information
Theory (ISIT), 2018.
[43] K. Bhattad and K. R. Narayanan, “An MSE-based transfer chart for analyzing iterative
decoding schemes using a Gaussian approximation,” IEEE Trans. Information Theory,
vol. 58, no. 1, pp. 22–38, 2007.
[44] N. Merhav, D. Guo, and S. Shamai, “Statistical physics of signal estimation in Gaussian
noise: Theory and examples of phase transitions,” IEEE Trans. Information Theory, vol. 56,
no. 3, pp. 1400–1416, 2010.
[45] C. Méasson, A. Montanari, T. J. Richardson, and R. Urbanke, “The generalized area theo-
rem and some of its consequences,” IEEE Trans. Information Theory, vol. 55, no. 11, pp.
4793–4821, 2009.
[46] G. Reeves, “Conditional central limit theorems for Gaussian projections,” in Proc. IEEE
International Symposium on Information Theory (ISIT), 2017, pp. 3055–3059.
[47] G. Reeves, “Two-moment inequailties for Rényi entropy and mutual information,” in Proc.
IEEE International Symposium on Information Theory (ISIT), 2017, pp. 664–668.
[48] S. B. Korada and N. Macris, “Tight bounds on the capacity of binary input random CDMA
systems,” IEEE Trans. Information Theory, vol. 56, no. 11, pp. 5590–5613, 2010.
[49] A. Montanari and D. Tse, “Analysis of belief propagation for non-linear problems: The
example of CDMA (or: How to prove Tanaka’s formula),” in Proc. IEEE Information
Theory Workshop (ITW), 2006, pp. 160–164.
[50] W. Huleihel and N. Merhav, “Asymptotic MMSE analysis under sparse representation
modeling,” Signal Processing, vol. 131, pp. 320–332, 2017.
[51] G. Reeves and M. Gastpar, “The sampling rate–distortion trade-off for sparsity pattern
recovery in compressed sensing,” IEEE Trans. Information Theory, vol. 58, no. 5, pp.
3065–3092, 2012.
[52] G. Reeves and M. Gastpar, “Approximate sparsity pattern recovery: Information-theoretic
lower bounds,” IEEE Trans. Information Theory, vol. 59, no. 6, pp. 3451–3465, 2013.
[53] J. Barbier, M. Dia, N. Macris, and F. Krzakala, “The mutual information in random linear
estimation,” in Proc. Allerton Conference on Communication, Control, and Computing,
2016.
[54] J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborová, “Phase transitions,
optimal errors and optimality of message-passing in generalized linear models,” 2017,
https://arxiv.org/abs/1708.03395.
228 Galen Reeves and Henry D. Pfister
Summary
8.1 Background
229
230 Devavrat Shah
the store. More precisely, such a “choice model” can be viewed as a black-box that spits
out the probability of purchase of a particular option when the customer is presented
with a collection of options.
A canonical fine-grained representation for such a “choice model” is the distribution
over permutations of all the possible options (including the no-purchase option). Then,
probability of purchasing a particular option when presented with a collection of options
is simply the probability that this particular option has the highest (relative) order or rank
amongst all the presented options (including the no-purchase option).
Therefore, one way to operationalize such a “choice model” is to learn the distribution
over permutations of all options that a store owner can stock in the store. Clearly, such a
distribution needs to be learned from the observations or data. The data available to the
store owner are the historical transactions as well as what was stocked in the store when
each transaction happened. Such data effectively provide a bag of pair-wise comparisons
between options: consumer exercises or purchases option A over option B corresponds
to a pair-wise comparison A > B or “A is preferred to B.”
In summary, to model consumer choice, we wish to learn the distribution over per-
mutations of all possible options using observations in terms of a collection of pair-wise
comparisons that are consistent with the learned distribution.
In the context of sports, we wish to go a step further, to obtain a ranking of sports
teams or players that is based on outcomes of games, which are simply pair-wise
comparisons (between teams or players). Similarly, for the purpose of data-driven
policy-making, we wish to aggregate people’s opinions about socio-economic issues
such as modes of transportation according to survey data; for designing online recom-
mendation systems that are based on historical online activity of individuals, we wish
to recommend the top few options; or to sort objects on the basis of noisy outcomes of
pair-wise comparisons.
knowing the “first-order” marginal distribution information that states the probability of
a given option being in a certain position in the permutation. Therefore, to track con-
sumers in the store, we wish to learn the distribution over consumer trajectories that is
consistent with this first-order marginal information over time.
In summary, the model to track trajectories of individuals boils down to continually
learning the distribution over permutations that is consistent with the first-order marginal
information and subsequently finding the most likely ranking or permutation from the
learned distribution. It is the very same question that arises in the context of aggregating
web-page rankings obtained through results of search from multiple search engines in a
computationally efficient manner.
on, simple statistical tests as well as simple estimation procedures were developed to fit
such a model to observed data [9]. Now the IIA property possessed by the MNL model
is not necessarily desirable as evidenced in many empirical scenarios. Despite such
structural limitations, the MNL model has been widely utilized across application areas
primarily due to the ability to learn the model parameters easily from observed data. For
example, see [12–14] for applications in transportation and [15, 16] for applications in
operations management and marketing.
With a view to addressing the structural limitations of the MNL model, a number of
generalizations to this model have been proposed over the years. Notable among these
are the so-called “nested” MNL model, as well as mixtures of MNL models (or MMNL
models). These generalizations avoid the IIA property and continue to be consistent with
the random utility maximization framework at the expense of increased model complex-
ity; see [13, 17–20] for example. The interested reader is also referred to an overview
article on this line of research [14]. While generalized models of this sort are in prin-
ciple attractive, their complexity makes them difficult to learn while avoiding the risk
of overfitting. More generally, specifying an appropriate parametric model is a difficult
task, and the risks associated with mis-specification are costly in practice. For an applied
view of these issues, see [10, 21, 22].
As an alternative to the MNL model (and its extensions), one might also consider
the parametric family of choice models induced by the exponential family of distri-
butions over permutations. These may be viewed as the models that have maximum
entropy among those models that satisfy the constraints imposed by the observed data.
The number of parameters in such a model is equal to the number of constraints in the
maximum-entropy optimization formulation, or equivalently the effective dimension of
the underlying data, see the Koopman–Pitman–Darmois theorem [23]. This scaling of
the number of parameters with the effective data dimension makes the exponential fam-
ily obtained via the maximum-entropy principle very attractive. Philosophically, this
approach imposes on the model only those constraints implied by the observed data.
On the flip side, learning the parameters of an exponential family model is a computa-
tionally challenging task (see [24–26]) as it requires computing a “partition function,”
possibly over a complex state space.
Very recently, Jagabathula and Shah [27, 28] introduced a non-parametric sparse
model. Here the distribution over permutations is assumed to have sparse (or small)
support. While this may not be exactly true, it can be an excellent approximation to the
reality and can provide computationally efficient ways to both infer the model [27, 28]
in a manner consistent with observations and utilize it for effective decision-making
[29, 30].
8.2 Setup
Given N objects or items denoted as [N] = {1, . . . , N}, we are interested in the distribution
over permutations of these N items. A permutation σ : [N] → [N] is one-to-one and onto
mapping, with σ(i) denoting the position or ordering of element i ∈ [N].
Computing Choice 233
Let S N denote the space of N! permutations of these N items. The set of distributions
over S N is denoted as M(S N ) = {ν : S N → [0, 1] : σ∈S N ν(σ) = 1}. Given ν ∈ M(S N ), the
first-order marginal information, M(ν) = [Mi j (ν)], is an N × N doubly stochastic matrix
with non-negative entries defined as
Mi j (ν) = ν(σ)1{σ(i)= j} , (8.1)
σ∈S N
where, for σ ∈ S N , σ(i) denotes the rank of item i under permutation σ, and 1{x} is the
standard indicator with 1{true} = 1 and 1{false} = 0. The comparison marginal information,
C(ν) = [Ci j (ν)], is an N × N matrix with non-negative entries defined as
Ci j (ν) = ν(σ)1{σ(i)>σ( j)} . (8.2)
σ∈S N
By definition, the diagonal entries of C(ν) are all 0s, and Ci j (ν) + C ji (ν) = 1 for all 1 ≤
i j ≤ N. We shall abuse the notation by using M(σ) and C(σ) to denote the matrices
obtained by applying them to the distribution where the support of the distribution is
simply {σ}.
Throughout, we assume that there is a ground-truth model ν. We observe marginal
information M(ν) or C(ν), or their noisy versions.
in a sense, is ill-defined due to the impossibility result of Arrow [31]: there is no ranking
algorithm that works for all ν and satisfies certain basic hypotheses expected from any
ranking algorithm even when N = 3.
For this reason, like in the context of recovering ν, we will have to impose structure
on ν. In particular, the structure that we shall impose (e.g., the sparse model or the
multinomial logit model) seems to suggest a natural answer for ranking or the most
“relevant” permutation: find the σ that has maximal probability, i.e., find σ∗ (ν), where
Again, the goals would include the ability to recover σ∗ (ν) (exactly or approximately)
using observations (a) when η = 0 and (b) when there is a non-trivial η, and (c) the ability
to do this in a computationally efficient manner.
8.3 Models
We shall consider two types of model here: the non-parametric sparse model and the
parametric random utility model. As mentioned earlier, a large number of models have
been studied in the literature and are not discussed in detail here.
supp(ν) = {σ ∈ S N : ν(σ) > 0}. (8.4)
ν0 = supp(ν). (8.5)
where P(μ) ∈ {M(μ), C(μ)} depending upon the type of information considered.
Computing Choice 235
Yi = ui + εi , (8.7)
where εi are independent random variables across all i ∈ [N] – they represent “random
perturbation” of the “inherent utility” ui . We assume that all of the εi have identical mean
across all i ∈ [N], but can have varying distribution. The specific form of the distribution
gives rise to different types of models. We shall describe a few popular examples of this
in what follows. Before we do that, we explain how this setup gives rise to the distribu-
tion over permutations by describing the generative form of the distribution. Specifically,
to generate a random permutation over the N options, we first sample random variable
Yi , i ∈ [N], independently. Then we sort Y1 , . . . , YN in decreasing order,1 and this sorted
order of indices [N] provides the permutation. Now we describe two popular examples
of this model.
Probit model. Let εi have a Gaussian distribution with mean 0 and variance σ2i for i ∈
[N]. Then, the resulting model is known as the Probit model. In the homogeneous setting,
we shall assume that σ2i = σ2 for all i ∈ [N].
Multinomial logit (MNL) model. Let εi have a Gumbel distribution with mode μi and
scaling parameter βi > 0, i.e., the PDF of εi is given by
1 x − μi
f (x) = exp(−(z + exp(−z))), where z = , for x ∈ R. (8.8)
βi βi
In the homogeneous setting, μi = μ and βi = β for all i ∈ [N]. In this scenario, the resulting
distribution over permutations turns out to be equivalent to the following generative
model.
Let wi > 0 be a parameter associated with i ∈ [N]. Then the probability of permutation
σ ∈ S N is given by (for example, see [32])
N wσ−1 ( j)
P(σ) = . (8.9)
j=1
wσ−1 ( j) + wσ−1 ( j+1) + · · · + wσ−1 (N)
1 We shall assume that the distribution of εi , i ∈ [N], has a density and hence ties never happen between
Yi , Y j for any i j ∈ [N].
236 Devavrat Shah
lemma 8.1 Let εi , ε j be independent random variables with Gumbel distributions with
mode μi , μ j , respectively, with scaling parameters βi = β j = β > 0. Then, Δi j = εi − ε j has
a logistic distribution with parameters μi − μ j (location) and β (scale).
The proof of Lemma 8.1 follows by, for example, using the characteristic function
associated with the Gumbel distribution along with the property of the gamma function
(Γ(1 + z)Γ(1 − z) = zπ/ sin(πz)) and then identifying the characteristic function of the
logistic distribution.
Returning to our model, when we compare the random utilities associated with
options i and j, Yi and Y j , respectively, we assume the corresponding random pertur-
bation to be homogeneous, i.e., μi = μ j = μ and βi = β j = β > 0. Therefore, Lemma 8.1
suggests that
P Yi > Y j = P εi − ε j > u j − ui
= P Logistic(0, β) > u j − ui
= 1 − P Logistic(0, β) < u j − ui
1
= 1−
1 + exp − u j − ui /β
exp ui /β
=
exp ui /β + exp u j /β
wi
= , (8.11)
wi + w j
where wi = exp ui /β , w j = exp u j /β . It is worth remarking that the property of
Lemma 8.9 relates the MNL model to the Bradley–Terry model.
Learning the model and ranking. For the random utility model, the question of learn-
ing the model from data effectively boils down to learning the model parameters from
observations. In the context of a homogeneous model, i.e., εi in (8.7), for which we have
an identical distribution across all i ∈ [N], the primary interest is in learning the inherent
utility parameters ui , for i ∈ [N]. The question of recovering the ranking, on the other
hand, is about recovering σ ∈ S N , which is the sorted (decreasing) order of the inherent
utilities ui , i ∈ [N]: for example, if u1 ≥ u2 ≥ · · · ≥ uN , then the ranking is the identity
permutation.
In this section, we describe the conditions under which we can learn the underlying
sparse distribution using first-order marginal and comparison marginal information. We
divide the presentation into two parts: first, we consider access to exact or noise-less
marginals for exact recovery; and, then, we discuss its robustness.
We can recover a ranking in terms of the most likely permutation once we have
recovered the sparse model by simply sorting the likelihoods of the permutations in the
support of the distribution, which requires time O(K log K), where K is the sparsity of
Computing Choice 237
the model. Therefore, the key question in the context of the sparse model is the recovery
of the distribution, on which we shall focus in the remainder of this section.
Now suppose that ν(σi ) = pi , where pi ∈ [0, 1] for 1 ≤ i ≤ 3, and ν(σ) = 0 for all other
σ ∈ S N . Without loss of generality, let p1 ≤ p2 . Then
p1 M(σ1 ) + p2 M(σ2 ) + p3 M(σ3 ) = (p2 − p1 )M(σ2 ) + (p3 + p1 )M(σ3 ) + p1 M(σ4 ).
Here, note that {M(σ1 ), M(σ2 ), M(σ3 )} are linearly independent, yet the sparsest
solution is not unique. Therefore, it is not feasible to recover the sparse model uniquely.
Note that the above example can be extended for any N ≥ 4 by simply having identity
permutation for all elements larger than 4 in the above example. Therefore, for any N
with support size 3, we cannot always recover them uniquely.
Signature condition for recovery. Example 8.1 suggests that it is not feasible to expect an
RIP-like condition for the “projection matrix” corresponding to the first-order marginals
or comparison marginals so that any sparse probability distribution can be recovered.
The next best thing we can hope for is the ability to recover almost all of the sparse
probability distributions. This leads us to the signature condition of the matrix for a
given sparse vector, which, as we shall see, allows recovery of the particular sparse
vector [27, 28].
condition 8.1 (Signature conditions) A given matrix A ∈ Rm×n is said to satisfy a sig-
nature condition with respect to index set S ⊂ {1, . . . , n} if, for each i ∈ S , there exists
j(i) ∈ [m] such that A j(i)i 0 and A j(i)i = 0 for all i i, i ∈ S .
The signature condition allows recovery of sparse vector using a simple “peeling”
algorithm. We summarize the recovery result followed by an algorithm that will imply
the result.
theorem 8.1 Let A ∈ {0, 1}m×n with all of its columns being distinct. Let x ∈ Rn≥0 be such
that A satisfies a signature condition with respect to the set supp(x). Let the non-zero
components of x, i.e., {xi : i ∈ supp(x)}, be such that, for any two distinct S 1 , S 2 ⊂ supp(x),
i∈S 1 xi i ∈S 2 xi . Then x can be recovered from y, where y = Ax.
Proof To establish Theorem 8.1, we shall describe the algorithm that recovers x under
the conditions of the theorem and simultaneously argue for its correctness. To that end,
the algorithm starts by sorting components of y. Since A ∈ {0, 1}m×n , for each j ∈ [m],
y j = i∈S j xi , with S j ⊆ supp(x). Owing to the signature condition, for each i ∈ supp(x),
there exists j(i) ∈ [m] such that y j(i) = xi . If we can identify j(i) for each i, we recover the
values xi , but not necessarily the position i. To identify the position, we identify the ith
column of matrix A, and, since the columns of matrix A are all distinct, this will help us
identify the position. This will require use of the property that, for any S 1 S 2 ⊂ supp(x),
i∈S 1 xi i ∈S 2 xi . This implies, to begin with, that all of the non-zero elements of x
are distinct. Without loss of generality, let the non-zero elements of x be x1 , . . . , xK , with
K = |supp(x)| such that 0 < x1 < · · · < xK .
Now consider the smallest non-zero element of y. Let it be y j1 . From the property of
x, it follows that y j1 must be the smallest non-zero element of x, x1 . The second smallest
component of y that is distinct from x1 , let it be y j2 , must be x2 . The third distinct
smallest component, y j3 , however, could be x1 + x2 or x3 . Since we know x1 , x2 , and
Computing Choice 239
Then ν can be recovered from its first-order marginal distribution with probability
1 − o(1) as long as K ≤ (1 − )N log N for a fixed > 0.
The proof of Theorem 8.2 can be found in [27, 28]. In a nutshell, it states that the most
sparse distribution over permutations with sparsity up to N log N can be recovered from
its first-order marginals. This is in sharp contrast with the counterexample, Example 8.1,
which states that for any N, a distribution with sparsity 3 cannot be recovered
uniquely.
Recovering the distribution using comparison marginals via the signature condition.
Next, we utilize the signature condition, Condition 8.1, in the context of recovering the
distribution over permutations from its comparison marginals. Let Ac ∈ {0, 1}N(N−1)×N!
denote the comparison marginal matrix that maps the N!-dimensional vector corre-
sponding to the distribution over permutations to an N 2 -dimensional vector correspond-
ing to the comparison marginals of the distribution. We state the signature property of
Ac next.
lemma 8.3 Let S be a randomly chosen subset of {1, . . . , N!} of size K. Then the com-
parison marginal matrix Ac satisfies the signature condition with respect to S with
probability 1 − o(1) as long as K = o(log N).
The proof of the above lemma can be found in [29, 30]. Lemma 8.3 and Theorem 8.1
immediately imply the following result.
theorem 8.3 Let S ⊂ S N be a randomly chosen subset of S N of size K, denoted as S =
{σ1 , . . . , σK }. Let p1 , . . . , pK be chosen from a joint distribution with a continuous density
over the subspace of [0, 1]K corresponding to p1 + · · · + pK = 1. Let ν be a distribution
over S N such that
⎧
⎪
⎪
⎨ pk if σ = σk , k ∈ [K],
ν(σ) = ⎪
⎪ (8.13)
⎩0 otherwise.
Then ν can be recovered from its comparison marginal distribution with probability
1 − o(1) as long as K = o(log N).
The proof of Theorem 8.3 can be found in [29, 30]. This proof suggests that it is feasi-
ble to recover a sparse model with growing support size with N as long as it is o(log N).
However, it is exponentially smaller than the recoverable support size compared with
the first-order marginal. This seems to be related to the fact that the first-order marginal
is relatively information-rich compared with the comparison marginal.
type of marginal information, with η being noise such that some norm of η, e.g., η2
or η∞ , is bounded above by δ, with δ > 0 being small if we have access to enough
samples. Here δ represents the error observed due to access to finitely many samples
and is assumed to be known.
For example, if we have access to n independent samples for each marginal entry (e.g.,
i ranked in position j for a first-order marginal or i found to be better than j for a com-
parison marginal) according to ν, and we create an empirical estimation of each entry
in M(ν) or C(ν), then, using the Chernoff bound for the binomial distribution and the
union bound over a collection of events, it can be argued that η∞ ≤ δ with probability
1 − δ as long as n ∼ (1/δ2 ) log(4N/δ) for each entry. Using more sophisticated methods
from the matrix-estimation literature, it is feasible to obtain better estimations of M(ν)
or C(ν) from fewer samples of entries and even when some of the entries are entirely
unobserved as long as M(ν) or C(ν) has structure. This is beyond the scope of this expo-
sition; however, we refer the interested reader to [39–42] as well as the discussion in
Section 8.6.4.
Given this, the goal is to recover a sparse distribution whose marginals are close to the
observations. More precisely, we wish to find a distribution ν such that P(ν)−D2 ≤ f (δ)
and ν0 is small. Here, ideally we would like f (δ) = δ but we may settle for any f such
that f (δ) → 0 as δ → 0.
Following the line of reasoning in Section 8.4.1, we shall assume that there is a sparse
model νs with respect to which the marginal matrix satisfies the signature condition,
Condition 8.1, and P(νs ) − D2 ≤ δ. The goal would be to produce an estimate ν so that
ν0 = ν0 and P(ν) − D2 ≤ f (δ).
This is the exact analog of the robust recovery of a sparse signal in the context of
compressed sensing where the RIP-like condition allowed recovery of a sparse approx-
imation to the original signal from linear projections through linear optimization. The
computational complexity of such an algorithm scales at least linearly in n, the ambient
dimension of the signal. As discussed earlier, in our context this would lead to the com-
putation cost scaling as N!, which is prohibitive. The exact recovery algorithm discussed
in Section 8.4.1 has computation cost O(2K + N 2 log N) in the context of recovering a
sparse model satisfying the signature condition. The brute-force search for the sparse
model will lead to cost at least N!K ≈ (N!) K or exp(Θ(KN log N)) for K N!. The ques-
tion is whether it is possible to get rid of the dependence on N!, and ideally to achieve a
scaling of O(2K + N 2 log N), as in the case of exact model recovery.
In what follows, we describe the conditions on noise under which the algorithm
described in Section 8.4.1 is robust. This requires an assumption that the underlying
ground-truth distribution is sparse and satisfies the signature condition. This recovery
result requires noise to be small. How to achieve such a recovery in a higher-noise regime
remains broadly unknown; initial progress toward it has been made in [43].
Robust recovery under the signature condition: low-noise regime. Recall that the “peel-
ing” algorithm recovers the sparse model when the signature condition is satisfied using
exact marginals. Here, we discuss the robustness of the “peeling” algorithm under noise.
Specifically, we argue that the “peeling” algorithm as described is robust as long as the
noise is “low.” We formalize this in the statement below.
242 Devavrat Shah
theorem 8.4 Let A ∈ {0, 1}m×n , with all of its columns being distinct. Let x ∈ Rn≥0 be such
that A satisfies the signature condition with respect to the set supp(x). Let the non-zero
components of x, i.e., {xi : i ∈ supp(x)}, be such that, for any S 1 S 2 ⊂ supp(x),
xi − xi > 2δK, (8.14)
i∈S 1 i ∈S 2
for some δ > 0. Then, given y = Ax + η with η∞ < δ, it is feasible to find x so that
x − x∞ ≤ δ.
Proof To establish Theorem 8.4, we shall utilize effectively the same algorithm as that
utilized for establishing Theorem 8.1 in the proof of Theorem 8.1. However, we will
have to deal with the “error” in measurement y delicately.
To begin with, according to arguments in the proof of Theorem 8.1, it follows that all
non-zero elements of x are distinct. Without loss of generality, let the non-zero elements
of x be x1 , . . . , xK , with K = |supp(x)| ≥ 2 such that 0 < x1 < · · · < xK ; xi = 0 for K + 1 ≤
i ≤ n. From (8.14), it follows that xi+1 ≥ xi + 4δ for 1 ≤ i < K and x1 ≥ 2δ. Therefore,
≥ y j, (8.17)
for any j ∈ J(1) ⊂ J 1 . But this is a contradiction, since y j1 ≤ y j for all j ∈ J. That is,
S = {1} or y j1 = x1 + η j1. Thus, we have found x1 = y j1 such that |x1 − x1 | < δ.
Now, for any j ∈ J 1 with y j = i∈S xi + η j with S ∩ {2, . . . , K} ∅, we have, with the
notation x(S ) = i∈S xi ,
|x1 − y j | = |x1 − y j + x1 − x1 |
= |x1 − x(S ) − η j + x1 − x1 |
Computing Choice 243
J k+2 ← J k+1 \{ j ∈ J k+1 : |vy j − x(S )| ≤ (|S | + 1)δ, for some S ⊂ {1, . . . , k + 1}},
it follows that
This completes the induction step. It also establishes the desired result that we can
recover x such that x − x∞ ≤ δ.
Naturally, as before, Theorem 8.4 implies robust versions of Theorems 8.2 and 8.3.
In particular, if we are forming an empirical estimation of M(ν) or C(ν) that is based on
independently drawn samples, then simple application of the Chernoff bound along with
a union bound will imply that it may be sufficient to have samples that scale as δ−2 log N
in order to have M(ν) or C(ν) so that M(ν)−M(ν)∞ < δ or C(ν)−C(ν)∞ < δ with high
probability (i.e., 1 − oN (1)). Then, as long as ν satisfies condition (8.14) in addition to the
signature condition, Theorem 8.4 guarantees approximate recovery as discussed above.
Robust recovery under the signature condition: high-noise regime. Theorem 8.4
provides conditions under which the “peeling” algorithm manages to recover the
distribution as long as the elements in the support are far enough apart. Putting it
another way, for a given x, the error tolerance needs to be small enough compared with
the gap that is implicitly defined by (8.14) for recovery to be feasible.
Here, we make an attempt to go beyond such restrictions. In particular, supposing
we can view observations as a noisy version of a model that satisfies the signature
condition, then we will be content if we can recover any model that satisfies the
signature condition and is consistent with observations (up to noise). For this, we shall
assume knowledge of the sparsity K.
Now, we need to learn supp(x), i.e., positions of x that are non-zero and the non-zero
values at those positions. The determination of supp(x) corresponds to selecting the
columns of A. Now, if A satisfies the signature condition with respect to supp(x), then
we can simply choose the entries in the positions of y corresponding to the signature
component. If the choice of supp(x) is correct, then this will provide an estimate x
so that x − x2 ≤ δ. In general, if we assume that there exists x such that A satisfies
the signature condition with respect to supp(x) with K = x0 and y − Ax2 ≤ δ, then
a suitable approach would be to find x such that x0 = K, A satisfies the signature
condition with respect to supp x , and it minimizes y − Ax2 .
In summary, we are solving a combinatorial optimization problem over the space
of columns of A that collectively satisfy the signature condition. Formally, the space
of subsets of columns of A of size K can be encoded through a binary-valued matrix
Z ∈ {0, 1}m×m as follows: all but K columns of Z are zero, and the non-zero columns
of Z are distinct columns of A collectively satisfying the signature condition. More
Computing Choice 245
precisely, for any 1 ≤ i1 < i2 < · · · < iK ≤ m, representing the signature columns, the
variable Z should satisfy
Zi j i j = 1, for 1 ≤ j ≤ K, (8.20)
Zi j ik = 0, for 1 ≤ j k ≤ K, (8.21)
[Zai j ]a∈[m] ∈ col(A), for 1 ≤ j ≤ K, (8.22)
Zab = 0, for a ∈ [m], b {i1 , . . . , iK }. (8.23)
In the above, col(A) = {[Ai j ]i∈[m] : 1 ≤ j ≤ n} represents the set of columns of matrix A.
Then the optimization problem of interest is
minimize y − Zy2 over Z ∈ {0, 1}m×m
such that Z satisfies constraints (8.20) − (8.23). (8.24)
m
The constraint set (8.20)–(8.23) can be viewed as a disjoint union of K sets, each one
corresponding to a choice of 1 ≤ i1 < · · · < iK ≤ m. For each such choice, we can solve
the optimization (8.24) and choose the best solution across all of them. That is, the
computation cost is O(mK ) times the cost of solving the optimization problem (8.24).
The complexity of solving (8.24) fundamentally depends on the constraint (8.22) – it
captures the structural complexity of describing the column set of matrix A.
A natural convex relaxation of the optimization problem (8.24) involves replacing
(8.22) and Z ∈ {0, 1}m×m by
[Zai j ]a∈[m] ∈ convex-hull(col(A)), for 1 ≤ j ≤ K; Z ∈ [0, 1]m×m . (8.25)
In the above, for any set S ,
Q
convex-hull(S ) ≡ a x : a ≥ 0, x ∈ S for ∈ [Q], a = 1, for Q ≥ 2 .
=1
In the best case, it may be feasible to solve the optimization with the convex relaxation
efficiently. However, the relaxation may not yield a solution that is achieved at the
extreme points of convex-hull(col(A)), which is what we desire. This is due to the fact
that the objective 2 -norm of the error we are considering is strictly convex. To overcome
this challenge, we can replace 2 by ∞ . The constraints of interest are, for a given ε > 0,
Zi j i j = 1, for 1 ≤ j ≤ K, (8.26)
Zi j ik = 0, for 1 ≤ j k ≤ K, (8.27)
yi − (Zy)i ≤ ε, for 1 ≤ i ≤ m, (8.28)
yi − (Zy)i ≥ −ε, for 1 ≤ i ≤ m, (8.29)
[Zai j ]a∈[m] ∈ convex-hull(col(A)), for 1 ≤ j ≤ K, (8.30)
Zab = 0, for a ∈ [m], b {i1 , . . . , iK }. (8.31)
This results in the linear program
m
minimize ζ i j Zi j over Z ∈ [0, 1]m×m
i, j=1
246 Devavrat Shah
for some L = N δ for some δ ∈ (0, 1). Then there exists ν such that |supp ν | = O(N/ε2 ),
ν satisfies the signature condition with respect to the first-order marginals, and
M(ν) − M(ν)2 ≤ ε.
We discuss recovery of an exact model for the MNL model and recovery of the ranking
for a generic random utility model with homogeneous random perturbation.
Recovering MNL: first-order marginals. Without loss of generality, let us assume that
the parameters w1 , . . . , wN are normalized so that i wi = 1. Then, under the MNL model
according to (8.9),
P σ(i) = 1 = wi . (8.34)
That is, the first column of the first-order marginal matrix M(ν) = [Mi j (ν)] precisely
provides the parameters of the MNL model!
Recovering the MNL model: comparison marginals. Under the MNL model, according
to (8.11), for any i j ∈ [N],
wi
P σ(i) > σ( j) = . (8.35)
wi + w j
The comparison marginals, C(ν) provide access to P σ(i) > σ( j) for all i j ∈ [N].
Using these, we wish to recover the parameters w1 , . . . , wN .
Next, we describe a reversible Markov chain over N states whose stationary distribu-
tion is precisely the parameters of our interest, and its transition kernel utilizes the C(ν).
This alternative representation provides an intuitive algorithm for recovering the MNL
parameters, and more generally what is known as the rank centrality [44, 45].
To that end, the Markov chain of interest has N states. The transition kernel or
transition probability matrix Q = [Qi j ] ∈ [0, 1]N×N of the Markov chain is defined using
comparison marginals C = C(ν) as follows:
⎧
⎪
⎪
⎨C ji /2N, if i j,
Qi j = ⎪
⎩1 − ji C ji /2N, if i = j.
⎪ (8.36)
The Markov chain has a unique stationary distribution because (a) Q is irreducible,
since Ci j , C ji > 0 for all i j as long as wi > 0 for all i ∈ [N], and (b) Qii > 0 by definition
248 Devavrat Shah
for all i ∈ [N] and hence it is aperiodic. Further, w = [wi ]i∈[N] ∈ [0, 1]N is a stationary
distribution since it satisfies the detailed-balance condition, i.e., for any i j ∈ [N],
C ji wj
wi Qi j = wi = wi
2N 2N(wi + w j )
wi Ci j
= wj = wj
2N(wi + w j ) 2N
= w j Q ji . (8.37)
Thus, by finding the stationary distribution of a Markov chain as defined above, we
can find the parameters of the MNL model. This boils down to finding the largest
eigenvector of Q, which can be done using various efficient algorithms, including the
standard power-iteration method.
We note that the algorithm employed to find the parameters of the MNL model does
not need to have access to all entries of C. Suppose E ⊂ {(i, j) : i j ∈ [N]} to be a
subset of all possible N2 pairs for which we have access to C. Let us define a Markov
chain with Q such that, for i j ∈ [N], Qi j is defined according to (8.36) if (i, j) ∈ E (we
assume (i, j) ∈ E and then ( j, i) ∈ E because C ji = 1 − Ci j by definition), else Qi j = 0; and
Qii = 1 − ji Qi j . The resulting Markov chain is aperiodic, since by definition Qii > 0.
Therefore, as long as the resulting Markov chain is irreducible, it has a unique stationary
distribution. Now the Markov chain is irreducible if effectively all N states are reachable
from each other via transitions {(i, j), ( j, i) : (i, j) ∈ E}. That is, there are data that compare
any two i j ∈ [N] through, potentially, chains of comparisons, which, in a sense, is a
minimal requirement in order to have consistent ranking across all i ∈ [N]. Once we have
this, again it follows that the stationary distribution is given by w = [wi ]i∈[N] ∈ [0, 1]N ,
since the detailed-balance equation (8.37) holds for all i j ∈ [N] with (i, j) ∈ E.
Recovering ranking for a homogeneous RUM. As mentioned in Section 8.3.2, we wish
to recover the ranking or ordering of inherent utilities for a homogeneous RUM. That
is, if u1 ≥ · · · ≥ uN , then the ranking of interest is identity, i.e., σ ∈ S N such that σ(i) = i
for all i ∈ [N]. Recall that the homogeneous RUM random perturbations εi in (8.7)
have an identical distribution for all i ∈ [N]. We shall assume that the distribution of the
random perturbation is absolutely continuous with respect to the Lebesgue measure on
R. Operationally, for any t1 < t2 ∈ R,
P ε1 ∈ (t1 , t2 ) > 0. (8.38)
The following is the key characterization of a homogeneous RUM with (8.38) that will
enable recovery of ranking from marginal data (both comparison and first-order); also
see [46, 47].
lemma 8.4 Consider a homogeneous RUM with property (8.38). Then, for i j ∈ [N],
1
ui > u j ⇔ P Yi > Y j > . (8.39)
2
Further, for any k i, j ∈ [N],
ui > u j ⇔ P Yi > Yk > P Y j > Yk . (8.40)
Computing Choice 249
Proof By definition,
P Yi > Y j = P εi − ε j > u j − ui . (8.41)
Since εi , ε j are independent and identically distributed (i.i.d.) with propert (8.38), their
difference random variable εi − ε j has 0 mean, symmetric and with property (8.38).
That is, 0 is its unique median as well, and, for any t > 0,
1
P εi − ε j > t) = P εi − ε j < −t) < . (8.42)
2
This leads to the conclusion that
1
ui > u j ⇔ P Yi > Y j > .
2
Similarly,
P Yi > Yk = P εi − εk > uk − ui , (8.43)
P Y j > Yk = P ε j − εk > uk − u j . (8.44)
Now εi − εk and ε j − εk are identically distributed with property (8.38). That is, one has
a strictly monotonically increasing cumulative distribution function (CDF). Therefore,
(8.40) follows immediately.
Recovering the ranking: comparison marginals. From (8.39) of Lemma 8.4, using
comparison marginals C(ν), we can recover a ranking of [N] that corresponds to the
ranking of their inherent utility for a generic homogeneous RUM as follows. For each
i ∈ [N], assign rank as
1
rank(i) = N − j ∈ [N] : j i, Ci j > . (8.45)
2
From Lemma 8.4, it immediately follows that the rank provides the ranking of [N] as
desired.
We also note that (8.40) of Lemma 8.4 suggests an alternative way (which will turn
out to be robust and more useful) to find the same ranking. To that end, for each i ∈ [N],
define the score as
1
score(i) = Cik . (8.46)
N − 1 ki
From (8.40) of Lemma 8.4, it follows that, for any i j ∈ [N],
score(i) > score( j) ⇔ ui > u j . (8.47)
That is, by ordering [N] in decreasing order of score values, we obtain the desired
ranking.
Recovering the ranking: first-order marginals. We are given a first-order marginal data
matrix, M = M(ν) ∈ [0, 1]N×N , where the Mi j represent P(σ(i) = j) under ν for i, j ∈ [N].
To recover the ranking under a generic homogeneous RUM using M, we shall introduce
the notion of the Borda count, see [48]. More precisely, for any i ∈ [N],
borda(i) = E[σ(i)] = P(σ(i) = j) j = jMi j . (8.48)
j∈[N] j∈[N]
250 Devavrat Shah
That is, borda(i) can be computed using M for any i ∈ [N]. Recall that we argued earlier
that the score(·) (in decreasing order) provides the desired ordering or ranking of [N].
However, computing score required access to comparison marginals C. And it’s not
feasible to recover C from M.
On the other hand, intuitively it seems that borda (in increasing order) provides an
ordering over [N] that might be what we want. Next, we state a simple invariant that ties
score(i) and borda(i), which will lead to the conclusion that we can recover the desired
ranking by sorting [N] in increasing order of borda count [46, 47].
lemma 8.5 For any i ∈ [N] and any distribution over permutations,
Proof Consider any permutation σ ∈ S N . For any i ∈ [N], σ(i) denotes the position in
[N] that i is ranked to. That is, N − σ(i) is precisely the number of elements in [N] (and
not equal to i) that are ranked below i. Formally,
N − σ(i) = 1(σ(i) > σ( j)). (8.50)
ji
Taking the expectation on both sides with respect to the underlying distribution over
permutations and rearranging terms, we obtain
N = E[σ(i)] + P(σ(i) > σ( j)). (8.51)
ji
That is, using the same algorithm for estimating a parameter as in the case of access
to the exact marginals, we obtain an estimator which seems to have reasonably good
properties.
Recovering the MNL model: comparison marginals. We shall utilize the noisy compari-
son data to create a Markov chain as in Section 8.5.1. The stationary distribution of this
noisy or perturbed Markov chain will be a good approximation of the original Markov
chain, i.e., the true MNL parameters. This will lead to a good estimator of the MNL
model using noisy comparison data.
To that end, we have access to noisy comparison marginals C = C + η. To keep things
generic, we shall assume that we have access to comparisons for a subset of pairs. Let
E ⊂ {(i, j) : i j ∈ [N]} denote the subset of all possible N2 pairs for which we have
access to noisy comparison marginals, and we shall assume that (i, j) ∈ E iff ( j, i) ∈ E.
Define di = { j ∈ [N] : j i, (i, j) ∈ E} and dmax = maxi di . Define a noisy Markov chain
with transition matrix Q as
⎧
⎪
⎪
⎪ C ji /2dmax if i j, (i, j) ∈ E,
⎪
⎪
⎨
Qi j = ⎪
⎪ 1 − (1/2dmax ) j:(i, j)∈E C ji if i = j, (8.54)
⎪
⎪
⎪
⎩0, if (i, j) E.
We shall assume that E is such that the resulting Markov chain with transition matrix Q
is irreducible; it is aperiodic since Qii > 0 for all i ∈ [N] by definition (8.54). As before,
it can be verified that this noisy Markov chain is reversible and has a unique stationary
distribution that satisfies the detailed-balance condition. Let π denote this stationary
distribution. The corresponding ideal Markov chain has transition matrix Q defined as
⎧
⎪
⎪
⎪ C ji /2dmax if i j, (i, j) ∈ E,
⎪
⎪
⎨
Qi j = ⎪
⎪ 1 − (1/2dmax ) j:(i, j)∈E C ji if i = j, (8.55)
⎪
⎪
⎪
⎩0, if (i, j) E.
πT Q = πT . (8.56)
We can find π using a power-iteration algorithm. More precisely, let ν0 ∈ [0, 1]N be a
probability distribution as our initial guess. Iteratively, for iteration t ≥ 0,
νt+1
T
= νtT Q. (8.57)
Here Δ = Q − Q, πmax = maxi πi , πmin = mini πi ; let λmax (Q) be the second largest (in
√
norm) eigenvalue of Q; ρ = λmax (Q) + Δ2 πmax /πmin ; and it is assumed that ρ < 1.
Before we provide the proof of this lemma, let us consider its implications. It is
quantifying the robustness of our approach for identifying the parameters of the MNL
model using comparison data. Specifically, since limt→∞ νt → π, (8.58) implies
π − π 1 πmax
≤ Δ2 . (8.59)
π 1−ρ πmin
Proof of Lemma 8.6. Define the inner-product space induced by π. For any u, v ∈ RN ,
define the inner product ·, ·π as
u, vπ = ui vi πi . (8.64)
i
√
This defines norm uπ = u, uπ for all u ∈ R. Let L2 (π) denote the space of all vectors
with finite · π norm endowed with inner product ·, ·π . Then, for any u, v ∈ L2 (π),
⎛ ⎞
⎜⎜⎜ ⎟⎟⎟
u, Qvπ = ui ⎜⎜⎜⎝ Qi j v j ⎟⎟⎟⎟⎠πi
⎜
i j
= ui v j πi Qi j = ui v j π j Q ji
i, j i, j
⎛ ⎞
⎜⎜⎜ ⎟⎟
= π j v j ⎜⎝ Q ji ui ⎟⎟⎟⎠ = Qu, vπ .
⎜ (8.65)
j i
That is, Q is self-adjoint over L2 (π). For a self-adjoint matrix Q over L2 (π), define the
norm
Quπ
Q2,π = max . (8.66)
u uπ
It can be verified that, for any u ∈ RN and Q,
√ √
πmin u2 ≤ uπ ≤ πmax u2 , (8.67)
πmin πmax
Q2 ≤ Q2,π ≤ Q2 . (8.68)
πmax πmin
1 1 1
Consider a symmetrized version of Q as S = Π 2 QΠ− 2 , where Π± 2 is the N × N
± 12
diagonal matrix with the ith entry on the diagonal being πi . The symmetry of S
follows due to the detailed-balance property of Q, i.e., πi Qi j = π j Q ji for all i, j. Since
Q is a probability transition matrix, by the Perron–Frobenius theorem we have that
its eigenvalues are in [−1, 1], with the top eigenvalue being 1 and unique. Let the
eigenvalues be 1 = λ1 > λ2 ≥ · · · ≥ λN > −1. Let the corresponding (left) eigenvectors of
1
Q be v1 , . . . , vN . By definition v1 = π. Therefore, ui = Π− 2 vi are (left) eigenvectors of S
with eigenvalue λi for 1 ≤ i ≤ N, since
1 1 1 1
uTi S = (Π− 2 vi )T Π 2 QΠ− 2 = vTi QΠ− 2
1 1
= λi vTi Π− 2 = λi (Π− 2 vi )T = λi uTi . (8.69)
1 1
That is, u1 = π 2 or Π− 2 u1 = 1. By singular value decomposition, we can write
N
S = S1 + S\1 , where S1 = λ1 u1 uT1 and S\1 = i=2 λi ui uTi . That is,
1 1 1 1 1 1 1 1
Q = Π− 2 SΠ 2 = Π− 2 S1 Π 2 + Π− 2 S\1 Π 2 = 1πT + Π− 2 S\1 Π 2 . (8.70)
1 1
= (νt − π)T (1πT + Π− 2 S\1 Π 2 ) + (νt − π)T Δ + πT Δ
1 1
= (νt − π)T Π− 2 S\1 Π 2 + (νt − π)T Δ + πT Δ, (8.71)
where we used the fact that (νt − π)T 1 = 0 since both νt and π are probability vectors.
1 1
Now, for any matrix W, Π− 2 WΠ 2 2,π = W2 . Therefore,
νt+1
T
− πT π ≤ νt − ππ S\1 2 + Δπ,2 + πT Δπ . (8.72)
By definition S\1 2 = max λ2 , |λN | = λmax (Q). Let γ = λmax (Q) + Δπ,2 . Then
t−1
νtT − πT π ≤ γt ν0 − ππ + γ s πT Δπ . (8.73)
s=0
where ηi· = [ηik ]k∈[N] . Therefore, the relative order of any pair of i, j ∈ [N] is preserved
under a noisy score as long as
That is,
That is, error(i) is like computing the Borda count for i using η+ ≡ [|ηik |], which we
+
define as bordaη (i). Then the relative order of any pair i, j ∈ [N] per noisy Borda count
is preserved if
+ +
bordaη (i) + bordaη ( j) < |borda(i) − borda( j)|. (8.85)
8.6 Discussion
“low-rank” structure on the model parameter matrix to enable recovery. In another line
of work, using higher-moment information for a separable mixture model, in [57] Oh
and Shah provided a tensor-decomposition-based approach for recovering the mixture
MNL model.
References
[1] J. Huang, C. Guestrin, and L. Guibas, “Fourier theoretic probabilistic inference over
permutations,” J. Machine Learning Res., vol. 10, no. 5, pp. 997–1070, 2009.
[2] P. Diaconis, Group representations in probability and statistics. Institute of Mathematical
Statistics, 1988.
[3] L. Thurstone, “A law of comparative judgement,” Psychol. Rev., vol. 34, pp. 237–286,
1927.
[4] J. Marschak, “Binary choice constraints on random utility indicators,” Cowles Foundation
Discussion Paper, 1959.
[5] J. Marschak and R. Radner, Economic theory of teams. Yale University Press, 1972.
Computing Choice 259
[6] J. I. Yellott, “The relationship between Luce’s choice axiom, Thurstone’s theory of
comparative judgment, and the double exponential distribution,” J. Math. Psychol.,
vol. 15, no. 2, pp. 109–144, 1977.
[7] R. Luce, Individual choice behavior: A theoretical analysis. Wiley, 1959.
[8] R. Plackett, “The analysis of permutations,” Appl. Statist., vol. 24, no. 2, pp. 193–202,
1975.
[9] D. McFadden, “Conditional logit analysis of qualitative choice behavior,” in Frontiers in
Econometrics, P. Zarembka, ed. Academic Press, 1973, pp. 105–142.
[10] G. Debreu, “Review of R. D. Luce, ‘individual choice behavior: A theoretical analysis,’”
Amer. Economic Rev., vol. 50, pp. 186–188, 1960.
[11] R. A. Bradley, “Some statistical methods in taste testing and quality evaluation,”
Biometrics, vol. 9, pp. 22–38, 1953.
[12] D. McFadden, “Econometric models of probabilistic choice,” in Structural analysis of
discrete data with econometric applications, C. F. Manski and D. McFadden, eds. MIT
Press, 1981.
[13] M. E. Ben-Akiva and S. R. Lerman, Discrete choice analysis: Theory and application to
travel demand. CMIT Press, 1985.
[14] D. McFadden, “Disaggregate behavioral travel demand’s RUM side: A 30-year
retrospective,” Travel Behaviour Res., pp. 17–63, 2001.
[15] P. M. Guadagni and J. D. C. Little, “A logit model of brand choice calibrated on scanner
data,” Marketing Sci., vol. 2, no. 3, pp. 203–238, 1983.
[16] S. Mahajan and G. J. van Ryzin, “On the relationship between inventory costs and variety
benefits in retail assortments,” Management Sci., vol. 45, no. 11, pp. 1496–1509, 1999.
[17] M. E. Ben-Akiva, “Structure of passenger travel demand models,” Ph.D. dissertation,
Department of Civil Engineering, MIT, 1973.
[18] J. H. Boyd and R. E. Mellman, “The effect of fuel economy standards on the U.S.
automotive market: An hedonic demand analysis,” Transportation Res. Part A: General,
vol. 14, nos. 5–6, pp. 367–378, 1980.
[19] N. S. Cardell and F. C. Dunbar, “Measuring the societal impacts of automobile
downsizing,” Transportation Res. Part A: General, vol. 14, nos. 5–6, pp. 423–434, 1980.
[20] D. McFadden and K. Train, “Mixed MNL models for discrete response,” J. Appl.
Econometrics, vol. 15, no. 5, pp. 447–470, 2000.
[21] K. Bartels, Y. Boztug, and M. M. Muller, “Testing the multinomial logit model,” 1999,
unpublished working paper.
[22] J. L. Horowitz, “Semiparametric estimation of a work-trip mode choice model,” J.
Econometrics, vol. 58, pp. 49–70, 1993.
[23] B. Koopman, “On distributions admitting a sufficient statistic,” Trans. Amer. Math. Soc.,
vol. 39, no. 3, pp. 399–409, 1936.
[24] B. Crain, “Exponential models, maximum likelihood estimation, and the Haar condition,”
J. Amer. Statist. Assoc., vol. 71, pp. 737–745, 1976.
[25] R. Beran, “Exponential models for directional data,” Annals Statist., vol. 7, no. 6, pp.
1162–1178, 1979.
[26] M. Wainwright and M. Jordan, “Graphical models, exponential families, and variational
inference,” Foundations and Trends Machine Learning, vol. 1, nos. 1–2, pp. 1–305, 2008.
[27] S. Jagabathula and D. Shah, “Inferring rankings under constrained sensing,” in Advances
in Neural Information Processing Systems, 2009, pp. 753–760.
260 Devavrat Shah
[28] S. Jagabathula and D. Shah, “Inferring rankings under constrained sensing,” IEEE Trans.
Information Theory, vol. 57, no. 11, pp. 7288–7306, 2011.
[29] V. Farias, S. Jagabathula, and D. Shah, “A data-driven approach to modeling choice,” in
Advances in Neural Information Processing Systems, 2009.
[30] V. Farias, S. Jagabathula, and D. Shah, “A nonparametric approach to modeling choice
with limited data,” Management Sci., vol. 59, no. 2, pp. 305–322, 2013.
[31] K. J. Arrow, “A difficulty in the concept of social welfare,” J. Political Economy, vol. 58,
no. 4, pp. 328–346, 1950.
[32] J. Marden, Analyzing and modeling rank data. Chapman & Hall/CRC, 1995.
[33] E. J. Candès and T. Tao, “Decoding by linear programming,” IEEE Trans. Information
Theory, vol. 51, no. 12, pp. 4203–4215, 2005.
[34] E. J. Candès, J. K. Romberg, and T. Tao, “Stable signal recovery from incomplete
and inaccurate measurements,” Communications Pure Appl. Math., vol. 59, no. 8, pp.
1207–1223, 2006.
[35] E. J. Candès and J. Romberg, “Quantitative robust uncertainty principles and optimally
sparse decompositions,” Foundations Computational Math., vol. 6, no. 2, pp. 227–254,
2006.
[36] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal
reconstruction from highly incomplete frequency information,” IEEE Trans. Information
Theory, vol. 52, no. 2, pp. 489–509, 2006.
[37] D. L. Donoho, “Compressed sensing,” IEEE Trans. Information Theory, vol. 52, no. 4,
pp. 1289–1306, 2006.
[38] R. Berinde, A. C. Gilbert, P. Indyk, H. Karloff, and M. J. Strauss, “Combining geometry
and combinatorics: A unified approach to sparse signal recovery,” in Proc. 46th Annual
Allerton Conference on Communications, Control, and Computation, 2008, pp. 798–805.
[39] S. Chatterjee, “Matrix estimation by universal singular value thresholding,” Annals Statist.,
vol. 43, no. 1, pp. 177–214, 2015.
[40] D. Song, C. E. Lee, Y. Li, and D. Shah, “Blind regression: Nonparametric regression
for latent variable models via collaborative filtering,” in Advances in Neural Information
Processing Systems, 2016, pp. 2155–2163.
[41] C. Borgs, J. Chayes, C. E. Lee, and D. Shah, “Thy friend is my friend: Iterative collabora-
tive filtering for sparse matrix estimation,” in Advances in Neural Information Processing
Systems, 2017, pp. 4715–4726.
[42] N. Shah, S. Balakrishnan, A. Guntuboyina, and M. Wainwright, “Stochastically transitive
models for pairwise comparisons: Statistical and computational issues,” in International
Conference on Machine Learning, 2016, pp. 11–20.
[43] V. F. Farias, S. Jagabathula, and D. Shah, “Sparse choice models,” in 46th Annual
Conference on Information Sciences and Systems (CISS), 2012, pp. 1–28.
[44] S. Negahban, S. Oh, and D. Shah, “Iterative ranking from pair-wise comparisons,” in
Advances in Neural Information Processing Systems, 2012, pp. 2474–2482.
[45] S. Negahban, S. Oh, and D. Shah, “Rank centrality: Ranking from pairwise comparisons,”
Operations Res., vol. 65, no. 1, pp. 266–287, 2016.
[46] A. Ammar and D. Shah, “Ranking: Compare, don’t score,” in Proc. 49th Annual Allerton
Conference on Communication, Control, and Computing, 2011, pp. 776–783.
[47] A. Ammar and D. Shah, “Efficient rank aggregation using partial data,” in ACM
SIGMETRICS Performance Evaluation Rev., vol. 40, no. 1, 2012, pp. 355–366.
Computing Choice 261
[48] P. Emerson, “The original Borda count and partial voting,” Social Choice and Welfare,
vol. 40, no. 2, pp. 353–358, 2013.
[49] B. Hajek, S. Oh, and J. Xu, “Minimax-optimal inference from partial rankings,” in
Advances in Neural Information Processing Systems, 2014, pp. 1475–1483.
[50] Y. Chen and C. Suh, “Spectral mle: Top-k rank aggregation from pairwise comparisons,”
in International Conference on Machine Learning, 2015, pp. 371–380.
[51] Y. Chen, J. Fan, C. Ma, and K. Wang, “Spectral method and regularized mle are both
optimal for top-k ranking,” arXiv:1707.09971, 2017.
[52] M. Jang, S. Kim, C. Suh, and S. Oh, “Top-k ranking from pairwise comparisons: When
spectral ranking is optimal,” arXiv:1603.04153, 2016.
[53] L. Maystre and M. Grossglauser, “Fast and accurate inference of Plackett–Luce models,”
in Advances in Neural Information Processing Systems, 2015, pp. 172–180.
[54] A. Ammar, S. Oh, D. Shah, and L. F. Voloch, “What’s your choice?: Learning the mixed
multi-nomial,” in ACM SIGMETRICS Performance Evaluation Rev., vol. 42, no. 1, 2014,
pp. 565–566.
[55] S. Oh, K. K. Thekumparampil, and J. Xu, “Collaboratively learning preferences from ordi-
nal data,” in Advances in Neural Information Processing Systems, 2015, pp. 1909–1917.
[56] Y. Lu and S. N. Negahban, “Individualized rank aggregation using nuclear norm regu-
larization,” in Proc. 53rd Annual Allerton Conference on Communication, Control, and
Computing, 2015, pp. 1473–1479.
[57] S. Oh and D. Shah, “Learning mixed multinomial logit model from ordinal data,” in
Advances in Neural Information Processing Systems, 2014, pp. 595–603.
[58] N. B. Shah, S. Balakrishnan, and M. J. Wainwright, “Feeling the bern: Adaptive estimators
for Bernoulli probabilities of pairwise comparisons,” in IEEE International Symposium on
Information Theory (ISIT), 2016, pp. 1153–1157.
[59] M. Adler, P. Gemmell, M. Harchol-Balter, R. M. Karp, and C. Kenyon, “Selection in the
presence of noise: The design of playoff systems,” in SODA, 1994, pp. 564–572.
[60] K. G. Jamieson and R. Nowak, “Active ranking using pairwise comparisons,” in Advances
in Neural Information Processing Systems, 2011, pp. 2240–2248.
[61] A. Karbasi, S. Ioannidis, and L. Massoulié, “From small-world networks to comparison-
based search,” IEEE Trans. Information Theory, vol. 61, no. 6, pp. 3056–3074, 2015.
[62] M. Braverman and E. Mossel, “Noisy sorting without resampling,” in Proc. 19th Annual
ACM–SIAM Symposium on Discrete Algorithms, 2008, pp. 268–276.
[63] Y. Yue and T. Joachims, “Interactively optimizing information retrieval systems as a
dueling bandits problem,” in Proc. 26th Annual International Conference on Machine
Learning, 2009, pp. 1201–1208.
[64] K. G. Jamieson, S. Katariya, A. Deshpande, and R. D. Nowak, “Sparse dueling bandits,”
in AISTATS, 2015.
[65] M. Dudík, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi, “Contextual dueling
bandits,” in Proc. 28th Conference on Learning Theory, 2015, pp. 563–587.
[66] R. Kondor, A. Howard, and T. Jebara, “Multi-object tracking with representations of the
symmetric group,” in Artificial Intelligence and Statistics, 2007, pp. 211–218.
[67] M. Jerrum, A. Sinclair, and E. Vigoda, “A polynomial-time approximation algorithm for
the permanent of a matrix with nonnegative entries,” J. ACM, vol. 51, no. 4, pp. 671–697,
2004.
[68] T. L. Saaty and G. Hu, “Ranking by eigenvector versus other methods in the analytic
hierarchy process,” Appl. Math. Lett., vol. 11, no. 4, pp. 121–125, 1998.
262 Devavrat Shah
[69] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar, “Rank aggregation methods for the
web,” in Proc. 10th International Conference on World Wide Web, 2001, pp. 613–622.
[70] A. Rajkumar and S. Agarwal, “A statistical convergence perspective of algorithms for rank
aggregation from pairwise data,” in International Conference on Machine Learning, 2014,
pp. 118–126.
[71] H. Azari, D. Parks, and L. Xia, “Random utility theory for social choice,” in Advances in
Neural Information Processing Systems, 2012, pp. 126–134.
9 Universal Clustering
Ravi Kiran Raman and Lav R. Varshney
Summary
Clustering is a general term for the set of techniques that, given a set of objects, aim to
select those that are closer to one another than to the rest of the objects, according to
a chosen notion of closeness. It is an unsupervised learning problem since objects are
not externally labeled by category. Much research effort has been expended on finding
natural mathematical definitions of closeness and then developing/evaluating algorithms
in these terms [1]. Many have argued that there is no domain-independent mathematical
notion of similarity, but that it is context-dependent [2]; categories are perhaps natural in
that people can evaluate them when they see them [3, 4]. Some have dismissed the prob-
lem of unsupervised learning in favor of supervised learning, saying it is not a powerful
natural phenomenon (see p. 159 of [5]).
Yet, most of the learning carried out by people and animals is unsupervised. We
largely learn how to think through categories by observing the world in its unlabeled
state. This is a central problem in data science. Whether grouping behavioral traces into
categories to understand their neural correlates [6], grouping astrophysical data to under-
stand galactic potentials [7], or grouping light spectra into color names [8], clustering is
crucial to data-driven science.
Drawing on insights from universal information theory, in this chapter we ask whether
there are universal approaches to unsupervised clustering. In particular, we consider
instances wherein the ground-truth clusters are defined by the unknown statistics gov-
erning the data to be clustered. By universality, we mean that the system does not have
prior access to such statistical properties of the data to be clustered (as is standard in
machine learning), nor does it have a strong sense of the appropriate notion of similarity
to measure which objects are close to one another.
In an age with explosive data-acquisition rates, we have seen a drastic increase in the
amount of raw, unlabeled data in the form of pictures, videos, text, and voice. By 2023,
it is projected that the per capita amount of data stored in the world will exceed the entire
Library of Congress (1014 bits) at the time Shannon first estimated it in 1949 [9, 10].
263
264 Ravi Kiran Raman and Lav R. Varshney
Making sense of such large volumes of data, e.g., for statistical inference, requires
extensive, efficient pre-processing to transform the data into meaningful forms. In addi-
tion to the increasing volumes of known data forms, we are also collecting new varieties
of data with which we have little experience [10].
In astronomy, GalaxyZoo uses the general intelligence of people in crowdsourcing
systems to classify celestial imagery (and especially point out unknown and anoma-
lous objects in the universe); supervised learning is not effective since there is a limited
amount of training data available for this task, especially given the level of noise in the
images [11]. Analyses of such new data forms often need to be performed with minimal
or no contextual information.
Crowdsourcing is also popular for collecting (noisy) labeled training data for
machine-learning algorithms [12, 13]. As an example, impressive classification perfor-
mance of images has been attained using deep convolutional networks on the ImageNet
database [14], but this is after training over a set of 1000 classes of images with 1000
labeled samples for each class, i.e., one million labeled samples, which were obtained
from costly crowdsourcing [15]. Similarly, Google Translate has achieved near-human-
level translation using deep recurrent neural networks (RNNs) [16] at the cost of large
labeled training corpora on the order of millions of examples [17]. For instance, the
training dataset for English to French consisted of 36 million pairs – far more than what
the average human uses. On the other hand, reinforcement learning methods in their
current form rely on exploring state–action pairs to learn the underlying cost function
by Q-learning. While this might be feasible in learning games such as chess and Go, in
practice such exploration through reinforcement might be expensive. For instance, the
recently successful Alpha Go algorithm uses a combination of supervised and reinforce-
ment learning of the optimal policy, by learning from a dataset consisting of 29.4 million
positions learned from 160 000 games [18]. Beside this dataset, the algorithm learned
from the experience of playing against other Go programs and players. Although col-
lecting such a database and the games to learn from might be feasible for a game of Go,
in general inferring from experience is far more expensive.
Although deep learning has given promising solutions for supervised learning prob-
lems, results for unsupervised learning pale in comparison [19]. Is it possible to make
sense of (training) data without people? Indeed, one of the main goals of artificial
general intelligence has always been to develop solutions that work with minimal
supervision [20]. To move beyond data-intensive supervised learning methods, there
is growing interest in unsupervised learning problems such as density/support estima-
tion, clustering, and independent-component analysis. Careful choice and application
of unsupervised methods often reveal structure in the data that is otherwise not evi-
dent. Good clustering solutions help sort unlabeled data into simpler sub-structures,
not only allowing for subsequent inferential learning, but also adding insights into
the characteristics of the data being studied. This chapter studies information-theoretic
formulations of unsupervised clustering, emphasizing their strengths under limited
contextual information.
Universal Clustering 265
in data compression the shortest code length cannot be achieved without taking advantage of the
regular features in data, while in estimation it is these regular features, the underlying mecha-
nism, that we want to learn . . . like a jigsaw puzzle where the pieces almost fit but not quite, and,
moreover, vital pieces were missing.
266 Ravi Kiran Raman and Lav R. Varshney
This observation is significant not only in reaffirming the connection between the two
fields, but also in highlighting the fact that the unification/translation of ideas often
requires closer inspection and some additional effort.
Compression, by definition, is closely related to the task of data clustering. Consider
the k-means clustering algorithm where each data point is represented by one of k pos-
sible centroids, chosen according to Euclidean proximity (and thereby similar to lossy
data compression under the squared loss function where the compressed representation
can be viewed as a cluster center). Explicit clustering formulations for communication
under the universal setting, wherein the transmitted messages represent cluster labels,
have also been studied [28]; this work and its connection to clustering are elaborated in
Section 9.5.2.
It is evident from history and philosophy that universal compression and commu-
nication may yield much insight into both the design and the analysis of inference
algorithms, especially of clustering. Thus, we give some brief insight into prominent
prior work in these areas.
Compression
Compression, as originally introduced by Shannon [21], considers the task of efficiently
representing discrete data samples, X1 , . . . , Xn , generated by a source independently and
identically according to a distribution PX (·), as shown in Fig. 9.1. Shannon showed that
the best rate of lossless compression for such a case is given by the entropy, H(PX ), of
the source.
A variety of subsequently developed methods, such as arithmetic coding [29], Huff-
man coding [30], and Shannon–Fano–Elias coding [26], approach this limit to perform
optimal compression of memoryless sources. However, they often require knowledge of
the source distribution for encoding and decoding.
On the other hand, Lempel and Ziv devised a universal encoding scheme for com-
pressing unknown sources [31], which has proven to be asymptotically optimal for
strings generated by stationary, ergodic sources [32, 33]. This result is impressive as
it highlights the feasibility of optimal compression (in the asymptotic sense), without
contextual knowledge, by using a simple encoding scheme. Unsurprisingly, this work
also inspired the creation of practical compression algorithms such as gzip. The method
has subsequently been generalized to random fields (such as images) with lesser stor-
age requirements as well [34]. Another approach to universal compression is to first
approximate the source distribution from the symbol stream, and then perform optimal
compression for this estimated distribution [35–37]. Davisson identifies the necessary
and sufficient condition for noiseless universal coding of ergodic sources as being that
the per-letter average mutual information between the parameter that defines the source
distribution and the message is zero [38].
Figure 9.1 Model of data compression. The minimum rate of compression for a memoryless
source is given by the entropy of the source.
Universal Clustering 267
Communication
In the context of communicating messages across a noisy, discrete memoryless chan-
nel (DMC), W, Shannon established that the rate of communication is limited by the
capacity of the channel [21, 48], which is given by
Here, the maximum is over all information sources generating the codeword symbol X
that is transmitted through the channel W to receive the noisy symbol Y, as shown in
Fig. 9.2.
Shannon showed that using a random codebook for encoding the messages and joint
typicality decoding achieves the channel capacity. Thus, the decoder requires knowledge
of the channel. Goppa later defined the maximum mutual information (MMI) decoder
to perform universal decoding [49]. The decoder estimates the message as the one with
the codeword that maximizes the empirical mutual information with the received vec-
tor. This decoder is universally optimal in the error exponent [50] and has also been
generalized to account for erasures using constant-composition random codes [51].
Universal communication over channels subject to conditions on the decoding error
probability has also been studied [52]. Beside DMCs, the problem of communica-
tion over channels with memory has also been extensively studied both with channel
knowledge [53–55] and in the universal sense [56, 57].
Note that the problem of universal communication translates to one of classification
of the transmitted codewords, on the basis of the message, using noisy versions without
knowledge of the channel. In addition, if the decoder is unaware of the transmission
codebook, then the best one could hope for is to cluster the messages. This problem
has been studied in [28, 58], which demonstrate that the MMI decoder is sub-optimal
Figure 9.2 Model of data communication. The maximum rate of communication for discrete
memoryless channels is given by the capacity of the channel.
268 Ravi Kiran Raman and Lav R. Varshney
for this context, and therefore introduces minimum partition information (MPI) decod-
ing to cluster the received noisy codewords. More recently, universal communication in
the presence of noisy codebooks has been considered [59]. A comprehensive survey of
universal communication under a variety of channel models can be found in [60].
supθ∈Θ1 pθ (y)
LG (y) = ,
supθ∈Θ0 pθ (y)
is used in place of the likelihood ratio. The generalized likelihood ratio test (GLRT) is
asymptotically optimal under specific hypothesis classes [61].
A universal version of composite hypothesis testing is given by the competitive
minimax formulation [62]:
Pe (δ|θ0 , θ1 )
min max .
δ (θ0 ,θ1 )∈Θ0 ×Θ1 inf Pe (
δ|θ0 , θ1 )
δ
That is, the formulation compares the worst-case error probabilities with the correspond-
ing Bayes error probability of the binary hypothesis test. The framework highlights the
loss from the composite nature of the hypotheses and from the lack of knowledge of
these families, proving to be a benchmark. This problem has also been studied in the
Neyman–Pearson formulation [63].
Universal information theory has also been particularly interested in the problem of
prediction. In particular, given the sequence X1 , . . . , Xn , the aim is to predict Xn+1 such
Figure 9.3 Composite hypothesis testing: infer a hypothesis without knowledge of the conditional
distribution of the observations. In this and all other subsequent figures in this chapter we adopt
the convention that the decoders and algorithms are aware of the system components enclosed in
solid lines and unaware of those enclosed by dashed lines.
Figure 9.4 Prediction: estimate the next symbol given all past symbols without knowledge of the
stationary, ergodic source distribution.
Universal Clustering 269
Figure 9.5 Denoising: estimate a transmitted sequence of symbols without knowledge of one or
both of the channel and source distributions.
Figure 9.6 Multireference alignment: estimate the transmitted sequence up to a cyclic rotation,
without knowledge of the channel distribution and permutation.
Figure 9.7 Image registration: align a corrupted copy Y to reference X without knowledge of the
channel distribution.
Figure 9.8 Delay estimation: estimate the finite cyclic delay from a noisy, cyclically rotated
sequence without channel knowledge.
Figure 9.9 Clustering: determine the correct partition without knowledge of the conditional
distributions of observations.
only to align the copy to the reference and not necessarily to denoise the signal. The max
mutual information method has been considered for universal image registration [71].
In the universal version of the registration problem with unknown discrete memory-
less channels, the decoder has been shown to be universally asymptotically optimal in
the error exponent for registering two images (but not for more than two images) [72].
The registration approach is derived from that used for universal delay estimation for
discrete channels and memoryless sources, given cyclically shifted, noisy versions of
the transmitted signal, in [73]. The model for the delay estimation problem is shown in
Fig. 9.8.
Classification and clustering of data sources have also been considered from the uni-
versal setting and can broadly be summarized by Fig. 9.9. As described earlier, even
the communication problem can be formulated as one of universal clustering of mes-
sages, encoded according to an unknown random codebook, and transmitted across an
unknown discrete memoryless channel, by using the MPI decoder [58].
Note that the task of clustering such objects is strictly stronger than the universal
binary hypothesis test of identifying whether two objects are drawn from the same source
or not. Naturally, several ideas from the hypothesis literature translate to the design of
clustering algorithms. Binary hypothesis testing, formulated as a classification problem
using empirically observed statistics, and the relation of the universal discriminant func-
tions to universal data compression are elaborated in [74]. An asymptotically optimal
decision rule for the problem under hidden Markov models with unknown statistics in
the Neyman–Pearson formulation is designed and compared with the GLRT in [75].
Unsupervised clustering of discrete objects under universal crowdsourcing was stud-
ied in [76]. In the absence of knowledge of the crowd channels, we define budget-
optimal universal clustering algorithms that use distributional identicality and response
Universal Clustering 271
Figure 9.10 Outlier hypothesis testing: determine outlier without knowledge of typical and outlier
hypothesis distributions.
Figure 9.11 Data-generation model for universal clustering. Here we cluster n data sources with
labels (L1 , . . . , Ln ) = (1, 2, 2, . . . , ). True labels define unknown source distributions. Latent
vectors are drawn from the source distributions and corrupted by the unknown noise channel,
resulting in m-dimensional observation vectors for each source. Universal clustering algorithms
cluster the sources using the observation vectors generated according to this model.
let W(Y1 , . . . , Yn |X1 , . . . , Xn ) be the channel that corrupts the latent vectors to generate
observation vectors Y1 , . . . , Yn ∈ Ym . The specific source or channel models that define
the data samples are application-specific.
A clustering algorithm infers the correct clustering of the data sources using the obser-
vation vectors, such that any two objects are in the same cluster if and only if they share
the same label. The statistical model is depicted in Fig. 9.11.
definition 9.4 (Universal clustering) A clustering algorithm Φ is universal if it per-
forms clustering in the absence of knowledge of the source and/or channel distributions,
{P1 , . . . , P }, W, and other parameters defining the data samples.
Example 9.1 Let us now consider a simple example to illustrate the statistical model
under consideration. Let μ1 , . . . , μ ∈ Rn be the latent vector distributions corresponding
to the labels, i.e., the source distribution is given by Q j (μ j ) = 1. That is, if source i has
label j, i.e., Li = j, then Xi = μ j with probability one. Let W be a memoryless additive
white Gaussian noise (AWGN) channel, i.e., Yi = Xi + Zi , where Zi ∼ N(0, I).
Since the latent vector representation is dictated by the mean of the observation vec-
tors corresponding to the label, and since the distribution of each observation vector is
Gaussian, the maximum-likelihood estimation of the correct cluster translates to the par-
tition that minimizes the intra-cluster distance of points from the centroids of the cluster.
That is, it is the same problem as that which the k-means algorithm attempts to solve.
the distance criterion, and the ML estimate as obtained from the k-means algorithm (with
appropriate initializations) being the optimal clustering algorithm. One can extrapolate
the example to see that a Laplacian model for data generation with Hamming loss would
translate to an L1 -distance criterion among the data vectors.
In the absence of knowledge of the statistics governing data generation, we require
appropriate notions of similarity to define insightful clustering algorithms. Let us define
the similarity score of the set of objects {Y1 , . . . , Yn } clustered according to a partition
P as S (Y1 , . . . , Yn ; P), where S : (Ym )n × P → R. Clustering can equivalently be defined
by a notion of distance between objects, where the similarity can be treated as just the
negative/inverse of the distance.
Identifying the best clustering for a task is thus equivalent to maximizing the intra-
cluster (or minimizing the inter-cluster) similarity of objects over the space of all viable
partitions, i.e.,
Such optimization may not be computationally efficient except for certain similarity
functions, as the partition space is discrete and exponentially large.
In the absence of a well-defined similarity criterion, S , defining the “true” clusters that
satisfy (9.2), the task may be more art than science [2]. Developing contextually appro-
priate similarity functions is thus of standalone interest, and we outline some popular
similarity functions used in the literature.
Efforts have long been made at identifying universal discriminant functions that define
the notion of similarity in classification [74]. Empirical approximation of the Kullback–
Leibler (KL) divergence has been considered as a viable candidate for a universal
discriminant function to perform binary hypothesis testing in the absence of knowledge
of the alternate hypothesis, such that it incurs an error exponent arbitrarily close to the
optimal exponent.
The Euclidean distance is the most common measure of pair-wise distance between
observations used to cluster objects, as evidenced in the k-means clustering algorithm.
A variety of other distances such as the Hamming distance, edit distance, Lempel–Ziv
distance, and more complicated formulations such as those in [78, 79] have also been
considered to quantify similarity in a variety of applications. The more general class of
Bregman divergences have been considered as inter-object distance functions for clus-
tering [80]. On the other hand, a fairly broad class of f -divergence functionals were used
as the notion of distance between two objects in [76]. Both of these studies estimate the
distributions from the data samples Y1 , . . . , Yn and use a notion of pair-wise distance
between the sources on the probability simplex to cluster them.
Clusters can also be viewed as alternate efficient representations of the data, com-
prising sufficient information about the objects. The Kolmogorov complexity is widely
accepted as a measure of the information content in an object [81]. The Kolomogorov
complexity, K(x), of x is the length of the shortest binary program to compute x using a
universal Turing machine. Similarly, K(x|y), the Kolmogorov complexity of x given y, is
the length of the shortest binary program to compute x, given y, using a universal Turing
Universal Clustering 275
machine. Starting from the Kolmogorov complexity, in [82] Bennett et al. introduce the
notion of the information distance between two objects X, Y as
and argue that this is a universal cognitive similarity distance. The smaller the informa-
tion distance between two sources, the easier it is to represent one by the other, and hence
the more similar they are. A corresponding normalized version of this distance, called the
normalized information distance, is obtained by normalizing the information distance
by the maximum Kolmogorov complexity of the individual objects [83] and is a useful
metric in clustering sources according to the pair-wise normalized cognitive similarity.
Salient properties of this notion of universal similarity have been studied [84], and
heuristically implemented using word-similarity scores, computed using Google page
counts, as empirical quantifications of the score [85]. While such definitions are the-
oretically universal and generalize a large class of similarity measures, it is typically
not feasible to compute the Kolmogorov complexity. They do, however, inspire other
practical notions of similarity. For instance, the normalized mutual information between
objects has been used as the notion of similarity to maximize the inter-cluster correlation
[86]. The normalized mutual information has also been used for feature selection [87],
a problem that translates to clustering in the feature space.
Whereas a similarity score guides us toward the design of a clustering algorithm, it
is also important to be able to quantitatively compare and evaluate the results of such
methods. One standard notion of the quality of a clustering scheme is comparing it
against the benchmark set by random clustering ensembles. A variety of metrics based
on this notion have been defined.
Arguably the most popular such index was introduced by Rand [1]. Let P1 , P2 ∈ Pc
be two viable partitions of a set of objects {X1 , . . . , Xn }. Let a be the number of pairs
of objects that are in the same cluster in both partitions and b be the number of pairs
of objects that are in different clusters in both partitions. Then, the Rand index for
deterministic clustering algorithms is given by
a+b
R = n ,
2
quantifying the similarity of the two partitions as observed through the fraction of pair-
wise comparisons they agree on. For randomized algorithms, this index is adjusted using
expectations of these quantities.
A quantitative comparison of clustering methods can also be made through the nor-
malized mutual information [88]. Specifically, for two partitions P1 , P2 , if p(C1 ,C2 ) is
the fraction of objects that are in cluster C1 in P1 and in cluster C2 in P2 , and if
p1 ,
p2
are the corresponding marginal distributions, then the normalized mutual information is
defined as
2I(p)
NMI(P1 , P2 ) = .
p1 ) + H(
H( p2 )
Naturally the NMI value is higher for similar clustering algorithms.
276 Ravi Kiran Raman and Lav R. Varshney
The most prominent criterion for clustering objects is based on a notion of distance
between the samples. Conventional clustering tools in machine learning such as k-
means, support vector machines, linear discriminant analysis, and k-nearest-neighbor
classifiers [91–93] have used Euclidean distances as the notion of distance between
objects represented as points in a feature space. As in Example 9.1, these for instance
translate to optimal universal clustering methods under Gaussian observation models
and Hamming loss. Other similar distances in the feature-space representation of the
observations can be used to perform clustering universally and optimally under varying
statistical and loss models.
In this section, however, we restrict the focus to universal clustering of data sources
in terms of the notions of distance in the conditional distributions generating them. That
is, according to the statistical model of Fig. 9.11, we cluster according to the distance
between the estimated conditional distributions of p(Yi |Li ). We describe clustering under
sample independence in detail first, and then highlight some distance-based methods
used for clustering sources with memory.
Figure 9.12 Data-generation model for universal distance-based clustering. Here we cluster n
sources with labels (L1 , . . . , Ln ) = (1, 2, 2, . . . , ). True labels define the source distributions and,
for each source, m i.i.d. samples are drawn to generate the observation vectors. Source
identicality is used to cluster the sources.
is equivalent to data clustering using pair-wise distances between the spectral densities
of the sources. A similar algorithm using KL divergences between probability distri-
butions was considered in [95]. More generally, the k-means clustering algorithm can
be extended directly to solve the clustering problem under a variety of distortion func-
tions between the distributions corresponding to the sources, in particular, the class of
Bregman divergences [96].
definition 9.5 (Bregman divergence) Let f : R+ → R be a strictly convex function,
and let p0 , p1 be two probability mass functions (p.m.f.s), represented by vectors. The
Bregman divergence is defined as
In particular, we cluster objects that have empirical distributions that are close to each
other in the sense of an f -divergence functional.
definition 9.6 ( f -divergence) Let p, q be discrete probability distributions defined on
a space of m alphabets. Given a convex function f : [0, ∞) → R, the f -divergence is
defined as
m
pi
D f (p q) = qi f . (9.4)
i=1
qi
Figure 9.13 Distance-based clustering of nine objects of three types according to [76]. The graph
is obtained by thresholding the f -divergences of empirical response distributions. The clustering
is then done by identifying the maximal clusters in the thresholded graph.
Universal Clustering 279
whose data are generated according to a distribution p s for some s ∈ S. In general the
set of parameters can be uncountably infinite, and in this sense the set of labels need
not necessarily be discrete. However, this does not affect our study of clustering of a
finite collection of sources. We are interested in separating the outliers from the typical
sources, and thus we wish to cluster the sources into two clusters. In the presence of
knowledge of the typical and outlier distributions, the optimal detection rule is charac-
terized by the generalized likelihood ratio test (GLRT) wherein each sequence can be
tested for being an outlier as
where the threshold η is chosen depending on the priors of typical and outlier
distributions, and decision 1 implies that the sequence is drawn from an outlier.
In universal outlier hypothesis testing [77], we are unaware of the typical and out-
lier hypothesis distributions. The aim is then to design universal tests such that we
can detect the outlier sequences universally. We know that, for any distribution p and
a sequence of i.i.d. observations ym drawn according to this distribution, if p is the
empirical distribution of the observed sequence, then
p) + D(p
p(ym ) = exp −m(H( p)) . (9.5)
Then any likelihood ratio for two distributions essentially depends on the difference
in KL divergences of the corresponding distributions from the empirical estimate.
Hence, the outlier testing problem is formulated as one of clustering typical and out-
lier sequences according to the KL divergences of the empirical distributions from
the cluster centroids by using (9.5) in a variety of settings of interest [99–101].
Efficient clustering-based outlier detection methods, with a linear computational com-
plexity in the number of sources, with universal exponential consistency have also
been devised [102]. Here the problem is addressed by translating it into one of clus-
tering according to the empirical distributions, with the KL divergence as the similarity
measure.
A sequential testing variation of the problem has also been studied [103], where the
sequential probability ratio test (SPRT) is adapted to incorporate (9.5) in the likelihood
ratios. In the presence of a unique outlier distribution μ, the universal test is exponen-
tially consistent with the optimal error exponent given by 2B(π, μ), where B(p, q) is the
Bhattacharya distance between the distributions p and q. Further, this approach is also
exponentially consistent when there exists at least one outlier, and it is consistent in the
case of the null hypothesis (no outlier distributions) [77].
components of the vector are dependent. Let us assume that the channel W is the
same – memoryless. For such sources, we now describe some novel universal clustering
methods.
As noted earlier, compressed streams represent cluster centers either as themselves
or as clusters in the space of the codewords. In this sense, compression techniques
have been used in defining clustering algorithms for images [104]. In particular, a vari-
ety of image segmentation algorithms have been designed starting from compression
techniques that fundamentally employ distance-based clustering mechanisms over the
sources generating the image features [105–107].
Compression also inspires more distance-based clustering algorithms, using the com-
pression length of each sequence when compressed by a selected compressor C. The
normalized compression distance between sequences X, Y is defined as
C(X, Y) − min{C(X),C(Y)}
NCD(X Y) = , (9.6)
max{C(X),C(Y)}
where C(X, Y) is the length of compressing the pair. Thus, the normalized compression
distance represents the closeness of the compressed representations of the sequence,
therein translating to closeness of the sources as dictated through the compressor C.
Thus, the choice of compression scheme here characterizes the notion of similarity
and the corresponding universal clustering model it applies to. Clustering schemes
based on the NCD minimizing (maximizing) the intra-cluster (inter-cluster) distance
are considered in [108].
Clustering of stationary, ergodic random processes according to the source distribu-
tion has also been studied, wherein two objects are in the same cluster if and only if
they are drawn from the same distribution [109]. The algorithm uses empirical esti-
mates of a weighted L1 distance between the source distributions as obtained from the
data streams corresponding to each source. Statistical consistency of the algorithm is
established for generic ergodic sources. This method has also been used for clustering
time-series data [110].
The works we have summarized here represent a small fraction of the distance-based
clustering methods in the literature. It is important to notice the underlying thread of
universality in each of these methods as they do not assume explicit knowledge of the
statistics that defines the objects to be clustered. Thus we observe that, in the universal
framework, one is able to perform clustering reliably, and often with strong guaran-
tees such as order-optimality and exponential consistency using information-theoretic
methods.
Another large class of clustering algorithms consider the dependence among random
variables to cluster them. Several applications such as epidemiology and meteorology,
for instance, generate temporally or spatially correlated sources of information with an
element of dependence across sources belonging to the same label. We study the task
Universal Clustering 281
of clustering such sources in this section, highlighting some of these formulations and
universal solutions under two main classes – independence clustering using graphical
models and clustering using multivariate information functionals.
In this section we study the problem of independence clustering under similar con-
straints on the statistical models to obtain insightful results and algorithms. A class of
graphical models that has been well understood is that of the Ising model [120–122].
Most generally, a fundamental threshold for conditional mutual information has been
used in devising an iterative algorithm for structure recovery in Ising models [122]. For
the independence clustering problem, we can use the same threshold to design a simple
extension of this algorithm to define an iterative independence clustering algorithm for
Ising models.
Beside such restricted distribution classes, a variety of other techniques involving
greedy algorithms [123], maximum-likelihood estimation [124], variational methods
[125], locally tree-like grouping strategies [126], and temporal sampling [127] have been
studied for dependence structure recovery using samples, for all possible distributions
under specific families of graphs and alphabets. The universality of these methods in
recovering the graphical model helps define direct extensions of these algorithms to
perform universal independence clustering.
Clustering using Bayesian network recovery has been studied in a universal crowd-
sourcing context [76], where the conditional dependence structure is used to identify
object similarity. We consider object clustering using responses of a crowdsourcing sys-
tem with long-term employees. Here, the system employs m workers such that each
worker labels each of the n objects. The responses of a given worker are assumed to
be dependent in a Markov fashion on the responses to the previous object and the most
recent object of the same class as shown in Fig. 9.14.
More generally, relating to the universal clustering model of Fig. 9.11, this version
of independence clustering reduces to Fig. 9.15. That is, the crowd workers observe the
object according to their true label, and provide a label appropriately, depending on the
observational noise introduced by the channel W. Thus, the latent vector representation
is a simple repetition code of the true label. The responses of the crowd workers are
defined by the Markov channel W, such that the response to an object is dependent on
that offered for the most recent object of the same label, and the most recent object.
Here, we define a clustering algorithm that identifies the clusters by recovering the
Bayesian network from the responses of workers. In particular, the algorithm com-
putes the maximum-likelihood (ML) estimates of the mutual information between the
responses to pairs of objects and, using the data-processing inequality, reconstructs the
Bayesian network defining the worker responses. As elaborated in [98] the number of
responses per object required for reliable clustering of n objects is m = O(log n), which is
order-optimal. The method and algorithm directly extend to graphical models with any
finite-order Markov dependences.
Figure 9.14 Bayesian network model of responses to a set of seven objects chosen from a set of
three types in [76]. The most recent response and the response to the most recent object of the
same type influence a response.
Universal Clustering 283
Figure 9.15 Block diagram of independence clustering under a temporal dependence structure.
Here we cluster n objects with labels (L1 , . . . , Ln ) = (1, 2, 2, . . . , ). The latent vectors are
m-dimensional repetitions of the true label. The observations are drawn according to the Markov
channel that introduces dependences in observations across objects depending on the true labels.
Clustering is performed according to the inferred conditional dependences in the observations.
where H(XC ) is the joint entropy of all random variables in the cluster C.
Universal Clustering 285
From the definition of the partition information, it is evident that the average
inter-cluster (intra-cluster) dependence is minimized (maximized) by a partition that
minimizes the partition information.
definition 9.9 (Minimum partition information) The minimum partition information
(MPI) of a set of random variables X1 , . . . , Xn is defined as
I(X1 ; . . . ; Xn ) = min IP (X1 ; . . . ; Xn ). (9.12)
P∈P,|P|>1
The finest partition minimizing the partition information is referred to as the fundamen-
tal partition, i.e., if P∗ = arg minP∈P,|P|>1 IP (X1 ; . . . ; Xn ), then P∗ ∈ P∗ such that P∗ P,
for any P ∈ P∗ .
The MPI finds operational significance as the capacity of the multiterminal secret-key
agreement problem [146]. More recently, the change in MPI with the addition or removal
of sources of common randomness was identified [147], giving us a better understanding
of the multivariate information functional.
286 Ravi Kiran Raman and Lav R. Varshney
The MPI finds functional use in universal clustering under communication, in the
absence of knowledge of the codebook (also thought of as communicating with aliens)
[28, 58]. It is of course infeasible to recover the messages when the codebook is not
available, and so the focus here is on clustering similar messages according to the trans-
mitted code stream. In particular, the empirical MPI was used as the universal decoder
for clustering messages transmitted through an unknown discrete memoryless channel,
when the decoder is unaware of the codebook used for encoding the messages. By
identifying the received codewords that are most dependent on each other, the decoder
optimally clusters the messages up to the capacity of the channel.
Clustering random variables under the Chow–Liu tree approximation by minimizing
the partition information is studied in [148]. A more general version of the clustering
problem that is obtained by minimizing the partition information is considered in [149],
where the authors describe the clustering algorithm, given the joint distribution.
Identifying the optimal partition is often computationally expensive as the number
of partitions is exponential in the number of objects. An efficient method to cluster the
random variables is identified in [149]. We know that entropy is a submodular function
as for any set of indices A, B,
H(XA∪B ) + H(XA∩B ) ≤ H(XA ) + H(XB ),
as conditioning reduces entropy. The equality holds if and only if XA ⊥ XB . For any
A ⊆ [n], let
hγ (A) = H(XA ) − γ,
hγ (A) = min hγ (C), (9.13)
P∈P(A)
C∈P
where P(A) is the set of all partitions of the index set A. Then, we have
hγ (A1 ∪ A2 ) = min hγ (C)
P∈P(A1 ∪A2 )
C∈P
≤ min hγ (C) + min hγ (C)
P∈P(A1 ) P∈P(A2 )
C∈P C∈P
=
hγ (A1 ) +
hγ (A2 ).
Example 9.4 Continuing with the example of the multivariate Gaussian random vari-
able in Example 9.3, the minimum partition function is determined by submodular
minimization over the functions f : 2[n] → R, defined by f (C) = log(|ΣC |), since the
optimal partition is obtained as
1
P∗ = arg min log(|ΣC |).
P∈P |P| − 1
C∈P
where λCi are the eigenvalues of the covariance submatrix ΣC and λi are the eigenvalues
of Σ. In this sense, clustering according to the multivariate information translates to a
clustering similar to spectral clustering.
For another view on the optimal clustering solution to this problem, relating to the
independence clustering equivalent, let us compare this case with Example 9.2. The
independence clustering solution for Gaussian graphical models recovered the block-
diagonal decomposition of the covariance matrix. Here, we perform a relaxed version of
the same retrieval wherein we return the block-diagonal decomposition with minimum
normalized entropy difference from the given distribution.
The multiinformation has, for instance, been used in identifying function transforma-
tions that maximize the multivariate correlation within the set of random variables, using
a greedy search algorithm [86]. It has also been used in devising unsupervised methods
for identifying abstract structure in data from areas like genomics and natural language
[152]. In both of these studies the multiinformation proves to be a more robust measure
of correlation between data sources and hence a stronger notion to characterize the rela-
tionship between them. The same principle is exploited in devising meaningful universal
clustering algorithms.
288 Ravi Kiran Raman and Lav R. Varshney
Figure 9.16 Source model for multi-image clustering according to mutual independence of
images. Here we cluster n images with labels (L1 , . . . , Ln ) = (1, 2, 2, . . . , ). The labels define the
base scene being imaged (latent vectors). The scenes are subject to an image-capture noise which
is independent across scenes and pixels. Clustering is performed according to these captured
images universally.
Example 9.5 For a jointly Gaussian random vector (X1 , . . . , Xn ) ∼ N(0, Σ), the multi-
information is given by
⎛ ⎞
1 ⎜⎜⎜ i∈[n] σ2i ⎟⎟⎟
I M (X1 ; . . . ; Xn ) = log⎜⎝ ⎜ ⎟⎟, (9.16)
2 |Σ| ⎠
where σ2i is the variance of Xi .
Then, the cluster information is given by
⎛ ⎞
⎜⎜ i∈[n] σi ⎟
2
⎟⎟
= log⎜⎜⎜⎝
1 ⎟⎟⎠,
IC(P) (X1 ; . . . ; Xn ) (9.17)
2 |ΣP |
where again [ΣP ]i, j = [Σ]i, j 1{i, j ∈ C for some C ∈ P}, for all i, j ∈ [n]2 . If P =
{{1}, . . . , {n}}, i.e., the partition of singletons, then the cluster information is essentially
given by
(P) 1 |ΣP|
IC (X1 ; . . . ; Xn ) = log .
2 |ΣP |
Thus the cluster information represents the information in the clustered form as
compared against the singletons.
where (X1 , . . . , Xn ) ∼ Q, and Qi is the marginal distribution of Xi for all i ∈ [n]. The cluster
version of the illum information, correspondingly defined as
⎛ ⎞
⎜⎜⎜ ⎟⎟
(P)
LC = D⎜⎝ ⎜ QC Q⎟⎟⎟⎠,
C∈P
290 Ravi Kiran Raman and Lav R. Varshney
Example 9.6 For the jointly Gaussian random vector (X1 , . . . , Xn ) ∼ N(0, Σ), let Σ =
n
i=1 λi ui ui be the orthonormal eigendecomposition. Then, |Σ| = i=1 λi , and the illum
T n
information is given by
n ⎡ ⎛ 2⎞ ⎤
1 ⎢⎢⎢⎢ uTi
Σui ⎜⎜ σ ⎟⎟ ⎥⎥
L(X1 ; . . . ; Xn ) = ⎢⎣ − log⎜⎜⎜⎝ i ⎟⎟⎟⎠ − 1⎥⎥⎥⎦, (9.19)
2 i=1 λi λi
Example 9.7 For a pair-wise Markov random field (MRF) defined on a graph G =
(V, E) as
⎛ ⎞
⎜⎜⎜ ⎟⎟⎟
⎜
QG (X1 ; . . . ; Xn ) = exp⎜⎜⎜⎝ ψi (Xi ) + ψi j (Xi , X j ) − A(ψ)⎟⎟⎟⎟⎠, (9.20)
i∈V (i, j)∈E
where A(ψ) is the log partition function, the sum information [155], which is defined as
reduces to
S (X1 ; . . . ; Xn ) = EQG ψi j (Xi , X j ) − EQi Q j ψi j (Xi , X j ) .
(i, j)∈E
That is, the sum information is dependent only on the expectation of the pair-wise poten-
tial functions defining the MRF. In particular, it is independent of the partition function
and thus statistically and computationally easier to estimate when the potential functions
are known. This also gives us an alternative handle on information-based dependence
structure recovery.
From Example 9.7, we see the advantages of the illum information and other corre-
spondingly defined multivariate information functionals in universal clustering of data
sources. Thus, depending on the statistical families defining the data sources, we see that
Universal Clustering 291
9.6 Applications
Data often include all the information we require for processing them, and the difficulty
in extracting useful information typically lies in understanding their fundamental struc-
ture and in designing effective methods to process such data. Universal methods aim to
learn this information and perform unsupervised learning tasks such as clustering and
provide a lot of insight into problems we have little prior understanding on. In this sec-
tion we give a very brief overview of some of the applications of information-theoretic
clustering algorithms.
For instance, in [28] Misra hypothesizes the ability to communicate with aliens by
clustering messages under a universal communication setting without knowledge of the
codebook. That is, if one presumes that aliens communicate using a random codebook,
then the codewords can be clustered reliably, albeit without the ability to recover the
mapping between message and codeword.
The information-bottleneck method has been used for document clustering [157] and
for image clustering [158] applications. Similarly, clustering of pieces of music through
universal compression techniques has been studied [159, 160].
Clustering using the minimum partition information has been considered in biology
[149]. The information content in gene-expression patterns is exploited through info-
clustering for genomic studies of humans. Similarly, firing patterns of neurons over time
are used as data to better understand neural stimulation of the human brain by identifying
modular structures through info-clustering.
Clustering of images in the absence of knowledge of image and noise models along
with joint registration has been considered [153]. It is worth noting that statistically
consistent clustering and registration of images can be performed with essentially no
information about the source and channel models.
Multivariate information functionals in the form of multiinformation have also been
utilized in clustering data sources according to their latent factors in a hierarchical fash-
ion [152, 161]. This method, which is related to ICA in the search for latent factors
using information, has been found to be efficient in clustering statements according
to personality types, in topical characterization of news, and in learning hierarchical
clustering from DNA strands. The method has in fact been leveraged in social-media
content categorization with the motivation of identifying useful content for disaster
response [162].
Beside direct clustering applications, the study of universal clustering is significant
for understanding fundamental performance limits of systems. As noted earlier, under-
standing the task of clustering using crowdsourced responses, without knowledge of the
crowd channels, helps establish fundamental benchmarks for practical systems [76].
292 Ravi Kiran Raman and Lav R. Varshney
9.7 Conclusion
Most analyses performed within this framework have considered first-order optimal-
ities in the performance of the algorithms such as error exponents. These results are
useful when we have sufficiently large (asymptotic) amounts of data. However, when
we have finite (moderately large) data, stronger theoretical results focusing on second-
and third-order characteristics of the performance through CLT-based finite-blocklength
analyses are necessary [165].
Information functionals, which are often used in universal clustering algorithms,
have notoriously high computational and/or sample complexities of estimation and thus
an important future direction of research concerns the design of robust and efficient
estimators [166].
These are just some of the open problems in the rich and burgeoning area of universal
clustering.
References
[1] W. M. Rand, “Objective criteria for the evaluation of clustering methods,” J. Am. Statist.
Assoc., vol. 66, no. 336, pp. 846–850, 1971.
[2] U. von Luxburg, R. C. Williamson, and I. Guyon, “Clustering: Science or art?” in Proc.
29th International Conference on Machine Learning (ICML 2012), 2012, pp. 65–79.
[3] G. C. Bowker and S. L. Star, Sorting things out: Classification and its consequences. MIT
Press, 1999.
[4] D. Niu, J. G. Dy, and M. I. Jordan, “Iterative discovery of multiple alternative clustering
views,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 36, no. 7, pp. 1340–1353,
2014.
[5] L. Valiant, Probably approximately correct: Nature’s algorithms for learning and pros-
pering in a complex world. Basic Books, 2013.
[6] J. T. Vogelstein, Y. Park, T. Ohyama, R. A. Kerr, J. W. Truman, C. E. Priebe, and M. Zlatic,
“Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure
learning,” Science, vol. 344, no. 6182, pp. 386–392, 2014.
[7] R. E. Sanderson, A. Helmi, and D. W. Hogg, “Action-space clustering of tidal streams to
infer the Galactic potential,” Astrophys. J., vol. 801, no. 2, 18 pages, 2015.
[8] E. Gibson, R. Futrell, J. Jara-Ettinger, K. Mahowald, L. Bergen, S. Ratnasingam, M. Gib-
son, S. T. Piantadosi, and B. R. Conway, “Color naming across languages reflects color
use,” Proc. Natl. Acad. Sci. USA, vol. 114, no. 40, pp. 10 785–10 790, 2017.
[9] C. E. Shannon, “Bits storage capacity,” Manuscript Division, Library of Congress,
handwritten note, 1949.
[10] M. Weldon, The Future X Network: A Bell Labs perspective. CRC Press, 2015.
[11] C. Lintott, K. Schawinski, S. Bamford, A. Slosar, K. Land, D. Thomas, E. Edmondson,
K. Masters, R. C. Nichol, M. J. Raddick, A. Szalay, D. Andreescu, P. Murray, and J. Van-
denberg, “Galaxy Zoo 1: Data release of morphological classifications for nearly 900 000
galaxies,” Monthly Notices Roy. Astron. Soc., vol. 410, no. 1, pp. 166–178, 2010.
[12] A. Kittur, E. H. Chi, and B. Suh, “Crowdsourcing user studies with Mechanical Turk,”
in Proc. SIGCHI Conference on Human Factors in Computational Systems (CHI 2008),
2008, pp. 453–456.
294 Ravi Kiran Raman and Lav R. Varshney
[32] J. Ziv, “Coding theorems for individual sequences,” IEEE Trans. Information Theory, vol.
24, no. 4, pp. 405–412, 1978.
[33] A. D. Wyner and J. Ziv, “The rate-distortion function for source coding with side infor-
mation at the decoder,” IEEE Trans. Information Theory, vol. 22, no. 1, pp. 1–10,
1976.
[34] J. J. Rissanen, “A universal data compression system,” IEEE Trans. Information Theory,
vol. 29, no. 5, pp. 656–664, 1983.
[35] R. Gallager, “Variations on a theme by Huffman,” IEEE Trans. Information Theory, vol.
24, no. 6, pp. 668–674, 1978.
[36] J. C. Lawrence, “A new universal coding scheme for the binary memoryless source,”
IEEE Trans. Information Theory, vol. 23, no. 4, pp. 466–472, 1977.
[37] J. Ziv, “Coding of sources with unknown statistics – Part I: Probability of encoding error,”
IEEE Trans. Information Theory, vol. 18, no. 3, pp. 384–389, 1972.
[38] L. D. Davisson, “Universal noiseless coding,” IEEE Trans. Information Theory, vol. 19,
no. 6, pp. 783–795, 1973.
[39] D. Slepian and J. K. Wolf, “Noiseless coding of correlated information sources,” IEEE
Trans. Information Theory, vol. 19, no. 4, pp. 471–480, 1973.
[40] T. M. Cover, “A proof of the data compression theorem of Slepian and Wolf for ergodic
sources,” IEEE Trans. Information Theory, vol. 21, no. 2, pp. 226–228, 1975.
[41] I. Csiszár, “Linear codes for sources and source networks: Error exponents, universal
coding,” IEEE Trans. Information Theory, vol. 28, no. 4, pp. 585–592, 1982.
[42] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE
National Convention Record, vol. 4, no. 1, pp. 142–163, 1959.
[43] T. Berger, Rate distortion theory: A mathematical basis for data compression. Prentice-
Hall, 1971.
[44] J. Ziv, “Coding of sources with unknown statistics – Part II: Distortion relative to a fidelity
criterion,” IEEE Trans. Information Theory, vol. 18, no. 3, pp. 389–394, May 1972.
[45] J. Ziv, “On universal quantization,” IEEE Trans. Information Theory, vol. 31, no. 3, pp.
344–347, 1985.
[46] E. Hui Yang and J. C. Kieffer, “Simple universal lossy data compression schemes derived
from the Lempel–Ziv algorithm,” IEEE Trans. Information Theory, vol. 42, no. 1, pp.
239–245, 1996.
[47] A. D. Wyner, J. Ziv, and A. J. Wyner, “On the role of pattern matching in information
theory,” IEEE Trans. Information Theory, vol. 44, no. 6, pp. 2045–2056, 1998.
[48] C. E. Shannon, “Communication in the presence of noise,” Proc. IRE, vol. 37, no. 1, pp.
10–21, 1949.
[49] V. D. Goppa, “Nonprobabilistic mutual information without memory,” Problems Control
Information Theory, vol. 4, no. 2, pp. 97–102, 1975.
[50] I. Csiszár, “The method of types,” IEEE Trans. Information Theory, vol. 44, no. 6, pp.
2505–2523, 1998.
[51] P. Moulin, “A Neyman–Pearson approach to universal erasure and list decoding,” IEEE
Trans. Information Theory, vol. 55, no. 10, pp. 4462–4478, 2009.
[52] N. Merhav, “Universal decoding for arbitrary channels relative to a given class of
decoding metrics,” IEEE Trans. Information Theory, vol. 59, no. 9, pp. 5566–5576, 2013.
[53] C. E. Shannon, “Certain results in coding theory for noisy channels,” Information Control,
vol. 1, no. 1, pp. 6–25, 1957.
296 Ravi Kiran Raman and Lav R. Varshney
[54] A. Feinstein, “On the coding theorem and its converse for finite-memory channels,”
Information Control, vol. 2, no. 1, pp. 25–44, 1959.
[55] I. Csiszár and P. Narayan, “Capacity of the Gaussian arbitrarily varying channel,” IEEE
Trans. Information Theory, vol. 37, no. 1, pp. 18–26, 1991.
[56] J. Ziv, “Universal decoding for finite-state channels,” IEEE Trans. Information Theory,
vol. 31, no. 4, pp. 453–460, 1985.
[57] M. Feder and A. Lapidoth, “Universal decoding for channels with memory,” IEEE Trans.
Information Theory, vol. 44, no. 5, pp. 1726–1745, 1998.
[58] V. Misra and T. Weissman, “Unsupervised learning and universal communication,” in
Proc. 2013 IEEE International Symposium on Information Theory, 2013, pp. 261–265.
[59] N. Merhav, “Universal decoding using a noisy codebook,” arXiv:1609:00549 [cs.IT], in
IEEE Trans. Information Theory , vol. 64. no. 4, pp. 2231–2239, 2018.
[60] A. Lapidoth and P. Narayan, “Reliable communication under channel uncertainty” IEEE
Trans. Information Theory, vol. 44, no. 6, pp. 2148–2177, 1998.
[61] O. Zeitouni, J. Ziv, and N. Merhav, “When is the generalized likelihood ratio test
optimal?” IEEE Trans. Information Theory, vol. 38, no. 5, pp. 1597–1602, 1992.
[62] M. Feder and N. Merhav, “Universal composite hypothesis testing: A competitive
minimax approach,” IEEE Trans. Information Theory, vol. 48, no. 6, pp. 1504–1517,
2002.
[63] E. Levitan and N. Merhav, “A competitive Neyman–Pearson approach to universal
hypothesis testing with applications,” IEEE Trans. Information Theory, vol. 48, no. 8,
pp. 2215–2229, 2002.
[64] M. Feder, N. Merhav, and M. Gutman, “Universal prediction of individual sequences,”
IEEE Trans. Information Theory, vol. 38, no. 4, pp. 1258–1270, 1992.
[65] M. Feder, “Gambling using a finite state machine,” IEEE Trans. Information Theory,
vol. 37, no. 5, pp. 1459–1465, 1991.
[66] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdú, and M. J. Weinberger, “Universal
discrete denoising: Known channel,” IEEE Trans. Information Theory, vol. 51, no. 1,
pp. 5–28, 2005.
[67] E. Ordentlich, K. Viswanathan, and M. J. Weinberger, “Twice-universal denoising,” IEEE
Trans. Information Theory, vol. 59, no. 1, pp. 526–545, 2013.
[68] T. Bendory, N. Boumal, C. Ma, Z. Zhao, and A. Singer, “Bispectrum inversion with
application to multireference alignment,” vol. 66, no. 4, pp. 1037–1050, 2018.
[69] E. Abbe, J. M. Pereira, and A. Singer, “Sample complexity of the Boolean multirefer-
ence alignment problem,” in Proc. 2017 IEEE International Symposium on Information
Theory, 2017, pp. 1316–1320.
[70] A. Pananjady, M. J. Wainwright, and T. A. Courtade, “Denoising linear models with per-
muted data,” in Proc. 2017 IEEE International Symposium on Information Theory, 2017,
pp. 446–450.
[71] P. Viola and W. M. Wells III, “Alignment by maximization of mutual information,” Int. J.
Computer Vision, vol. 24, no. 2, pp. 137–154, 1997.
[72] R. K. Raman and L. R. Varshney, “Universal joint image clustering and registration using
partition information,” in Proc. 2017 IEEE International Symposium on Information
Theory, 2017, pp. 2168–2172.
[73] J. Stein, J. Ziv, and N. Merhav, “Universal delay estimation for discrete channels,” IEEE
Trans. Information Theory, vol. 42, no. 6, pp. 2085–2093, 1996.
Universal Clustering 297
[74] J. Ziv, “On classification with empirically observed statistics and universal data compres-
sion,” IEEE Trans. Information Theory, vol. 34, no. 2, pp. 278–286, 1988.
[75] N. Merhav, “Universal classification for hidden Markov models,” IEEE Trans. Informa-
tion Theory, vol. 37, no. 6, pp. 1586–1594, Nov. 1991.
[76] R. K. Raman and L. R. Varshney, “Budget-optimal clustering via crowdsourcing,” in Proc.
2017 IEEE International Symposium on Information Theory, 2017, pp. 2163–2167.
[77] Y. Li, S. Nitinawarat, and V. V. Veeravalli, “Universal outlier hypothesis testing,” IEEE
Trans. Information Theory, vol. 60, no. 7, pp. 4066–4082, 2014.
[78] G. Cormode, M. Paterson, S. C. Sahinalp, and U. Vishkin, “Communication complex-
ity of document exchange,” in Proc. 11th Annual ACM-SIAM Symposium on Discrete
Algorithms (SODA ’00), 2000, pp. 197–206.
[79] S. Muthukrishnan and S. C. Sahinalp, “Approximate nearest neighbors and sequence
comparison with block operations,” in Proc. 32nd Annual ACM Symposium on Theory
Computation (STOC ’00), 2000, pp. 416–424.
[80] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman
divergences,” J. Machine Learning Res., vol. 6, pp. 1705–1749, 2005.
[81] M. Li and P. Vitányi, An introduction to Kolmogorov complexity and its applications,
3rd edn. Springer, 2008.
[82] C. H. Bennett, P. Gács, M. Li, P. M. B. Vitányi, and W. H. Zurek, “Information distance,”
IEEE Trans. Information Theory, vol. 44, no. 4, pp. 1407–1423, 1998.
[83] M. Li, X. Chen, X. Li, B. Ma, and P. M. B. Vitányi, “The similarity metric,” IEEE Trans.
Information Theory, vol. 50, no. 12, pp. 3250–3264, 2004.
[84] P. Vitanyi, “Universal similarity,” in Proc. IEEE Information Theory Workshop (ITW ’05),
2005, pp. 238–243.
[85] R. L. Cilibrasi and P. M. B. Vitányi, “The Google similarity distance,” IEEE Trans.
Knowledge Data Engineering, vol. 19, no. 3, pp. 370–383, 2007.
[86] H. V. Nguyen, E. Müller, J. Vreeken, P. Efros, and K. Böhm, “Multivariate maximal
correlation analysis,” in Proc. 31st Internatinal Conference on Machine Learning (ICML
2014), 2014, pp. 775–783.
[87] P. A. Estévez, M. Tesmer, C. A. Perez, and J. M. Zurada, “Normalized mutual infor-
mation feature selection,” IEEE Trans. Neural Networks, vol. 20, no. 2, pp. 189–201,
2009.
[88] L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas, “Comparing community structure
identification,” J. Statist. Mech., vol. 2005, p. P09008, 2005.
[89] A. J. Gates and Y.-Y. Ahn, “The impact of random models on clustering similarity,” J.
Machine Learning Res., vol. 18, no. 87, pp. 1–28, 2017.
[90] J. Lewis, M. Ackerman, and V. de Sa, “Human cluster evaluation and formal quality
measures: A comparative study,” in Proc. 34th Annual Conference on Cognitive Science
in Society, 2012.
[91] J. MacQueen, “Some methods for classification and analysis of multivariate observa-
tions,” in Proc. 5th Berkeley Symposium on Mathematics Statistics and Probability, 1967,
pp. 281–297.
[92] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification, 2nd edn. Wiley, 2001.
[93] C. M. Bishop, Pattern recognition and machine learning. Springer, 2006.
[94] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizer design,” IEEE
Trans. Communication, vol. 28, no. 1, pp. 84–95, 1980.
298 Ravi Kiran Raman and Lav R. Varshney
[113] F. R. Bach and M. I. Jordan, “Beyond independent components: Trees and clusters,” J.
Machine Learning Res., vol. 4, no. 12, pp. 1205–1233, 2003.
[114] C. K. Chow and C. N. Liu, “Approximating discrete probability distributions with
dependence trees,” IEEE Trans. Information Theory, vol. 14, no. 3, pp. 462–467, 1968.
[115] D. M. Chickering, “Learning Bayesian networks is NP-complete,” in Learning from data,
D. Fisher and H.-J. Lenz, eds. Springer, 1996, pp. 121–130.
[116] A. Montanari and J. A. Pereira, “Which graphical models are difficult to learn?” in Proc.
Advances in Neural Information Processing Systems 22, 2009, pp. 1303–1311.
[117] P. Abbeel, D. Koller, and A. Y. Ng, “Learning factor graphs in polynomial time and
sample complexity,” J. Machine Learning Res., vol. 7, pp. 1743–1788, 2006.
[118] Z. Ren, T. Sun, C.-H. Zhang, and H. H. Zhou, “Asymptotic normality and optimali-
ties in estimation of large Gaussian graphical models,” Annals Statist., vol. 43, no. 3,
pp. 991–1026, 2015.
[119] P.-L. Loh and M. J. Wainwright, “Structure estimation for discrete graphical models: Gen-
eralized covariance matrices and their inverses,” in Proc. Advances in Neural Information
Processing Systems 25, 2012, pp. 2087–2095.
[120] N. P. Santhanam and M. J. Wainwright, “Information-theoretic limits of selecting binary
graphical models in high dimensions,” IEEE Trans. Information Theory, vol. 58, no. 7,
pp. 4117–4134, 2012.
[121] L. Bachschmid-Romano and M. Opper, “Inferring hidden states in a random kinetic Ising
model: Replica analysis,” J. Statist. Mech., vol. 2014, no. 6, p. P06013, 2014.
[122] G. Bresler, “Efficiently learning Ising models on arbitrary graphs,” in Proc. 47th Annual
ACM Symposium Theory of Computation (STOC ’15), 2015, pp. 771–782.
[123] P. Netrapalli, S. Banerjee, S. Sanghavi, and S. Shakkottai, “Greedy learning of Markov
network structure,” in Proc. 48th Annual Allerton Conference on Communication Control
Computation, 2010, pp. 1295–1302.
[124] V. Y. F. Tan, A. Anandkumar, L. Tong, and A. S. Willsky, “A large-deviation analysis of
the maximum-likelihood learning of Markov tree structures,” IEEE Trans. Information
Theory, vol. 57, no. 3, pp. 1714–1735, 2011.
[125] M. J. Beal and Z. Ghahramani, “Variational Bayesian learning of directed graphical
models with hidden variables,” Bayesian Analysis, vol. 1, no. 4, pp. 793–831, 2006.
[126] A. Anandkumar and R. Valluvan, “Learning loopy graphical models with latent variables:
Efficient methods and guarantees,” Annals Statist., vol. 41, no. 2, pp. 401–435, 2013.
[127] G. Bresler, D. Gamarnik, and D. Shah, “Learning graphical models from the Glauber
dynamics,” arXiv:1410.7659 [cs.LG], 2014, to be published in IEEE Trans. Information
Theory.
[128] A. P. Dawid, “Conditional independence in statistical theory,” J. Roy. Statist. Soc. Ser. B.
Methodol., vol. 41, no. 1, pp. 1–31, 1979.
[129] T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White, “Testing ran-
dom variables for independence and identity,” in Proc. 42nd Annual Symposium on the
Foundations Computer Science, 2001, pp. 442–451.
[130] A. Gretton and L. Györfi, “Consistent non-parametric tests of independence,” J. Machine
Learning Res., vol. 11, no. 4, pp. 1391–1423, 2010.
[131] R. Sen, A. T. Suresh, K. Shanmugam, A. G. Dimakis, and S. Shakkottai, “Model-powered
conditional independence test,” in Proc. Advances in Neural Information Processing
Systems 30, 2017, pp. 2955–2965.
300 Ravi Kiran Raman and Lav R. Varshney
Summary
10.1 Introduction
In the standard framework of statistical learning theory (see, e.g., [1]), we are faced with
the stochastic optimization problem
minimize Lμ (w) := (w, z)μ(dz),
Z
where w takes values in some hypothesis space W, μ is an unknown probability law on
an instance space Z, and : W × Z → R+ is a given loss function. The quantity Lμ (w)
defined above is referred to as the expected (or population) risk of a hypothesis w ∈ W.
302
Information-Theoretic Stability and Generalization 303
We are given a training sample of size n, i.e., an n-tuple Z = (Z1 , . . . , Zn ) of i.i.d. random
elements of Z drawn from μ. A (possibly randomized) learning algorithm1 is a Markov
kernel PW|Z that maps the training data Z to a random element W of W, and the objective
is to generate W with a suitably small population risk
Lμ (W) = (W, z)μ(dz).
Z
1
n
LZ (W) := (W, Zi )
n i=1
is a natural proxy that can be computed from the data Z and from the output of the
algorithm W. The generalization error of PW|Z is the difference Lμ (W) − LZ (W), and we
are interested in its expected value:
gen(μ, PW|Z ) := E Lμ (W) − LZ (W) ,
where the expectation is w.r.t. the joint probability law P := μ⊗n ⊗ PW|Z of the training
data Z and the algorithm’s output W.
1 We are using the term “algorithm” as a synonym for “rule” or “procedure,” without necessarily assuming
computational efficiency.
304 Maxim Raginsky et al.
One motivation to study this quantity is as follows. Let us assume, for simplicity, that
the infimum inf w∈W Lμ (w) exists, and is achieved by some w◦ ∈ W. We can decompose
the expected value of the excess risk Lμ (W) − Lμ (w◦ ) as
E[Lμ (W) − Lμ (w◦ )] = E Lμ (W) − LZ (W) + E[LZ (W)] − Lμ (w◦ )
= E Lμ (W) − LZ (W) + E[LZ (W) − LZ (w◦ )]
= gen(μ, PW|Z ) + E[LZ (W) − LZ (w◦ )], (10.1)
where in the second line we have used the fact that, for any fixed w ∈ W, the empiri-
cal risk LZ (w) is an unbiased estimate of the population risk Lμ (w): ELZ (w) = Lμ (w).
This decomposition shows that the expected excess risk of a learning algorithm will
be small if its expected generalization error is small and if the expected difference
of the empirical risks of W and w◦ is bounded from above by a small non-negative
quantity.
For example, we can consider the empirical risk minimization (ERM) algorithm [1]
that returns any minimizer of the empirical risk:
Evidently, the second term in (10.1) is non-positive, so the expected excess risk of ERM
is upper-bounded by its expected generalization error. A crude upper bound on the latter
is given by
gen(μ, PW|Z ) ≤ E sup |Lμ (w) − LZ (w)| , (10.2)
w∈W
and it can be shown that, under some restrictions on the complexity of W, the expected
supremum on the right-hand side decays to zero as the sample n → ∞, i.e., empiri-
cal risks converge to population risks uniformly over the hypothesis class. However, in
many cases it is possible to attain asymptotically vanishing excess risk without uniform
convergence (see the article by Shalev-Shwartz et al. [2] for several practically relevant
examples); moreover, the bound in (10.2) is oblivious to the interaction between the data
Z and the algorithm’s output W, and may be rather loose in some settings (for example,
if the algorithm has a fixed computational budget and therefore cannot be expected to
explore the entire hypothesis space W).
10.2 Preliminaries
Moreover, if μ ν, then
dν − 1.
1 dμ
dTV (μ, ν) = (10.4)
2 Ω dν
For these and related results, see Section 1.2 of [17].
If (U, V) is a random couple with joint probability law PUV , the mutual information
between U and V is defined as I(U; V) := D(PUV PU ⊗ PV ). The conditional mutual
information between U and V given a third random element Y jointly distributed with
(U, V) is defined as
I(U; V|Y) := PY (dy)D(PUV|Y=y PU|Y=y ⊗ PV|Y=y ).
If we use the total variation distance instead of the relative entropy, we obtain the
T -information T (U; V) := dTV (PUV , PU ⊗ PV ) (see, e.g., [18]) and the conditional
T -information
T (U; V|Y) := PY (dy)dTV (PUV|Y=y , PU|Y=y ⊗ PV|Y=y ).
The erasure mutual information [19] between jointly distributed random objects U and
V = (V1 , . . . , Vm ) is
m
I − (U; V) = I − (U; V1 , . . . , Vm ) := I(U; Vk |V −k ),
k=1
−k
where V := (V1 , . . . , Vk−1 , Vk+1 , . . . , Vm ). By analogy, we define the erasure T -
information as
m
T − (U; V) = T − (U; V1 , . . . , Vm ) := T (U; Vk |V −k ).
k=1
The erasure mutual information I − (U; V) is related to the usual mutual information
I(U; V) = I(U; V1 , . . . , Vm ) via the identity
m
I − (U; V) = mI(U; V) − I(U; V −k ) (10.5)
k=1
(Theorem 7 of [19]). Moreover, I − (U; V) may be larger or smaller than I(U; V), as can
be seen from the following examples [19]:
• if U takes at most countably many values and V1 = . . . = Vm = U, then I(U; V) =
H(U), the Shannon entropy of U, while I − (U; V) = 0;
i.i.d.
• if V1 , . . . , Vm ∼ Bern( 12 ) and U = V1 ⊕ V2 ⊕ · · · ⊕ Vm , then I(U; V) = log 2, while
I − (U; V) = n log 2;
• if U ∼ N(0, σ2 ) and Vm = U + Nm , where N1 , . . . , Nm are i.i.d. N(0, 1) independent
of U, then
1
I(U; V) = log(1 + nσ2 ),
2
Information-Theoretic Stability and Generalization 307
n σ2
I − (U; V) = log 1 + .
2 1 + (n − 1)σ2
We also have the following.
proposition 10.1 If V1 , . . . , Vm are independent, then, for an arbitrary U jointly
distributed with V,
where the first and third steps use the chain rule and the independence of V1 , . . . , Vm ,
while the second step is provided by the data-processing inequality. Summing over all
k, we get (10.6).
E[eλ(U−EU) ] ≤ eλ
2 σ2 /2
. (10.7)
A classic result due to Hoeffding states that any almost surely bounded random variable
is sub-Gaussian.
lemma 10.1 (Hoeffding [20]) If a ≤ U ≤ b almost surely, for some −∞ < a ≤ b < ∞, then
E[eλ(U−EU) ] ≤ eλ
2 (b−a)2 /8
, (10.8)
i.e., U is (b − a)2 /4 -sub-Gaussian.
Consider a pair of random elements U and V of some spaces U and V, respectively,
with joint distribution PUV . Let Ū and V̄ be independent copies of U and V, such
that PŪ V̄ = PU ⊗ PV . For an arbitrary real-valued function f : U × V → R, we have the
following upper bound on the absolute difference between E[ f (U, V)] and E[ f (Ū, V̄)].
lemma 10.2 If f (u, V) is σ2 -sub-Gaussian under PV for every u, then
E[ f (U, V)] − E[ f (Ū, V̄)] ≤ 2σ2 I(U; V). (10.9)
where the supremum is over all measurable functions F : Ω → R, such that eF dν < ∞.
From (10.10), we know that, for any λ ∈ R,
D(PV|U=u PV ) ≥ E[λ f (u, V)|U = u] − log E eλ f (u,V)
λ2 σ 2
≥ λ E[ f (u, V)|U = u] − E[ f (u, V)] − , (10.11)
2
where the second step follows from the sub-Gaussian assumption on f (u, V):
λ2 σ 2
log E eλ( f (u,V)−E[ f (u,V)]) ≤ ∀λ ∈ R.
2
By maximizing the right-hand side of (10.11) over all λ ∈ R and rearranging we obtain
E[ f (u, V)|U = u] − E[ f (u, V)] ≤ 2σ2 D(PV|U=u PV ).
Then, using the law of iterated expectation and Jensen’s inequality,
E[ f (U, V)] − E[ f (Ū, V̄)] = E[E[ f (U, V) − f (U, V̄)|U]]
≤ PU (du)|E[ f (u, V)|U = u] − E[ f (u, V)]|
U
≤ PU (du) 2σ2 D(PV|U=u PV )
U
≤ 2σ2 D(PUV PU ⊗ PV ).
The result follows by noting that I(U; V) = D(PUV PU ⊗ PV ).
The magnitude of gen(μ, PW|Z ) is determined by the stability properties of PW|Z , i.e.,
by the sensitivity of PW|Z to local modifications of the training data Z. We wish to
quantify this variability in information-theoretic terms. Let PW|Z=z denote the prob-
ability distribution of the output of the algorithm in response to a deterministic z =
(z1 , . . . , zn ) ∈ Zn . This coincides with the conditional distribution of W given Z = z, i.e.,
PW|Z=z (·) = PW|Z=z (·). Recalling the notation z−i := (z1 , . . . , zi−1 , zi+1 , . . . , zn ), we can write
the conditional distribution of W given Z−i = z−i in the following form:
PW|Z−i =z−i (·) = μ(dzi )PW|Z=(z1 ,...,zi ,...,zn ) (·); (10.12)
Z
unlike PW|Z=z , this conditional distribution is determined by both μ and PW|Z . We put
forward the following definition.
definition 10.1 Given the data-generating distribution μ, we say that a learning
algorithm PW|Z is (ε, μ)-stable in erasure T -information if
T − (W; Z) = T − (W; Z1 , . . . , Zn ) ≤ nε, (10.13)
and (ε, μ)-stable in erasure mutual information if
I − (W; Z) = I − (W; Z1 , . . . , Zn ) ≤ nε, (10.14)
Information-Theoretic Stability and Generalization 309
where all expectations are w.r.t. P. We say that PW|Z is ε-stable (in erasure T -information
or in erasure mutual information) if it is (ε, μ)-stable in the appropriate sense for every μ.
These two notions of stability are related.
lemma 10.3 Consider a learning algorithm PW|Z and a data-generating distribution μ.
√
1. If PW|Z is (ε, μ)-stable in erasure mutual information, then it is ( ε/2, μ)-stable in
erasure T -information.
2. If PW|Z is (ε, μ)-stable in erasure T -information with ε ≤ 1/4 and the hypothesis
class W is finite, i.e., |W| < ∞, then PW|Z is (ε log(|W|/ε), μ)-stable in erasure mutual
information.
Proof For Part 1, using Pinsker’s inequality and the concavity of the square root,
we have
1
n
1 −
T (W; Z) = T (W; Zi |Z−i )
n n i=1
1 1
n
≤ I(W; Zi |Z−i )
n i=1 2
1
n
≤ I(W; Zi |Z−i )
2n i=1
ε
≤ .
2
For Part 2, since the output W is finite-valued, we can express the conditional mutual
information I(W; Zi |Z−i ) as the difference of two conditional entropies:
We now recall the following inequality (Theorem 17.3.3 of [22]): for any two probability
distributions μ and ν on a common finite set Ω with dTV (μ, ν) ≤ 1/4,
|Ω|
|H(μ) − H(ν)| ≤ dTV (μ, ν)log . (10.16)
dTV (μ, ν)
Applying (10.16) to (10.15), we get
I(W; Zi |Z−i ) ≤ μ⊗n (dz)dTV (PW|Z−i =z−i , PW|Z=z )
|W|
· log . (10.17)
dTV (PW|Z−i =z−i , PW|Z=z )
310 Maxim Raginsky et al.
remark 10.1 Note that the sufficient conditions of Lemma 10.4 are distribution-free;
that is, they do not involve μ. These notions of stability were introduced recently by
Bassily et al. [23] under the names of KL- and TV-stability.
Proof We give the proof for the erasure mutual information; the case of the erasure
T -information is analogous. Fix μ ∈ P(Z), s ∈ Zn , and i ∈ [n]. For z ∈ Z, let zi,z denote the
n-tuple obtained by replacing zi in z with z. Then
D(PW|Z=z PW|Z−i =z−i ) ≤ μ(dz )D(PW|Z=z PW|Z=zi,z ) ≤ ε,
Z
where the first inequality follows from the convexity of the relative entropy. The claimed
result follows by substituting this estimate into the expression
I(W; Zi |Z−i ) = μ⊗n (dz)D(PW|Z=z PW|Z−i =z−i )
Zn
for the conditional mutual information.
Another notion of stability results if we consider plain T -information and mutual
information.
definition 10.2 Given a pair (μ, PW|Z ), we say that PW|Z is (ε, μ)-stable in
T -information if
T (W; Z) = T (W; Z1 , . . . , Zn ) ≤ nε, (10.18)
and (ε, μ)-stable in mutual information if
I(W; Z) = I(W; Z1 , . . . , Zn ) ≤ nε, (10.19)
where all expectations are w.r.t. P. We say that PW|Z is ε-stable (in T-information or in
mutual information) if it is (ε, μ)-stable in the appropriate sense for every μ.
Information-Theoretic Stability and Generalization 311
In this section, we will relate the generalization error of an arbitrary learning algorithm
to its information-theoretic stability properties. We start with the following simple, but
important, result.
theorem 10.1 If the loss function takes values in [0, 1], then, for any pair (μ, PW|Z ),
1
| gen(μ, PW|Z )| ≤ T − (W; Z). (10.20)
n
In particular, if PW|Z is (ε, μ)-stable in erasure T -information, then | gen(μ, PW|Z )| ≤ 2ε.
Proof The proof technique is standard in the literature on algorithmic stability (see,
e.g., the proof of Lemma 7 in [7], the discussion at the beginning of Section 3.1 in [10],
or the proof of Lemma 11 in [2]); note, however, that we do not require PW|Z to be
symmetric. Introduce an auxiliary sample Z = (Z1 , . . . , Zn ) ∼ μ⊗n that is independent of
(Z, W) ∼ P. Since E[Lμ (W)] = E[(W, Zi )] for each i ∈ [n], we write
1
n
gen(μ, PW|Z ) = E[(W, Zi ) − (W, Zi )].
n i=1
Now, for each i ∈ [n] let us denote by W (i) the output of the algorithm when the input is
equal to Zi,Zi . Then, since the joint probability law of (W, Z, Zi ) evidently coincides with
the joint probability law of (W (i) , Zi,Zi , Zi ), we have
E[(W, Zi ) − (W, Zi )] = E[(W (i) , Zi ) − (W (i) , Zi )]. (10.21)
Moreover,
E(W (i) , Zi ) = μ⊗(n−1) (dz−i )μ(dzi )μ(dzi )P i,z (dw)(w, zi )
W|Z=z i
= μ⊗n (dz)PW|Z−i =z−i (dw)(w, zi ), (10.22)
312 Maxim Raginsky et al.
where in the second line we used the fact that Z1 , . . . , Zn and Zi are i.i.d. draws from μ.
Using (10.22) and (10.23), we obtain
E(W (i) , Z ) − E(W (i) , Z )
i i
≤ μ⊗n (dz) PW|Z−i =z−i (dw)(w, zi ) − PW|Z=z (dw)(w, zi )
Zn W W
≤ μ⊗n (dz)dTV (PW|Z=z , PW|Z−i =z−i )
Zn
= T (W; Zi |Z−i ),
where we have used (10.3) and the fact that T (U; V|Y) = EdTV (PU|VY , PU|Y ) [18].
Summing over i ∈ [n] and using the definition of T − (W; Z), we get the claimed
bound.
The following theorem replaces the assumption of bounded loss with a less restrictive
sub-Gaussianity condition.
theorem 10.2 Consider a pair (μ, PW|Z ), where (w, Z) is σ2 -sub-Gaussian under μ for
every w ∈ W. Then
2σ2
|gen(μ, PW|Z )| ≤ I(W; Z). (10.24)
n
√
In particular, if PW|Z is (ε, μ)-stable in mutual information, then | gen(μ, PW|Z )| ≤ 2σ2 ε.
remark 10.2 Upper bounds on the expected generalization error in terms of the mutual
information I(Z; W) go back to earlier results of McAllester on PAC-Bayes methods [24]
(see also the tutorial paper [25]).
remark 10.3 For a bounded loss function (·, ·) ∈ [a, b], (w, Z) is (b − a)2 /4-sub-
Gaussian for all μ and all w ∈ W, by Hoeffding’s lemma.
remark 10.4 Since Z1 , . . . , Zn are i.i.d., I(W; Z) ≤ I − (W; Z), by Proposition 10.1. Thus,
if PW|Z is ε-stable in erasure mutual information, then it is automatically ε-stable √ in
mutual information, and therefore the right-hand side of (10.24) is bounded by 2σ2 ε.
On the other hand, proving stability in erasure mutual information is often easier than
proving stability in mutual infornation.
Proof For each w, LZ (w) = (1/n) ni=1 (w, Zi ) is (σ2 /n)-sub-Gaussian. Thus, we can
apply Lemma 10.2 to U = W, V = Z, f (U, V) = LZ (W).
Information-Theoretic Stability and Generalization 313
is the collection of empirical risks of the hypotheses in W. Using Lemma 10.2 by set-
ting U = ΛW (Z), V = W, and f (ΛW (z), w) = Lz (w), we immediately recover the result
obtained by Russo and Zou even when W is uncountably infinite.
theorem 10.3 (Russo and Zou [26]) Suppose (w, Z) is σ2 -sub-Gaussian under μ for
all w ∈ W, then
2σ2
gen(μ, PW|Z ) ≤ I(W; ΛW (Z)). (10.26)
n
It should be noted that Theorem 10.2 can also be obtained as a consequence of
Theorem 10.3 because
by the data-processing inequality for the Markov chain ΛW (Z) − Z − W. The latter holds
since, for each w ∈ W, LZ (w) is a function of Z. However, if the output W depends on Z
only through the empirical risks ΛW (Z) (i.e., the Markov chain Z − ΛW (Z) − W holds),
then Theorems 10.2 and 10.3 are equivalent. The advantage of Theorem 10.2 is that
I(W; Z) is often much easier to evaluate than I(W; ΛW (Z)). We will elaborate on this
when we discuss the Gibbs algorithm and adaptive composition of learning algorithms.
Recent work by Jiao et al. [27] extends the results of Russo and Zou by introducing
a generalization of mutual information that can handle the cases when (w, Z) is not
sub-Gaussian.
size of
2σ2 2
n≥ log (10.28)
α2 β
suffices to guarantee
The following results pertain to the case when W and Z are dependent, but the mutual
information I(W; Z) is sufficiently small. The tail probability now is taken with respect
to the joint distribution P = μ⊗n ⊗ PW|Z .
theorem 10.4 Suppose (w, Z) is σ2 -sub-Gaussian under μ for all w ∈ W. If a learning
algorithm satisfies I(W; ΛW (Z)) ≤ C, then for any α > 0 and 0 < β ≤ 1, (10.29) can be
guaranteed by a sample complexity of
8σ2 C 2
n≥ + log . (10.30)
α2 β β
In view of (10.27), any learning algorithm that is (ε, μ)-stable in mutual information
satisfies the condition I(W; ΛW (Z)) ≤ nε. We also have the following corollary.
corollary 10.1 Under the conditions in Theorem 10.4, if C ≤ (g(n) − 1)β log(2/β)
for some function g(n) ≥ 1, then a sample complexity that satisfies n/g(n) ≥
(8σ2 /α2 ) log(2/β) guarantees (10.29).
For example, taking g(n) = 2, Corollary 10.1 implies that, if C ≤ β log(2/β), then
(10.29) can be guaranteed by a sample complexity of n = (16σ2 /α2 ) log(2/β), which
is on the same order as the sample complexity when Z and W are independent as in
√
(10.28). As another example, taking g(n) = n, Corollary 10.1 implies that, if C ≤
√
( n − 1)β log(2/β), then a sample complexity of n = (64σ4 /α4 ) log(2/β) 2 guarantees
(10.29).
Recent papers of Dwork et al. [28, 29] give “high-probability” bounds on the abso-
lute generalization error of differentially private algorithms with bounded loss functions,
i.e., the tail bound P[|Lμ (W) − LZ (W)| > α] ≤ β is guaranteed to hold whenever n
(1/α2 ) log(1/β). By contrast, Theorem 10.4 does not require differential privacy and
assumes that (w, Z) is sub-Gaussian. Bassily et al. [30] obtain a concentration inequal-
ity on the absolute generalization error on the same order as the bound of Theorem 10.4
and show that this bound is sharp – they give an example of a learning problem (μ, W, )
and a learning algorithm PW|Z that satisfies I(W; Z) ≤ O(1) and
1 1
P |LZ (W) − Lμ (W)| ≥ ≥ .
2 n
They also give an example where a sufficient amount of mutual information is necessary
in order for the ERM algorithm to generalize.
The proof of Theorem 10.4 is based on Lemma 10.2 and an adaptation of the “monitor
technique” of Bassily et al. [23]. We first need the following two lemmas. The first
lemma is a simple consequence of the tensorization property of mutual information.
lemma 10.6 Consider the parallel execution of m independent copies of PW|Z on
independent datasets Z1 , . . . , Zm : for t = 1, . . . , m, an independent copy of PW|Z takes
Zt ∼ μ⊗n as input and outputs Wt . Let Zm := (Z1 , . . . , Zm ) be the overall dataset. If
under μ, PW|Z satisfies I(W; ΛW (Z)) ≤ C, then the overall algorithm PW m |Zm satisfies
I(W m ; ΛW (Z1 ), . . . , ΛW (Zm )) ≤ mC.
Information-Theoretic Stability and Generalization 315
Proof The proof is by the independence among (Zt , Wt ), t = 1, . . . , m, and the chain rule
of mutual information.
The next lemma is the key piece. It will be used to construct a procedure that executes
m copies of a learning algorithm in parallel and then selects the one with the largest
absolute generalization error.
lemma 10.7 Let Zm = (Z1 , . . . , Zm ), where each Zt ∼ μ⊗n is independent of
all of the others. If an algorithm PW,T,R|Zm : Z m×n → W × [m] × {±1} satisfies
I(W, T, R; (ΛW (Z1 ), . . . , ΛW (Zm )) ≤ C, and if (w, Z) is σ2 -sub-Gaussian for all
w ∈ W, then
2σ2C
E R(LZT (W) − Lμ (W)) ≤ .
n
Proof The proof is based on Lemma 10.2. Let U = (ΛW (Z1 ), . . . , ΛW (Zm )), V =
(W, T, R), and
f (ΛW (z1 ), . . . , ΛW (zm )), (w, t, r) = rLzt (w).
If (w, Z) is σ2 -sub-Gaussian under Z ∼ μ for all w ∈ W, then (r/n) ni=1 (w, Zt,i ) is
(σ2 /n)-sub-Gaussian for all w ∈ W, t ∈ [m] and r ∈ {±1}, and hence f (u, V) is (σ2 /n)-
sub-Gaussian for every u. Lemma 10.2 implies that
2σ2 I(W, T, R; ΛW (Z1 ), . . . , ΛW (Zm ))
E[RLZT (W)] − E[RLμ (W)] ≤ ,
n
which proves the claim.
Note that the upper bound in Lemma 10.7 does not depend on m. With these lemmas,
we can prove Theorem 10.4.
Proof of Theorem 10.4 First, let PW m |Zm be the parallel execution of m independent
copies of PW|Z , as in Lemma 10.6. Given Zm and W m , let the output of the “monitor” be
a sample (W ∗ , T ∗ , R∗ ) drawn from W × [m] × {±1} according to
(T ∗ , R∗ ) = arg max r Lμ (Wt ) − LZt (Wt ) and W ∗ = WT ∗ . (10.31)
t∈[m], r∈{±1}
This gives
R∗ Lμ (W ∗ ) − LZT ∗ (W ∗ ) = max Lμ (Wt ) − LZt (Wt ).
t∈[m]
Note that, conditionally on W m , the tuple (W ∗ , T ∗ , R∗ ) can take only 2m values, which
means that
I(W ∗ , T ∗ , R∗ ; ΛW (Z1 ), . . . , ΛW (Zm )|W m ) ≤ log(2m). (10.33)
In addition, since PW|Z is assumed to satisfy I(W; ΛW (Z)) ≤ C, Lemma 10.6 implies that
I(W m ; ΛW (Z1 ), . . . , ΛW (Zm )) ≤ mC.
316 Maxim Raginsky et al.
Therefore, by the chain rule of mutual information and the data-processing inequality,
we have
This result improves on Proposition 3.2 of [26], which states that ELZ (W) − Lμ (W) ≤
√
σ/ n + 36 2σ2C/n. Theorem 10.5 together
with Markov’s inequality implies that
(10.29) can be guaranteed by n ≥ 2σ2 /α2 β2 C + log 2 , but it has a worse dependence
on β than does the sample complexity given by Theorem 10.4.
using standard estimates for covering numbers in finite-dimensional Banach spaces [31],
we can write
3B
I(Z; W ) ≤ log N(W, · · , r) ≤ d log ,
r
and therefore, under the sub-Gaussian assumption on , the composite learning algoritm
PW |Z satisfies
2σ2 d 3B
gen(μ, PW |Z ) ≤ log . (10.42)
n r
√
Forexample, if we set r = 3B/ n, the above bound on the generalization error will scale
2
as σ d log n /n. If (·, z) is Lipschitz, i.e., |(w, z) − (w , z)| ≤ w − w 2 , then we can
use (10.42) to obtain the following generalization bound for the original algorithm PW|Z :
⎛ ⎞
⎜⎜⎜ 2d ⎟⎟⎟
⎜
gen(μ, PW|Z ) ≤ inf ⎜⎜⎜2r +
2σ
log
3B ⎟⎟⎟.
r≥0⎝ n r ⎟⎠
√
Again, taking r = 3B/ n, we get
2σ2 d log n 6B
gen(μ, PW|Z ) ≤ + √ .
n n
strings (w(X1 ), . . . , w(Xn1 )) for w ∈ W1 are all distinct and {(w(X1 ), . . . , w(Xn1 )) : w ∈
W1 } = {(w(X1 ), . . . , w(Xn1 ) : w ∈ W}. In other words, W1 forms an empirical cover of
W with respect to Z1 . Then pick a hypothesis from W1 with the minimal empirical risk
on Z2 , i.e.,
W = arg min LZ2 (w). (10.43)
w∈W1
and
dP∗
1
dP∗W|Z=z ∗ − 1 ≤ 1 − e−2β/n ,
1
dTV (P∗W|Z=z , P∗W|Z=z )
W|Z=z
=
2 W dPW|Z=z 2
where we have used (10.4). Another bound on the relative entropy between P∗W|Z=z and
P∗W|Z=z can be obtained as follows. We start with
β EQ [e−βLz (W) ]
D(P∗W|Z=z P∗W|Z=z ) = P∗W|Z=z (dw) (w, zi )−(w, zi ) +log . (10.52)
n W EQ [e−βLz (W) ]
Then
EQ [e−βLz (W) ] β
log −βL
= log exp (w, zi ) − (w, zi ) P∗W|Z=z (dw)
EQ [e z (W) ] W n
β β2
≤ P∗W|Z=z (dw) (w, zi ) − (w, zi ) + 2 , (10.53)
n W 2n
where the inequality follows by applying Hoeffding’s lemma to the random vari-
able (W, zi ) − (W, zi ) with W ∼ P∗W|Z=z , which takes values in [−1, 1] almost surely.
Combining (10.52) and (10.53), we get the bound
β2
D(P∗W|Z=z P∗W|Z=z ) ≤
2n2
for any z, z ∈ Z with dH (z, z ) = 1. Invoking Lemma 10.4, we see that P∗W|Z is (1 −
n
e−2β/n )-stable in erasure T -information and ((2β/n) ∧ (β2 /2n2 ))-stable in erasure mutual
information.
Therefore, Theorem 10.1 gives | gen(μ, P∗W|Z )| ≤ 1 − e−2β/n . Moreover, since takes
values in [0, 1], the sub-Gaussian assumption of Theorem
10.2 is satisfied with σ2 = 1/4
∗
by Hoeffding’s lemma, and thus | gen(μ, PW|Z )| ≤ β/n ∧ (β/2n). Taking the minimum
of the two bounds, we obtain (10.51).
With the above guarantees on the generalization error, we can analyze the population
risk of the Gibbs algorithm. We first present a result for countable hypothesis spaces.
corollary 10.2 Suppose W is countable. Let W denote the output of the Gibbs
algorithm applied to Z, and let wo denote the hypothesis that achieves the minimum
population risk among W. If takes values in [0, 1], the population risk of W satisfies
1 1 β
E[Lμ (W)] ≤ inf Lμ (w) + log + . (10.54)
w∈W β Q(wo ) 2n
Proof We can bound the expected empirical risk of the Gibbs algorithm P∗W|Z as
1
E[LZ (W)] ≤ E[LZ (W)] + D(P∗W|Z Q|PZ ) (10.55)
β
1
≤ E[LZ (w)] + D(δw Q) for all w ∈ W, (10.56)
β
where δw is the point mass at w. The second inequality is obtained via the optimality of
the Gibbs P∗W|Z in (10.49), since one can view δw as a sub-optimal learning algorithm that
322 Maxim Raginsky et al.
simply ignores the dataset and always outputs w. Taking w = wo , noting that E[LZ (wo )] =
Lμ (wo ), and combining this with the upper bound of Theorem 10.7 we obtain
1 β
E[Lμ (W)] ≤ inf Lμ (w) + D(δwo Q) + . (10.57)
w∈W β 2n
This leads to (10.54), as D(δwo Q) = − log Q(wo ) when W is countable.
The auxiliary distribution Q in the Gibbs algorithm can encode any prior knowledge
of the population risks of the hypotheses in W. For example, we can order the hypotheses
according to our (possibly imperfect) prior knowledge of their population risks, and set
√
Q(wm ) = 6/(π2 m2 ) for the mth hypothesis in the order.2 Then, setting β = n, (10.54)
becomes
2 log mo + 1
E[Lμ (W)] ≤ inf Lμ (w) + √ , (10.58)
w∈W n
where mo is the index of wo in the order. Thus, better prior knowledge of the population
risks leads to a smaller sample complexity to achieve a certain expected excess risk. As
another example, if |W| = k < ∞ and we do not have any a priori preferences among
the hypotheses,
then we can take Q to be the uniform distribution on W. Upon setting
β = 2n log k, (10.54) becomes
2 log k
E[Lμ (W)] ≤ inf Lμ (w) + .
w∈W n
For uncountable hypothesis spaces, we can proceed in an analogous fashion to analyze
the population risk under the assumption that the loss function is Lipschitz.
corollary 10.3 Suppose W = Rd with the Euclidean norm · 2 . Let wo be the hypoth-
esis that achieves the minimum population risk among W. Suppose that takes values
in [0, 1], and (·, z) is ρ-Lipschitz for all z ∈ Z. Let W denote the output of the Gibbs
algorithm applied to Z. The population risk of W satisfies
β √ 1
E[Lμ (W)] ≤ inf Lμ (w) + + inf aρ d + D N(wo , a2 Id )Q , (10.59)
w∈W 2n a>0 β
where N(v, Σ) denotes the d-dimensional normal distribution with mean v and covari-
ance matrix Σ.
Proof Just as we did in the proof of Corollary 10.2, we first bound the expected
empirical risk of the Gibbs algorithm P∗W|Z . For any a > 0, we can view N(wo , a2 Id )
as a learning algorithm that ignores the dataset and always draws a hypothesis from
this Gaussian distribution. Denote by γd the standard normal density on Rd . The
non-negativity of relative entropy and (10.49) imply that
1
E[LZ (W)] ≤ E[LZ (W)] + D(P∗W|Z Q|PZ ) (10.60)
β
∞
m=1 (1/m ) = (π /6) < 2.
2 Recall that 2 2
Information-Theoretic Stability and Generalization 323
)w−w *
0 1
≤ E[LZ (w)]γd dw + D N(wo , a2 Id )Q (10.61)
a β
W )w−w *
0 1
= Lμ (w)γd dw + D N(wo , a2 Id )Q . (10.62)
W a β
Combining this with the upper bound on the expected generalization error of Theo-
rem 10.7, we obtain
)w−w *
o 1 β
E[Lμ (W)] ≤ inf Lμ (w)γd dw + D N(wo , a2 Id )Q + . (10.63)
a>0 W a β 2n
Since (·, z) is ρ-Lipschitz for all z ∈ Z, we have that, for any w ∈ W,
|Lμ (w) − Lμ (wo )| ≤ E[|(w, Z) − (wo , Z)|] ≤ ρw − wo 2 . (10.64)
Then
)w−w *
0 ) w − w0 *
Lμ (w)γd dw ≤ Lμ (wo ) + ρw − wo 2 γd dw (10.65)
W a W a
√
≤ Lμ (wo ) + ρa d. (10.66)
Substituting this into (10.63), we obtain (10.59).
Again, we can use the distribution Q to express our preference of the hypotheses in
W. For example, we can choose Q = N(wQ , b2 Id ) with b = n−1/4 d−1/4 ρ−1/2 and choose
β = n3/4 d1/4 ρ1/2 . Then, setting a = b in (10.59), we have
d1/4 ρ1/2
E[Lμ (W)] ≤ inf Lμ (w) + 1/4
wQ − wo 22 + 3 . (10.67)
w∈W 2n
This result places essentially no restrictions on W, which could be unbounded, and
only requires the Lipschitz condition on (·, z), which could be non-convex. The sample
complexity decreases with better prior knowledge of the optimal hypothesis.
Proof We will prove that any (ε, δ)-differentially private learning algorithm is (1 −
e−ε (1 − δ))-stable in total variation. To that end, let us rewrite the differential privacy
condition (10.68) as follows. For any z, z with dH (z, z ) ≤ 1,
Eeε (PW|Z=z PW|Z=z ) ≤ δ, (10.71)
where the Eγ -divergence (with γ ≥ 1) between two probability measures μ and ν on a
measurable space (Ω, F ) is defined as
Eγ (μν) := max(μ(E) − γν(E))
E∈F
(see, e.g., [34] or p. 28 of [35]). It satisfies the following inequality (see Section VII.A
of [36]):
Eγ (μν) ≥ 1 − γ(1 − dTV (μ, ν)). (10.72)
Using (10.72) in (10.71) with γ = eε , we get
dH (z, z ) ≤ 1 =⇒ dTV (PW|Z=z , PW|Z=z ) ≤ 1 − e−ε (1 − δ),
and therefore A is (1 − e−ε (1 − δ))-stable in erasure T -information by Lemma 10.4.
Using Theorem 10.1, we obtain Eq. (10.69); the inequality (10.70) follows from (10.72),
Lemma 10.3, and Theorem 10.2.
For example, the Gibbs algorithm is (2β/n, 0)-differentially private [37], so
Theorem 10.7 is a special case of Theorem 10.8.
Just as in the case of the Gibbs algorithm, we can encode our preferences for (or prior
knowledge about) various hypotheses by controlling the amount of noise added to each
hypothesis. The following result formalizes this idea.
corollary 10.4 Suppose W = {wm }∞ m=1 is countably infinite, and the ordering is such
that a hypothesis with a lower index is preferred over one with a higher index. Also sup-
pose ∈ [0, 1]. For the noisy ERM algorithm in (10.73), choosing Nm to be an exponential
random variable with mean bm , we have
⎛∞ ⎞−1
1 Lμ (wm ) ⎜⎜⎜⎜ 1 ⎟⎟⎟⎟
∞
E[Lμ (W)] ≤ min Lμ (wm ) + bmo + − ⎜⎝⎜ ⎟⎟ , (10.74)
m 2n m=1 bm b ⎠
m=1 m
o +3
m1.1
E[Lμ (W)] ≤ min Lμ (wm ) + . (10.75)
m n1/3
Information-Theoretic Stability and Generalization 325
Without adding noise, the ERM algorithm applied to theabove case when card (W) =
k < ∞ can achieve E[Lμ (WERM )] ≤ minm∈[k] Lμ (wm ) + (1/2n)log k. Compared with
(10.75), we see that performing noisy ERM may be beneficial when we have high-
quality prior knowledge of wo and when k is large.
Proof We prove the result assuming card (W) = k < ∞. When W is countably infinite,
the proof carries over by taking the limit as k → ∞.
First, we upper-bound the expected generalization error via I(W; Z). We have the
following chain of inequalities:
I(W; Z) ≤ I (LZ (wm ) + Nm )m∈[k] ; (LZ (wm ))m∈[k]
k
≤ I(LZ (wm ); LZ (wm ) + Nm )
m=1
k
E[LZ (wm )]
≤ log 1 +
m=1
bm
k
Lμ (wm )
= log 1 + , (10.76)
m=1
bm
where we have used the data-processing inequality for mutual information; the fact that,
for product channels, the overall input–output mutual information is upper-bounded by
the sum of the input–output mutual information of individual channels [38]; the formula
for the capacity of the additive exponential noise channel under an input mean constraint
[39]; and the fact that E[LZ (wm )] = Lμ (wm ). The assumption that takes values in [0, 1]
implies that (w, Z) is 1/4-sub-Gaussian for all w ∈ W, and, as a consequence of (10.76),
+
1
k
Lμ (wm )
gen(μ, PW|Z ) ≤ log 1 + . (10.77)
2n m=1 bm
We now upper-bound the expected empirical risk. By the construction of the algorithm,
almost surely
LZ (W) = LZ (W) + NW − NW
≤ LZ (wmo ) + Nmo − NW
≤ LZ (wmo ) + Nmo − min{Nm , m ∈ [k]}.
Taking the expectation on both sides, we get
⎛ k ⎞−1
⎜⎜⎜ 1 ⎟⎟⎟
E[LZ (W)] ≤ Lμ (wmo ) + bmo − ⎜⎜⎜⎝ ⎟⎟⎟ . (10.78)
bm ⎠
m=1
Combining (10.77) and (10.78), we have
+
⎛ k ⎞−1
1 ⎜⎜⎜ 1 ⎟⎟⎟
k
Lμ (wm ) ⎜ ⎟⎟⎟ ,
E[Lμ (W)] ≤ min Lμ (wm ) + log 1 + + bmo − ⎜⎜⎝ ⎠
m∈[k] 2n m=1 bm m=1
bm
k
1
1.1
≤ 11 − 10k−1/10
m=1
m
If the Markov chain Z−ΛW j (Z)−W j holds conditional on W j−1 for j = 1, . . . , k, then the
upper bound in (10.79) can be sharpened to kj=1 I(W j ; ΛW j (Z)|W j−1 ). Thus, we can
control the generalization error of the final output by controlling the conditional mutual
information at each step of the composition. This also gives us a way to analyze the gen-
eralization error of the composite learning algorithm using local information-theoretic
properties of the constituent algorithms. Recent work by Feldman and Steinke [42] pro-
poses a stronger notion of stability in erasure mutual information (uniformly with respect
to all, not necessarily product, distributions of Z) and applies it to adaptive data analysis.
Armed with this notion, they design and analyze a noise-adding algorithm that calibrates
the noise variance to the empirical variance of the data. A related information-theoretic
notion of stability was also proposed by Cuff and Yu [43] in the context of differential
privacy.
Acknowledgments
The authors would like to thank the referees and Nir Weinberger for reading the
manuscript and for suggesting a number of corrections, and Vitaly Feldman for valu-
able general comments and for pointing out the connection to the work of McAllester
on PAC-Bayes bounds. The work of A. Rakhlin is supported in part by the NSF under
grant no. CDS&E-MSS 1521529 and by the DARPA Lagrange program. The work of
M. Raginsky and A. Xu is supported in part by NSF CAREER award CCF–1254041 and
in part by the Center for Science of Information (CSoI), an NSF Science and Technology
Center, under grant agreement CCF-0939370.
References
[7] O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Machine Learning Res.,
vol. 2, pp. 499–526, 2002.
[8] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi, “General conditions for predictivity in
learning theory,” Nature, vol. 428, no. 6981, pp. 419–422, 2004.
[9] S. Kutin and P. Niyogi, “Almost-everywhere algorithmic stability and generalization error,”
in Proc. 18th Conference on Uncertainty in Artificial Intelligence (UAI 2002), 2002, pp.
275–282.
[10] A. Rakhlin, S. Mukherjee, and T. Poggio, “Stability results in learning theory,” Analysis
Applications, vol. 3, no. 4, pp. 397–417, 2005.
[11] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in
private data analysis,” in Proc. Theory of Cryptography Conference, 2006, pp. 265–284.
[12] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, “Generalization in
adaptive data analysis and holdout reuse,” arXiv:1506.02629, 2015.
[13] C. Dwork and A. Roth, “The algorithmic foundatons of differential privacy,” Foundations
and Trends in Theoretical Computer Sci., vol. 9, nos. 3–4, pp. 211–407, 2014.
[14] P. Kairouz, S. Oh, and P. Viswanath, “The composition theorem for differential privacy,” in
Proc. 32nd International Conference on Machine Learning (ICML), 2015, pp. 1376–1385.
[15] M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of
stability and bias of learning algorithms,” in Proc. IEEE Information Theory Workshop,
2016.
[16] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of
learning algorithms,” in Proc. Conference on Neural Information Processing Systems,
2017.
[17] H. Strasser, Mathematical theory of statistics: Statistical experiments and asymptotic
decision Theory. Walter de Gruyter, 1985.
[18] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,”
IEEE Trans. Information Theory, vol. 62, no. 1, pp. 35–55, 2016.
[19] S. Verdú and T. Weissman, “The information lost in erasures,” IEEE Trans. Information
Theory, vol. 54, no. 11, pp. 5030–5058, 2008.
[20] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” J. Amer.
Statist. Soc., vol. 58, no. 301, pp. 13–30, 1963.
[21] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic
theory of independence. Oxford University Press, 2013.
[22] T. M. Cover and J. A. Thomas, Elements of information theory, 2nd edn. Wiley, 2006.
[23] R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman, “Algorithmic sta-
bility for adaptive data analysis,” in Proc. 48th ACM Symposium on Theory of Computing
(STOC), 2016, pp. 1046–1059.
[24] D. McAllester, “PAC-Bayesian model averaging,” in Proc. 1999 Conference on Learning
Theory, 1999.
[25] D. McAllester, “A PAC-Bayesian tutorial with a dropout bound,” arXiv:1307.2118, 2013,
http://arxiv.org/abs/1307.2118.
[26] D. Russo and J. Zou, “Controlling bias in adaptive data analysis using information theory,”
in Proc. 19th International Conference on Artificial Intelligence and Statistics (AISTATS),
2016, pp. 1232–1240.
[27] J. Jiao, Y. Han, and T. Weissman, “Dependence measures bounding the exploration bias for
general measurements,” in Proc. IEEE International Symposium on Information Theory,
2017, pp. 1475–1479.
Information-Theoretic Stability and Generalization 329
[28] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, “Preserving sta-
tistical validity in adaptive data analysis,” in Proc. 47th ACM Symposium on Theory of
Computing (STOC), 2015.
[29] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, “Generaliza-
tion in adaptive data analysis and holdout reuse,” in 28th Annual Conference on Neural
Information Processing Systems (NIPS), 2015.
[30] R. Bassily, S. Moran, I. Nachum, J. Shafer, and A. Yehudayoff, “Learners that use little
information,” in Proc. Conference on Algorithmic Learning Theory (ALT), 2018.
[31] R. Vershynin, High-dimensional probability: An introduction with applications in data
science. Cambridge University Press, 2018.
[32] L. Devroye, L. Györfi, and G. Lugosi, A probabilistic theory of pattern recognition.
Springer, 1996.
[33] K. L. Buescher and P. R. Kumar, “Learning by canonical smooth estimation. I. Simultane-
ous estimation,” IEEE Tran. Automatic Control, vol. 41, no. 4, pp. 545–556, 1996.
[34] J. E. Cohen, J. H. B. Kemperman, and G. Zbǎganu, Comparisons of stochastic matrices,
with applications in information theory, statistics, economics, and population sciences.
Birkhäuser, 1998.
[35] Y. Polyanskiy, “Channel coding: Non-asymptotic fundamental limits,” Ph.D. dissertation,
Princeton University, 2010.
[36] I. Sason and S. Verdú, “ f -divergence inequalities,” IEEE Trans. Information Theory,
vol. 62, no. 11, pp. 5973–6006, 2016.
[37] F. McSherry and K. Talwar, “Mechanism design via differential privacy,” in Proc. 48th
Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2007.
[38] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” Lecture Notes for
ECE563 (UIUC) and 6.441 (MIT), 2012–2016, http://people.lids.mit.edu/yp/homepage/
data/itlectures_v4.pdf.
[39] S. Verdú, “The exponential distribution in information theory,” Problems of Information
Transmission, vol. 32, no. 1, pp. 86–95, 1996.
[40] M. Raginsky, “Strong data processing inequalities and Φ-Sobolev inequalities for discrete
channels,” IEEE Trans. Information Theory, vol. 62, no. 6, pp. 3355–3389, 2016.
[41] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press, 2016.
[42] V. Feldman and T. Steinke, “Calibrating noise to variance in adaptive data analysis,” in
Proc. 2018 Conference on Learning Theory, 2018.
[43] P. Cuff and L. Yu, “Differential privacy as a mutual information constraint,” in Proc.
2016 ACM SIGSAC Conference on Computer and Communication Security (CCS), 2016,
pp. 43–54.
11 Information Bottleneck and
Representation Learning
Pablo Piantanida and Leonardo Rey Vega
Information theory aims to characterize the fundamental limits for data compression,
communication, and storage. Although the coding techniques used to prove these funda-
mental limits are impractical, they provide valuable insight, highlighting key properties
of good codes and leading to designs approaching the theoretical optimum (e.g., turbo
codes, ZIP and JPEG compression algorithms). On the other hand, statistical models and
machine learning are used to acquire knowledge from data. Models identify relationships
between variables that allow one to make predictions and assess their accuracy. A good
choice of data representation is paramount for performing large-scale data processing
in a computationally efficient and statistically meaningful manner [1], allowing one to
decrease the need for storage, or to reduce inter-node communication if the data are
distributed.
Shannon’s abstraction of information merits careful study [2]. While a layman might
think that the problem of communication is to convey meaning, Shannon clarified
that “the fundamental problem of communication is that of reproducing at one point
330
Information Bottleneck and Representation Learning 331
a message selected at another point.” Shannon further argued that the meaning of a
message is subjective, i.e., dependent on the observer, and irrelevant to the engineering
problem of communication. However, what does matter for the theory of communica-
tion is finding suitable representations for given data. In source coding, for example, one
generally aims at distilling the relevant information from the data by removing unneces-
sary redundancies. This can be cast in information-theoretic terms, as higher redundancy
makes data more predictable and lowers their information content.
In the context of learning [3, 4], we propose to distinguish between these two rather
different aspects of data: information and knowledge. Information contained in data is
unpredictable and random, while additional structure and redundancy in the data stream
constitute knowledge about the data-generation process, which a learner must acquire.
Indeed, according to connectionist models [5], the redundancy contained within mes-
sages enables the brain to build up its cognitive maps and the statistical regularities in
these messages are being used for this purpose. Hence, this knowledge, provided by
redundancy [6, 7] in the data, must be what drives unsupervised learning. While infor-
mation theory is a unique success story, from its birth, it discarded knowledge as being
irrelevant to the engineering problem of communication. However, knowledge is recog-
nized as being a critical – almost central – component of representation learning. The
present text provides an information-theoretic treatment of this problem.
Knowledge representation. The data deluge of recent decades has led to new expec-
tations for scientific discoveries from massive data. While mankind is drowning in data, a
significant part of the data is unstructured, and it is difficult to discover relevant informa-
tion. A common denominator in these novel scenarios is the challenge of representation
learning: how to extract salient features or statistical relationships from data in order
to build meaningful representations of the relevant content. In many ways, deep neural
networks have turned out to be very good at discovering structures in high-dimensional
data and have dramatically improved the state-of-the-art in several pattern-recognition
tasks [8]. The global learning task is decomposed into a hierarchy of layers with non-
linear processing, a method achieving great success due to its ability not only to fit
different types of datasets but also to generalize incredibly well. The representational
capabilities of neural networks [9] have attracted significant interest from the machine-
learning community. These networks seem to be able to learn multi-level abstractions,
with a capability to harness unlabeled data, multi-task learning, and multiple inputs,
while learning from distributed and hierarchical data, to represent context at multiple
levels.
The actual goal of representation learning is neither accurate estimation of model
parameters [10] nor compact representation of the data themselves [11, 12]; rather,
we are mostly interested in the generalization capabilities, meaning the ability to suc-
cessfully apply rules extracted from previously seen data to characterize unseen data.
According to the statistical learning theory [13], models with many parameters tend to
overfit by representing the learned data too accurately, therefore diminishing their abil-
ity to generalize to unseen data. In order to reduce this “generalization gap,” i.e., the
difference between the “training error” and the “test error” (a measure of how well the
learner has learned), several regularization methods were proposed in the literature [13].
332 Pablo Piantanida and Leonardo Rey Vega
A recent breakthrough in this area has been the development of dropout [14] for training
deep neural networks. This consists in randomly dropping units during training to
prevent their co-adaptation, including some information-based regularization [15] that
yields a slightly more general form of the variational auto-encoder [16].
Why is that we succeed in learning high-dimensional representations? Recently
there has been much interest in understanding the importance of implicit regularization.
Numerical experiments in [17] demonstrate that network size may not be the main form
of capacity control for deep neural networks and, hence, some other, unknown, form of
capacity control plays a central role in learning multilayer feed-forward networks. From
a theoretical perspective, regularization seems to be an indispensable component in order
to improve the final misclassification probability, while convincing experiments support
the idea that the absence of all regularization does not necessarily induce a poor gener-
alization gap. Possible explanations were approached via rate-distortion theory [18, 19]
by exploring heuristic connections with the celebrated information-bottleneck princi-
ple [20]. Within the same line of work, in [21, 22] Russo and Zou and Xu and Raginsky
have proven bounds showing that the square root of the mutual information between
the training inputs and the parameters inferred from the training algorithm provides a
concise bound on the generalization gap. These bounds crucially depend on the Markov
operator that maps the training set into the network parameters and whose characteriza-
tion could not be an easy task. Similarly, in [23] Achille and Soatto explored how the
use of an information-bottleneck objective on the network parameters (and not on the
representations) may help to avoid overfitting while enforcing invariant representations.
The interplay between information and complexity. The goal of data representation
may be cast as trying to find regularity in the data. Regularity may be identified with the
“ability to compress” by viewing representation learning as lossy data compression: this
tells us that, for a given set of encoder models and dataset, we should try to find the
encoder or combination of encoders that compresses the data most. In this sense, we
may speak of the information complexity of a structure, meaning the minimum amount
of information (number of bits) we need to store enough information about the structure
that allows us to achieve its reconstruction. The central result in this chapter states that
good representation models should squeeze out as much regularity as possible from
the given data. In other words, representations are expected to distill the meaningful
information present in the data, i.e., to separate structure as seeing the regularity from
noise, interpreted as the accidental information.
The structure of this chapter. This chapter can be read without any prior knowledge
of information theory and statistical learning theory. In the first part, the basic learning
framework for analysis is developed and an accessible overview of basic concepts in
statistical learning theory and the information-bottleneck principle are presented. The
second part introduces an upper bound to the generalization gap corresponding to the
cross-entropy loss and shows that, when this penalty term times a suitable multiplier
and the cross-entropy empirical risk are minimized jointly, the problem is equivalent
to optimizing the information-bottleneck objective with respect to the empirical data
distribution. The notion of information complexity is introduced and intuitions behind it
are developed.
Information Bottleneck and Representation Learning 333
Minimizing over all possible classifiers QY|X gives the smallest average probability
of misclassification. An optimum classifier c (·) chooses the hypothesis y ∈ Y with
largest posterior probability PY|X given the observation x, that is the maximum a pos-
teriori (MAP) decision. The MAP test that breaks ties randomly with equal probability
is given by1
⎧
⎪
⎪
⎪
1
⎨ , if y ∈ B(x),
Q (y|x) ⎪
MAP
⎪ |B(x)| (11.2)
Y|X ⎪
⎩ 0, otherwise,
This classification rule is called the Bayes decision rule. The Bayes decision rule is opti-
mal in the sense that no other decision rule has a smaller probability of misclassification.
It is straightforward to obtain the following lemma.
1 In general, the optimum classifier given in (11.2) is not unique. Any conditional p.m.f. with support in
B(x) for each x ∈ X will be equally good.
334 Pablo Piantanida and Leonardo Rey Vega
lemma 11.1 (Bayes error) The misclassification error rate of the Bayes decision rule is
given by
MAP
PE Q = 1 − EPX max
PY|X (y |X) . (11.3)
Y|X y ∈Y
Finding the Bayes decision rule requires knowledge of the underlying distribution
PXY , but typically in applications these distributions are not known. In fact, even a para-
metric form or an approximation to the true distribution is unknown. In this case, the
learner tries to overcome the lack of knowledge by resorting to labeled examples. In
addition, the probability of misclassification using the labeled examples has the par-
ticularity that it is mathematically hard to solve for the optimal decision rule. As a
consequence, it is common to work with a surrogate (information measure) given by
the average logarithmic loss or cross-entropy loss. This loss is used when a probabilis-
tic interpretation of the scores is desired by measuring the dissimilarity between the true
label distribution PY|X and the predicted label distribution Q Y|X , and is defined below.
lemma 11.2 (Surrogate based on the average logarithmic loss) A natural surrogate for
the probability of misclassification PE (Q ) corresponding to a classifier Q
Y|X is given
Y|X
by the average logarithmic loss EPXY − log Q Y|X (Y|X) , which satisfies
Y|X ) ≤ 1 − exp −EPXY − log Q
PE (Q Y|X (Y|X) . (11.4)
The average logarithmic loss can provide an effective and better-behaved surrogate for
the particular problem of minimizing the probability of misclassification [9]. Evidently,
the optimal decision rule for the average logarithmic loss is QY|X ≡ PY|X . This does not
match in general with the optimal decision rule for the probability of misclassification
QMAP in expression (11.2). Although the average logarithmic loss may induce an irre-
Y|X
ducible gap with respect to the probability of misclassification, it is clear that when the
true PY|X concentrates around a particular value y(x) for each x ∈ X (which is necessary
for a statistical model PY|X to induce low probability of misclassification) this gap could
be significantly reduced.
The problem of finding a good classifier can be divided into that of simultaneously
finding a (possibly randomized) encoder QU|X : X → P(U) that maps raw data to a
representation, possibly living in a higher-dimensional (feature) space U, and a soft-
decoder Q Y|U : U → P(Y), which maps the representation to a probability distribution
on the label space Y. Although these mappings induce an equivalent classifier,
QY|X (y|x) = QU|X (u|x)Q
Y|U (y|u), (11.6)
u∈U
We measure the expected performance of (QU|X , Q Y|U ) via the risk function:
(QU|X , QY|U ) → L(QU|X , Q
Y|U ) EPXY QU|X (·|X), Q Y|U (Y|·) . (11.10)
In addition to the points noted earlier, another crucial component of knowledge rep-
resentation is the use of deep representations. Formally speaking, we consider Kth-layer
randomized encoders {QUk |Uk−1 }k=1
K with U0 ≡ X instead of one randomized encoder
QU|X . Although this appears at first to be more general, it can be casted using the one-
layer randomized encoder formulation induced by the marginal distribution that relates
the input layer and the output layer of the network. Therefore any result for the one-
layer formulation immediately implies a result for the Kth-layer formulation, and for
this reason we shall focus on the one-layer case without loss of generality.
lemma 11.3 (Optimal decoders) The minimum cross-entropy loss risk satisfies
inf L(Q
Y|U , QU|X ) = H(QY|U |QU ), (11.11)
Q
Y|U
: U→P(Y)
where
QU|X (u|x)PXY (x, y)
x∈X
QY|U (y|u) = . (11.12)
QU|X (u|x)PX (x)
x∈X
Proof The proof follows from the positivity of the relative entropy by noticing that
L(QU|X , Q
Y|U ) = D QY|U Q
Y|U
QU + H(QY|U |QU ).
336 Pablo Piantanida and Leonardo Rey Vega
which is only a function of the encoder model QU|X . However, the optimal decoder
cannot be determined since PXY is unknown.
The learner’s goal is to select QU|X and Q
Y|U by minimizing the risk (11.10). However,
since PXY is unknown the learner cannot directly measure the risk, and it is common to
measure the agreement of a pair of candidates with a finite training dataset in terms of
the empirical risk.
definition 11.3 (Empirical risk) Let P XY denote the empirical distribution through the
training dataset Sn {(x1 , y1 ), . . . , (xn , yn )}. The empirical risk is
Lemp (QU|X , Q Y|U ) EP XY QU|X (·|X), Q
Y|U (Y|·) (11.14)
1
n
= QU|X (·|xi ), Q
Y|U (yi |·) . (11.15)
n i=1
lemma 11.4 (Optimality of empirical decoders) Given a randomized encoder QU|X :
X → P(U), define the empirical decoder with respect to the empirical distribution
XY as
P
XY (x, y)
QU|X (u|x)P
x∈X
QY|U (y|u) . (11.16)
QU|X (u|x)PX (x)
x∈X
Y|U : U → P(Y) as
Then, the risk can be lower-bounded uniformly over Q
Lemp (QU|X , Q
Y|U ) ≥ Lemp (QU|X , QY|U ), (11.17)
where equality holds provided that QY|U ≡ QY|U , i.e., the optimal decoder is computed
from the encoder and the empirical distribution as done in (11.16).
Proof The inequality follows along the lines of Lemma (11.3) by noticing that
Lemp (QU|X , Q
Y|U ) = D QY|U Q Y|U | QU + Lemp ( QY|U , QU|X ). Finally, the non-negativity
of the relative conditional entropy completes the proof.
Since the empirical risk is evaluated on finite samples, its evaluation may be sensi-
tive to sampling (noise) error, thus giving rise to the issue of generalization. It can be
argued that a key component of learning is not just the development of a representa-
tion model on the basis of a finite training dataset, but its use in order to generalize to
unseen data. Clearly, successful generalization necessitates the closeness (in some sense)
of the selected representation and decoder models. Therefore, successful representation
learning would involve successful generalization. This chapter deals with the informa-
tion complexity of successful generalization. The generalization gap defined below is a
measure of how an algorithm could perform on new data, i.e., data that are not available
during the training phase. In the light of Lemmas 11.3 and 11.4, we will restrict our anal-
ysis to encoders only and assume that the optimal empirical decoder has been selected,
Information Bottleneck and Representation Learning 337
Y|U ≡ QY|U both in the empirical risk (11.14) and in the true risk (11.10). This is
i.e., Q
reasonable given the fact that the true PXY is not known, and the only decoder that can
be implemented in practice is the empirical one.
definition 11.4 (Generalization gap) Given a stochastic mapping QU|X : X → P(U),
the generalization gap is defined as
(QU|X , Sn ) → Egap (QU|X , Sn ) Lemp (QU|X , Q Y|U ),
Y|U ) − L(QU|X , Q (11.18)
Y|U )
which represents the error incurred by the selected QU|X when the rule Lemp (QU|X , Q
is used instead of the true risk L(QU|X , QY|U ).
provided that the chosen model class F is restricted [25]. The learner chooses a pair
∈ F that minimizes the empirical risk:
Q U|X
Lemp Q U|X , QY|U ≤ Lemp (QU|X , QY|U ), for all QU|X ∈ F . (11.20)
which depends on the empirical risk and the so-called generalization gap, respectively.
Expression (11.21) states that an adequate selection of the encoder should be performed
in order to minimize the empirical risk and the generalization gap simultaneously. It
is reasonable to expect that the optimal encoder achieving the minimal risk in (11.19)
does not belong to our restricted class of models F , so the learner may want to enlarge
the model classes F as much as possible. However, this could induce a larger value of
the generalization gap, which could lead to a trade-off between these two fundamental
quantities.
definition 11.9 (Approximation and estimation error) The sub-optimality of the model
class F is measured in terms of the approximation error:
Y|U ) − L .
Eapp (F ) inf L(QU|X , Q (11.22)
QU|X ∈F
definition 11.10 (Excess risk) The excess risk of the algorithm (11.20) selecting an
, Q
optimal pair (Q ) can be decomposed as
Y|U U|X
Eexc F , QU|X , QY|U E L QU|X , QY|U − L
= Eapp (F ) + E Eest F , QU|X , QY|U ,
where the expectation is taken with respect to the random choice Sn of dataset which
, Q
induces the optimal pair (Q ).
Y|U U|X
The approximation error Eapp (F ) measures how closely encoders in the model class
F can approximate the optimal solution L . On the other hand, the estimation error
, Q
Eest (F , Q ) measures the effect of minimizing the empirical risk instead of the
U|X Y|U
true risk, which is caused by the finite size of the training data. The estimation error is
determined by the number of training samples and by the complexity of the model class,
i.e., large models have smaller approximation error but lead to higher estimation errors,
and it is also related to the generalization error [25]. However, for the sake of simplicity,
in this chapter we restrict our attention only to the generalization gap.
Information Bottleneck and Representation Learning 339
log Mn
≤ R + , (11.24)
n
EPnX d̄ X n ; gn fn (X n ) ≤ D + , (11.25)
n ) ≡ (1/n) n
where d̄(X n ; X i=1 d(xi ;
xi ).
The set of all achievable pairs (R, D) contains the complete characterization of all
the possible trade-offs between the rate R (which quantifies the level of compression of
the source X measuring the necessary number of bits per symbol) and the distortion D
(which quantifies the average fidelity level per symbol in the reconstruction using the
2 In the asymptotic regime one considers that the number of realizations of the stochastic source to be
compressed tends to infinity. Although this could be questionable in practice, the asymptotic problem
reflects accurately the important trade-offs of the problem. In this presentation, our focus will be on the
asymptotic problem originally solved by Shannon.
340 Pablo Piantanida and Leonardo Rey Vega
It is the great achievement of Shannon [27] to have obtained the following result.
theorem 11.1 (Rate-distortion function) The rate-distortion function for source X with
and with distortion function d(·; ·) is given by
reconstruction alphabet X
RX,d (D) = inf .
I PX ; PX|X (11.27)
: X →P(X)
PX|X
EP [d(X;X)] ≤ D
XX
This function depends solely on the distribution PX and the distortion function d(·; ·)
and contains the exact trade-off between compression and fidelity that can be expected
for the particular source and distortion function. It is easy to establish that this func-
tion is positive, non-increasing in D, and convex. Moreover, there exists D > 0 such
that RX,d (D) is finite and we denote the minimum of such values of D by Dmin with
Rmax limD→Dmin + RX,d (D). Although RX,d (D) could be hard to compute in closed form
for a particular PX and d(·; ·), the problem in (11.27) is a convex optimization one, for
which there exist efficient numerical techniques. However, several important cases admit
closed-form expressions, such as the Gaussian case with quadratic distortion3 [24].
Another important function related to the rate-distortion function is the distortion-rate
function. This function can be defined independently from the rate-distortion function
and directly from information-theoretic principles. Intuitively, this function is the infi-
mum value of the distortion D as a function of the rate R for all (R, D) achievable pairs.
We will define it directly from the rate-distortion function:
R−1
X,d (I) inf D ∈ R≥0 : RX,d (D) ≤ I . (11.28)
Besides their obvious importance in the problem of source coding, the definitions of
the rate-distortion and distortion-rate functions will be useful for the problem of learn-
ing as presented in the previous section. They will permit one to establish connections
between the misclassification probability, the cross-entropy, and the mutual information
3 Although the Gaussian case does not correspond to a finite cardinality set X, the result in (11.27) can
easily be extended to that case using quantization arguments.
4 It is worth mentioning that by using R−1 X,d (I) we are abusing notation. This is because in general it is not
true that RX,d (D) is injective for every D ≥ 0. However, when I ∈ [Rmin , Rmax ) with Rmin RX,d (Dmax )
and Dmax minx∈X EPX d(X; x) , under some very mild conditions on PX and d(·; ·), R−1 X,d (I) is the true
inverse of RX,d (D), which is guaranteed to be injective in the interval D ∈ (Dmin , Dmax ].
Information Bottleneck and Representation Learning 341
between the input X and the output of the encoder QU|X . These connections will be con-
ceptually important for the rest of the chapter, at least from a qualitative point of view.
From the above derivation, we can set a distortion measure: d(u; y) 1 − Q Y|U (y|u).
In this way, the probability of misclassification can be written as an average over the
outcomes of Y (taken as the source) and U (taken as the reconstruction) of the distortion
measure: 1 − Q Y|U (y|u). In this manner, we can consider the following rate-distortion
function:
RY,QY|U (D) inf I PY ; PU|Y , (11.30)
PU|Y : Y →P(U)
EPUY [1−Q Y|U
(Y|U)] ≤ D
which provides a connection between the misclassification probability and the mutual
information I PY ; PU|Y .
From this formulation, we are able to obtain the following lemma, which provides an
upper and a lower bound on the probability of misclassification via the distortion-rate
function and the cross-entropy loss.
342 Pablo Piantanida and Leonardo Rey Vega
Proof The upper bound simply follows by using the Jensen inequality from [24],
while the lower bound is a consequence of the definition of the rate-distortion and
distortion-rate functions. The probability of misclassification corresponding to the clas-
sifier can be expressed by the expected distortion EPXY QU|X [d(Y, U)] = PE (Q Y|U , QU|X ),
which is based on the fidelity function d(y, u) 1 − Q Y|U (y|u) as shown in (11.29).
Because of the Markov chain Y −− X −− U, we can use the data-processing inequality
[24] and the definition of the rate-distortion function, obtaining the following bound for
the classification error:
I(PX ; QU|X ) ≥ I(PY ; QU|Y ) (11.34)
≥ inf I PY ; PU|Y
(11.35)
: Y→P(U)
PU|Y
≤ EP Q [d(Y,U)]
EP [d(Y,U)]
UY XY U|X
Y|U , QU|X ) .
= RY,QY|U PE (Q (11.36)
−1
For EPXY QU|X [d(Y, U)], we can use the definition of RY,Q (·), and thus obtain from
Y|U
(11.34) the fundamental bound
−1 −1
RY,Q
(I(PX ; QU|X )) ≤ RY,Q
(I(PY ; QU|Y )) ≤ PE (Q
Y|U , QU|X ).
Y|U Y|U
The lower bound in the above expression states that any limitation in terms of the
mutual information between the raw data and their representation will bound from below
the probability of misclassification while the upper bound shows that the cross-entropy
loss introduced in (11.10) can be used as a surrogate to optimize the probability of mis-
classification, as was also pointed out in Lemma 11.2. As a matter of fact, it appears that
the probability of misclassification is controlled by two fundamental information quan-
tities: the mutual information I(PX ; QU|X ) and the cross-entropy loss L(QY|U , QU|X ).
Consider the case of logarithmic distortion d(y; u) = − log PY|U (y|u), where
x∈X PU|X (u|x)PXY (x, y)
PY|U (y|u) = . (11.38)
x∈X PU|X (u|x)PX (x)
The noisy lossy source coding with this choice of distortion function gives rise to the
celebrated information bottleneck [20]. In precise terms,
RXY,d (D) = inf I PX ; PU|X . (11.39)
PU|X : X →P(U)
H(PY|U |PU ) ≤ D
Noticing that H(PY|U |PU ) = −I PY ; PU|Y + H(PY ) and defining μ H(PY ) − D, we can
write (11.39) as
R̄XY (μ) = inf I PX ; PU|X . (11.40)
PU|X : X →P(U)
I(PY ;PU|Y ) ≥ μ
Equation (11.40) summarizes the trade-off that exists between the level of compression
of the observable source X, using representation U, and the level of information
about the hidden source Y preserved by this representation. This function is called the
rate-relevance function, where μ is the minimum level of relevance we expect from
representation U when the rate used for the compression of X is R̄XY (μ). Notice that in
the information-bottleneck case the distortion d(y; u) depends on the optimal conditional
distribution P∗U|X through (11.38). This makes the problem of characterizing R̄XY (μ)
more difficult than (11.37), in which the distortion function is fixed. In fact, although
R̄XY (μ) is positive, non-decreasing, and convex, the problem in (11.40) is not convex,
which leads to the need for more sophisticated tools for its solution. Moreover, from the
corresponding operational definition for the lossy source coding problem (analogous
to Definitions 11.11 and 11.12), it is clear that the distortion function for sequences
Y n and U n is applied symbol-by-symbol d̄(Y n ; U n ) = −(1/n) i=1 log PY|U (Yi |Ui ),
implying a memoryless condition between hidden source realization Y n and description
U n = fn (X n ). It is possible to show [29, 30] that, if we apply a full logarithmic distortion
d̄(Y n ; U n ) = −(1/n) log PY n |U n (Y n |U n ), not necessarily additive as in the previous case,
the rate-relevance function in (11.40) remains unchanged, where relevance is measured
by the non-additive multi-letter mutual information:
344 Pablo Piantanida and Leonardo Rey Vega
1
d̄(Y n ; U n ) ≡ I PY n ; P fn (X n )|Y n . (11.41)
n
As a simple example in which the rate-relevance function in (11.40) can be calculated
in closed form, we can consider the case in which X and Y are jointly Gaussian with
zero mean, variances σ2X and σ2Y , and Pearson correlation coefficient given by ρXY .
Using standard information-theoretic arguments [30], it can be shown that the optimal
distribution PU|X is also Gaussian, with mean X and variance given by
2−2μ − (1 − ρ2XY )
σ2U|X = σ2X . (11.42)
1 − 2−2μ
With this choice for PU|X we easily obtain that I PY ; PU|Y = μ and that
⎛ ⎞ ⎛ ⎞
1 ⎜⎜ ρ2 ⎟⎟ 1 ⎜⎜ 1 ⎟⎟⎟
R̄XY (μ) = log ⎜⎜⎝ −2μ XY 2 ⎟⎟⎠, 0 ≤ μ ≤ log ⎜⎜⎝ ⎟⎠. (11.43)
2 2 − (1 − ρXY ) 2 1 − ρ2XY
It is interesting to observe that R̄XY (μ) depends only on the structure of the sources X
and Y through the correlation coefficient ρXY and not on their variances. It should also
be noted that the level of relevance μ is constrained to lie in a bounded interval. This is
not surprising because of the Markov chain U −− X −− Y, the maximum value for the
relevance level is I PX ; PY|X , which is easily shown to be equal to 12 log (1/(1 − ρ2XY )).
The maximum level of relevance is achievable only as long as the rate R → ∞, that is,
when the source X is minimally compressed. The trade-off between rate and relevance
for this simple example can be appreciated in Fig. 11.1 for ρXY = 0.9.
10
6
R̄XY (µ)
0
0 0.2 0.4 0.6 0.8 1 1.2
µ
Figure 11.1 R̄XY (μ) for ρXY = 0.9.
Information Bottleneck and Representation Learning 345
On recognizing that EPXY QU|X [− log QY|U (Y|U)] = H(QY|U |QU ), where QU (u) =
x∈X QU|X (u|x)PX (x), we see that (11.45) is closely related to the information bot-
tleneck and to the rate-relevance function defined in (11.40). In fact, the problem in
(11.45) can be equivalently written as
sup I PY ; QU|Y − β · I PX ; QU|X , (11.47)
QU|X :X→P(U)
with QU|Y (u|y) = x∈X QU|X (u|x)PX|Y (x|y). We can easily see that in (11.47) we are
considering the dual problem to (11.40), looking for the supremum of relevance μ
subject to a given rate R. The value of β (which can be thought of as a typical Lagrange
multiplier [33]) can be thought of as a hyperparameter which controls the trade-off
346 Pablo Piantanida and Leonardo Rey Vega
between I PY ; QU|Y (relevance) and I PX ; QU|X (rate). In more precise terms, consider
the following set:
R (μ, R) ∈ R2≥0 : ∃ QU|X : X → P(U) s.t.
R ≥ I PX ; QU|X ,
μ ≤ I PY ; QU|Y , U −− X −− Y . (11.48)
It is easy to show that this region corresponds to the set of achievable values of
relevance and rate (μ, R) for the corresponding noisy lossy source coding problem with
logarithmic distortion as was defined in Section 11.3.3. This set is closed and convex
and it is not difficult to show that [34]
sup I PY ; QU|Y = sup {μ : (μ, R) ∈ R}. (11.49)
QU|X : X →P(U)
I PX ;QU|X ≤ R
Using convex optimization theory [33], we can easily conclude that (11.47) corresponds
to obtaining the supporting hyperplane of region R with slope β. As any convex and
closed set is characterized by all of its supporting hyperplanes, by varying β and solving
(11.47) we are reconstructing the upper boundary of R which coincides with (11.49).
In other words, the hyperparameter β is directly related to the value of R at which we
are considering the maximum possible value of the redundancy μ, or, which amounts
to the same thing, the value of β controls the complexity of representations of X, as was
pointed out above.
It remains only to discuss the implementation of a procedure for solving (11.47).
Unfortunately, although the set R characterizing the solutions of (11.47) is convex, it
is not true that (11.47) is itself a convex optimization problem. However, the structure
of the problem allows the use of efficient numerical optimization procedures that
guarantee convergence to local optimum solutions. These numerical procedures are
basically Blahut–Arimoto (BA)-type algorithms. These are often used to refer to a
class of algorithms for numerically computing the capacity of a noisy channel and the
rate-distortion function for given channel and source distributions, respectively [35, 36].
For these reasons, these algorithms can be applied with minor changes to the problem
(11.47), as was done in [20].
Clearly, for the solution of (11.47) we need as input the distribution PXY . When
only training samples and labels Sn {(x1 , y1 ), . . . , (xn , yn )} are available, we use the
empirical distribution PXY instead of the true distribution PXY .
In Fig. 11.2, we plot what we call the excess risk (as presented in Definition 11.10),
rewritten as
∗,β ∗,β
Excess risk H(QY|U |QU ) − H(PY|X |PX ), (11.50)
∗,β ∗,β ∗,β
where QY|U , QU are computed by using the optimal solution QU|X in (11.47) and the
XY . As β defines unequivocally the value of I PX ; Q∗,β , which
empirical distribution P U|X
is basically the rate or complexity associated with the chosen encoder, we choose the
horizontal axis to be labeled by rate R. Experiments were performed by using synthetic
data with alphabets |X| = 128 and |Y| = 4. The excess-risk curve as a function of the
Information Bottleneck and Representation Learning 347
1
103 samples
104 samples
0.8
Excess Risk 105 samples
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3 H(PX)
Rate (R)
Figure 11.2 Excess risk (11.50) as a function of rate R being the mutual information between the
representation U and the corresponding input X.
rate constraint for different sizes of training samples is plotted. With dashed vertical
lines, we denote the rate for which the excess risk achieves its minimum. When the
number of training samples increases, the optimal rate R approaches its maximum
possible value: H(PX ) (black vertical dashed line on the far right). We emphasize that
for every curve there exists a different limiting rate Rlim , such that, for each R ≥ Rlim , the
excess risk remains constant for that value. It is not difficult to check that Rlim = H(P X ).
Furthermore, for every size of the training samples, there is an optimal value of Ropt
which provides the lowest excess risk in (11.50). In a sense, this is indicating that the
rate R can be interpreted as an effective regularization term and, thus, it can provide
robustness for learning in practical scenarios in which the true input distribution is not
known and the empirical data distribution is used. It is worth mentioning that when
more data are available the optimal value of the regularizing rate R becomes less critical.
This fact was expected since, when the amount of training data increases, the empirical
distribution approaches the data-generating distribution.
In the next section, we provide a formal mathematical proof of the explicit relation
between the generalization gap and the rate constraint, which explains the heuristic
observations presented in Fig. 11.2.
In the following, we will denote L(QU|X ) ≡ L(QU|X , Q Y|U ) and Lemp (QU|X ) ≡
Lemp (QU|X , QY|U ). We will study informational bounds on the generalization
gap (11.18). More precisely, the goal is to find the learning rate n (Q, Sn , γn ) such that
P Egap (QU|X , Sn ) > n (QU|X , Sn , γn ) ≤ γn , (11.51)
which depends on the empirical risk and the so-called generalization gap. Expression
(11.52) states that a suitable selection of the encoder can be obtained by minimizing the
empirical risk and the generalization gap simultaneously, that is
Lemp QU|X + λ · n (QU|X , Sn , γn ), (11.53)
for some suitable multiplier λ ≥ 0. It is reasonable to expect that the optimal encoder
achieving the minimal risk in (11.10) does not belong to F , so we may want to enlarge
the model classes as much as possible. However, as usual, we expect a sensitive trade-off
between these two fundamental quantities.
' ( )
Y|U , QU|X ) ≤ H(Q
L(Q Y|U |Q
U ) + Aδ X ; QU|X ) log(n)
I(P
Cδ
√ + √ +O
log(n)
. (11.57)
n n n
An interesting connection between the empirical risk minimization of the cross-
entropy loss and the information-bottleneck method presented in the previous section
arises which motivates formally the following algorithm [15, 20, 37].
definition 11.13 (Information-bottleneck algorithm) A representation learning
algorithm inspired by the information-bottleneck principle [20] consists in finding an
encoder QU|X ∈ F that minimizes over the random choice Sn ∼ PnXY the functional
L(λ)
IB (QU|X ) H( QY|U | QU ) + λ · I( PX ; QU|X ), (11.58)
Y|U is given by (11.16) and Q
for a suitable multiplier λ > 0, where Q U is its denominator.
where there are two kinds of parameters: a structure parameter k and real-value
parameters θ, whose parameters depend on the structure, e.g., Θk may account for
different number of layers or nonlinearities, while Pk (Z) indicates different kinds of
350 Pablo Piantanida and Leonardo Rey Vega
noise distribution. Theorem 11.2 motivates the following model-selection principle for
learning compact representations.
Find a parameter k and real-value parameters θ for the observed data Sn with which
the corresponding data representation can be encoded with the shortest code length:
- ' .
inf Lemp Q(θ,k) , S n + λ · I
P X ; Q(θ,k)
, (11.61)
θ∈Θk , k=[1:K]U|X U|X
where the mutual information penalty term indicates the minimum of the expected
redundancy between the minimum code length5 (measured in bits) − log Q(θ,k)U|X
(·|x) to
encode representations under a known data source and the best code length − log QU (·)
chosen to encode the data representations without knowing the input samples:
I PX ; Q(θ,k) = min E E (θ,k) − log QU (U) + log Q(θ,k) (U|X) . (11.62)
U|X PX Q
QU ∈P(U) U|X
U|X
This information principle combines the empirical cross-entropy risk (11.14) with the
“information complexity” of the selected encoder (11.62) as being a regularization
that acts as a sample-dependent penalty against overfitting. One may view (11.62) as
a possible means of comparing the appropriateness of distinct representation models
(e.g., number of layers or amount of noise), after a parametric choice has been selected.
The coding interpretation of the penalty term in (11.61) is that the length of the
description of the representations themselves can be quantified in the same units
as the code length in data compression, namely, bits. In other words, for each data
sample x, a randomized encoder can induce different types of representations U(x) with
expected information length given by H QU|X (·|x) . When this representation has to
be encoded without knowing QU|X since x is not given to us (e.g., in a communication
problem where the sender wishes to communicate the representations only), the
required average length of an encoding distribution QU results in EQU|X − log QU (U) .
In this sense, expression (11.61) suggests that we should select encoders that allow
us to then encode representations efficiently. Interestingly, this is closely related to the
celebrated minimum-description-length (MDL) method for density estimation [38, 39].
However, the fundamental difference between these principles is that the information
complexity (11.62) follows from the generalization gap and measures the amount of
information conveyed by the representations relative to an encoder model, as opposed
to the model parameters of the encoder itself.
The information-theoretic significance of (11.62) goes beyond simply a regulariza-
tion term, since it leads us to introduce the fundamental notion of encoder capacity.
This key idea of encoder capacity is made possible thanks to Theorem 11.2 that
connects mathematically the generalization gap to the information complexity, which
is intimately related to the number of distinguishable samples from the representations.
Notice that the information complexity can be upper-bounded as
5 As is well known in information theory, the shortest expected code length is achievable by a uniquely
decodable code under a known data source [24].
Information Bottleneck and Representation Learning 351
⎛ ⎞
1 ⎜⎜⎜⎜⎜ 1 ⎟⎟⎟
n n
X ; QU|X
I P = D⎜⎜⎝QU|X (·|xi ) QU|X (·|x j )⎟⎟⎟⎟⎠ (11.63)
n i=1 n j=1
1
n n
≤ 2
D QU|X (·|xi )QU|X (·|x j ) , (11.64)
n i=1 j=1
where {xi }ni=1 are the training examples from the dataset Sn and the last inequality
follows from the convexity of the relative entropy. This bound is measuring the average
degree of closeness between the corresponding representations for the different sample
inputs. When two distributions, QU|X (·|xi ) and QU|X (·|x j ), are very close to each
other, i.e., QU|X assigns high likelihood to similar representations corresponding to
different inputs xi x j , they do not contribute so much to the complexity of the overall
representations. In other words, the more sample inputs an encoder can differentiate,
the more patterns it can fit well, and hence the larger the mutual information and thus
the risk of overfitting. This observation suggests that the complexity of a representation
model with respect to a sample dataset can be related to the number of data samples
that essentially yield different (distinguishable) representations. Inspired by the concept
of stochastic complexity [39], we introduce below the notion of encoder capacity to
measure the complexity of a representation model.
definition 11.14 (Capacity of randomized encoders) The encoder capacity Ce of a
randomized encoder QU|X with respect to a sample set A ⊆ X is defined as
⎛ ⎞ ( )
⎜⎜⎜ ⎟⎟⎟ 1
Ce (A, QU|X ) max log⎝⎜ ⎜
⎜ ⎟
⎟
QU|X u|ψ(u) ⎠⎟ = log|A| − log , (11.65)
ψ : U→A
u∈U
1−ε
1 1
ε min QU|X (u|x)1 ψ(u) x ≤ 1 − . (11.66)
ψ : U→A |A| |A|
x∈A u∈U
The argument of the logarithm in the second term of (11.65) represents the probabil-
ity of being able to distinguish samples from their representations 1 − ε, i.e., the average
probability that estimated samples via the maximum-likelihood estimator ψ(·) from QU|X
are equal to the true samples. Therefore, the encoder capacity is the logarithm of the total
number of samples minus a term that depends on the probability of misclassification
of the input samples from their representations. When ε is small, then Ce (A, QU|X ) ≈
log |A| − ε and thus all samples are perfectly distinguishable. The following proposition
gives simple bounds6 on the encoder capacity from the information complexity (11.62),
which, as we already know, has a close relation to the generalization gap.
proposition 11.1 Let QU|X be an encoder distribution and let P X be an empirical
distribution with support An ≡ supp(PX ). Then, the information complexity and the
encoder capacity satisfy
6 Notice that it is possible to provide better bounds on ε by relying on the results in [40]. However, we
preferred simplicity to “tightness” since the purpose of Proposition 11.1 is to link the encoder capacity
and the information complexity.
352 Pablo Piantanida and Leonardo Rey Vega
( )
1
Ce An , QU|X = log|An | − log (11.67)
1−ε
and
1
g−1 log|An | − I PX ; QU|X ≤ ε ≤ log|An | − I P X ; QU|X , (11.68)
2
where ε is defined by (11.66) with respect to An and, for 0 ≤ t ≤ 1,
Proof We begin with the lower bound (11.70). Consider the inequalities
I PX ; QU|X = min D QU|X QU P X (11.71)
QU ∈P(U)
( )
QU|X (U|x)
≤ min EPX EQU|X max log (11.72)
QU ∈P(U) x∈An QU (U)
( )
QU|X u|ψ (u)
≤ min max log (11.73)
QU ∈P(U) u∈U QU (u)
⎛ ⎞
⎜⎜⎜ ⎟⎟⎟
= log⎜⎜⎝⎜ QU|X u|ψ (u) ⎟⎟⎠⎟ = Ce QU|X , An ,
(11.74)
u∈U
where (11.73) follows by letting ψ be the mapping maximizing Ce QU|X , An ,
and (11.74) follows by noticing that (11.73) is the smallest worst-case regret, known as
the minimax regret, and thus by choosing QU to be the normalized maximum-likelihood
distribution on the restricted set An the claim is a consequence of the remarkable result
of Shtarkov [41].
It remains to show the bounds in (11.68). In order to show the lower bound, we can
simply apply Fano’s lemma (Lemma 2.10 of [42]), from which we can bound from
below the error probability (11.66) that is based on An . As for the upper bound,
log|An | − I P X ; QU|X ≥ H PX − I PX ; QU|X (11.75)
= U (u)H Q
Q X|U (·|u) (11.76)
u∈U
/ 0
≥2 U (u) 1 − max Q
Q X|U (x |u) (11.77)
x ∈X
u∈U
= 2ε, (11.78)
where (11.75) follows from the assumption An = supp P X and the fact that the entropy
is maximal over the uniform distribution; (11.77) follows by using Equation (7) of [43]
and (11.78) by the definition of ε in (11.66). This concludes the proof.
remark 11.1 In Proposition 11.1, the function g−1 (t) 0 for t < 0 and, for
0 < t < log|An |, g−1 (t) is a solution of the equation g(ε) = t with respect to
Information Bottleneck and Representation Learning 353
ε ∈ 0, 1 − 1/|An | ; this solution exists since the function g is continuous and increasing
on 0, 1 − 1/|An | and g(0) = 0, g 1 − 1/|An | = log|An |.
remark 11.2 (Generalization requires learning invariant representations) An important
consequence of the lower bound in (11.68) in Proposition 11.1 is that by limiting the
information complexity, i.e., by controlling the generalization gap according to the
criterion (11.61), we bound from below the error probability of distinguishing input
samples from their representations. In other words, from expression (11.67) and Theo-
rem 11.2 we can conclude that encoders inducing a large misclassification probability
on input samples from their representations, i.e., different inputs must share similar
representations, are expected to achieve better generalization. Specifically, this also
implies formally that we need only enforce invariant representations to control the
encoder capacity (e.g., injecting noise during training), from which the generalization
is upper-bounded naturally thanks to Theorem 11.2 and the connection with the
information complexity. However, there is a sensitive trade-off between the amount of
noise (enforcing both invariance and generalization) and the minimization of the cross-
entropy loss. Additionally, it is not difficult to show from the data-processing inequality
that stacking noisy encoder layers reinforces increasingly invariant representations
since distinguishing inputs from their representations becomes harder – or equivalently
the encoder capacity decreases – the deeper the network.
where we define
⎡ ⎤
⎢⎢⎢ ⎥⎥⎥
,gap (QU|X , Sn ) = EP ⎢⎢⎢−
E QU|X (u|x) log QY|U (Y|U)⎥⎥⎥⎦
XY ⎣
u∈U
⎡ ⎤
⎢⎢⎢ ⎥⎥⎥
− EPXY ⎢⎢⎣−⎢ QU|X (u|x)log QY|U (Y|U)⎥⎥⎥⎦. (11.80)
u∈U
,gap (QU|X , Sn ) is the gap corresponding to the optimal decoder selecting, which
That is, E
depends on the true PXY , according to Lemma 11.3. It is not difficult to show that
,gap (QU|X , Sn ) ≤ H(QY|U |QU ) − H(Q
E U ) + E DQ
Y|U |Q Y|U QY|U , (11.81)
QU
where the second term can be bounded as EQU D Q Y|U QY|U ≤ D
PXY PXY . The first
term of (11.81) is bounded as
H(QY|U |QU ) − H(Q U ) ≤ H(QU ) − H(Q
Y|U |Q U ) + H(PY ) − H(P Y )
+ H(QU|Y |PY ) − H(Q U|Y |P Y ). (11.82)
354 Pablo Piantanida and Leonardo Rey Vega
where
⎧
⎪
⎪
⎪ 0 x ≤ 0,
⎪
⎨
φ(x) = ⎪
⎪ −x log(x) 0 < x < e−1 , (11.85)
⎪
⎪
⎩ e−1 x ≥ e−1
and V(a) = a − ā1d 22 , with a ∈ Rd , d ∈ N+ , ā = (1/d) di=1 ai , and 1d is the vector of
ones of length d.
It is clear that PY → H(PY ) is a differentiable function and, thus, we can apply a
first-order Taylor expansion to obtain
1 2
Y ) = ∂H(PY ) , pY −
H(PY ) − H(P pY + o pY − pY 2 , (11.86)
∂pY
where ∂H(PY )/∂PY (y) = − log PY (y) − log(e) for each y ∈ Y. Then, using the
Cauchy–Schwartz inequality, we have
'
Y ) ≤ V{log pY (y)}y∈Y pY −
H(PY ) − H(P
pY 2 + o pY −
pY 2 . (11.87)
*
log(n) ' Bδ |Y| log |U|
≤ √ Bδ V({qU|X (u|x)} x∈X ) + √
n u∈U
n
−1 ' ( )
2|U|e Bδ log(n)
+ √ + V {log pY (y)}y∈Y √ + O , (11.91)
n n n
√ √ √
where we use n ≥ a2 e2 and φ a/ n ≤ ((a/2) log(n)/ n) + (e−1/ n). By combining this
result with the next inequality [18]:
' √ ⎛ 3 ⎞
2 ⎜⎜⎜ 1 ⎟⎟⎟ *
V {qU|X (u|x)} x∈X ≤ ⎜
⎝1 + ⎟⎠ I(PX ; QU|X ), (11.92)
u∈U
p X (xmin ) |X|
we relate to the mutual information. Finally, using Taylor arguments as above, we can
easily write
' '
I P ; Q − I P
X ; QU|X ≡ O( pX −pX 2 ) ≤ O(n−1/2 ) (11.93)
X U|X
with probability 1 − δ. It only remains to analyze the second term on the right-hand
side of (11.79). Using standard manipulations, we can easily show that this term can be
equivalently written as
⎛ ⎞
⎜⎜⎜ QY|U (y|u) ⎟⎟⎟
PXY (x, y) − P XY (x, y) QU|X (u|x)log ⎜⎜⎝ ⎟⎟. (11.94)
Y|U (y|u) ⎠
Q
(x,y)∈X×Y u∈U
It is not difficult to see that given QU|X , PXY → log QY|U (y|u) is a differentiable
function and, thus, we can apply a first-order Taylor expansion to obtain
⎛ ⎞ 1 2
⎜⎜⎜ QY|U (y|u) ⎟⎟⎟ ∂log QY|U (y|u)
QU|X (u|x)log ⎝⎜ ⎜ ⎟
⎟=− QU|X (u|x) , pXY −
pXY
Y|U (y|u) ⎠
Q ∂pXY
u∈U u∈U
+ o pXY − pXY 2 (11.95)
and
∂log QY|U (y|u) QU|X (u|x ) 1{y = y} − QY|U (y|u)
= . (11.96)
∂PXY (x , y ) QUY (u, y)
With the assumption that every encoder QU|X (u|x) in the family F satisfies that
QU|X (u|x) > α for every (u, x) ∈ U × X with α > 0, we obtain that
∂ log QY|U (y|u) 2
≤ , ∀(x, x , y , u) ∈ X × X × Y × U. (11.97)
∂PXY (x , y ) α
From simple algebraic manipulations, we can bound the term in (11.94) as
⎛ ⎞
⎜⎜⎜ QY|U (y|u) ⎟⎟⎟
XY (x, y)
PXY (x, y) − P QU|X (u|x)log ⎜⎝⎜ ⎟⎟
u∈U
Y|U (y|u) ⎠
Q
(x,y)∈X×Y
⎛ ⎞2
2 ⎜⎜⎜⎜⎜ ⎟⎟
XY (x, y)|⎟⎟⎟⎟⎟ .
≤ ⎜⎜⎝ |PXY (x, y) − P ⎠ (11.98)
α
(x,y)∈X×Y
356 Pablo Piantanida and Leonardo Rey Vega
Again, using McDiarmid’s concentration inequality, it can be shown that with proba-
bility close to one this term is O(1/n), which can be neglected compared with the other
terms calculated previously. This concludes the proof of the theorem.
References
[1] National Research Council, Frontiers in massive data analysis. National Academies Press,
2013.
[2] C. Shannon, “A mathematical theory of communication,” Bell System Technical J.,
vols. 3, 4, 27, pp. 379–423, 623–656, 1948.
[3] V. Vapnik, The nature of statistical learning theory, 2nd edn. Springer, 2000.
Information Bottleneck and Representation Learning 357
[26] A. E. Gamal and Y.-H. Kim, Network information theory. Cambridge University Press,
2012.
[27] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE
National Convention Record, vol. 4, no. 1, pp. 142–163, 1959.
[28] R. Dobrushin and B. Tsybakov, “Information transmission with additional noise,” IEEE
Trans. Information Theory, vol. 8, no. 5, pp. 293–304, 1962.
[29] T. Courtade and T. Weissman, “Multiterminal source coding under logarithmic loss,”
IEEE Trans. Information Theory, vol. 60, no. 1, pp. 740–761, 2014.
[30] M. Vera, L. R. Vega, and P. Piantanida, “Collaborative representation learning,”
arXiv:1604.01433 [cs.IT], 2016.
[31] N. Slonim and N. Tishby, “Document clustering using word clusters via the information
bottleneck method,” in Proc. 23rd Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, 2000, pp. 208–215.
[32] L. Wang, M. Chen, M. Rodrigues, D. Wilcox, R. Calderbank, and L. Carin, “Information-
theoretic compressive measurement design,” IEEE Trans. Pattern Analysis Machine
Intelligence, vol. 39, no. 6, pp. 1150–1164, 2017.
[33] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge University Press, 2004.
[34] M. Vera, L. R. Vega, and P. Piantanida, “Compression-based regularization with an
application to multi-task learning,” IEEE J. Selected Topics Signal Processing, vol. 5,
no. 12, pp. 1063–1076, 2018.
[35] S. Arimoto, “An algorithm for computing the capacity of arbitrary discrete memoryless
channels,” IEEE Trans. Information Theory, vol. 18, no. 1, pp. 14–20, 1972.
[36] R. Blahut, “Computation of channel capacity and rate-distortion functions,” IEEE Trans.
Information Theory, vol. 18, no. 4, pp. 460–473, 1972.
[37] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information
bottleneck,” CoRR, vol. abs/1612.00410, 2016.
[38] J. Rissanen, “Paper: Modeling by shortest data description,” Automatica, vol. 14, no. 5,
pp. 465–471, 1978.
[39] P. D. Grünwald, I. J. Myung, and M. A. Pitt, Advances in minimum description length:
Theory and applications. MIT Press, 2005.
[40] S. Arimoto, “On the converse to the coding theorem for discrete memoryless channels
(corresp.),” IEEE Trans. Information Theory, vol. 19, no. 3, pp. 357–359, 1973.
[41] Y. M. Shtarkov, “Universal sequential coding of single messages,” Problems Information
Transmission, vol. 23, no. 3, pp. 175–186, 1987.
[42] A. B. Tsybakov, Introduction to nonparametric estimation, 1st edn. Springer, 2008.
[43] D. Tebbe and S. Dwyer, “Uncertainty and the probability of error (corresp.),” IEEE Trans.
Information Theory, vol. 14, no. 3, pp. 516–518, 1968.
12 Fundamental Limits in Model
Selection for Modern Data Analysis
Jie Ding, Yuhong Yang, and Vahid Tarokh
Summary
12.1 Introduction
Model selection is the task of selecting a statistical model or learning method from a
model class, given a set of data. Some common examples are selecting the variables for
low- or high-dimensional linear regression, basis terms such as polynomials, splines, or
wavelets in function estimation, the order of an autoregressive process, the best machine-
learning techniques for solving real-data challenges on an online competition platform,
etc. There has been a long history of model-selection techniques that arise from fields
such as statistics, information theory, and signal processing. A considerable number
of methods have been proposed, following different philosophies and with sometimes
drastically different performances. Reviews of the literature can be found in [1–6] and
references therein. In this chapter, we aim to provide an integrated understanding of
359
360 Jie Ding et al.
the properties of various approaches, and introduce two recent advances leading to the
improvement of classical model-selection methods.
We first introduce some notation. We use Mm = {pθ , θ ∈ Hm } to denote a model which
is a set of probability density functions, where Hm is the parameter space and pθ is short
for pm (Z1 , . . . , Zn | θ), the probability density function of data Z1 , . . . , Zn ∈ Z. A model
class, {Mm }m∈M , is a collection of models indexed by m ∈ M. We denote by n the sample
size, and by dm the size/dimension of the parameter in model Mm . We use p∗ to denote
the true-data generating distribution, and E∗ for the expectation associated with it. In
the parametric framework, there exists some m ∈ M and some θ∗ ∈ Hm such that p∗ is
equivalent to pθ∗ almost surely. Otherwise it is in the non-parametric framework. We use
→ p to denote convergence in probability (under p∗ ), and N(μ, σ2 ) to denote a Gaussian
distribution of mean μ and variance σ2 . We use capital and lower-case letters to denote
random variables and the realized values.
The chapter’s content can be outlined as follows. In Section 12.2, we review the two
statistical/machine-learning goals (i.e., prediction and inference) and the fundamental
limits associated with each of them. In Section 12.3, we explain how model selection is
the key to reliable data analysis through a toy example. In Section 12.4, we introduce
the background and theoretical properties of the Akaike information criterion (AIC)
and the Bayesian information criterion (BIC), as two fundamentally important model-
selection criteria (and other principles sharing similar asymptotic properties). We shall
discuss their conflicts in terms of large-sample performances in relation to two statistical
goals. In Section 12.5 we introduce a new information criterion, referred to as the bridge
criterion (BC), that bridges the conflicts of the AIC and the BIC. We provide its back-
ground, theoretical properties, and a related quantity referred to as the parametricness
index that is practically very useful to describe how likely it is that the selected model
can be practically trusted as the “true model.” In Section 12.6, we review recent develop-
ments in modeling-procedure selection, which, differently from model selection in the
narrow sense of choosing among parametric models, aims to select the better statistical
or machine-learning procedure.
Data analysis usually consists of two steps. In the first step, candidate models are pos-
tulated, and for each candidate model Mm = {pθ , θ ∈ Hm } we estimate its parameter
θ ∈ Hm . In the second step, from the set of estimated candidate models pθm (m ∈ M) we
select the most appropriate one (for either interpretation or prediction purposes). We note
that not every data analysis and its associated model-selection procedure formally rely
on probability distributions. An example is nearest-neighbor learning with the neighbor
size chosen by cross-validation, which requires only that the data splitting is meaningful
and that the predictive performance of each candidate model/method can be assessed
in terms of some measure (e.g., quadratic loss, hinge loss, or perceptron loss). Also,
Fundamental Limits in Model Selection for Modern Data Analysis 361
motivated by computational feasibility, there are methods (e.g., LASSO1 ) that combine
the two steps into a single step.
Before we proceed, we introduce the concepts of “fitting” and “optimal model” rele-
vant to the above two steps. The fitting procedure given a certain candidate model Mm
is usually achieved by minimizing the negative log-likelihood function
θ → n,m (θ),
with θm being the maximum-likelihood estimator (MLE) (under model Mm ). The maxi-
mized log-likelihood value is defined by n,m ( θm ). The above notion applies to non-i.i.d.
data as well. Another function often used in time-series analysis is the quadratic loss
function {zt − E p (Zt | z1 , . . . , zt−1 )}2 instead of the negative log-likelihood. (Here the
expectation is taken over the distribution p.) Since there can be a number of other vari-
ations, the notion of the negative log-likelihood function could be extended as a general
loss function (p, z) involving a density function p(·) and data z. Likewise, the notion of
the MLE could be thought of as a specific form of M-estimator.
To define what the “optimal model” means, let pm = pθm denote the estimated distribu-
tion under model Mm . The predictive performance may be assessed via the out-sample
prediction loss: E∗ ((
pm , Z )|Z) = Z ( pm , z )p∗ (z )dz , where Z is independent from and
identically distributed to the data used to obtain pm , and Z is the data domain. A sample
analog of E∗ (( pm , Z )|Z) may be the in-sample prediction loss (also referred to as the
empirical loss), defined as En (( pm , z)) = n−1 nt=1 ( pm , zt ), to measure the fitness of the
observed data to the model Mm . In view of this definition, the optimal model can be
naturally defined as the candidate model with the smallest out-sample prediction loss,
i.e., m0 = arg minm∈M E∗ (( pm , Z )|Z). In other words, Mm0 is the model whose predic-
tive power is the best offered by the candidate models given the observed data and the
specified model class. It can be regarded as the theoretical limit of learning given the
current data and the model list.
In a parametric framework, the true data-generating model is usually the optimal
model for sufficiently large sample size [8]. In this vein, if the true density function p∗
belongs to some model Mm , or equivalently p∗ = pθ∗ for some θ∗ ∈ Hm and m ∈ M, then
we aim to select such Mm (from {Mm }m∈M ) with probability going to one as the sample
size increases. This is called consistency in model selection. In addition, the MLE of pθ
for θ ∈ Hm is known to be an asymptotically efficient estimator of the true parameter
θ∗ [9]. In a non-parametric framework, the optimal model depends on the sample size:
for a larger sample size, the optimal model tends to be larger since more observations
can help reveal weak effects that are out of reach at a small sample size. In that situa-
tion, it can be statistically unrealistic to pursue selection consistency [10]. We note that
the aforementioned equivalence between the optimal model and the true model may not
1 Least absolute shrinkage and selection operator (LASSO) [7] is a penalized regression method, whose
penalty term is in the form of λβ1 , where β is the regression coefficient vector and λ is a tuning param-
eter that controls how many (and which) variables are selected. In practice, data analysts often select an
appropriate λ that is based on e.g., five-fold cross-validation.
362 Jie Ding et al.
hold for high-dimensional regression settings where the number of independent vari-
ables is large relative to the sample size [8]. Here, even if the true model is included as a
candidate, its dimension may be too high to be appropriately identified on the basis of a
relatively small amount of data. Then the (literally) parametric setting becomes virtually
non-parametric.
There are two main objectives in learning from data. One is to understand the data-
generation process for scientific discoveries. Under this objective, a notion of funda-
mental limits is concerned with the consistency of selecting the optimal model. For
example, a scientist may use the data to support his/her physics model or identify genes
that clearly promote early onset of a disease. Another objective of learning from data
is for prediction, where the data scientist does not necessarily care about obtaining an
accurate probabilistic description of the data (e.g., which covariates are independent of
the response variable given a set of other covariates). Under this objective, a notion of
fundamental limits is taken in the sense of achieving optimal predictive performance.
Of course, one may also be interested in both directions. For example, scientists may be
interested in a physical model that explains well the causes of precipitation (inference),
and at the same time they may also want to have a good model for predicting the amount
of precipitation on the next day or during the next year (prediction).
In line with the two objectives above, model selection can also have two directions:
model selection for inference and model selection for prediction. The first is intended to
identify the optimal model for the data, in order to provide a reliable characterization of
the sources of uncertainty for scientific interpretation. The second is to choose a model
as a vehicle to arrive at a model/method that offers a satisfying top performance. For
the former goal, it is crucial that the selected model is stable for the data, meaning that
a small perturbation of the data does not affect the selection result. For the latter goal,
however, the selected model may be simply the lucky winner among a few close com-
petitors whose predictive performance can be nearly optimal. If so, the model selection
is perfectly fine for prediction, but the use of the selected model for insight and interpre-
tation may be misleading. For instance, in linear regression, because the covariates are
often correlated, it is quite possible that two very different sets of covariates may offer
nearly identical top predictive performances yet neither can justify its own explanation
of the regression relationship against that by the other.
Associated with the first goal of model selection for inference or identifying the best
candidate is the concept of selection consistency. The selection consistency means that
the optimal model/method is selected with probability going to one as the sample size
goes to infinity. This idealization that the optimal model among the candidates can be
practically deemed the “true model” is behind the derivations of several model-selection
methods. In the context of variable selection, in practical terms, model-selection con-
sistency is intended to mean that the useful variables are identified and their statistical
significance can be ascertained in a follow-up study while the rest of the variables can-
not. However, in reality, with limited data and large noise the goal of model-selection
consistency may not be reachable. Thus, to certify the selected model as the “true” model
for reliable statistical inference, the data scientist must conduct a proper selection of
model diagnostic assessment (see [11] and references therein). Otherwise, the use of
Fundamental Limits in Model Selection for Modern Data Analysis 363
the selected model for drawing conclusions on the data-generation process may give
irreproducible results, which is a major concern in the scientific community [12].
In various applications where prediction accuracy is the dominating consideration,
the optimal model as defined earlier is the target. When it can be selected with high
probability, the selected model can not only be trusted for optimal prediction but also
comfortably be declared the best. However, even when the optimal model is out of reach
in terms of selection with high confidence, other models may provide asymptotically
equivalent predictive performance. In this regard, asymptotic efficiency is a natural con-
sideration for the second goal of model selection. When prediction is the goal, obviously
prediction accuracy is the criterion to assess models. For theoretical examination, the
convergence behavior of the loss of the prediction based on the selected model charac-
terizes the performance of the model-selection criterion. Two properties are often used
to describe good model-selection criteria. The asymptotic efficiency property demands
that the loss of the selected model/method is asymptotically equivalent to the smallest
among all the candidates. The asymptotic efficiency is technically defined by
minm∈M Lm
→p 1 (12.1)
Lm
as n → ∞, where m denotes the selected model. Here, Lm = E∗ (( pm , Z)) − E∗ ((p∗ , Z))
is the adjusted prediction loss, where
pm denotes the estimated distribution under model
m. The subtraction of E∗ ((p∗ , Z)) makes the definition more refined, which allows one
to make a better comparison of competing model-selection methods. Overall, the goal
of prediction is to select a model that is comparable to the optimal model regardless of
whether it is stable or not as the sample size varies. This formulation works both for
parametric and for non-parametric settings.
We provide a synthetic experiment to demonstrate that better fitting does not imply better
predictive performance due to inflated variances in parameter estimation.
Example 12.1 Linear regression Suppose that we generated synthetic data from a
regression model Y = f (X) + ε, where each item of data is in the form of zi = (yi , xi ).
Each response yi (i = 0, . . . , n − 1) is observed at xi = i/n (fixed design points), namely
yi = f (i/n) + εi . Suppose that the εi s are independent standard Gaussian noises. Suppose
that we use polynomial regression, and the specified models are in the form of f (x) =
m
j=0 β j x (0 ≤ x < 1, m being a positive integer). The candidate models are specified to be
j
{Mm , m = 1, . . . , dn }, with Mm corresponding to f (x) = mj=0 β j x j . Clearly, the dimension
of Mm is dm = m + 1.
The prediction loss for regression is calculated as Lm = n−1 ni=1 ( f (xi ) − fm (xi ))2 ,
where f is the least-squares estimate of f using model Mm (see, for example, [8]).
The efficiency, as before, is defined as minm∈M Lm /Lm .
364 Jie Ding et al.
When the data-generating model is unknown, one critical problem is the identification
of the degree of the polynomial model fitted to the data. We need to first estimate polyno-
mial coefficients with different degrees 1, . . . , dn , and then select one of them according
to a certain principle.
In an experiment, we first generate independent data using each of the following true
data-generating models, with sample sizes n = 100, 500, 2000, 3000. We then fit the data
using the model class as given above, with the maximal order dn = 15.
(1) Parametric framework. The data are generated by f (x) = 10(1 + x + x2 ).
Suppose that we adopt the quadratic loss in this example. Then we obtain the in-
sample prediction loss em = n−1 ni=1 (Yi − fm (xi ))2 . Suppose that we plot em against
dm , then the curve must be monotonically decreasing, because a larger model fits the
same data better. We then compute the out-sample prediction loss E∗ ((pm , Z)), which is
equivalent to
n
pm , Z)) = n−1 ( f (xi ) −
E∗ (( fm (xi ))2 + σ2 (12.2)
i=1
in this example. The above expectation is taken over the true distribution of an inde-
pendent future data item Zt . Instead of showing the out-sample prediction loss of each
candidate model, we plot its rescaled version (on [0, 1]). Recall the asymptotic efficiency
as defined in (12.1). Under quadratic loss, we have E∗ ((p∗ , Z)) = σ2 , and the asymptotic
efficiency requires
minm∈M n−1 ni=1 ( f (xi ) − fm (xi ))2
→ p 1, (12.3)
n−1 ni=1 ( f (xi ) −
fm
(xi ))
2
where m denotes the selected model. In order to describe how the predictive performance
of each model deviates from the best possible, we define the efficiency of each model as
the term on the left-hand side in (12.1), or (12.3) in this example.
We now plot the efficiency of each candidate model on the left-hand side of Fig. 12.1.
The curves show that the predictive performance is optimal only for the true model. We
note that the minus-σ2 adjustment of the out-sample prediction loss in the numerator and
denominator of (12.3), compared with (12.2), makes it highly non-trivial to achieve the
property (see, for example, [8, 13–15]). Consider, for example, the comparison between
two nested polynomial models with degrees d = 2 and d = 3, the former being the true
data-generating model. It can be proved that, without subtracting σ2 , the ratio (of the
mean-square prediction errors) for each candidate model approaches 1; on subtracting
σ2 , the ratio for the former model still approaches 1, while the ratio for the latter model
approaches 2/3.
(2) Non-parametric framework. The data are generated by f (x) = 20x1/3 .
As for framework (1), we plot the efficiency on the right-hand side of Fig. 12.1. Dif-
ferently from the case (1), the predictive performance is optimal at increasing model
dimensions (as the sample size n increases). As mentioned before, in such a non-
parametric framework (i.e., the true f is not in any of the candidate models), the optimal
model is highly unstable as the sample size varies, so that pursuing an inference of a
fixed good model becomes improper. Intuitively, this is because, in a non-parametric
Fundamental Limits in Model Selection for Modern Data Analysis 365
1 1
0.8 0.8
Efficiency
Efficiency
0.6 0.6
0.4 0.4
Figure 12.1 The efficiency of each candidate model under two different data-generating processes.
The best-performing model is the true model of dimension 3 in the parametric framework
(left figure), whereas the best-performing model varies with sample size in the non-parametric
framework (right figure).
framework, more complex models are needed to accommodate more observed data in
order to strike an appropriate trade-off between the estimation bias (i.e., the smallest
approximation error between the data-generating model and a model in the model space)
and the variance (i.e., the variance due to parameter estimation) so that the prediction
loss can be reduced. Thus, in the non-parametric framework, the optimal model changes,
and the model-selection task aims to select a model that is optimal for prediction (e.g.,
asymptotically efficient), while recognizing it is not tangible to identify the true/optimal
model for the inference purpose. Note that Fig. 12.1 is drawn using information of the
underlying true model, but that information is unavailable in practice, hence the need for
a model-selection method so that the asymptotically best efficiency can still be achieved.
This toy experiment illustrates the general rules that (1) a larger model tends to fit data
better, and (2) the predictive performance is optimal with a candidate model that typi-
cally depends both on the sample size and on the true data-generating process (which
is unknown in practice). In a virtually (or practically) parametric scenario, the optimal
model is stable around the present sample size and it may be practically treated as the
true model. In contrast, in a virtually (or practically) non-parametric scenario, the opti-
mal model changes sensitively to the sample size (around the present sample size) and
the task of identifying the elusive optimal model for reliable inference is unrealistic.
With this understanding, an appropriate model-selection technique is called for so as to
single out the optimal model for inference and prediction in a strong practically paramet-
ric scenario, or to strike a good balance between the goodness of fit and model complexity
(i.e., the number of free unknown parameters) on the observed data to facilitate optimal
prediction in a practically non-parametric scenario.
Various model-selection criteria have been proposed in the literature. Though each
approach was motivated by a different consideration, many of them originally aimed to
select either the order in an autoregressive model or a subset of variables in a regression
model. We shall revisit an important class of them referred to as information criteria.
366 Jie Ding et al.
n
= arg min n−1
m (pθm , zt ) + fn,dm ,
m∈M t=1
where the objective function is the estimated in-sample loss plus a penalty fn,d (indexed
by the sample size n and model dimension d).
The Akaike information criterion (AIC) [16, 17] is a model-selection principle that
was originally derived by minimizing the Kullback–Leibler (KL) divergence from a
candidate model to the true data-generating model p∗ . Equivalently, the idea is to approx-
imate the out-sample prediction loss by the sum of the in-sample prediction loss and a
correction term. In the typical setting where the loss is logarithmic, the AIC procedure
is to select the model Mm that minimizes
The only difference from the AIC is that the constant 2 in the penalty term is replaced
with the logarithm of the sample size. Its original derivation by Schwarz was only for
an exponential family from a frequentist perspective. But it turned out to have a nice
Bayesian interpretation, as its current name suggests. Recall that, in Bayesian data anal-
ysis, marginal likelihood is commonly used for model selection [19]. In a Bayesian
setting, we would introduce a prior with density θ → pm (θ) (θ ∈ Hm ), and a likelihood
of data pm (Z | θ), where Z = [Z1 , . . . , Zn ], for each m ∈ M. We first define the marginal
likelihood of model Mm by
p(Z | Mm ) = pm (Z | θ)pm (θ)dθ. (12.6)
Hm
The candidate model with the largest marginal likelihood should be selected. Inter-
estingly, this Bayesian principle is asymptotically equivalent to the BIC in selecting
models. To see the equivalence, we assume that Z1 , . . . , Zn are i.i.d., and π(·) is any
Fundamental Limits in Model Selection for Modern Data Analysis 367
prior distribution on θ which has dimension d. We let n (θ) = ni=1 log pθ (zi ) be the log-
likelihood function, and θn the MLF of θ. Note that n implicitly depends on the model.
A proof of the Bernstein–von Mises theorem (see Chapter 10.2 of [20]) implies (under
regularity conditions)
d
p(Z1 , . . . , Zn ) exp − n (
θn ) + log n → p c∗ (12.7)
2
as n → ∞, for some constant c∗ that does not depend on n. Therefore, selecting a
model with the largest marginal likelihood p(Z1 , . . . , Zn ) (as advocated by Bayesian
model comparison) is asymptotically equivalent to selecting a model with the small-
est BIC in (12.5). It is interesting to see that the marginal likelihood of a model does
not depend on the imposed prior at all, given a sufficiently large sample size. We note
that, in many cases of practical data analysis, especially when likelihoods cannot be
written analytically, the BIC is implemented less than the Bayesian marginal likeli-
hood, because the latter can easily be implemented by utilizing Monte Carlo-based
computation methods [21].
Cross-validation (CV) [22–26] is a class of model-selection methods that are widely
used in machine-learning practice. CV does not require the candidate models to be
parametric, and it works as long as data splittings make sense and one can assess the
predictive performance in terms of some measure. A specific type of CV is the delete-1
CV method [27] (or leave-one-out, LOO). The idea is explained as follows. For brevity,
let us consider a parametric model class as before. Recall that we wish to select a
model Mm with as small an out-sample loss E∗ ((pθm , Z)) as possible. Its computa-
tion involves an unknown true-data-generating process, but we may approximate it by
n−1 ni=1 (pθm,−i , zi ), where
θm,−i is the MLE under model Mm using all the observa-
tions except zi . In other words, given n observations, we leave each observation out
and attempt to predict that data point by using the n − 1 remaining observations, and
record the average prediction loss over n rounds. It is worth mentioning that LOO is
asymptotically equivalent to the AIC under some regularity conditions [27].
The general practice of CV works in the following manner. It randomly splits the
original data into a training set of nt data and a validation set of nv = n − nt data; each
candidate model is trained from the nt data and validated on the remaining data (i.e., to
record the average validation loss); the above procedure is replicated a few times, each
with a different validation set, in order to alleviate the variance caused by splitting; in
the end, the model with the least average validation loss is selected, and the model is
re-trained using the complete data for future use. The v-fold CV (with v being a positive
integer) is a specific version of CV. It randomly partitions data into v subsets of (approx-
imately) equal size; each model is trained on v − 1 folds and validated on the remaining
1 fold; the procedure is repeated v times, and the model with the smallest average vali-
dation loss is selected. The v-fold CV is perhaps more commonly used than LOO, partly
due to the large computational complexity involved in LOO. The holdout method, which
is often used in data competitions (e.g., Kaggle competition), may be viewed as a special
case of CV: it does data splitting only once, producing one part as the training set and
the remaining part as the validation set.
368 Jie Ding et al.
We have mentioned that LOO was asymptotically equivalent to the AIC. How about a
general CV with nt training data and nv validation data? For regression problems, it has
been proved that CV is asymptotically similar to the AIC when nv /nt → 0 (including
LOO as a special case), and to the BIC when nv /nt → ∞ (see, e.g., [8]). Additional
comments on CV and corrections of some misleading folklore will be elaborated in
Section 12.6.
Note that many other model-selection methods are closely related to the AIC and
the BIC. For example, methods that are asymptotically equivalent to the AIC include
finite-sample-corrected AIC [28], which was proposed as a corrected version of the
AIC, generalized cross-validation [29], which was proposed for selecting the degree of
smoothing in spline regression, the final-prediction error criterion which was proposed
as a predecessor of the AIC [30, 31], and Mallows’ C p method [32] for regression-
variable selection. Methods that share the selection-consistency property of the BIC
include the Hannan and Quinn criterion [33], which was proposed as the smallest
penalty that achieves strong consistency (meaning that the best-performing model is
selected almost surely for sufficiently large sample size), the predictive least-squares
(PLS) method based on the minimum-description-length principle [34, 35], and the use
of Bayes factors, which is another form of Bayesian marginal likelihood. For some
methods such as CV and the generalized information criterion (GIC, or written as
GICλn ) [8, 36, 37], their asymptotic behavior usually depends on the tuning parameters.
In general, the AIC and the BIC have served as the golden rules for model selection
in statistical theory since their coming into existence. Their asymptotic properties have
been rigorously established for autoregressive models and regression models, among
many other models. Though cross-validations or Bayesian procedures have also been
widely used, their asymptotic justifications are still rooted in frequentist approaches in
the form of the AIC, the BIC, etc. Therefore, understanding the asymptotic behavior of
the AIC and the BIC is of vital value both in theory and in practice. We therefore focus
on the properties of the AIC and the BIC in the rest of this section and Section 12.5.
references therein). For example, consider the minimax risk of estimating the regression
function f ∈ F under the squared error
n
inf sup n−1 E∗ (
f (xi ) − f (xi ))2 , (12.8)
f f ∈F i=1
where f is over all estimators based on the observations, and f (xi ) equals the expecta-
tion of the ith response variable (or the ith value of the regression function) conditional
on the ith vector of variables xi . Each xi can refer to a vector of explanatory vari-
ables, or polynomial basis terms, etc. For a model-selection method δ, its worst-case
risk is sup f ∈F R( f, δ, n), where R( f, δ, n) = n−1 ni=1 E∗ {
fδ (xi ) − f (xi )}2 , and
fδ is the least-
squares estimate of f under the variables selected by δ. The method δ is said to be
minimax-rate optimal over F if sup f ∈F R( f, δ, n) converges (as n → ∞) at the same rate
as the minimax risk in (12.8).
Another good property of the AIC is that it is asymptotically efficient (as defined in
(12.1)) in a non-parametric framework. In other words, the predictive performance of its
selected model is asymptotically equivalent to the best offered by the candidate models
(even though it is highly sensitive to the sample size). However, the BIC is known to be
consistent in selecting the data-generating model with the smallest dimension in a para-
metric framework. For example, suppose that the data are truly generated by a quadratic
function corrupted by random noise, and the candidate models are a quadratic polyno-
mial, a cubic polynomial, and an exponential function. Then the quadratic polynomial
is selected with probability going to one as the sample size tends to infinity. The expo-
nential function is not selected because it is a wrong model, and the cubic polynomial is
not selected because it overfits (even though it nests the true model as a special case).
12.5 The Bridge Criterion − Bridging the Conflicts between the AIC and the BIC
In this section, we review a recent advance in the understanding of the AIC, the BIC, and
related criteria. We choose to focus on the AIC and the BIC here because they represent
two cornerstones of model-selection principles and theories. We are concerned only with
settings where the sample size is larger than the model dimensions. Many details of the
following discussion can be found in technical papers such as [8, 13, 15, 39–41] and
references therein.
Recall that the AIC is asymptotically efficient for the non-parametric scenario and
is also minimax optimal. In contrast, the BIC is consistent and asymptotically effi-
cient for the parametric scenario. Despite the good properties of the AIC and the BIC,
they have their own drawbacks. The AIC is known to be inconsistent in a paramet-
ric scenario where there are at least two correct candidate models. As a result, the
AIC is not asymptotically efficient in such a scenario. For example, if data are truly
generated by a quadratic polynomial function corrupted by random noise, and the can-
didate models include quadratic and cubic polynomials, then the former model cannot
be selected with probability going to one as the sample size increases. The asymptotic
370 Jie Ding et al.
(with the default cn = n2/3 ) over all the candidate models whose dimensions are no larger
than dmAIC , defined as the dimension of the model selected by the AIC (dm in (12.4)).
Note that the penalty is approximately cn log dm , but it is written as a harmonic number
to highlight some of its nice interpretations. Its original derivation was motivated by
the recent discovery that the information loss of underfitting a model of dimension d
372 Jie Ding et al.
using dimension d − 1 is asymptotically χ21 /d (where χ21 denotes the chi-squared random
variable with one degree of freedom) for large d, assuming that nature generates the
model from a non-informative uniform distribution over its model space (in particular
the coefficient space of all stationary autoregressions). Its heuristic derivation is reviewed
in Section 12.5.2. The BC was later proved to be asymptotically equivalent to the AIC
in a non-parametric framework, and equivalent to the BIC otherwise in rather general
settings [10, 15]. Some intuitive explanations will be given in Section 12.5.3. A technical
explanation of how the AIC, BIC, and BC relate to each other can be found in [15].
where E∗ is with respect to the stationary distribution of {zt } (see Chapter 3 of [47]).
The above quantity can be also regarded as the minimum KL divergence from the
space of order-d AR to the true data-generating model (rescaled) under Gaussian noise
assumptions. We further define the relative loss gd = log(ed−1 /ed ) for any positive
integer d.
Now suppose that nature generates the data from an AR(d) process, which is in turn
randomly generated from the uniform distribution Ud . Here, Ud is defined over the
space of all the stable AR filters of order d whose roots have modulus less than 1:
Sd = [ψd,1 , . . . , ψd,d ]T :
d d
zd + ψd, zd− = (z − a ), ψd, ∈ R, |a | < 1, = 1, . . . , d .
=1 =1
This result suggests that the underfitting loss gd ≈ χ21 /d tends to decrease with d.
Because the increment of the penalty from dimension d − 1 to dimension d can be treated
as a quantity to compete with the underfitting loss [15], it suggests that we penalize in a
way not necessarily linear in model dimension.
12.5.3 Interpretation
In this section, we provide some explanations of how the BC can be related to the AIC
and BIC. The explanation does not depend on any specific model assumption, which
shows that it can be applied to a wide range of other models.
The penalty curves (normalized by n) for the AIC, BIC, and BC can be respectively
denoted by
2
fn,dm (AIC) = dm ,
n
log(n)
fn,dm (BIC) = dm ,
n
cn
d
m
1
fn,dm (BC) = .
n k=1 k
Any of the above penalty curves can be written in the form of dk=1 tk , and only the slopes
tk (k = 1, . . . , dn ) matter to the performance of order selection. For example, suppose that
k2 is selected instead of k1 (k2 > k1 ) by some criterion. This implies that the gain of
2
prediction loss Lk1 − Lk2 is greater than the sum of slopes kk=k tk . Thus, without loss
1 +1
of generality, we can shift the curves of the above three criteria to be tangent to the bent
curve of the BC in order to highlight their differences and connections. Here, two curves
are referred to as tangent to each other if one is above the other and they intersect at one
point, the tangent point.
Given a sample size n, the tangent point between the fn,dm (BC) and fn,dm (BIC) curves
is at T BC:BIC = cn / log n. Consider the example cn = n1/3 . If the true order d0 is finite,
T BC:BIC will be larger than d0 for all sufficiently large n. In other words, there will be
an infinitely large region as n tends to infinity, namely 1 ≤ k ≤ T BC:BIC , where d0 falls
374 Jie Ding et al.
Table 12.1. Regression variable selection (standard errors are given in the parentheses)
into and where the BC penalizes more than does the BIC. As a result, asymptotically
the BC does not overfit. On the other hand, the BC will not underfit because the largest
penalty preventing one from selecting dimension k + 1 versus k is cn /n, which will be
less than any fixed positive constant (close to the KL divergence from a smaller model
to the true model) with high probability for large n. This reasoning suggests that the BC
is consistent.
Since the BC penalizes less for larger orders and finally becomes similar to the AIC,
it is able to share the asymptotic optimality of the AIC under suitable conditions. A full
illustration of why the BC is expected to work well in general can be found in [15]. As
we shall see, the bent curve of the BC well connects the BIC and the AIC so that a good
balance between the underfitting and overfitting risks is achieved.
Moreover, in many applications, the data analyst would like to quantify to what
extent the framework under consideration can be virtually treated as parametric, or,
in other words, how likely it is that the postulated model class is well specified. This
motivated the concept of the “parametricness index” (PI) [15, 45] to evaluate the relia-
bility of model selection. One definition of the PI, which we shall use in the following
experiment, is the quantity
|dmBC − dmAIC |
PIn =
|dmBC − dmAIC | + |dmBC − dmBIC |
on [0, 1] if the denominator is well defined, and PIn = 0 otherwise. Here, dmδ is the
dimension of the model selected by the method δ. Under some conditions, it can be
proved that PIn → p 1 in a parametric framework and PIn → p 0 otherwise.
In an experiment concerning Example 1, we generate data using each of those two
true data-generating processes, with sample sizes n = 500. We replicate 100 independent
trials and then summarize the performances of LOO, GCV, the AIC, the BC, the BIC,
and delete-nv CV (abbreviated as CV) with nv = n − n/log n in Table 12.1. Note that
CV with such a validation size is proved to be asymptotically close to the BIC, because
delete-nv CV has the same asymptotic behavior as GICλn introduced in Section 12.4
with
λn = n/(n − nv ) + 1 (12.10)
in selecting regression models [8]. From Table 12.1, it can be seen that the methods on
the right-hand side of the AIC perform better than the others in the parametric setting,
Fundamental Limits in Model Selection for Modern Data Analysis 375
while the methods on the left-hand side of the BIC perform better than the others in the
non-parametric setting. The PI being close to one (zero) indicates the parametricness
(non-parametricness). These are in accord with the existing theory.
In this section, we introduce cross-validation as a general tool not only for model
selection but also for modeling-procedure selection, and highlight the point that there
is no one-size-fits-all data-splitting ratio of cross-validation. In particular, we clarify
some widespread folklore on training/validation data size that may lead to improper
data analysis.
Before we proceed, it is helpful to first clarify some commonly used terms involved
in cross-validation (CV). Recall that a model-selection method would conceptually fol-
low two phases: (A) the estimation and selection phase, where each candidate model is
trained using all the available data and one of them is selected; and (B) the test phase,
where the future data are predicted and the true out-sample performance is checked. For
CV methods, the above phase (A) is further split into two phases, namely (A1), the train-
ing phase to build up each statistical/machine-learning model on the basis of the training
data; and (A2), the validation phase which selects the best-performing model on the
basis of the validation data. In the above step (B), analysts are ready to use/implement
the obtained model for predicting the future (unseen) data. Nevertheless, in some data
practice (e.g., in writing papers), analysts may wonder how that model is going to per-
form on completely unseen real-world data. In that scenario, part of the original dataset
(referred to as the “test set”) is taken out before phase (A), in order to approximate the
true predictive performance of the finally selected model after phase (B). In a typical
application that is based on the given data, the test set is not really available, and thus
we need only consider the original dataset being split into two parts: (1) the training set
and (2) the validation set.
has been proved that the CV method is consistent in choosing between the AIC and the
BIC given nt → ∞, nv /nt → ∞, and some other regularity assumptions (Theorem 1 of
[48]). In other words, the probability of the BIC being selected goes to 1 in a paramet-
ric framework, and the probability of the AIC being selected goes to 1 otherwise. In
this way, the modeling-procedure selection using CV naturally leads to a hybrid model-
selection criterion that builds upon the strengths of the AIC and the BIC. Such hybrid
selection is going to combine some of the theoretical advantages of both the AIC and
the BIC, as the BC does. The task of classification is somewhat more relaxed than the
task of regression. In order to achieve consistency in selecting the better classifier, the
splitting ratio may be allowed to converge to infinity or any positive constant, depending
on the situation [14]. In general, it is safe to let nt → ∞ and nv /nt → ∞ for consistency
in modeling-procedure selection.
Example 12.2 A set of real-world data reported by Scheetz et al. in [49] (and avail-
able from http://bioconductor.org) consists of over 31 000 gene probes represented on an
Affymetrix expression microarray, from an experiment using 120 rats. Using domain-
specific prescreening procedures as proposed in [49], 18 976 probes were selected that
exhibit sufficient signals for reliable analysis. One purpose of the real data was to dis-
cover how genetic variation relates to human eye disease. Researchers are interested in
finding the genes whose expression is correlated with that of the gene TRIM32, which
378 Jie Ding et al.
has recently been found to cause human eye diseases. Its probe is 1389163_at, one of the
18 976 probes. In other words, we have n = 120 data observations and 18 975 variables
that possibly relate to the observation. From domain knowledge, one expects that only a
few genes are related to TRIM32.
p(1 − p)/nv ,
where p is the true out-sample success rate of a procedure. Suppose the top-ranked
team achieves slightly above p = 50%. Then, in order for the competition to declare it
to be the true winner at 95% confidence level from the second-ranked team with only
√
1% observed difference in classification accuracy, the standard error 1/ 4nv is roughly
required to be smaller than 0.5%, which demands nv ≥ 10 000. Thus if the holdout sam-
ple size is not enough, the winning team may well be just the lucky one among the
Fundamental Limits in Model Selection for Modern Data Analysis 379
top-ranking teams. It is worth pointing out that the discussion above is based on sum-
mary accuracy measures (e.g., classification accuracy on the holdout data), and other
hypothesis tests can be employed for formal comparisons of the competing teams (e.g.,
tests based on differencing of the prediction errors).
To summarize this section, we introduced the use of cross-validation as a general tool
for modeling-procedure selection, and we discussed the issue of choosing the test data
size so that such selection is indeed reliable for the purpose of consistency. For modeling-
procedure selection, it is often safe to let the validation size take a large proportion (e.g.,
half) of the data in order to achieve good selection behavior. In particular, the use of
LOO for the goal of comparing procedures is the least trustworthy method. The popu-
lar 10-fold CV may leave too few observations in evaluation to be stable. Indeed, the
5-fold CV often produces more stable selection results for high-dimensional regression.
Moreover, v-fold CV, regardless of v, in general, is often unstable, and a repeated v-fold
approach usually improves performance. A quantitative relation between model stability
and selection consistency remains to be established by research. We also introduced the
paradox that using more training data and validation data does not necessarily lead to
better modeling-procedure selection. That further indicates the importance of choosing
an appropriate splitting ratio.
12.7 Conclusion
There has been a debate regarding whether to use the AIC or the BIC in the past decades,
centering on whether the true data-generating model is parametric or not with respect to
the specified model class, which in turn affects the achievability of fundamental limits
of learning the optimal model given a dataset and a model class at hand. Compared
with the BIC, the AIC seems to be more widely used in practice, perhaps mainly due
to the thought that “all models are wrong” and the minimax-rate optimality of the AIC
offers more protection than does the BIC. Nevertheless, the parametric setting is still
of vital importance. One reason for this is that being consistent in selecting the true
model if it is really among the candidates is certainly mathematically appealing. Also, a
non-parametric scenario can be a virtually parametric scenario, where the optimal model
(even if it is not the true model) is stable for the current sample size. The war between the
AIC and the BIC originates from two fundamentally different goals: one is to minimize
the certain loss for prediction purposes, and the other is to select the optimal model for
inference purposes. A unified perspective on integrating their fundamental limits is a
central issue in model selection, which remains an active line of research.
380 Jie Ding et al.
References
[1] S. Greenland, “Modeling and variable selection in epidemiologic analysis,” Am. J. Public
Health, vol. 79, no. 3, pp. 340–349, 1989.
[2] C. M. Andersen and R. Bro, “Variable selection in regression – a tutorial,” J. Chemomet-
rics, vol. 24, nos. 11–12, pp. 728–737, 2010.
[3] J. B. Johnson and K. S. Omland, “Model selection in ecology and evolution,” Trends
Ecology Evolution, vol. 19, no. 2, pp. 101–108, 2004.
[4] P. Stoica and Y. Selen, “Model-order selection: A review of information criterion rules,”
IEEE Signal Processing Mag., vol. 21, no. 4, pp. 36–47, 2004.
[5] J. B. Kadane and N. A. Lazar, “Methods and criteria for model selection,” J. Amer. Statist.
Assoc., vol. 99, no. 465, pp. 279–290, 2004.
[6] J. Ding, V. Tarokh, and Y. Yang, “Model selection techniques: An overview,” IEEE Signal
Processing Mag., vol. 35, no. 6, pp. 16–34, 2018.
[7] R. Tibshirani, “Regression shrinkage and selection via the LASSO,” J. Roy. Statist. Soc.
Ser. B, vol. 58, no. 1, pp. 267–288, 1996.
[8] J. Shao, “An asymptotic theory for linear model selection,” Statist. Sinica, vol. 7, no. 2,
pp. 221–242, 1997.
[9] C. R. Rao, “Information and the accuracy attainable in the estimation of statistical
parameters,” in Breakthroughs in statistics. Springer, 1992, pp. 235–247.
[10] J. Ding, V. Tarokh, and Y. Yang, “Optimal variable selection in regression models,”
http://jding.org/jie-uploads/2017/11/regression.pdf, 2016.
[11] Y. Nan and Y. Yang, “Variable selection diagnostics measures for high-dimensional
regression,” J. Comput. Graphical Statist., vol. 23, no. 3, pp. 636–656, 2014.
[12] J. P. Ioannidis, “Why most published research findings are false,” PLoS Medicine, vol. 2,
no. 8, p. e124, 2005.
[13] R. Shibata, “Asymptotically efficient selection of the order of the model for estimating
parameters of a linear process,” Annals Statist., vol. 8, no. 1, pp. 147–164, 1980.
[14] Y. Yang, “Comparing learning methods for classification,” Statist. Sinica, vol. 16, no. 2,
pp. 635–657, 2006.
[15] J. Ding, V. Tarokh, and Y. Yang, “Bridging AIC and BIC: A new criterion for autoregres-
sion,” IEEE Trans. Information Theory, vol. 64, no. 6, pp. 4024–4043, 2018.
[16] H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Automation
Control, vol. 19, no. 6, pp. 716–723, 1974.
[17] H. Akaike, “Information theory and an extension of the maximum likelihood principle,” in
Selected papers of Hirotugu Akaike. Springer, 1998, pp. 199–213.
[18] G. Schwarz, “Estimating the dimension of a model,” Annals Statist., vol. 6, no. 2, pp. 461–
464, 1978.
[19] A. Gelman, H. S. Stern, J. B. Carlin, D. B. Dunson, A. Vehtari, and D. B. Rubin, Bayesian
data analysis. Chapman and Hall/CRC, 2013.
[20] A. W. Van der Vaart, Asymptotic statistics. Cambridge University Press, 1998, vol. 3.
[21] J. S. Liu, Monte Carlo strategies in scientific computing. Springer Science & Business
Media, 2008.
[22] D. M. Allen, “The relationship between variable selection and data agumentation and a
method for prediction,” Technometrics, vol. 16, no. 1, pp. 125–127, 1974.
[23] S. Geisser, “The predictive sample reuse method with applications,” J. Amer. Statist. Assoc.,
vol. 70, no. 350, pp. 320–328, 1975.
Fundamental Limits in Model Selection for Modern Data Analysis 381
[47] G. E. Box, G. M. Jenkins, and G. C. Reinsel, Time series analysis: Forecasting and control.
John Wiley & Sons, 2011.
[48] Y. Zhang and Y. Yang, “Cross-validation for selecting a model selection procedure,”
J. Econometrics, vol. 187, no. 1, pp. 95–112, 2015.
[49] T. E. Scheetz, K.-Y. A. Kim, R. E. Swiderski, A. R. Philp, T. A. Braun, K. L. Knudtson,
A. M. Dorrance, G. F. DiBona, J. Huang, T. L. Casavant, V. C. Sheffield, and E. M. Stone,
“Regulation of gene expression in the mammalian eye and its relevance to eye disease,”
Proc. Natl. Acad. Sci. USA, vol. 103, no. 39, pp. 14 429–14 434, 2006.
[50] J. Huang, S. Ma, and C.-H. Zhang, “Adaptive lasso for sparse high-dimensional regression
models,” Statist. Sinica, pp. 1603–1618, 2008.
13 Statistical Problems with Planted
Structures: Information-Theoretical
and Computational Limits
Yihong Wu and Jiaming Xu
Summary
This chapter provides a survey of the common techniques for determining the sharp sta-
tistical and computational limits in high-dimensional statistical problems with planted
structures, using community detection and submatrix detection problems as illustrative
examples. We discuss tools including the first- and second-moment methods for ana-
lyzing the maximum-likelihood estimator, information-theoretic methods for proving
impossibility results using mutual information and rate-distortion theory, and methods
originating from statistical physics such as the interpolation method. To investigate com-
putational limits, we describe a common recipe to construct a randomized polynomial-
time reduction scheme that approximately maps instances of the planted-clique problem
to the problem of interest in total variation distance.
13.1 Introduction
The interplay between information theory and statistics is a constant theme in the devel-
opment of both fields. Since its inception, information theory has been indispensable for
understanding the fundamental limits of statistical inference. The classical information
bound provides fundamental lower bounds for the estimation error, including Cramér–
Rao and Hammersley–Chapman–Robbins lower bounds in terms of Fisher information
and χ2 -divergence [1, 2]. In the classical “large-sample” regime in parametric statistics,
Fisher information also governs the sharp minimax risk in regular statistical models [3].
The prominent role of information-theoretic quantities such as mutual information, met-
ric entropy, and capacity in establishing the minimax rates of estimation has long been
recognized since the seminal work of [4–7], etc.
Instead of focusing on the large-sample asymptotics, the attention of contemporary
statistics has shifted toward high dimensions, where the problem size and the sample
size grow simultaneously and the main objective is to obtain a tight characterization of
the optimal statistical risk. Certain information-theoretic methods have been remarkably
successful for high-dimensional problems. Such methods include those based on metric
entropy and Fano’s inequality for determining the minimax risk within universal con-
stant factors (minimax rates) [7]. Unfortunately, the aforementioned methods are often
383
384 Yihong Wu and Jiaming Xu
too crude for the task of determining the sharp constant, which requires more refined
analysis and stronger information-theoretic tools.
An additional challenge in dealing with high dimensionality is the need to address
the computational aspect of statistical inference. An important element absent from the
classical statistical paradigm is the computational complexity of inference procedures,
which is becoming increasingly relevant for data scientists dealing with large-scale noisy
datasets. Indeed, recent results [8–13] revealed the surprising phenomenon that certain
problems concerning large networks and matrices undergo an “easy–hard–impossible”
phase transition, and computational constraints can severely penalize the statistical per-
formance. It is worth pointing out that here the notion of complexity differs from the
worst-case computational hardness studied in the computer science literature which
focused on the time and space complexity of various worst-case problems. In contrast,
in a statistical context, the problem is of a stochastic nature and the existing theory on
average-case hardness is significantly underdeveloped. Here, the hardness of a statistical
problem is often established either within the framework of certain computation models,
such as the sums-of-squares relaxation hierarchy, or by means of a reduction argument
from another problem, notably the planted-clique problem, which is conjectured to be
computationally intractable.
In this chapter, we provide an exposition on some of the methods for determining
the information-theoretic as well as the computational limits for high-dimensional sta-
tistical problems with a planted structure, with a specific focus on characterizing sharp
thresholds. Here the planted structure refers to the true parameter, which is often of
a combinatorial nature (e.g., partition) and hidden in the presence of random noise.
To characterize the information-theoretic limit, we will discuss tools including the
first- and second-moment methods for analyzing the maximum-likelihood estimator,
information-theoretic methods for proving impossibility results using mutual informa-
tion and rate-distortion theory, and methods originating from statistical physics such
as the interpolation method. There is no established recipe for determining the com-
putational limit of statistical problems, especially the “easy–hard–impossible” phase
transition, and it is usually done on a case-by-case basis; nevertheless, the common
element is to construct a randomized polynomial-time reduction scheme that approxi-
mately maps instances of a given hard problem to one that is close to the problem of
interest in total variation distance.
so that the criterion T (A) ≥ τ determines with high probability whether A is drawn from
P or Q.
definition 13.4 (Correlated recovery) The estimator σ achieves correlated recovery of
σ∗ if there exists a fixed constant > 0 such that E[| σ, σ∗ |] ≥ n for all n.
The detection problem can be understood as a binary hypothesis-testing problem.
Given a test statistic T (A), we consider its distribution under the planted and null
models. If these two distributions are asymptotically disjoint, i.e., their total variation
distance tends to 1 in the limit of large datasets, then it is information-theoretically pos-
sible to distinguish the two models with high probability by measuring T (A). A classic
choice of statistic for binary hypothesis testing is the likelihood ratio,
P(A) P(A, σ) P(A|σ) P(σ)
= σ = σ .
Q(A) Q(A) Q(A)
This object will figure heavily both in our upper bounds and in our lower bounds of the
detection threshold.
Before presenting our proof techniques, we first give the sharp threshold for detection
and correlated recovery under the binary symmetric community model.
1 This criterion is also known as strong detection, in contrast to weak detection which requires only that
P(T (A) < τ) + Q(T (A) ≥ τ) be bounded away from 1 as n → ∞. Here we focus exclusively on strong
detection. See [24, 25] for detailed discussions on weak detection.
Statistical Problems with Planted Structures 387
2 Throughout this chapter, logarithms are with respect to the natural base.
388 Yihong Wu and Jiaming Xu
sub-optimal in view of Theorem 13.1. One reason is that the naive union bound in the
first-moment analysis may not be tight; it does not take into the account the correlation
between P(A|σ) and P(A|σ ) for two different σ, σ under the null model.
Next we explain how to carry out the first-moment analysis in the Gaussian case with
√ √ √
P = N(μ/ n, 1) and Q = N(−μ/ n, 1). Specifically, assume A = (μ/ n) σ∗ (σ∗ )T − I +
W, where W is a symmetric Gaussian random variable with zero diagonal and
i.i.d. √
Wi j ∼ N(0, 1) for i < j. It follows that log (P(A|σ)/Q(A)) = (μ/ n) i< j Ai j σi σ j +
μ2 (n − 1)/4. Therefore, the generalized likelihood test reduces to the test statistic
maxσ T (σ) i< j Ai, j σi σ j . Under the null model Q, T (σ) ∼ N(0, n(n − 1)/2). Under
√
the planted model P, T (σ) = μ/ n i< j σ∗i σ∗j σi σ j + i< j Wi, j σi σ j . Hence the distribu-
tion of T (σ) depends on the overlap | σ, σ∗ | between σ and the planted partition σ∗ .
Suppose | σ, σ∗ | = nω. Then
μn(nω2 − 1) n(n − 1)
T (σ) ∼ N √ , .
2 n 2
To prove that detection is possible, notice that, in the planted model, maxσ T (σ) ≥
T (σ∗ ). Setting ω = 1, Gaussian tail bounds yield that
μn(n − 1)
P T (σ∗ ) ≤ √ − n log n ≤ n−1 .
2 n
Under the null model, taking the union bound over at most 2n ways to choose σ, we
can bound the probability that any partition is as good, according to T , as the planted
one, by
⎛ ⎛ ⎞2 ⎞⎟
μn(n − 1) ⎜⎜⎜⎜ ⎜⎜⎜ μ n − 1 log n ⎟⎟⎟ ⎟⎟⎟
Q max T (σ) > √ − n log n ≤ 2 exp⎜⎜⎜⎝−n⎜⎜⎝
n
− ⎟⎟ ⎟⎟.
σ 2 n 2 n n − 1 ⎠ ⎟⎠
Thus the probability of this event is e−Ω(n) whenever μ > 2 log 2, meaning that above
this threshold we can distinguish the null and planted models with the generalized
likelihood test.
To prove that correlated recovery
is possible, since μ > 2 log 2, there exists a fixed
> 0 such that μ(1 − 2 ) > 2 log 2. Taking the union bound over every σ with | σ, σ∗ | ≤
n gives
μn(n − 1)
P max
∗
T (σ) ≥ √ − n log n
| σ,σ |≤n 2 n
⎛ ⎛ ⎞2 ⎞⎟
⎜⎜⎜ ⎜⎜ μ(1 − 2 ) n ⎟⎟⎟ ⎟⎟⎟
≤ 2n exp⎜⎜⎜⎜⎝−n⎜⎜⎜⎝
log n ⎟⎟ ⎟⎟.
−
2 n−1 n − 1 ⎠ ⎟⎠
3 In fact, the quantity ρ = (P − Q)2 /(2(P + Q)) is an f -divergence known as the Vincze–Le Cam distance
[4, 36].
390 Yihong Wu and Jiaming Xu
(P − Q)2
=E i σ
1 + σi σ j σ j (13.4)
2(P + Q)
i< j
ρ
⎡ ⎛ ⎞⎤
⎢⎢⎢ ⎜⎜⎜ ! ⎟⎟⎟⎥⎥⎥
⎢ ⎜
≤ E⎢⎢⎢⎣exp⎜⎜⎜⎝ρ j ⎟⎟⎟⎟⎠⎥⎥⎥⎥⎦.
i σ j σ
σi σ (13.5)
i< j
For the Bernoulli setting where P = Bern(a/n) and Q = Bern(b/n) for fixed constants
a, b, we have ρ τ/n + O(1/n2 ), where τ (a − b)2 /(2(a + b)). Thus,
" #τ $%
χ2 (P Q) + 1 ≤ E exp 2 + O(1) .
σ, σ
2n
We then write σ = 2ξ − 1, where ξ ∈ {0, 1}n is the indicator vector for the first com-
munity which is drawn uniformly at random from all binary vectors with Hamming
weight n/2, and ξ is its independent copy. Then σ, σ = 4 ξ,
ξ − n, where H ξ,
ξ ∼
Hypergeometric(n, n/2, n/2). Thus
⎡ ⎛ ⎞⎤
⎢⎢ ⎜⎜ τ 4H − n 2 ⎟⎟⎟⎥⎥⎥
χ2 (P Q) + 1 ≤ E⎢⎢⎢⎣exp⎜⎜⎜⎝ √ + O(1) ⎟⎟⎠⎥⎥⎦.
2 n
√
Since (1/ n/16)(H − n/4) → N(0, 1) as n → ∞ by the central limit theorem for hyper-
geometric distributions (see, e.g., p. 194 of [37]), using Theorem 1 of [38] for the
convergence of the moment-generating function, we conclude that χ2 (P Q) is bounded
if τ < 1.
4 Indeed, since P{σ2 = −|σ1 = +} = n/(2n − 2), I(σ1 ; σ2 ) = log 2 − h(n/(2n − 2)) = Θ(n−2 ), where h is the
binary entropy function in (13.34).
Statistical Problems with Planted Structures 391
equivalent to stating that σ1 and σ2 are asymptotically independent given the observa-
tion A; this is shown in Theorem 2.1 of [39] for the SBM below the recovery threshold
τ = (a − b)2 /(2(a + b)) < 1.
Polyanskiy and Wu recently [40] proposed an information-percolation method based
on strong data-processing inequalities for mutual information to bound the mutual infor-
mation in (13.6) in terms of bond percolation probabilities, which yields bounds or a
sharp recovery threshold for correlated recovery; a similar program is carried out inde-
pendently in [41] for a variant of mutual information defined via the χ2 -divergence. For
two communities, this method yields the sharp threshold in the Gaussian model but not
in the SBM.
Next, we describe another method of proving (13.6) via second-moment analysis that
reaches the sharp threshold. Let P+ and P− denote the conditional distribution of A con-
ditioned on σ1 = σ2 and on σ1 σ2 , respectively. The following result can be distilled
from [42] (see Appendix A13.2 for a proof): for any probability distribution Q, if
(P+ − P− )2
= o(1), (13.7)
Q
then (13.6) holds and hence correlated recovery is impossible. The LHS of (13.7) can
be computed similarly to the usual second moment (13.4) when Q is chosen to be the
distribution of A under the null model. In Appendix A13.2 we verify that (13.7) is sat-
isfied below the correlated recovery threshold τ = (a − b)2 /(2(a + b)) < 1 for the binary
symmetric SBM.
Blockwise mutual information I(σ; A). Although this quantity is not directly related
to correlated recovery per se, its derivative with respect to some appropriate signal-to-
noise-ratio (SNR) parameter can be related to or coincides with the reconstruction error
thanks to the I-MMSE formula [43] or variants. Using this method, we can prove that
the Kullback–Leibler divergence D(P Q) = o(n) implies the impossibility of correlated
recovery in the Gaussian case. As shown in (13.2), a bounded second moment read-
ily implies a bounded KL divergence. Hence, as a corollary, we prove that a bounded
second moment also implies the impossibility of correlated recovery in the Gaussian
case. Below, we sketch the proof of the impossibility of correlated recovery in the Gaus-
sian case, by assuming D(P Q) = o(n). The proof makes use of mutual information, the
I-MMSE formula, and a type of interpolation argument [44–46].
√
Assume that A(β) = βM + W in the planted model and A = W in the null model,
√
where β ∈ [0, 1] is an SNR parameter, M = (μ/ n)(σσT − I), W is a symmetric Gaus-
i.i.d.
sian random matrix with zero diagonal, and Wi j ∼ N(0, 1) for all i < j. Note that
β = 1 corresponds to the binary symmetric community model in Definition 13.2 with
√ √
P = N(μ/ n, 1) and Q = N(−μ/ n, 1). Below we abbreviate A(β) as A whenever the
context is clear. First, recall that the minimum mean-squared error estimator is given by
the posterior mean of M:
MMSE (A) = E[M|A],
M
and the resulting (rescaled) minimum mean-squared error is
1
MMSE(β) = E M − E[M|A] 2F . (13.8)
n
392 Yihong Wu and Jiaming Xu
We will start by proving that, if D(P Q) = o(n), then, for all β ∈ [0, 1], the MMSE
= 0, i.e.,
tends to that of the trivial estimator M
1
lim MMSE(β) = lim E M 2
F = μ2 . (13.9)
n→∞ n→∞ n
Note that limn→∞ MMSE(β) exists by virtue of Proposition III.2 of [44]. Let us compute
the mutual information between M and A:
P(A|M)
I(β) I(M; A) = E M,A log (13.10)
P(A)
Q(A) P(A|M)
= EA log + E M,A log
P(A) Q(A)
⎡ ⎤
1 ⎢⎢⎢ β M 2F ⎥⎥⎥
⎢
= −D(P Q) + E M,A ⎢⎣ β M, A − ⎥⎥
2 2 ⎦
β
= −D(P Q) + E M 2F . (13.11)
4
By assumption, we have that D(P Q) = o(n) holds for β = 1; by the data-processing
inequality for KL divergence [47], this holds for all β < 1 as well. Thus (13.11) becomes
1 β 1 βμ2
lim I(β) = lim E M 2
F = . (13.12)
n→∞ n 4 n→∞ n 4
Next we compute the MMSE. Recall the I-MMSE formula [43] for Gaussian
channels:
dI(β) 1 ! & '2 n
= Mi j − E Mi j |A = MMSE(β) . (13.13)
dβ 2 i< j 4
Note that the MMSE is by definition bounded above by the squared error of the trivial
estimator M = 0, so that for all β we have
1
MMSE(β) ≤ E M 2
F ≤ μ2 . (13.14)
n
On combining these we have
1
μ2 (a) I(1) (b) 1
= lim = lim MMSE(β) dβ
4 n→∞ n 4 n→∞ 0
1
(c) 1
≤ lim MMSE(β) dβ
4 0 n→∞
1
(d) 1 μ2
≤ μ2 dβ = ,
4 0 4
where (a) and (b) hold due to (13.12) and (13.13), (c) follows from Fatou’s lemma, and
(d) follows from (13.14), i.e., MMSE(β) ≤ μ2 pointwise. Since we began and ended with
the same expression, these inequalities must all be equalities. In particular, since (d)
holds with equality, we have that (13.9) holds for almost all β ∈ [0, 1]. Since MMSE(β)
Statistical Problems with Planted Structures 393
1 & '
lim E −2 M, E[M|A] + E[M|A] 2F = 0 . (13.15)
n→∞ n
From the tower property of conditional expectation and the linearity of the inner product,
it follows that
1
lim E E[M|A] 2
F = 0. (13.16)
n→∞ n
] = EA [ E[M|A], M
E M,A [ M, M ]
& '
≤ EA E[M|A] F M F
( √ (13.16)
≤ EA [ E[M|A] 2F ] × μ n = o(n).
To obtain upper bounds on the thresholds for almost-exact and exact recovery, we turn
to the MLE. Specifically,
• to show that the MLE achieves almost-exact recovery, it suffices to prove that there
exists n = o(1) such that, with high probability, P(A|ξ) < P(A|ξ∗ ) for all ξ with
dH (ξ, ξ∗ ) ≥ n K; and
• to show that the MLE achieves exact recovery, it suffices to prove that, with high
probability, P(A|ξ) < P(A|ξ∗ ) for all ξ ξ∗ .
This type of argument often involves two key steps. First, upper-bound the probability
that P(A|ξ) ≥ P(A|ξ∗ ) for a fixed ξ using large-deviation techniques. Second, take an
appropriate union bound over all possible ξ using a “peeling” argument which takes into
account the fact that the further away ξ is from ξ∗ the less likely it is for P(A|ξ) ≥ P(A|ξ∗ )
to occur. Below we discuss these two key steps in more detail.
Given the data matrix A, a sufficient statistic for estimating the community C ∗ is the
log likelihood ratio (LLR) matrix L ∈ Rn×n , where Li j = log(dP/dQ)(Ai j ) for i j and
Lii = 0. For S , T ⊂ [n], define
!
e(S , T ) = Li j . (13.17)
(i< j):(i, j)∈(S ×T )∪(T ×S )
almost-exact recovery; nevertheless, we choose to analyze the MLE due to its simplicity,
and it turns out to be asymptotically optimal for almost-exact recovery as well.
To state the main results, we introduce some standard notations associated with
binary hypothesis testing based on independent samples. We assume the KL divergences
D(P Q) and D(Q P) are finite. In particular, P and Q are mutually& absolutely
' continu-
ous, and the likelihood ratio, dP/dQ, satisfies EQ [dP/dQ] = EP (dP/dQ)−1 = 1. Let L =
log(dP/dQ) denote the LLR. The likelihood-ratio test for n observations and threshold
nθ is to declare P to be the true distribution if nk=1 Lk ≥ nθ and to declare Q otherwise.
For θ ∈ [−D(Q P), D(P Q)], the standard Chernoff bounds for error probability of this
likelihood-ratio test are given by
⎡ n ⎤
⎢⎢⎢! ⎥⎥⎥
Q⎢⎣⎢ Lk ≥ nθ⎥⎥⎦⎥ ≤ exp(−nE Q (θ)),
⎢ (13.19)
k=1
⎡ n ⎤
⎢⎢⎢! ⎥⎥⎥
⎢
P⎢⎢⎣ Lk ≤ nθ⎥⎥⎥⎦ ≤ exp(−nE P (θ)), (13.20)
k=1
where the log moment generating functions of L are denoted by ψQ (λ) = log EQ [exp(λL)]
and ψP (λ) = log EP [exp(λL)] = ψQ (λ + 1) and the large-deviation exponents are given by
Legendre transforms of the log moment generating functions:
In particular, E P and E Q are convex functions. Moreover, since ψQ (0) = −D(Q P) and
ψQ (1) = D(P Q), we have E Q (−D(Q P)) = E P (D(P Q)) = 0 and hence E Q (D(P Q)) =
D(P Q) and E P (−D(Q P)) = D(Q P).
Under mild assumptions on the distribution (P, Q) (cf. Assumption 1 of [51]) which
are satisfied both by the Gaussian distribution and by the Bernoulli distribution, the sharp
thresholds for almost exact and exact recovery under the single-community model are
given by the following result.
theorem 13.2 Consider the single-community model with P = N(μ, 1) and Q = N(0, 1),
or P = Bern(p) and Q = Bern(q) with log(p/q) and log((1 − p)/(1 − q)) bounded. If
(K − 1)D(P Q)
K · D(P Q) → ∞ and lim inf > 2, (13.23)
n→∞ log(n/K)
then almost-exact recovery is information-theoretically possible. If, in addition to
(13.23),
, -
KE Q (1/K) log(n/K)
lim inf >1 (13.24)
n→∞ log n
holds, then exact recovery is information-theoretically possible.
Conversely, if almost-exact recovery is information-theoretically possible, then
(K − 1)D(P Q)
K · D(P Q) → ∞ and lim inf ≥ 2. (13.25)
n→∞ log(n/K)
396 Yihong Wu and Jiaming Xu
Next we proceed to describe the union bound for the proof of almost-exact recovery.
Note
0 that showing that MLE 1 achieves almost exact recovery is equivalent to showing
ML ∩ C ∗ | ≤ (1 − n )K = o(1). The first layer of the union bound is straightforward:
P |C
0 1 0 1
ML ∩ C ∗ | ≤ (1 − n )K = ∪(1−n )K |C
|C ML ∩ C ∗ | = . (13.28)
=0
For the second layer of the union bound, one naive way to proceed is
0 1
ML ∩ C ∗ | = ⊂ {C ∈ C : e(C,C) ≥ e(C ∗ ,C ∗ )}
|C
. /
= ∪C∈C e(C,C) ≥ e(C ∗ ,C ∗ ,
Note that E P (θ) = E Q (θ) − θ. Hence, we set θ = (1/K) log(n/K) so that E3 = E4 , which
goes to +∞ under the assumption of (13.24). The proof of exact recovery is completed
by taking the union bound over all .
Necessary Conditions
To derive lower bounds on the almost-exact recovery threshold, we resort to a sim-
ple rate-distortion argument. Suppose ξ achieves almost-exact recovery of ξ∗ . Then
E[dH (ξ, ξ)] = n K with n → 0. On the one hand, consider the following chain of
inequalities, which lower-bounds the amount of information required for a distortion
level n :
(a)
I(A; ξ∗ ) ≥ I(
ξ; ξ∗ ) ≥ min I(
ξ; ξ∗ )
E[d(
ξ,ξ∗ )]≤n K
where (a) follows from the data-processing inequality for mutual information since ξ →
A → ξ forms a Markov chain; (b) is due to the fact that maxE[w(X)]≤pn H(X) = nh(p) for
any p ≤ 1/2, where
1 1
h(p) p log + (1 − p)log (13.34)
p 1− p
is the binary entropy function and w(x) = i xi ; and (c) follows from the bound Kn ≥
(n/K)K , the assumption K/n is bounded away from one, and the bound h(p) ≤ −p log p +
p for p ∈ [0, 1].
On the other hand, consider the following upper bound on the mutual information:
n K
I(A; ξ∗ ) = min D(PA|ξ∗ Q|Pξ∗ ) ≤ D(PA|ξ∗ Q⊗(2) |Pξ∗ ) = D(P Q),
Q 2
where the first equality follows from the geometric interpretation of mutual information
as an “information radius” (see, e.g., Corollary 3.1 of [52]); the last equality follows
from the tensorization property of KL divergence for product distributions. Combining
the last two displays, we conclude that the second condition in (13.25) is necessary for
almost-exact recovery.
To show the necessity of the first condition in (13.25), we can reduce almost-exact
recovery to a local hypothesis testing via a genie-type argument. Given i, j ∈ [n], let
ξ\i, j denote {ξk : k i, j}. Consider the following binary hypothesis-testing problem for
determining ξi . If ξi = 0, a node J is randomly and uniformly chosen from { j : ξ j = 1},
and we observe (A, J, ξ\i,J ); if ξi = 1, a node J is randomly and uniformly chosen from
{ j : ξ j = 0}, and we observe (A, J, ξ\i,J ). It is straightforward to verify that this hypothesis-
testing problem is equivalent to testing H0 : Q⊗(K−1) P⊗(K−1) versus H1 : P⊗(K−1) Q⊗(K−1) .
Let E denote the optimal average probability of testing error, pe,0 denote the type-I error
Statistical Problems with Planted Structures 399
probability, and pe,1 denote the type-II error probability. Then we have the following
chain of inequalities:
!
n
E[dH (ξ,
ξ)] ≥ min P[ξi
ξi ]
i=1 ξi (A)
!
n
≥ min P[ξi
ξi ]
i=1 ξi (A,J, ξ\i,J )
=n min P[ξ1
ξ1 ] = nE.
ξ1 (A,J, ξ\1,J )
By the assumption E[dH (ξ, ξ)] = o(K), it follows that E = o(K/n). Under the assump-
tion that K/n is bounded away from one, E = o(K/n) further implies that the sum
of type-I and type-II probabilities of error pe,0 + pe,1 = o(1), or equivalently, TV((P ⊗
Q)⊗K−1 , (Q ⊗ P)⊗K−1 ) → 1, where TV(P, Q) |dP − dQ|/2 denotes the total variation
distance. Using D(P Q) ≥ log(1/(2(1 − TV(P, Q)))) (Equation (2.25) of [53]) and the
tensorization property of KL divergence for product distributions, we conclude that
(K − 1)(D(P Q) + D(Q P)) → ∞ is necessary for almost-exact recovery. It turns out
that, both for the Bernoulli distribution and for the Gaussian distribution as specified in
the theorem statement, D(P Q) D(Q P), and hence KD(P Q) → ∞ is necessary for
almost-exact recovery.
Clearly, any estimator achieving exact recovery also achieves almost-exact recovery.
Hence lower bounds for almost-exact recovery hold automatically for exact recovery.
Finally, we show the necessity of (13.26) for exact recovery. Since the MLE mini-
mizes the error probability among all estimators if the true community C ∗ is uniformly
distributed, it follows that, if exact recovery is possible, then, with high probability,
C ∗ has a strictly higher likelihood than any other community C C ∗ , in particular,
C = C ∗ \ {i} ∪ { j} for any pair of vertices i ∈ C ∗ and j C ∗ . To further illustrate the
proof ideas, consider the Bernoulli case of the single-community model. Then C ∗ has a
strictly higher likelihood than C ∗ \ {i} ∪ { j} if and only if e(i,C ∗ ), the number of edges
connecting i to vertices in C ∗ , is larger than e( j,C ∗ \ {i}), the number of edges connecting
j to vertices in C ∗ \ {i}. Therefore, with high probability, it holds that
where i0 is the random index such that i0 ∈ arg mini∈C ∗ e(i,C ∗ ). Note that the e( j,C ∗ \
{i0 }) s are i.i.d. for different j C ∗ and hence a large-probability lower bound to their
maximum can be derived using inverse concentration inequalities. Specifically, for the
sake of argument by contradiction, suppose that (13.26) does not hold. Furthermore, for
ease of presentation, assume that the large-deviation inequality (13.19) also holds in the
reverse direction (cf. Corollary 5 of [51] for a precise statement). Then it follows that
2 # n $3 # n $
∗ 1
P e( j,C \ {i0 }) ≥ log exp −KE Q log ≥ n−1+δ
K K K
400 Yihong Wu and Jiaming Xu
for some small δ > 0. Since the e( j,C ∗ \ {i0 })s are i.i.d. and there are n − K of them, it
further follows that, with large probability,
#n$
max∗ e( j,C ∗ \ {i0 }) ≥ log .
jC K
Similarly, by assuming that the large-deviation inequality (13.20) also holds in the
opposite direction and using the fact that E P (θ) = E Q (θ) − θ, we get that
2 # n $3 # n $
1
P e(i,C ∗ ) ≤ log exp −KE P log ≥ K −1+δ .
K K K
Although the e(i,C ∗ )s are not independent for different i ∈ C ∗ , the dependence is weak
and can be controlled properly. Hence, following the same argument as above, we get
that, with large probability,
#n$
min∗ e(i,C ∗ ) ≤ log .
i∈C K
Combining the large-probability lower and upper bounds and (13.35) yields the contra-
diction. Hence, (13.26) is necessary for exact recovery.
remark 13.1 Note that, instead of using the MLE, one could also apply a two-step
procedure to achieve exact recovery: first use an estimator capable of almost-exact recov-
ery and then clean up the residual errors through a local voting procedure for every
vertex. Such a two-step procedure has been analyzed in [51]. From the computational
perspective, both for the Bernoulli case and for the Gaussian case we have the following
results:
• if K = Θ(n), a linear-time degree-thresholding algorithm achieves the information
limit of almost exact recovery (see Appendix A of [54] and Appendix A of [55]);
• if K = ω(n/ log n), whenever information-theoretically possible, exact recovery can
be achieved in polynomial time using semidefinite programming [56];
• if K ≥ (n/ log n)(1/(8e) + o(1)) for the Gaussian case and K ≥ (n/ log n)(ρBP (p/q) +
o(1)) for the Bernoulli case, exact recovery can be attained in nearly linear time via
message passing plus clean-up [54, 55] whenever information-theoretically possible.
Here ρBP (p/q) denotes a constant depending only on p/q.
However, it remains unknown whether any polynomial-time algorithm can achieve the
respective information limit of almost exact recovery for K = o(n), or exact recovery for
K ≤ (n/ log n)(1/(8e) − ) in the Gaussian case and for K ≤ (n/ log n)(ρBP (p/q) − ) in
the Bernoulli case, for any fixed > 0.
Similar techniques can be used to derive the almost-exact and exact recovery thresh-
olds for the binary symmetric community model. For the Bernoulli case, almost-exact
recovery is efficiently achieved by a simple spectral method if n(p−q)2 /(p+q) → ∞[57],
which turns out to be also information-theoretically necessary [58]. An exact recov-
ery threshold for the binary community model has been derived and further shown to
be efficiently achievable by a two-step procedure consisting of a spectral method plus
clean-up [58, 59]. For the binary symmetric community model with general discrete dis-
tributions P and Q, the information-theoretic limit of exact recovery has been shown to
Statistical Problems with Planted Structures 401
be determined by the Rényi divergence of order 1/2 between P and Q [60]. The analysis
of the MLE has been carried out under k-symmetric community models for general k,
and the information-theoretic exact recovery threshold has been identified in [16] up to
a universal constant. The precise information-theoretic limit of exact recovery has been
determined in [61] for k = Θ(1) with a sharp constant and has further been shown to be
efficiently achievable by a polynomial-time two-step procedure.
In this section we discuss the computational limits (performance limits of all possible
polynomial-time procedures) of detecting the planted structure under the planted-clique
hypothesis (to be defined later). To investigate the computational hardness of a given
statistical problem, one main approach is to find an approximate randomized
polynomial-time reduction, which maps certain graph-theoretic problems, in particular,
the planted-clique problem, to our problem approximately in total variation, thereby
showing that these statistical problems are at least as hard as solving the planted-clique
problem.
We focus on the single-community model in Definition 13.1 and present results
for both the submatrix detection problem (Gaussian) [9] and the community detec-
tion problem (Bernoulli) [10]. Surprisingly, under appropriate parameterizations, the
two problems share the same “easy–hard–impossible” phase transition. As shown in
Fig. 13.1, where the horizontal and vertical axes correspond to the relative community
size and the noise level, respectively, the hardness of the detection has a sharp phase
transition: optimal detection can be achieved by computationally efficient procedures
for a relatively large community, but provably not for a small community. This is one
of the first results in high-dimensional statistics where the optimal trade-off between
statistical performance and computational efficiency can be precisely quantified.
Specifically, consider the submatrix detection problem in the Gaussian case of Def-
inition 13.1, where P = N(μ, 1) and Q = N(0, 1). In other words, the goal is to test
the null model, where the observation is an N × N Gaussian noise matrix, versus the
planted model, where there exists a K × K submatrix of elevated mean μ. Consider the
high-dimensional setting of K = N α and μ = N −β with N → ∞, where α, β > 0 parameter-
izes the cluster size and signal strength, respectively. Information-theoretically, it can be
shown that there exist detection procedures achieving vanishing error probability if and
only if β < β∗ max(α/2, 2α − 1) [21]. In contrast, if only randomized polynomial-time
algorithms are allowed, then reliable detection is impossible if β > β max(0, 2α − 1);
conversely if β < β , there exists a near-linear-time detection algorithm with vanish-
ing error probability. The plots of β∗ and β in Fig 13.1 correspond to the statistical and
computational limits of submatrix detection, respectively, revealing the following strik-
ing phase transition: for a large community (α ≥ 23 ), optimal detection can be achieved
by computationally efficient procedures; however, for a small community (α < 23 ),
computational constraint incurs a severe penalty on the statistical performance and
the optimal computationally intensive procedure cannot be mimicked by any efficient
algorithms.
402 Yihong Wu and Jiaming Xu
impossible
1
3 easy
hard
α
0 1 2 1
2 3
Figure 13.1 Computational versus statistical limits. For the submatrix detection problem, the size
of the submatrix is K = N α and the elevated mean is μ = N −β . For the community detection
problem, the cluster size is K = N α , and the in-cluster and inter-cluster edge probabilities p and q
are both on the order of N −2β .
For the Bernoulli case, it has been shown that, to detect a planted dense subgraph,
when the in-cluster and inter-cluster edge probabilities p and q are on the same order
and parameterized as N −2β and the cluster size as K = N α , the easy–hard–impossible
phase transition obeys the same diagram as that in Fig. 13.1 [10].
Our intractability result is based on the common hardness assumption of the planted-
clique problem in the Erdős–Rényi graph when the clique size is of smaller order than
the square root of the graph cardinality [62], which has been widely used to establish
various hardness results in theoretical computer science [63–68] as well as the hardness
of detecting sparse principal components [8]. Recently, the average-case hardness of
the planted-clique problem has been established under certain computational models
[69, 70] and within the sum-of-squares relaxation hierarchy [71–73].
The rest of the section is organized as follows. Section 13.4.1 gives the precise def-
inition of the planted-clique problem, which forms the basis of reduction both for the
submatrix detection problem and for the community detection problem, with the latter
requiring a slightly stronger assumption. Section 13.4.2 discusses how to approximately
reduce the planted-clique problem to the single-community detection problem in poly-
nomial time in both Bernoulli and Gaussian settings. Finally, Section 13.4.3 presents the
key techniques to bound the total variation between the reduced instance and the target
hypothesis.
definition 13.7 The PC detection problem with parameters (n, k, γ), denoted by
PC(n, k, γ) henceforth, refers to the problem of testing the following hypotheses:
H0C : G ∼ G(n, γ), H1C : G ∼ G(n, k, γ).
The problem of finding the planted clique has been extensively studied for γ = 12 and
√
the state-of-the-art polynomial-time algorithms [14, 62, 74–78] work only for k = Ω( n).
√
There is no known polynomial-time solver for the PC problem for k = o( n) and any
constant γ > 0. It has been conjectured [63, 64, 67, 70, 79] that the PC problem cannot
√
be solved in polynomial time for k = o( n) with γ = 12 , which we refer to as the PC
hypothesis.
5 We can also consider a planted dense subgraph with a fixed size K, where K vertices are chosen uniformly
at random to plant a dense subgraph with edge probability p. Our reduction scheme extends to this fixed-
size model; however, we have not been able to prove that the distributions are approximately matched
under the alternative hypothesis. Nevertheless, the recent work [13] showed that the computational limit
for detecting fixed-sized community is the same as that in Fig. 13.1, resolving an open problem in [10].
404 Yihong Wu and Jiaming Xu
definition 13.8 The planted dense subgraph detection problem with parameters
(N, K, p, q), henceforth denoted by PDS(N, K, p, q), refers to the problem of distinguish-
ing between the following hypotheses:
H0 : G ∼ G(N, q) P0 , H1 : G ∼ G(N, K, p, q) P1 .
We aim to reduce the PC(n, k, γ) problem to the PDS(N, K, cq, q) problem. For sim-
plicity, we focus on the case of c = 2; the general case follows similarly with a change in
some numerical constants that come up in the proof. We are given an adjacency matrix
A ∈ {0, 1}n×n , or, equivalently, a graph G, and, with the help of additional randomness,
will map it to an adjacency matrix A ∈ {0, 1}N×N , or, equivalently, a graph G such that the
C C
hypothesis H0 (H1 ) in Definition 13.7 is mapped to H0 exactly (H1 approximately) in
Definition 13.8. In other words, if A is drawn from G(n, γ), then A is distributed accord-
ing to P0 ; if A is drawn from G(n, k, 1, γ), then the distribution of A is close in total
variation to P1 .
Our reduction scheme works as follows. Each vertex in G is randomly assigned a
parent vertex in G, with the choice of parent being made independently for different
vertices in G, and uniformly over the set [n] of vertices in G. Let V s denote the set
of vertices in G with parent s ∈ [n] and let s = |V s |. Then the set of children nodes
{V s : s ∈ [n]} will form a random partition of [N]. For any 1 ≤ s ≤ t ≤ n, the number of
edges, E(V s , Vt ), from vertices in V s to vertices in Vt in G will be selected randomly with
a conditional probability distribution specified below. Given E(V s , Vt ), the particular set
of edges with cardinality E(V s , Vt ) is chosen uniformly at random.
It remains to specify, for 1 ≤ s ≤ t ≤ n, the conditional distribution of E(V s , Vt )
given s , t , and A s,t . Ideally, conditioned on s and t , we want to construct a Markov
kernel from A s,t to E(V s , Vt ) which maps Bern(1) to the desired edge distribution
Binom( s t , p), and Bern(1/2) to Binom( s t , q), depending on whether both s and t
are in the clique or not, respectively. Such a kernel, unfortunately, provably does not
exist. Nevertheless, this objective can be accomplished approximately
in terms of the
total variation. For s = t ∈ [n], let E(V s , Vt ) ∼ Binom 2 , q . For 1 ≤ s < t ≤ n, denote
t
P s t Binom( s t , p) and Q s t Binom( s t , q). Fix 0 < γ ≤ 12 and put m0 log2 (1/γ).
Define
⎧
⎪
⎪
⎪ P (m) + a s t for m = 0,
⎪
⎨ st
P s t (m) = ⎪
⎪ P s t (m) for 1 ≤ m ≤ m0 ,
⎪
⎪
⎩ (1/γ)Q s t (m) for m0 < m ≤ s t ,
where a s t = m0 <m≤ s t [P s t (m)−(1/γ)Q s t (m)]. Let Q s t = (1/(1−γ))(Q s t −γP s t ).
The idea behind our choice of P s t and Q s t is as follows. For a given P s t , we choose
Q s t to map Bern(γ) to Binom( s t , q) exactly; however, in order for Q to be a well-
defined probability distribution, we need to ensure that Q s t (m) ≥ γP s t (m), which fails
when m ≤ m0 . Thus, we set P s t (m) = Q s t (m)/γ for m > m0 . The remaining probability
mass a s t is added to P s t (0) so that P s t is a well-defined probability distribution.
It is straightforward to verify that Q s t and P s t are well-defined probability
distributions, and
dTV (P s t
,P s t ) ≤ 4(8q 2 )(m0 +1) (13.36)
Statistical Problems with Planted Structures 405
as long as s , t ≤ 2 and 16q 2 ≤ 1, where = N/n. Then, for 1 ≤ s < t ≤ n, the conditional
distribution of E(V s , Vt ) given s , t , and A s,t is given by
⎧
⎪
⎪
⎪ P s t if A st = 1, s , t ≤ 2 ,
⎪
⎨
E(V s , Vt ) ∼ ⎪
⎪ Q if A st = 0, s , t ≤ 2 , (13.37)
⎪
⎪
⎩ Q
s t
s t if max{ s , t } > 2 .
Next we show that the randomized reduction defined above maps G(n, γ) into G(N, q)
under the null hypothesis and G(n, k, γ) approximately into G(N, K, p, q) under the alter-
native hypothesis. By construction, (1 − γ)Q s t + γP s t = Q s t = Binom( s t , q) and
therefore the null distribution of the PC problem is exactly matched to that of the PDS
problem, i.e., PG|H C = P0 . The core of the proof lies in establishing that the alternative
0
distributions are approximately matched. The key observation is that, by (13.36), P s t
is close to P s t = Binom( s t , p) and thus, for nodes with distinct parents s t in the
planted clique, the number of edges E(V s , Vt ) is approximately distributed as the desired
Binom( s t , p); for nodes with the same
parent s in the planted clique, even though
E(V s , V s ) is distributed as Binom 2 , q which is not sufficiently close to the desired
s
Binom 2s , p , after averaging over the random partition {V s }, the total variation dis-
tance becomes negligible. More formally, we have the following proposition; the proof
is postponed to the next section.
proposition 13.1 Let , n ∈ N, k ∈ [n] and γ ∈ (0, 12 ]. Let N = n, K = k , p = 2q, and
m0 = log2 (1/γ). Assume that 16q 2 ≤ 1 and k ≥ 6e . If G ∼ G(n, γ), then G ∼ G(N, q),
C = P0 . If G ∼ G(n, k, 1, γ), then
i.e., PG|H
0
# $
−K − 2 m0 +1
C , P1 ≤ e 12 + 1.5ke 18 + 2k (8q ) + 0.5 e72e q − 1
2 2 2
dTV PG|H
1
√
+ 0.5ke− 36 . (13.38)
Reduction Scheme in the Gaussian Case
The same reduction scheme can be tweaked slightly to work for the Gaussian case,
which, in fact, needs only the PC hypothesis for γ = 12 .6 In this case, we aim to map an
adjacency matrix A ∈ {0, 1}n×n to a symmetric data matrix A ∈ RN×N with zero diagonal,
or, equivalently, a weighted complete graph G.
For any 1 ≤ s ≤ t ≤ n, we let E(V s , Vt ) denote the average weights of edges between
V s and Vt in G. As for the Bernoulli model, we will first generate E(V s , Vt ) randomly
with a properly chosen conditional probability distribution. Since E(V s , Vt ) is a sufficient
statistic for the set of Gaussian edge weights, the specific weight assignment can be
generated from the average weight using the same kernel both for the null and for the
alternative.
i.i.d.
To see how this works, consider a general setup where X1 , . . . , Xn ∼ N(μ, 1). Let X̄ =
n
(1/n) i=1 Xi . Then we can simulate X1 , . . . , Xn on the basis of the sufficient statistic X̄
√
as follows. Let [v0 , v1 , . . . , vn−1 ] be an orthonormal basis for Rn , with v0 = (1/ n)1 and
6 The original reduction proof in [9] for the submatrix detection problem crucially relies on the Gaussianity
and the reduction maps a bigger planted-clique instance into a smaller instance for submatrix detection by
means of averaging.
406 Yihong Wu and Jiaming Xu
1 = (1, . . . , 1) . Generate Z1 , . . . , Zn−1 ∼ N(0, 1). Then X̄1 + n−1
i.i.d.
i=1 Zi vi ∼ N(μ1, In ). Using
this general procedure, we can generate the weights AV s ,Vt on the basis of E(V s , Vt ).
It remains to specify, for 1 ≤ s ≤ t ≤ n, the conditional distribution of E(V s , Vt ) given
s , t , and A s,t . As for the Bernoulli case, conditioned on s and t , ideally we would
want to find a Markov kernel from A s,t to E(V s , Vt ) which maps Bern(1) to the desired
distribution N(μ, 1/ s t ) and Bern(1/2) to N(0, 1/ s t ), depending on whether both s
and t are in the clique or not, respectively. This objective can be accomplished approx-
imately in terms of the total variation. For s = t ∈ [n], let E(V s , Vt ) ∼ N(0, 1/ s t ). For
1 ≤ s < t ≤ n, denote P s t N(μ, 1/ s t ) and Q s t N(0, 1/ s t ), with density functions
p s t (x) and q s t (x), respectively.
Fix γ = 12 . Note that
q (x) 8 9
s t
= exp s t μ(μ/2 − x) ≥γ
p s t (x)
asymptotically equivalent to the original model in the sense of Le Cam [4] and hence
preserves the statistical difficulty of the problem. In other words, the continuous model
and its appropriately discretized counterpart are statistically indistinguishable and, more
importantly, the computational complexity of tests on the latter is well defined. More
precisely, for the submatrix detection model, provided that each entry of the n × n matrix
A is quantized by Θ(log n) bits, the discretized model is asymptotically equivalent to the
previous model (cf. Section 3 and Theorem 1 of [9] for a precise bound on the Le Cam
distance). With a slight modification, the above reduction scheme can be applied to the
discretized model (cf. Section 4.2 of [9]).
Second, we comment on the distinctions between the reduction scheme here and the
prior work that relies on the planted clique as the hardness assumption. Most previous
work [63, 64, 68, 81] in the theoretical computer science literature uses the reduction
from the PC problem to generate computationally hard instances of other problems and
establish worst-case hardness results; the underlying distributions of the instances could
be arbitrary. The idea of proving the hardness of a hypothesis-testing problem by means
of approximate reduction from the planted-clique problem such that the reduced instance
is close to the target hypothesis in total variation originates from the seminal work by
Berthet and Rigollet [8] and the subsequent paper by Ma and Wu [9]. The main dis-
tinction between these works and the results presented here, which are based on the
techniques in [10], is that Berthet and Rigollet [8] studied a composite-versus-composite
testing problem and Ma and Wu [9] studied a simple-versus-composite testing problem,
both in the minimax sense, as opposed to the simple-versus-simple hypothesis consid-
ered here and in [10], which constitutes a stronger hardness result. For the composite
hypothesis, a reduction scheme works as long as the distribution of the reduced instance
is close to some mixture distribution under the hypothesis. This freedom is absent in
constructing reduction for the simple hypothesis, which renders the reduction scheme as
well as the corresponding calculation of the total variation considerably more difficult.
In contrast, for the simple-versus-simple hypothesis, the underlying distributions of the
problem instances generated from the reduction must be close to the desired distributions
in total variation both under the null hypothesis and under the alternative hypothesis.
It is straightforward to verify that the null distributions are exactly matched by the
reduction scheme. Henceforth, we consider the alternative hypothesis, under which G is
drawn from the planted-clique model G(n, k, γ). Let C ⊂ [n] denote the planted clique.
Define S = ∪t∈C Vt and recall that K = k . Then |S | ∼ Binom(N, K/N) and, conditional
on |S |, S is uniformly distributed over all possible subsets of size |S | in [N]. By the
symmetry of the vertices of G, the distribution of A conditional on C does not depend
on C. Hence, without loss of generality, we shall assume that C = [k] henceforth. The
distribution of A can be written as a mixture distribution indexed by the random set S as
⎡ ⎤
⎢⎢⎢ ⎥⎥⎥
P1 ES ⎢⎢⎢⎢⎣P
∼
A S S × Bern(q)⎥⎥⎥⎥⎦.
[i, j]∈E(S ) c
By the definition of P1 ,
dTV (
P1 , P1 )
⎛ ⎡ ⎤ ⎡ ⎤⎞
⎜⎜⎜ ⎢⎢⎢ ⎥⎥⎥ ⎢⎢⎢ ⎥⎥⎥⎟⎟⎟
= dTV ⎜⎜⎜⎝ES ⎢⎢⎢⎣P ⎜ ⎢ S S × ⎥
Bern(q)⎥⎥⎥⎦, ES ⎢⎢⎢⎣ ⎢ Bern(p) Bern(q)⎥⎥⎥⎥⎦⎟⎟⎟⎟⎠
[i, j]∈E(S )c [i, j]∈E(S ) [i, j]∈E(S )c
⎡ ⎛ ⎞⎤
⎢⎢⎢ ⎜⎜⎜ ⎟⎟⎟⎥⎥⎥
⎢
≤ ES ⎢⎢⎢⎣dTV ⎜⎜⎜⎝P ⎜ S S × Bern(q), Bern(p) Bern(q)⎟⎟⎟⎟⎠⎥⎥⎥⎥⎦
[i, j]∈E(S )c [i, j]∈E(S ) [i, j]∈E(S )c
⎡ ⎛ ⎞⎤
⎢⎢⎢ ⎜⎜⎜ ⎟⎟⎟⎥⎥⎥
= ES ⎢⎢⎢⎢⎣dTV ⎜⎜⎜⎜⎝P S S , Bern(p)⎟⎟⎟⎟⎠⎥⎥⎥⎥⎦
[i, j]∈E(S )
⎡ ⎛ ⎞ ⎤
⎢⎢⎢ ⎜⎜⎜ ⎟⎟⎟ ⎥⎥⎥
≤ ES ⎢⎢⎢⎢⎣dTV ⎜⎜⎜⎜⎝P S S , Bern(p)⎟⎟⎟⎟⎠1{|S |≤1.5K} ⎥⎥⎥⎥⎦ + exp(−K/12), (13.41)
[i, j]∈E(S )
where the first inequality follows from the convexity of (P, Q) → dTV (P, Q), and the last
inequality follows from applying the Chernoff bound to |S |. Fix an S ⊂ [N] such that
: :
|S | ≤ 1.5K. Define PVt Vt = [i, j]∈E(Vt ) Bern(q) for t ∈ [k] and PV s Vt = (i, j)∈V s ×Vt Bern(p)
for 1 ≤ s < t ≤ k. By the triangle inequality,
⎛ ⎞ ⎛ ⎡ ++ ⎤⎥⎞⎟
⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎢⎢
dTV ⎜⎜⎜⎝P⎜ S S , ⎟
Bern(p)⎟⎟⎠⎟ ≤ dTV ⎜⎜⎝P ⎜ S S , E k ⎢⎢⎢⎢ ++ S ⎥⎥⎥⎥⎟⎟⎟⎟
V1 ⎣ PV s Vt + ⎥⎦⎟⎠
[i, j]∈E(S ) 1≤s≤t≤k
⎛ ⎡ ++ ⎤⎥ ⎞
⎜⎜⎜ ⎢⎢ ⎥⎥ ⎟⎟⎟
⎢
⎜
+ dTV ⎜⎜⎜⎝EV k ⎢⎢⎣⎢ PV s Vt +++ S ⎥⎥⎥⎦, Bern(p)⎟⎟⎟⎟⎠. (13.42)
1
1≤s≤t≤k [i, j]∈E(S )
To bound the term on the first line on the right-hand side of (13.42), first note that,
conditioned on the set S, {V1k } can be generated as follows. Throw balls indexed by S
into bins indexed by [k] independently and uniformly at random; let Vt is the set of balls
in the tth bin. Define the event E = {V1k : |Vt | ≤ 2 , t ∈ [k]}. Since |Vt | ∼ Binom(|S |, 1/k) is
stochastically dominated by Binom(1.5K, 1/k) for each fixed 1 ≤ t ≤ k, it follows from
the Chernoff bound and the union bound that P{E c } ≤ k exp(− /18). Then we have
Statistical Problems with Planted Structures 409
⎛ ⎡ ++ ⎤⎥⎞⎟
⎜⎜⎜ ⎢⎢⎢
dTV ⎜⎜⎝PS S , EV k ⎢⎢⎢⎣
⎜ ++ S ⎥⎥⎥⎥⎟⎟⎟⎟
1
PV s Vt + ⎥⎦⎟⎠
1≤s≤t≤k
⎛ ⎡ ++ ⎤⎥ ⎡ ++ ⎤⎥⎞⎟
⎜⎜⎜ ⎢⎢⎢ ⎢
= dTV ⎜⎜⎝⎜EV k ⎢⎢⎢⎣
(a)
V s Vt ++ S ⎥⎥⎥⎥, E ⎢⎢⎢⎢ ++ S ⎥⎥⎥⎥⎟⎟⎟⎟
1
P + ⎥⎦ V k ⎢⎣ PV s Vt
1 + ⎥⎦⎟⎠
1≤s≤t≤k 1≤s≤t≤k
⎡ ⎛ ⎞+ ⎤
⎢⎢⎢ ⎜⎜⎜ ⎟⎟⎟ + ⎥⎥⎥
≤ EV k ⎢⎢⎢⎣dTV ⎜⎜⎜⎝ V s Vt ,
P PV s Vt ⎟⎟⎟⎠ +++ S ⎥⎥⎥⎦
1
1≤s≤t≤k 1≤s≤t≤k
⎡ ⎛ ⎞ ++ ⎤⎥
⎢⎢⎢ ⎜⎜⎜ ⎟⎟⎟ ⎥⎥
⎢
≤ EV k ⎢⎢⎣dTV ⎜⎜⎝ ⎜ V s Vt ,
P PV s Vt ⎟⎟⎠1 V k ∈E +++ S ⎥⎥⎥⎦ + k exp(− /18),
⎟ 0 1
1 1
1≤s≤t≤k 1≤s≤t≤k
where (a) holds because, conditional on V1k , {AV s Vt : s, t ∈ [k]} are independent. Recall
that t = |Vt |. For any fixed V1 ∈ E, we have
k
⎛ ⎞
⎜⎜⎜ ⎟⎟⎟
dTV ⎜⎜⎝ ⎜
PV s Vt , PV s Vt ⎟⎟⎟⎠
1≤s≤t≤k 1≤s≤t≤k
⎛ ⎞
⎜⎜⎜ ⎟⎟⎟
(a)
= dTV ⎜⎝⎜ ⎜ V s Vt ,
P PV s Vt ⎟⎟⎟⎠
1≤s<t≤k 1≤s<t≤k
⎛ ⎞
⎜⎜⎜ ⎟⎟⎟
= dTV ⎜⎜⎜⎝ P s t ⎟⎟⎟⎠
(b)
P s t
,
1≤s<t≤k 1≤s<t≤k
⎛ ⎞
⎜⎜⎜ ⎟⎟⎟
≤ dTV ⎜⎜⎜⎝ P s t
, P s t ⎟⎟⎟⎠
1≤s<t≤k 1≤s<t≤k
! (c)
≤ dTV P s t
,P s t ≤ 2k2 (8q 2 )(m0 +1) ,
1≤s<t≤k
where (a) follows since P Vt Vt = PVt Vt for all t ∈ [k]; (b) is because the number of
edges E(V s , Vt ) is a sufficient statistic for testing P V s Vt versus PV s Vt on the submatrix
AV s Vt of the adjacency matrix; and (c) follows from the total variation bound (13.36).
Therefore,
⎛ ⎡ ++ ⎤⎥⎞⎟
⎜⎜⎜ ⎢⎢⎢ ⎥⎥⎟⎟
dTV ⎜⎝⎜P⎜ S S , E k ⎢⎢
V1 ⎣
⎢ PV s Vt +++ S ⎥⎥⎦⎥⎟⎟⎠⎟ ≤ 2k2 (8q 2 )(m0 +1) + k exp(− /18). (13.43)
1≤s≤t≤k
To bound the term on the second line on the right-hand side of (13.42), apply-
ing Lemma 9 of [10], which is a conditional version of the second-moment method,
yields
⎛ ⎡ ++ ⎤⎥ ⎞
⎜⎜⎜ ⎢⎢ ⎥ ⎟⎟⎟
⎜ ⎢ + ⎥
dTV ⎜⎜⎜⎝EV k ⎢⎢⎣⎢ PV s Vt ++ S ⎥⎥⎥⎦, Bern(p)⎟⎟⎟⎟⎠
1
1≤s≤t≤k [i, j]∈E(S )
" ++ %
1 . / 1 k )10 k 1 10k 1 ++ S − 1 + 2P{E c },
≤ P Ec + EV k ;Vk g(V1k , V1 V ∈E V ∈E +
(13.44)
2 2 1 1 1 1
410 Yihong Wu and Jiaming Xu
where
: :
1≤s≤t≤k PV s Vt 1≤s≤t≤k PV
s V
t
k )
g(V1k , V = :
1
[i, j]∈E(S ) Bern(p)
k 2
(|Vs ∩2 Vt |) k ⎛
⎞ |Vs ∩Vt |
⎜⎜⎜ 1 − 32 q ⎟⎟⎟( 2 )
q (1 − q)2 ⎜⎜⎝ ⎟⎟
= + = . (13.45)
s,t=1
p 1− p s,t=1
1 − 2q ⎠
where (a) follows from 1 + x ≤ e x for all x ≥ 0 and q < 1/4; (b) follows from the negative
association property of {|V s ∩ V t | : s, t ∈ [k]} proved in Lemma 10 of [10] in view of
x∧2
the monotonicity of x → e 2 ) on R+ ; (c) follows because |V s ∩ V
q ( t | is stochastically
dominated by Binom(1.5K, 1/k ) for all (s, t) ∈ [k] ; (d) follows from Lemma 11 of [10];
2 2
(e) follows from Lemma 12 of [10] with λ = q/2 and q ≤ 1/8. Therefore, by (13.44)
⎛ ⎞ (
⎜⎜⎜ ⎟⎟⎟
dTV ⎜⎜⎜⎝P⎜ S S , Bern(p)⎟⎟⎟⎟⎠ ≤ 0.5ke− 18 + 0.5 e72e q − 1 + 2ke− 18
2 2
[i, j]∈E(S )
√
≤ 0.5ke− 18 + 0.5 e72e q − 1 + 0.5ke− 36 .
2 2
(13.47)
The following theorem establishes the computational hardness of the PDS problem in
the interior of the hard region in Fig. 13.1.
theorem 13.3 Assume the PC hypothesis holds for all 0 < γ ≤ 1/2. Let α > 0 and
0 < β < 1 be such that
α
max{0, 2α − 1} β < β < . (13.48)
2
Then there exists a sequence {(N , K , q )} ∈N satisfying
log(1/q ) log K
lim = 2β, lim =α
→∞ log N →∞ log N
Statistical Problems with Planted Structures 411
N
such that, for any sequence of randomized polynomial-time tests φ : {0, 1}( 2 ) → {0, 1}
for the PDS(N , K , 2q , q ) problem, the type-I-plus-type-II error probability is lower-
bounded by
lim inf P0 {φ (G ) = 1} + P1 {φ (G ) = 0} ≥ 1,
→∞
Then
log(1/q ) 2+δ
lim = = 2β,
→∞ log N (2 + δ)/(2β) − 1 + 1
log K (2 + δ)α/(2β) − 1 + 1
lim = = α. (13.52)
→∞ log N (2 + δ)/(2β) − 1 + 1
Suppose that for the sake of contradiction there exists a small > 0 and a sequence of
randomized polynomial-time tests {φ } for PDS(N , K , 2q , q ), such that
P0 {φN ,K (G ) = 1} + P1 {φN ,K (G ) = 0} ≤ 1 −
holds for arbitrarily large , where G is the graph in the PDS(N , K , 2q , q ). Since
α > 2β, we have k ≥ 1+δ . Therefore, 16q 2 ≤ 1 and k ≥ 6e for all sufficiently large
. Applying Proposition 13.1, we conclude that G → φ(G) is a randomized polynomial-
time test for PC(n , k , γ) whose type-I-plus-type-II error probability satisfies
= 1} + P C {φ (G)
PH C {φ (G) = 0} ≤ 1 − + ξ, (13.53)
0
H 1
where the above inequality follows from (13.50). Therefore, (13.53) contradicts the
assumption that the PC hypothesis holds for γ.
Recent years have witnessed a great deal of progress on understanding the information-
theoretical and computational limits of various statistical problems with planted struc-
tures. As outlined in this survey, various techniques to identify the information-theoretic
limits are available. In some cases, polynomial-time procedures have been shown to
achieve the information-theoretic limits. However, in many other cases, it is believed
that there exists a wide gap between the information-theoretic limits and the compu-
tational limits. For the planted-clique problem, a recent exciting line of research has
identified the performance limits of a sum-of-squares hierarchy [71–73, 82, 83]. Under
the PC hypothesis, complexity-theoretic computational lower bounds have been derived
for sparse PCA [8], submatrix location [9], single-community detection [10], and var-
ious other detection problems with planted structures [13]. Despite these encouraging
results, a variety of interesting questions remain open. Below we list a few representative
problems. Closing the observed computational gap, or, equally importantly, disprov-
ing the possibility thereof on rigorous complexity-theoretic grounds, is an exciting new
topic at the intersection of high-dimensional statistics, information theory, and computer
science.
Computational Lower Bounds for Recovering the Planted Dense Subgraph
Closely related to the PDS detection problem is the recovery problem, where, given a
graph generated from G(N, K, p, q), the task is to recover the planted dense subgraph.
Consider the asymptotic regime depicted in Fig. 13.1. It has been shown in [16, 84] that
exact recovery is information-theoretically possible if and only if β < α/2 and can be
achieved in polynomial time if β < α − 12 . Our computational lower bounds for the PDS
detection problem imply that the planted dense subgraph is hard to approximate to any
constant factor if max(0, 2α − 1) < β < α/2 (the hard regime in Fig. 13.1). Whether the
planted dense subgraph is hard to approximate with any constant factor in the regime of
α − 12 ≤ β ≤ min{2α − 1, α/2} is an interesting open problem. For the Gaussian case, Cai
et al. [23] showed that exact recovery is computationally hard, β > α − 12 , by assuming a
variant of the standard PC hypothesis (see p. 1425 of [23]).
Finally, we note that in order to prove our computational lower bounds for the planted
dense subgraph detection problem in Theorem 13.3, we have assumed that the PC detec-
tion problem is hard for any constant γ > 0. An important open problem is to show by
means of reduction that, if the PC detection problem is hard with γ = 0.5, then it is also
hard with γ = 0.49.
Computational Lower Bounds within the Sum-of-Squares Hierarchy
For the single-community model, Hajek et al. [56] obtained a tight characterization of
the performance limits of semidefinite programming (SDP) relaxations, corresponding
to the sum-of-squares hierarchy with degree 2. In particular, (1) if K = ω(n/log n), SDP
attains the information-theoretic threshold with sharp constants; (2) if K = Θ(n/log n),
Statistical Problems with Planted Structures 413
Estimating Graphons
Graphon is a powerful network model for studying large networks [85]. Concretely,
given n vertices, the edges are generated independently, connecting each pair of two
distinct vertices i and j with a probability Mi j = f (xi , x j ), where xi ∈ [0, 1] is the
latent feature vector of vertex i that captures various characteristics of vertex i; f :
[0, 1] × [0, 1] → [0, 1] is a symmetric function called a graphon. The problem of interest
is to estimate either the edge probability matrix M or the graphon f on the basis of the
observed graph.
• When f is a step function which corresponds to the stochastic block model with
k blocks for some k, the minimax optimal estimation error rate is shown to be on
the order of k2 /n2 + log k/n [86], while the currently best error rate achievable in
polynomial time is k/n [87].
• When f belongs to Hölder or Sobolev space with smoothness index α, the minimax
optimal rate is shown to be n−2α/(α+1) for α < 1 and log n/n for α > 1 [86], while
the best error rate achievable in polynomial time that is known in the literature is
n−2α/(2α+1) [88].
414 Yihong Wu and Jiaming Xu
For both cases, it remains to be determined whether the minimax optimal rate can be
achieved in polynomial time.
Sparse PCA
Consider the following spiked Wigner model, where the underlying signal is a rank-one
matrix:
λ
X = √ vvT + W. (13.54)
n
i.i.d.
Here, v ∈ Rn , λ > 0 and W ∈ Rn×n is a Wigner random matrix with Wii ∼ N(0, 2) and
i.i.d.
Wi j = Wi j ∼ N(0, 1) for i < j. We assume that for some γ ∈ [0, 1] the support of v is drawn
n
uniformly from all γn subsets S ⊂ [n] with |S | = γn. Once the support has been chosen,
each non-zero component vi is drawn independently and uniformly from {±γ−1/2 }, so
that v 22 = n. When γ is small, the data matrix X is a sparse, rank-one matrix contam-
inated by Gaussian noise. For detection, we also consider a null model of λ = 0 where
X = W.
One natural approach for this problem is PCA: that is, diagonalize X and use its lead-
ing eigenvector v as an estimate of v. Using the theory of random matrices with rank-one
perturbations [27–29], both detection and correlated recovery of v are possible if and
only if λ > 1. Intuitively, PCA exploits only the low-rank structure of the underlying
signal, and not the sparsity of v; it is natural to ask whether one can succeed in detection
or reconstruction for some λ < 1 by taking advantage of this additional structure.
Through analysis of an approximate message-passing algorithm and the free energy, it
has been conjectured [46, 89] that there exists a critical sparsity threshold γ∗ ∈ (0, 1)
such that, if γ ≥ γ∗ , then both the information-theoretic threshold and the computational
threshold are given by λ = 1; if γ < γ∗ , then the computational threshold is given by
λ = 1, but the information-theoretic threshold for λ is strictly smaller. A recent series of
papers has identified the sharp information-theoretic threshold for correlated recovery
through the Guerra interpolation technique and the cavity method [46, 49, 50, 90]. Also,
the sharp information-theoretic threshold for detection has recently been determined
in [25]. However, there is no rigorous evidence justifying the clain that λ = 1 is the
computational threshold.
Tensor PCA
We can also consider a planted tensor model, in which we observe an order-k tensor
X = λv⊗k + W, (13.55)
where v is uniformly distributed over the unit sphere in Rn
and W ∈ (Rn )⊗k
is a totally
symmetric noise tensor with Gaussian entries N(0, 1/n) (see Section 3.1 of [91] for
a precise definition). This model is known as the p-spin model in statistical physics,
and is widely used in machine learning and data analysis to model high-order correla-
tions in a=dataset.
> A natural approach is tensor PCA, which coincides with the MLE:
⊗k
min u 2 =1 X, u . When k = 2, this reduces to standard PCA, which can be efficiently
computed by singular value decomposition; however, as soon as k ≥ 3, tensor PCA
becomes NP-hard in the worst case [92].
Statistical Problems with Planted Structures 415
Previous work
[24, 91, 93] has shown that tensor PCA achieves consistent estimation
of v if λ k log k, while this is information-theoretically impossible if λ k log k.
The exact location of the information-theoretic threshold for any k was determined
recently in [94], but all known polynomial-time algorithms fail far from this thresh-
old. A “tensor unfolding” algorithm is shown in [93] to succeed if λ n(k/2−1)/2 . In the
special case k = 3, it is further shown in [95] that a degree-4 sum-of-squares relaxation
succeeds if λ = ω(n log n)1/4 and fails if λ = O(n/log n)1/4 . More recent work [96] shows
that a spectral method achieves consistent estimation provided that λ = Ω(n1/4 ), improv-
ing the positive result in [95] by a poly-logarithmic factor. It remains to be determined
whether any polynomial-time algorithm succeeds in the regime of 1 λ n1/4 . Under
a hypergraph version of the planted-clique detection hypothesis, it is shown in [96] that
no polynomial-time algorithm can succeed when λ ≤ n1/4− for an arbitrarily small con-
stant > 0. It remains to be determined whether the usual planted-clique problem can be
reduced to the hypergraph version.
We consider a general setup. Let the number of communities k be a constant. Denote the
membership vector by σ = (σ1 , . . . , σn ) ∈ [k]n and the observation is A = (Ai j : 1 ≤ i <
j ≤ n). Assume the following conditions.
416 Yihong Wu and Jiaming Xu
A1 For any permutation π ∈ S k , (σ, A) and (π(σ), A) are equal in law, where π(σ)
(π(σ1 ), . . . , π(σn )).
A2 For any i j ∈ [n], I(σi , σ j ; A) = I(σ1 , σ2 ; A).
A3 For any z1 , z2 ∈ [k], P{σ1 = z1 , σ2 = z2 } = 1/k2 + o(1) as n → ∞.
These assumptions are satisfied for example for k-community SBM (where each pair
of vertices i and j are connected independently with probability p if σi = σ j and q
otherwise), and the membership vector σ can be uniformly distributed either on [k]n or
on the set of equal-sized k-partition of [n].
Recall that correlated recovery entails the following. For any σ, σ ∈ [k]n , define the
overlap:
, - 1 ! 1
o σ,
σ = max σi } −
1{π(σi )= . (13.57)
n π∈S k i∈[n]
k
We say an estimator
σ=
σ(A) achieves correlated recovery if7
8 , -9
E o σ,
σ = Ω(1), (13.58)
that is, the misclassification rate, up to a global permutation, outperforms random guess-
ing. Under the above three assumptions, we have the following characterization of
correlated recovery.
lemma A13.1 Correlated recovery is possible if and only if I(σ1 , σ2 ; A) = Ω(1).
Proof We start by recalling the relation between the mutual information and the total
variation. For any pair of random variables (X, Y), define the so-called T -information
[99]: T (X; Y) dTV (PXY , PX PY ) = E[dTV (PY|X , PY )]. For X ∼ Bern(p), this simply
reduces to
T (X; Y) = 2p(1 − p)dTV (PY|X=0 , PY|X=1 ). (13.59)
Furthermore, the mutual information can be bounded by the T -information, by Pinsker’s
and Fano’s inequality, as follows (from Equation (84) and Proposition 12 of [100]):
2T (X; Y)2 ≤ I(X; Y) ≤ log(M − 1)T (X; Y) + h(T (X; Y)), (13.60)
where in the upper bound M is the number of possible values of X, and h is the binary
entropy function in (13.34).
We prove the “if” part. Suppose I(σ1 , σ2 ; A) = Ω(1). We first claim that assumption
A1 implies that
I(1{σ1 =σ2 } ; A) = I(σ1 , σ2 ; A), (13.61)
that is, A is independent of σ1 , σ2 conditional on 1{σ1 =σ2 } . Indeed, for any z z ∈
[k], let π be any permutation such that π(z ) = z. Since Pσ,A = Pπ(σ),A , we have
PA|σ1 =z,σ2 =z = PA|π(σ1 )=z,π(σ2 )=z , i.e., PA|σ1 =z,σ2 =z = PA|σ1 =z ,σ2 =z . Similarly, one can
show that PA|σ1 =z1 ,σ2 =z2 = PA|σ1 =z1 ,σ2 =z2 , for any z1 z2 and z1 z2 , and this proves
the claim.
Let x j = 1{σ1 =σ j } . By the symmetry assumption A2, I(x j ; A) = I(x2 ; A) = Ω(1) for all
0 1
j 1. Since P x j = 1 = 1/k + o(1) by assumption A3, applying (13.60) with M = 2 and
in view of (13.59), we have dTV (PA|x j =0 , PA|x j =1 ) = Ω(1). Thus, there exists an estimator
x j ∈ {0, 1} as a function of A, such that
0 1 0 1
P xj = 1 | xj = 1 +P x j = 0 | x j = 0 ≥ 1 + dTV (PA|x j =0 , PA|x j =1 ) = 1 + Ω(1). (13.62)
and
0 1 0 1 1 0 1
P π(σ j ) =
σ j , x j = 0 = P π(σ j ) =
σ j ,
x j = 0, x j = 0 = P
x j = 0, x j = 0 ,
k−1
where the last step is because, conditional on x j = 0,0
σ j is1 chosen from {2, . . . , k}
uniformly and independently of everything else. Since P x j = 1 = 1/k + o(1), we have
0 1 1 0 1 0 1 (13.62) 1
P π(σ j ) =
σj = P xj = 1 | xj = 1 +P
x j = 0 | x j = 0 + o(1) ≥ + Ω(1).
k k
By (13.63), we conclude that
σ achieves correlated recovery of σ.
Next we prove the “only if” part. Suppose I(σ1 , σ2 ; A) = o(1) and we aim to show that
8 , -9
E o σ,
σ = o(1) for any estimator σ. By the definition of overlap, we have
++ +
++
, - 1 ! +++ ! 1 ++
o σ,
σ ≤ ++ σi } −
1{π(σi )= +.
n +π∈S k i∈[n]
k ++
Since there are k! = Ω(1) permutations in S k , it suffices to show that, for any fixed
permutation π,
⎡++ +
++⎤⎥
⎢⎢⎢++ ! ++⎥⎥⎥⎥
E⎢⎢⎢⎢⎣++
1
1{π(σi )= − +⎥ = o(n).
++
i∈[n]
σi } k ++⎥⎦
Since I(π(σi ), π(σ j ); A) = I(σi , σ j ; A), without loss of generality, we assume π = id in the
following. By the Cauchy–Schwarz inequality, it further suffices to show
⎡⎛ ⎞⎟2 ⎤⎥
⎢⎢⎢⎜⎜ ! ⎟⎟⎟ ⎥⎥⎥
⎜
E⎢⎢⎢⎣⎢⎜⎜⎜⎝
1
σi } −
1{σi = ⎟⎟⎠ ⎥⎥⎥ = o(n2 ). (13.64)
k ⎦
i∈[n]
418 Yihong Wu and Jiaming Xu
Note that
⎡⎛ ⎞⎟2 ⎤⎥
⎢⎢⎢⎜⎜ ! ⎟⎟⎟ ⎥⎥⎥
⎢ ⎜ 1
E⎢⎢⎢⎣⎜⎜⎜⎝ 1{σi = σi } − ⎟⎟ ⎥⎥
i∈[n]
k ⎠ ⎥⎦
! 1
1
= E 1{σi = σi } − 1 σ j} −
i, j∈[n]
k {σ j = k
! 0 1 2n ! . / n2
= P σi =
σi , σ j =
σj − P σi =
σi + 2 .
i, j∈[n]
k i∈[n] k
For the first term in the last displayed equation, let σ be identically distributed as σ
but independent of σ. Since I(σi , σ j ; σi ,
σ j ) ≤ I(σi , σ j ; A) = o(1) by the data-processing
inequality, it follows from the lower bound in (13.60) that dTV (Pσi ,σ j , σ j , Pσi ,σ j ,σi ,σ j ) =
σi ,
0 1
o(1). Since P{σi = σi , σ j = σ j } ≤ maxa,b∈[k] P σi = a, σ j = b ≤ 1/k +o(1) by assumption
2
A3, we have
0 1 1
P σi = σi , σ j = σ j ≤ 2 + o(1).
k
Similarly, for the second term, we have
. / 1
P σi =
σi = + o(1),
k
where the last equality holds due to I(σi ; A) = o(1). Combining the last three displayed
equations gives (13.64) and completes the proof.
Combining (13.61) with (13.60) and (13.59), we have I(σ1 , σ2 ; A) = o(1) if and only
if dTV (P+ , P− ) = o(1), where P+ = PA|σ1 =σ2 and P− = PA|σ1 σ2 . Note the follow-
ing characterization about the total variation distance, which simply follows from the
Cauchy–Schwartz inequality:
1 (P+ − P− )2
dTV (P+ , P− ) = inf , (13.65)
2 Q Q
where the infimum is taken over all probability distributions Q. Therefore (13.7) implies
(13.6).
Finally, we consider the binary symmetric SBM and show that, below the corre-
lated recovery threshold τ = (a − b)2 /(2(a + b)) < 1, (13.7) is satisfied if the reference
distribution Q is the distribution of A in the null (Erdős–Rényi) model. Note that
2 2
(P+ − P− )2 P+ P− P+ P−
= + −2 .
Q Q Q Q
Hence, it is sufficient to show
Pz Pz
= C + o(1), ∀z, z ∈ {±}
Q
Statistical Problems with Planted Structures 419
for some constant C that is independent of z and z. Specifically, following the derivations
in (13.4), we have
⎡ ⎤
Pz Pz ⎢⎢⎢ +++ ⎥⎥⎥
= E⎢⎢⎢⎣ ⎢ j ρ ++ σ1 σ2 = z, σ
i σ
1 + σi σ j σ z⎥⎥⎥⎥⎦
2 =
1 σ
Q i< j
" #ρ $ ++ %
= (1 + o(1))e−τ /4−τ/2 × E exp σ, σ 2 ++ σ1 σ2 = z, σ
2
2 =
1 σ z, (13.66)
2
where the last equality holds for ρ = τ/n + O(1/n2 ) and log(1 + x) = x − x2 /2 + O(x3 ).
Write σ = 2ξ − 1 for ξ ∈ {0, 1}n and let
!
n
H1 ξ1
ξ1 + ξ2
ξ2 and H2 ξ j
ξ j.
j≥3
1 + o(1)
= √ ,
1−τ
where the last equality holds due to nρ = τ + o(1/n), τ < 1, and the convergence of the
moment-generating function.
References
[1] E. L. Lehmann and G. Casella, Theory of point estimation, 2nd edn. Springer, 1998.
[2] L. D. Brown and M. G. Low, “Information inequality bounds on the minimax risk (with
an application to nonparametric regression),” Annals Statist., vol. 19, no. 1, pp. 329–337,
1991.
[3] A. W. Van der Vaart, Asymptotic statistics. Cambridge University Press, 2000.
[4] L. Le Cam, Asymptotic methods in statistical decision theory. Springer, 1986.
[5] I. A. Ibragimov and R. Z. Khas’minskı̆, Statistical estimation: Asymptotic theory.
Springer, 1981.
[6] L. Birgé, “Approximation dans les espaces métriques et théorie de l’estimation,” Z.
Wahrscheinlichkeitstheorie verwandte Gebiete, vol. 65, no. 2, pp. 181–237, 1983.
[7] Y. Yang and A. R. Barron, “Information-theoretic determination of minimax rates of
convergence,” Annals Statist., vol. 27, no. 5, pp. 1564–1599, 1999.
420 Yihong Wu and Jiaming Xu
[8] Q. Berthet and P. Rigollet, “Complexity theoretic lower bounds for sparse principal com-
ponent detection,” Journal of Machine Learning Research: Workshop and Conference
Proceedings, vol. 30, pp. 1046–1066, 2013.
[9] Z. Ma and Y. Wu, “Computational barriers in minimax submatrix detection,” Annals
Statist., vol. 43, no. 3, pp. 1089–1116, 2015.
[10] B. Hajek, Y. Wu, and J. Xu, “Computational lower bounds for community detection on
random graphs,” in Proc. COLT 2015, 2015, pp. 899–928.
[11] T. Wang, Q. Berthet, and R. J. Samworth, “Statistical and computational trade-offs in
estimation of sparse principal components,” Annals Statist., vol. 44, no. 5, pp. 1896–1930,
2016.
[12] C. Gao, Z. Ma, and H. H. Zhou, “Sparse CCA: Adaptive estimation and computational
barriers,” Annals Statist., vol. 45, no. 5, pp. 2074–2101, 2017.
[13] M. Brennan, G. Bresler, and W. Huleihel, “Reducibility and computational lower bounds
for problems with planted sparse structure,” in Proc. COLT 2018, 2018, pp. 48–166.
[14] F. McSherry, “Spectral partitioning of random graphs,” in 42nd IEEE Symposium on
Foundations of Computer Science, 2001, pp. 529–537.
[15] E. Arias-Castro and N. Verzelen, “Community detection in dense random networks,”
Annals Statist., vol. 42, no. 3, pp. 940–969, 2014.
[16] Y. Chen and J. Xu, “Statistical–computational tradeoffs in planted problems and subma-
trix localization with a growing number of clusters and submatrices,” in Proc. ICML 2014,
2014, arXiv:1402.1267.
[17] A. Montanari, “Finding one community in a sparse random graph,” J. Statist. Phys., vol.
161, no. 2, pp. 273–299, arXiv:1502.05680, 2015.
[18] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,”
Social Networks, vol. 5, no. 2, pp. 109–137, 1983.
[19] A. A. Shabalin, V. J. Weigman, C. M. Perou, and A. B. Nobel, “Finding large average
submatrices in high dimensional data,” Annals Appl. Statist., vol. 3, no. 3, pp. 985–1012,
2009.
[20] M. Kolar, S. Balakrishnan, A. Rinaldo, and A. Singh, “Minimax localization of struc-
tural information in large noisy matrices,” in Advances in Neural Information Processing
Systems, 2011.
[21] C. Butucea and Y. I. Ingster, “Detection of a sparse submatrix of a high-dimensional noisy
matrix,” Bernoulli, vol. 19, no. 5B, pp. 2652–2688, 2013.
[22] C. Butucea, Y. Ingster, and I. Suslina, “Sharp variable selection of a sparse submatrix in a
high-dimensional noisy matrix,” ESAIM: Probability and Statistics, vol. 19, pp. 115–134,
2015.
[23] T. T. Cai, T. Liang, and A. Rakhlin, “Computational and statistical boundaries for subma-
trix localization in a large noisy matrix,” Annals Statist., vol. 45, no. 4, pp. 1403–1430,
2017.
[24] A. Perry, A. S. Wein, and A. S. Bandeira, “Statistical limits of spiked tensor models,”
arXiv:1612.07728, 2016.
[25] A. E. Alaoui, F. Krzakala, and M. I. Jordan, “Finite size corrections and likelihood ratio
fluctuations in the spiked Wigner model,” arXiv:1710.02903, 2017.
[26] E. Mossel, J. Neeman, and A. Sly, “A proof of the block model threshold conjecture,”
Combinatorica, vol. 38, no. 3, pp. 665–708, 2013.
[27] J. Baik, G. Ben Arous, and S. Péché, “Phase transition of the largest eigenvalue
for nonnull complex sample covariance matrices,” Annals Probability, vol. 33, no. 5,
pp. 1643–1697, 2005.
Statistical Problems with Planted Structures 421
[28] S. Péché, “The largest eigenvalue of small rank perturbations of hermitian random
matrices,” Probability Theory Related Fields, vol. 134, no. 1, pp. 127–173, 2006.
[29] F. Benaych-Georges and R. R. Nadakuditi, “The eigenvalues and eigenvectors of finite,
low rank perturbations of large random matrices,” Adv. Math., vol. 227, no. 1, pp. 494–
521, 2011.
[30] F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborová, and P. Zhang, “Spec-
tral redemption in clustering sparse networks,” Proc. Natl. Acad. Sci. USA, vol. 110,
no. 52, pp. 20 935–20 940, 2013.
[31] L. Massoulié, “Community detection thresholds and the weak Ramanujan property,” in
Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing, 2014,
pp. 694–703, arXiv:1109.3318.
[32] C. Bordenave, M. Lelarge, and L. Massoulié, “Non-backtracking spectrum of random
graphs: Community detection and non-regular Ramanujan graphs,” in 2015 IEEE 56th
Annual Symposium on Foundations of Computer Science (FOCS), 2015, pp. 1347–1357,
arXiv: 1501.06087.
[33] J. Banks, C. Moore, R. Vershynin, N. Verzelen, and J. Xu, “Information-theoretic bounds
and phase transitions in clustering, sparse PCA, and submatrix localization,” IEEE Trans.
Information Theory, vol. 64, no. 7, pp. 4872–4894, 2018.
[34] Y. I. Ingster and I. A. Suslina, Nonparametric goodness-of-fit testing under Gaussian
models. Springer, 2003.
[35] Y. Wu, “Lecture notes on information-theoretic methods for high-dimensional statistics,”
2017, www.stat.yale.edu/~yw562/teaching/598/it-stats.pdf.
[36] I. Vajda, “On metric divergences of probability measures,” Kybernetika, vol. 45, no. 6,
pp. 885–900, 2009.
[37] W. Feller, An introduction to probability theory and its applications, 3rd edn. Wiley, 1970,
vol. I.
[38] W. Kozakiewicz, “On the convergence of sequences of moment generating functions,”
Annals Math. Statist., vol. 18, no. 1, pp. 61–69, 1947.
[39] E. Mossel, J. Neeman, and A. Sly, “Reconstruction and estimation in the planted partition
model,” Probability Theory Related Fields, vol. 162, nos. 3–4, pp. 431–461, 2015.
[40] Y. Polyanskiy and Y. Wu, “Application of information-percolation method to reconstruc-
tion problems on graphs,” arXiv:1804.05436, 2018.
[41] E. Abbe and E. Boix, “An information-percolation bound for spin synchronization on
general graphs,” arXiv:1806.03227, 2018.
[42] J. Banks, C. Moore, J. Neeman, and P. Netrapalli, “Information-theoretic thresholds for
community detection in sparse networks,” in Proc. 29th Conference on Learning Theory,
COLT 2016, 2016, pp. 383–416.
[43] D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimum mean-square error
in Gaussian channels,” IEEE Trans. Information Theory, vol. 51, no. 4, pp. 1261–1282,
2005.
[44] Y. Deshpande and A. Montanari, “Information-theoretically optimal sparse PCA,” in
IEEE International Symposium on Information Theory, 2014, pp. 2197–2201.
[45] Y. Deshpande, E. Abbe, and A. Montanari, “Asymptotic mutual information for the two-
groups stochastic block model,” arXiv:1507.08685, 2015.
[46] F. Krzakala, J. Xu, and L. Zdeborová, “Mutual information in rank-one matrix esti-
mation,” in 2016 IEEE Information Theory Workshop (ITW), 2016, pp. 71–75, arXiv:
1603.08447.
422 Yihong Wu and Jiaming Xu
[88] J. Xu, “Rates of convergence of spectral methods for graphon estimation,” in Proc. 35th
International Conference on Machine Learning, 2018, arXiv:1709.03183.
[89] T. Lesieur, F. Krzakala, and L. Zdeborová, “Phase transitions in sparse PCA,” in IEEE
International Symposium on Information Theory, 2015, pp. 1635–1639.
[90] A. E. Alaoui and F. Krzakala, “Estimation in the spiked Wigner model: A short proof of
the replica formula,” arXiv:1801.01593, 2018.
[91] A. Montanari, D. Reichman, and O. Zeitouni, “On the limitation of spectral methods:
From the Gaussian hidden clique problem to rank one perturbations of Gaussian ten-
sors,” in Advances in Neural Information Processing Systems, 2015, pp. 217–225, arXiv:
1411.6149.
[92] C. J. Hillar and L.-H. Lim, “Most tensor problems are NP-hard,” J. ACM, vol. 60, no. 6,
pp. 45:1–45:39, 2013.
[93] A. Montanari and E. Richard, “A statistical model for tensor PCA,” in Proc. 27th Inter-
national Conference on Neural Information Processing Systems, 2014, pp. 2897–2905.
[94] T. Lesieur, L. Miolane, M. Lelarge, F. Krzakala, and L. Zdeborová, “Statistical and
computational phase transitions in spiked tensor estimation,” arXiv:1701.08010, 2017.
[95] S. B. Hopkins, J. Shi, and D. Steurer, “Tensor principal component analysis via sum-of-
square proofs,” in COLT, 2015, pp. 956–1006.
[96] A. Zhang and D. Xia, “Tensor SVD: Statistical and computational limits,”
arXiv:1703.02724, 2017.
[97] D. Paul, “Asymptotics of sample eigenstruture for a large dimensional spiked covariance
model,” Statistica Sinica, vol. 17, no. 4, pp. 1617–1642, 2007.
[98] T. Lesieur, C. D. Bacco, J. Banks, F. Krzakala, C. Moore, and L. Zdeborová, “Phase
transitions and optimal algorithms in high-dimensional Gaussian mixture clustering,”
arXiv:1610.02918, 2016.
[99] I. Csiszár, “Almost independence and secrecy capacity,” Problemy peredachi informatsii,
vol. 32, no. 1, pp. 48–57, 1996.
[100] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,”
IEEE Trans. Information Theory, vol. 62, no. 1, pp. 35–55, 2016.
14 Distributed Statistical Inference with
Compressed Data
Wenwen Zhao and Lifeng Lai
Summary
This chapter introduces basic ideas of information-theoretic models for distributed sta-
tistical inference problems with compressed data, discusses current and future research
directions and challenges in applying these models to various statistical learning prob-
lems. In these applications, data are distributed in multiple terminals, which can
communicate with each other via limited-capacity channels. Instead of recovering data
at a centralized location first and then performing inference, this chapter describes
schemes that can perform statistical inference without recovering the underlying data.
Information-theoretic tools are borrowed to characterize the fundamental limits of the
classical statistical inference problems using compressed data directly. In this chapter,
distributed statistical learning problems are first introduced. Then, models and results of
distributed inference are discussed. Finally, new directions that generalize and improve
the basic scenarios are described.
14.1 Introduction
Nowadays, large amounts of data are collected by devices and sensors in multiple termi-
nals. In many scenarios, it is risky and expensive to store all of the data in a centralized
location, due to the data size, privacy concerns, etc. Thus, distributed inference, a class
of problems aiming to infer useful information from these distributed data without col-
lecting all of the data in a centralized location, has attracted significant attention. In these
problems, communication among different terminals is usually beneficial but the com-
munication channels between terminals are typically of limited capacity. Compared with
the relatively well-studied centralized setting, where data are stored in one terminal, the
distributed problems with a limited communication budget are more challenging.
Two main scenarios are considered in the existing works on distributed inference. In
the first scenario, named sample partitioning, each terminal has data samples related
to all random variables [1, 2], as shown in Fig. 14.1. In this figure, we use a matrix to
represent the available data. Here, different columns in the matrix denote corresponding
random variables to which the samples are related. The data matrix is partitioned in a
425
426 Wenwen Zhao and Lifeng Lai
X 1X 3 XL
X1X 2 XL 1
Observation 2
1
2 X 1X 3 XL
3
5
7
n X 1X 3 XL
8
10
X 1X 2
1
2
Observation X 1X 2 XL
1
2
n
XL
1
2
n
row-wise manner, and each terminal observes a subset of the samples, which relates
to all random variables (X1 , X2 , . . . , XL ). This scenario is quite common in real life. For
example, there are large quantities of voice and image data stored in personal smart
devices but, due to the sensitive nature of the data, we cannot ask all users to send
their voice messages or photos to those in the location. Generally, in this scenario, even
though each terminal has fewer data than those the centralized setting, terminals can
still apply learning methods to their local data. Certainly, communicating and combining
learning results from distributed terminals may improve the performance.
In the second scenario, named as feature partitioning, the data stored in each terminal
are related to only a subset, not all, of the random variables. Fig. 14.2 illustrates the
feature partitioning scenario, in which the data matrix is partitioned in a column-wise
manner and each terminal observes the data related to a subset of random variables.
For example, terminal X1 has all observations related to random variable X1 . This
scenario is also quite common in practice. For example, different information about
each patient is typically stored in different locations as patients may go to different
departments or different hospitals for different tests. In general, this scenario is more
challenging than sample partitioning as each terminal in the feature partitioning scenario
Distributed Statistical Inference with Compressed Data 427
is not able to obtain meaningful information from local data alone. Moreover, due to the
limited communication budget (it can be as low as a diminishing value), recovering data
first and then conducting inference is neither optimal nor necessary. Thus, we need to
design inference algorithms that can deal with compressed data, which is a much more
complicated and challenging problem than the problems discussed above.
This chapter will focus on the inference problems for the feature partitioning
scenario, which was first proposed by Berger [3] in 1979. Many existing works
on the classic distributed inference problem focus on the following three branches:
distributed hypothesis testing or distributed detection, distributed pattern classification,
and distributed estimation [4–14]. Using powerful information-theoretic tools such
as typical sequences, the covering lemma, etc., good upper and lower bounds on
the inference performance are derived for the basic model and many special cases.
Moreover, some results have been extended to more general cases. In this chapter,
we focus on the distributed hypothesis-testing problem. In Section 14.2, we study
the basic distributed hypothesis-testing model with non-interactive communica-
tion in the sense that each terminal sends only one message that is a function of
its local data to the decision-maker [4–9, 11, 15]. More details are introduced in
Section 14.2.
In Section 14.3, we consider more sophisticated models that allow interactive com-
munications. In the interactive communication cases, there can be multiple rounds of
communication among different terminals. In each round, each terminal can utilize
messages received from other terminals in previous rounds along with its own data
to determine the transmitted message. We start with a special form of interaction, i.e.,
cascaded communication among terminals [16, 17], in which terminals broadcast their
messages in a sequential order and each terminal uses all messages received so far along
with its own observations for encoding. Then we discuss the full interaction between
two terminals and analyze their performance [18–20]. More details will be introduced
in Section 14.3.
In Sections 14.2 and 14.3, the probability mass functions (PMFs) are fully specified.
In Section 14.4, we generalize the discussion to the scenario with model uncertainty. In
particular, we will discuss the identity-testing problem, in which the goal is to determine
whether given samples are generated from a certain distribution or not. By interpreting
the distributed identity-testing problem as composite hypothesis-testing problems, the
type-2 error exponent can be characterized using information-theoretic tools. We will
introduce more details in Section 14.4.
In this section, we review basic theoretic models for distributed hypothesis-testing prob-
lems. We first introduce the general model, then discuss basic ideas and results for two
special cases: (1) hypothesis testing against independence; and (2) zero-rate data com-
pression. In this chapter, we will present some results and the main ideas behind these
results. For detailed proofs, interested readers can refer to [4–11].
428 Wenwen Zhao and Lifeng Lai
14.2.1 Model
Consider a system with L terminals: Xi , i = 1, . . . , L and a decision-maker Y. Each ter-
minal and the decision-maker observe a component of the random vector (X1 , . . . , XL , Y)
that takes values in a finite set X1 × · · · × XL × Y and admits a PMF with two possible
forms:
With a slight abuse of notation, Xi is used to denote both the terminal and the alphabet set
from which the random variable Xi takes values. (X1n , . . . , XLn , Y n ) are independently and
identically (i.i.d.) generated according to one of the above joint PMFs. In other words,
(X1n , . . . , XLn , Y n ) is generated by either PnX1 ···XL Y or QnX1 ···XL Y . In a typical hypothesis-
testing problem, one determines which hypothesis is true under the assumption that
(X1n , . . . , XLn , Y n ) are fully available at the decision-maker. In the distributed setting, Xin ,
i = 1, . . . , L and Y n are at different locations. In particular, terminal Xi observes only
Xin and terminal Y observes only Y n . Terminals Xi are allowed to send messages to the
decision-maker Y. Using Y n and the received messages, Y determines which hypothesis
is true. We denote this system as S X1 ···XL |Y . Fig. 14.3 illustrates the system model. In the
following, we will use the term “decision-maker” and terminal Y interchangeably. Here,
Y n is used to model any side information available at the decision-maker. If Y is defined
to be an empty set, then the decision-maker does not have side information.
After observing the data sequence xin ∈ Xni , terminal Xi will use an encoder fi to
transform the sequence xin into a message fi (xin ), which takes values from the message
set Mi ,
ψ : M1 × · · · × ML × Yn → {H0 , H1 }. (14.4)
Terminal Terminal
f1 (X 1n )
X1
f 2 (X 2n )
X2
Y
...
...
f L (X nL)
XL
Figure 14.3 The basic model.
Distributed Statistical Inference with Compressed Data 429
For any given encoding functions fi , i = 1, . . . , L and decision function ψ, one can define
the acceptance region as
in which Acn denotes the complement of An , and the type-2 error probability is defined as
The goal is to design the encoding functions fi , i = 1, . . . , L and the decision function
ψ to maximize the type-2 error exponent under certain type-1 error and communication-
rate constraints (14.3).
More specifically, one can consider two kinds of type-1 error constraint, namely the
following.
• Constant-type constraint
αn ≤ (14.8)
for a prefixed > 0, which implies that the type-1 error probability must be smaller
than a given threshold.
• Exponential-type constraint
αn ≤ exp(−nr) (14.9)
for a given r > 0, which implies that the type-1 error probability must decrease expo-
nentially fast with an exponent no less than r. Hence the exponential-type constraint
is stricter than the constant-type constraint.
To distinguish these two different type-1 error constraints, we use different notations
to denote the corresponding type-2 error exponent, and we use the subscript ‘b’ to denote
that the error exponent is under the basic model.
in which the minimization is over all fi s and ψ satisfying condition (14.3) and (14.8).
• Under the exponential-type constraint, we define the type-2 error exponent as
1
σb (R1 , . . . , RL , r) = lim inf − log min βn ,
n→∞ n f1 ,..., fL ,ψ
in which the minimization is over all fi s and ψ satisfying condition (14.3) and (14.9).
430 Wenwen Zhao and Lifeng Lai
ing to the Slepian–Wolf theorem [21], the decoder can recover the original sequences
with diminishing error probability when the compression rates are larger than certain
values. Hence, when the compression rates are sufficiently large, we can adopt the source
coding method to first recover the original source sequences and then perform infer-
ences. However, in general, the rate constraints are typically too strict for the decision-
maker to fully recover {Xln }l=1
L in the inference problem. Moreover, in the inference
problem, recovery of source sequences is not its goal and typically is not necessary.
On the other hand, this inference problem is closely connected to distributed source
coding problems. In particular, the general idea of the existing schemes in distributed
inference problems is to mimic the schemes used in distributed source coding problems.
In the existing studies [4–7, 22], each terminal Xl compresses its sequence Xln into
Uln . Then these terminals send the auxiliary sequences {Uln }l=1L to the decision-maker
using source coding ideas so that the decision-maker can obtain {U n }L , which has a
l l=1
high probability of being the same as {Ul }l=1 . The compression step is to make sure
n L
each terminal sends enough information for one to recover Uln but does not exceed the
rate constraint. Finally, the decision-maker will decide between the two hypotheses
using {Un }L . Hence, we can see that, even though the decision-maker does not need
l l=1
to recover the sequences {Xln }l=1
L , it does need to recover {U n }L from the compressed
l l=1
messages.
Terminal Terminal
f1 (X 1n )
X1
f 2 (X 2n )
X2
Decoder X 1n , X n2 ,..., X nL
...
...
f L (X nL)
XL
Figure 14.4 A canonical example for the source coding problem.
Distributed Statistical Inference with Compressed Data 431
is denoted by Pn (X). Furthermore, we call a random variable X (n) that has the same
distribution as tp(xn ) the type variable of xn .
For any given sequence xn , we use a typical sequence to measure how likely it is that
this sequence is generated from a PMF PX .
definition 14.1 (Definition 1 of [6]) For a given a type PX ∈ Pn (X) and a constant η,
we denote by T ηn (X) the set of (PX , η)-typical sequences in Xn :
T ηn (X) xn ∈ Xn : |π(a|xn ) − PX (a)| ≤ ηPX (a), ∀a ∈ X .
In the same manner, we use Tηn (X) to denote the set of (PX , η)-typical sequences. Note
that, when η = 0, T 0 (X) denotes the set of sequences x ∈ Xn of type PX , and we use
n n
Similarly, for multiple random variables, define their joint empirical PMF as
n(a, b|xn , yn )
π(a, b|xn , yn ) , ∀(a, b) ∈ X × Y. (14.13)
n
For a given a type PXY ∈ Pn (X × Y) and a constant η, we denote by T ηn (XY) the set of
jointly (PXY , η)-typical sequences in Xn × Yn :
T ηn (XY) (xn , yn ) ∈ Xn × Yn :
|π(a, b|xn , yn ) − PXY (a, b)| ≤ ηPXY (a, b), ∀(a, b) ∈ X × Y . (14.14)
Furthermore, for yn ∈ Yn , we define T ηn (X|yn ) as the set of all xn s that are jointly
typical with yn :
T ηn (X|yn ) = {xn ∈ Xn : (xn , yn ) ∈ T ηn (XY)}. (14.15)
More details and properties can be found in [6, 21].
Note that the marginal distribution of (X1 , . . . , XL ) and Y are the same under both
hypotheses in the case of testing against independence. The problem has been studied
in [4, 9].
When L = 1 (the system is then denoted as S X1 |Y ), this problem can be shown
to have a close connection with the problem of source coding with a single helper
problem of [22]. Building on this connection and the results in [22], Ahlswede
and Csiszár [4] provided a single-letter characterization of the optimal type-2 error
exponent.
theorem 14.1 (Theorem 2 of [4]) In the system S X1 |Y with R1 ≥ 0, when the con-
straint on the type-1 error probability (14.8) and communication constraints (14.3) are
satisfied, the best error exponent for the type-2 error probability satisfies
where
When L ≥ 2, one can follow the similar approach in [4] to connect the testing against
independence problem to the problem of source coding with multiple helpers. How-
ever, unlike the problem of source coding with a single helper, the general problem
of source coding with multiple helpers is still open. Hence, a different approach to
exploit the more complicated problem is needed. First, we provide a lower bound
on the type-2 error exponent by generalizing Theorem 6 in [9] to the case of L
terminals.
theorem 14.2 (Theorem 6 of [9]) In the system S X1 ···XL |Y with Ri > 0, i = 1, . . . , L,
when the constraint on the type-1 error probability (14.8) and communication con-
straints (14.3) are satisfied, the error exponent of the type-2 error probability is
lower-bounded by
in which the maximization is over the PUi |Xi s such that I(Ui ; Xi ) ≤ Ri and |Ui | ≤ |Xi | + 1
for i = 1, . . . , L.
The lower bound in Theorem 14.2 can be viewed as a generalization of the bound
in Theorem 14.1. The constraints in Theorem 14.2 can be interpreted as the following
Markov-chains condition on the auxiliary random variables:
To achieve this lower bound, we design the following encoding/decoding scheme. For
a given rate constraint Ri , terminal Xi first generates a quantization codebook containing
2nRi quantization sequences uni . After observing xin , terminal Xi picks one sequence uni
from the quantization codebook to describe xin and sends this sequence to the decision-
maker. After receiving the descriptions from terminals, the decision-maker will declare
that the hypothesis H0 is true if the descriptions from these terminals and the side
information at the decision-maker are correlated. Otherwise, the decision-maker will
declare H1 true.
Then, we establish an upper bound on the type-2 error exponent that any scheme can
achieve by generalizing Theorem 7 in [9] to the case of L terminals.
in which the maximization is over the Ui s such that Ri ≥ I(Ui ; Xi ), |Ui | ≤ |Xi | + 1, Ui ↔
Xi ↔ (X1 , . . . , Xi−1 , Xi+1 , . . . , XL , Y) for i = 1, . . . , L.
We note that the constraints on auxiliary random variables in Theorems 14.2 and 14.3
are different. In particular, the Markov constraints in Theorem 14.3 are less strict than
those in Theorem 14.2. This implies that the lower bound and the upper bound do not
match with each other. Hence, more exploration of this problem is needed.
For the case of testing against independence under an exponential-type constraint on
the type-1 error probability, only a lower bound is established in [7], which is stated in
the following theorem.
First, let Φ denote the set of all continuous mappings from P(X1 ) to P(U1 |X1 ). ω(X 1 )
is an element in Φ for one particular X 1 is an auxiliary random variable that satisfies
1 . U
U |X = ω(X
P 1 ). Then, define
1 1
⎧ ⎫
⎪
⎪
⎪ ⎪
⎪
⎪
⎪
⎨ ⎪
⎬
φX1 (R1 , r) = ⎪ω ∈ Φ : max I(
U ;
X ) ≤ R ⎪ . (14.22)
⎪
⎪
⎪
1 1 1 ⎪
⎪
⎪
⎩
X : D(X1 ||X1 ) ≤ r
)
PU |X = ω(X1
⎭
1 1
theorem 14.4 (Corollary 2 of [7]) In the system S X1 |Y with R1 ≥ 0, when the con-
straint on the type-1 error probability (14.9) and communication constraints (14.3) are
satisfied, the best error exponent for the type-2 error probability is lower-bounded by
σb (R1 , r) ≥ max min D(X 1 ;
1 ||X1 ) + I(U Y) . (14.23)
ω ∈ φP (R , r) X
U
X1 Y 1 1 1Y
X
|| ≤ ||X || + 2
||U 1 1
D(U 1 1 Y||U1 X1 Y) ≤ r
PU |X = PU |X = ω(X )
1 1 1 1 1
U ↔ X1 ↔ Y
1 ||X1 ) + I(U
If we let r = 0, then D(X 1 ; 1 ;
Y) reduces to I(U Y), which is the same as in
(14.19). Unlike Theorem 14.1, a matching upper bound is hard to establish due to the
stronger constraint on the type-1 error probability.
434 Wenwen Zhao and Lifeng Lai
In this case, σb (R1 , . . . , RL , r) will be denoted as σb (0, . . . , 0, r). This zero-rate compres-
sion is of practical interest, as the normalized (by the length of the data) communication
cost is minimal. It is well known that this kind of zero-rate information is not useful in
the traditional distributed source coding with side-information problems [21, 23], whose
goal is to recover (X1n , . . . , XLn ) at terminal Y. However, in the distributed inference setup,
the goal is only to determine which hypothesis is true. The limited information from
zero-rate compressed messages will be very useful. A clear benefit of this zero-rate
compression approach is that the terminals need to consume only a limited amount of
communication resources.
To achieve this bound, one can use the following simple encoding/coding scheme: if
the observed sequence x1n ∈ T n (PX1 ), i.e., it is a typical sequence of PX1 , then we send 1
to the decision-maker Y; otherwise, we send 0. If terminal Y receives 1, then it decides
H0 is true; otherwise, it decides H1 is true. Using the properties of typical sequences, we
can easily get the lower bound on the type-2 error exponent. To get the matching upper
bound, the condition D(PX1 Y ||QX1 Y ) < +∞ is required. Readers can refer to [6] for more
details.
For the general zero-rate data compression, as we have Mi ≥ 2, more information is
sent to the decision-maker, which may lead to a better performance. Hence, the following
inequality holds:
The scenario with general zero-rate data compression under a constant-type constraint
has been considered in [8]. Shalaby adopted the blowing-up lemma [24] to give a tight
upper bound on the type-2 error exponent.
Distributed Statistical Inference with Compressed Data 435
theorem 14.6 (Theorem 1 of [8]) Let QX1 Y > 0. In the system S X1 |Y with R1 ≥ 0, when
the constraint on the type-1 error probability (14.8) is satisfied for all ∈ (0, 1), the best
error exponent for the type-2 error probability is upper-bounded by
Here, the positive condition QX1 Y (x1 , y) > 0, ∀(x1 , y) ∈ X1 × Y is required by the
blowing-up lemma.
Using the inequality (14.26), the combination of Theorems 14.5 and 14.6 yields the
following theorem.
theorem 14.7 (Theorem 2 of [8]) Let QX1 Y > 0. In the system S X1 |Y with R1 ≥ 0, when
the constraint on the type-1 error probability (14.8) is satisfied for all ∈ (0, 1), the best
error exponent for the type-2 error probability is
The above results are given for the case with L = 1, in [8] Shalaby and Papamarcou
also discussed the results in the case of general L.
in which
X Y ||QX Y )
σopt min D(P (14.30)
1 1
X Y ∈Hr
P 1
with
Hr = PX Y : P
X = Y = P
P X1 , P Y f or some P
X Y ∈ ϕr , (14.31)
1 1 1
where
X Y : D(P
ϕr = { P X Y ||PX Y ) ≤ r}. (14.32)
1 1 1
436 Wenwen Zhao and Lifeng Lai
To show this bound, the following coding scheme is adopted. After observing x1n ,
terminal X1 knows the type tp(x1n ) and sends tp(x1n ) (or an approximation of it, see below)
to the decision-maker. Terminal Y does the same. As there are at most n|X1 | types [25],
the rate required for sending the type from terminal X1 is (|X1 |log n)/n, which goes to
zero as n increases. After receiving all type information from the terminals, the decision-
maker will check whether there is a joint type P X Y ∈ Hr such that its marginal types
1
are the same as the information received from the terminals. If so, the decision-maker
declares H0 to be true, otherwise it declares H1 to be true. If the message size M1 is less
than n|X1 | , then, instead of the exact type information tp(x1n ), each terminal will send an
approximated version. For more details, please refer to [7].
Later, Han and Amari [5] proved an upper bound that matched the lower bound in
Theorem 14.8 by converting the problem under an exponential-type constraint to the
problem under a constant-type constraint.
theorem 14.9 (Theorem 5.5 of [5]) Let PX1 Y be arbitrary and QX1 Y > 0. For zero-rate
compression in S X1 Y with R1 = RY = 0, the error exponent is upper-bounded by
with
Hr = PX ···XL Y : P
Xi = P
Xi , P
Y = P
Y , i = 1, . . . , L for some P
X ···XL Y ∈ ϕr , (14.36)
1 1
where
X ···XL Y : D(P
ϕr = { P X ···XL Y ||PX ···XL Y ) ≤ r}. (14.37)
1 1 1
Thus we have single-letter characterized the distributed testing problem under zero-
rate data compression with an exponential-type constraint on the type-1 error probability.
Furthermore, the minimization problems in (14.30) and (14.35) are convex optimization
problems, which can be solved efficiently.
Distributed Statistical Inference with Compressed Data 437
corollary 14.1 Given PX1 ···XL Y and QX1 ···XL Y , the problem of finding σopt defined in
(14.35) is a convex optimization problem.
theorem 14.12 (Theorem 1 of [6]) In the system S X1 |Y with R1 ≥ 0, when the con-
straint on the type-1 error probability (14.8) and communication constraints (14.3) are
satisfied, the best error exponent for the type-2 error probability satisfies
θb (R1 , ) ≥ max min U X Y ||QU X Y ),
D(P (14.38)
1 1 1 1
U1 ∈ϕ0 P
U X Y ∈ξ0
1 1
where
For the case of an exponential-type constraint on the type-1 error probability, we first
define
⎧ ⎫
⎪
⎪
⎪ D( X
U 1
Y||UX Y) ≤ r ⎪
⎪
⎪
⎪
⎨
1 ⎪
⎬
φ(ω) = ⎪U X Y : P =
P = ω(
X ) ⎪ (14.39)
⎪
⎪
⎪
1 U|X 1 U|X 1 1 ⎪
⎪
⎪
⎩ U ↔ X1 ↔ Y ⎭
and
⎧ ⎫
⎪
⎪
⎪ UX = P
P UX ⎪
⎪
⎪
⎪
⎨ 1 1 ⎪
⎬
φ(ω) = ⎪ U X1 Y : UY
PUY = P ⎪ . (14.40)
⎪
⎪
⎪ ⎪
⎪
⎪
⎩ X
for some U 1
Y ∈φ(ω) ⎭
theorem 14.13 (Theorem 1 of [7]) In the system S X1 |Y with R1 ≥ 0, when the con-
straint on the type-1 error probability (14.9) and communication constraints (14.3) are
satisfied, the best error exponent for the type-2 error probability satisfies
(X1n , . . . , XLn , Y n ) are i.i.d. generated according to one of the above joint PMFs, and are
observed at different terminals. These terminals broadcast messages in a sequential order
from terminal 1 until terminal L, and each terminal will use all messages received so far
along with its own observations for encoding. More specifically, terminal X1 will first
broadcast its encoded message, which depends only on X1n , and then terminal X2 will
broadcast its encoded message, which now depends not only on its own observations X2n
but also on the message received from terminal X1 . The process continues until terminal
XL , which will use messages received from X1 until XL−1 and its own observations XLn
for encoding. Finally, terminal Y decides which hypothesis is true on the basis of its own
information and the messages received from terminals X1 , . . . , XL . The system model
is illustrated in Fig. 14.5.
Terminal
X1
Terminal
X2 Y
Using its own observations and messages received from encoders, terminal Y will use
a decoding function ψ to decide which hypothesis is true:
Given the encoding and decoding functions, we can define the acceptance region and
corresponding type-1 error probability, type-2 error probability, and type-2 error expo-
nents under different types of constraints on the type-1 error probability in a similar
manner to those in Section 14.2. To distinguish this case from the basic model, we use
θc and σc to denote the type-2 error exponent under a constant-type constraint and the
type-2 error exponent under an exponential-type constraint, respectively.
where
ϕ0 = {U1 U2 : R1 ≥ I(U1 ; X1 ), R2 ≥ I(U2 ; X2 |U1 ),
U1 ↔ X1 ↔ (X2 , Y),
U2 ↔ (X2 , U1 ) ↔ (X1 , Y),
|U1 | ≤ |X1 | + 1, |U2 | ≤ |X2 | · |U1 | + 1}. (14.48)
To achieve this bound, we employ the following coding scheme. For a given rate
constraint R1 , terminal X1 first generates a quantization codebook containing 2nR1 quan-
tization sequences un1 that are based on x1n . After observing x1n , terminal X1 picks
one sequence un1 from the quantization codebook which is jointly typical with x1n and
broadcasts this sequence. After receiving un1 from terminal X1 , terminal X2 generates
a quantization codebook containing 2nR2 quantization sequences un2 that are based on
both x2n and un1 . Then after observing x2n , terminal X2 picks one sequence un2 such that it
is jointly typical with un1 and x2n and broadcasts this sequence. Upon receiving both un1
440 Wenwen Zhao and Lifeng Lai
and un2 , the decision-maker will declare that the hypothesis H0 is true if the descriptions
from these terminals and the side information at the decision-maker are correlated. Oth-
erwise, the decision-maker will declare H1 true. To prove the converse part, please refer
to [17] for more details.
From the description above, we can get an intuitive idea that the decision-maker in
the interactive case receives more information than does the decision-maker in the non-
interactive case; thus a better performance is expected. In the following, we provide a
numerical example to illustrate the gain obtained from interactive communications.
In the example, we let X1 , X2 , and Y be binary random variables with joint PMF
PX1 X2 Y , which is shown in Table 14.1. With QX1 X2 Y = PX1 X2 PY and increasing com-
munication constraints R = R1 = R2 , we use Theorem 14.14 to find the best value of
the type-2 error exponent that we can achieve using our cascaded scheme. For compar-
ison, we also use Theorem 14.2 to find an upper bound on the type-2 error exponent
of the non-interactive case. By applying a grid search, we find the optimal conditional
distributions PU1 |X1 and PU2 |X2 for the non-interactive case and the optimal conditional
distributions PU1 |X1 and PU2 |X2 U1 for the cascaded case. We then calculate the bound on
the type-2 error exponent for both cases. For R = 0.48, we list the conditional distribu-
tions PU1 |X1 and PU2 |X2 for the non-interactive case in Table 14.2 and the conditional
distributions PU1 |X1 and PU2 |X2 U1 for the cascaded case in Table 14.3.
The simulation results for different Rs are shown in Fig. 14.6. From Fig. 14.6, we
can see that the type-2 error exponents in both cases increase with the increasing R,
which makes sense as the more information we can send, the fewer errors we will make.
Distributed Statistical Inference with Compressed Data 441
0.12
non-interactive
cascaded
0.1
0.06
0.04
0.02
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
R
Figure 14.6 Simulation results.
We also observe that the type-2 error exponent achieved using our cascaded communi-
cation scheme is even larger than an upper bound on the type-2 error exponent of any
non-interactive schemes. Hence, we confirm the intuitive idea that the greater amount of
information offered by cascaded communication facilitates better decision-making for
certain forms of testing against independence cases with positive communication rates.
where
L= PX ···XL Y : P
Xi = PXi , P
Y = PY , i = 1, . . . , L . (14.50)
1
442 Wenwen Zhao and Lifeng Lai
On comparing Theorem 14.15 with Theorem 14.5, we can see that the upper bound
on the type-2 error exponent for the cascaded communication scheme is the same as
the type-2 error exponent achievable by the non-interactive communication scheme.
This implies that the performance of the cascaded communication scheme is the same
as that of the non-interactive communication scheme in the zero-rate data compression
case.
The conclusion that cascaded communication does not improve the type-2 error
exponent in the zero-rate data compression case also holds for the scenario with an
exponential-type constraint on the type-1 error probability. In the cascaded communi-
cation case, according to the results in Theorem 14.15, we can use a similar strategy
to that in [5] to convert the problem under the exponential-type constraint (14.9) to the
corresponding problem under the constraint in (14.8). As the converting strategy is inde-
pendent of the communication style, it will be the same as that in Theorem 14.11. Then
an upper bound on the type-2 error exponent under the exponential-type constraint can
be easily derived without going into details, which will be shown in what follows.
theorem 14.16 (Theorem 8 of [17]) Letting PX1 ···XL Y be arbitrary and QX1 ···XL Y > 0,
the best type-2 error exponent for zero-rate compression case under the type-1 error
constraint (14.9), with L cascaded encoders, satisfies
with
Hr = PX ···XL Y : P
X = P
X , P
Y = P
Y , l = 1, . . . , L
1 l l
for some
PX1 ···XL Y ∈ ϕr , (14.52)
where
X ···XL Y : D(P
ϕr = {P X ···XL Y ||PX ···XL Y ) ≤ r}. (14.53)
1 1 1
On comparing Theorem 14.16 with Theorem 14.11, where a matching upper and
lower bound are provided for the non-interactive scheme, we can conclude that there
is no gain in performance on the type-2 error exponent under zero-rate compression
with the exponential-type constraint on the type-1 error probability.
Model
As in Section 14.2, we consider the following hypothesis-testing problem:
H0 : PX1 Y , H1 : QX1 Y . (14.56)
(X1n , Y n ) are i.i.d. generated according to one of the above joint PMFs, and are observed at
two different terminals. Terminal X1 first encodes its local information to messages M11
and broadcasts it. Terminal Y utilizes the message M11 to encode its own information
as M21 and broadcasts it. This is called one round of interactive communication of X1
and Y. After receiving M21 , terminal X1 can further encode its local information as M21
and broadcast it. This process goes on until N rounds of interactive communication have
been carried out. After receiving all messages from terminal X1 and Y, terminal X1 will
act as the decision-maker Y and makes a decision about the joint PMF of (X1 , Y). The
system model is illustrated in Fig. 14.7.
More specifically, terminal X1 uses the encoding function
f1i : {Xn1 , M2(i−1) , . . . , M21 } → M1i = {1, . . . , M1i }, i = 1, . . . , N, (14.57)
which is a map from Xn1 to M1i . Terminal Y uses an encoder
f2i : {Yn , M1i , . . . , M11 } → M2i = {1, . . . , M2i }, i = 1, . . . , N, (14.58)
with rate R such that
1
N
lim sup log(M1i M2i ) ≤ R, i = 1, . . . , N. (14.59)
n→∞ n i=1
X Y
to what we did in Section 14.2. To distinguish this case from the basic model and the
cascaded model, we use θi and σi to denote the type-2 error exponent under a constant-
type constraint and the type-2 error exponent under an exponential-type constraint,
respectively.
N
lim θi ≥ max [I(U[k] ; Y|U[1:k−1] V[1:k−1] ) + I(V[k] ; Y|U[1:k] V[1:k−1] )], (14.61)
→0 U[1:N] V[1:N] ∈ϕ(R)
k=1
where
⎧
⎪
⎪
⎨
N
ϕ(R) ⎪
⎪U V : R ≥ [I(U[k] ; Y|U[1:k−1] V[1:k−1] )
⎩ [1:N] [1:N]
k=1
+ I(V[k] ; Y|U[1:k] V[1:k−1] )],
U[k] ↔ (X, U1:k−1 , V1:k−1 ) ↔ Y, |U[k] | < ∞,
V[k] ↔ (Y, U1:k , V1:k−1 ) ↔ X, |V[k] | < ∞,
⎫
⎪
⎪
⎬
k = 1, . . . , N ⎪
⎪. (14.62)
⎭
For N = 1, which means there is only one round of communication between terminal
X1 and Y, one can also prove a matching upper bound on the type-2 error exponent.
Hence one has the following theorem.
theorem 14.19 (Theorem 3 of [20]) For the system S X1 Y with N rounds of communi-
cation, the type-2 error exponent satisfies
To achieve this bound, one can employ the following coding scheme. Terminal X1
first generates a quantization codebook containing 2n(I(U;X)+η) quantization sequences
un that are based on x1n . After observing x1n , terminal X1 picks one sequence un from
the quantization codebook which is jointly typical with x1n and broadcasts this sequence.
After receiving un1 from terminal X1 , terminal Y generates a quantization codebook
containing 2n(I(V;Y)+) quantization sequences vn that are based on both yn and un1 . Then,
after observing yn , terminal Y picks one sequence vn such that it is jointly typical with un
and yn and broadcasts this sequence. Upon receiving vn , the decision-maker will declare
that the hypothesis H0 is true if the descriptions from terminal Y and the information at
terminal X are correlated. Otherwise, the decision-maker will declare H1 true. To prove
the converse part, please refer to [20] for more details.
Distributed Statistical Inference with Compressed Data 445
where
L0 = PX Y : P
X = PX , P
Y = PY .
1 1 1
In Sections 14.2 and 14.3, we dealt with the scenarios where the PMF under each hypoth-
esis is fully specified. However, there are practical scenarios where the probabilistic
models are not fully specified. One of these problems is the identity-testing problem,
in which the goal is to determine whether given samples are generated from a certain
distribution or not.
In this section, we discuss the identity-testing problem in the feature partitioning
scenario with non-interactive communication of the encoders. As in Section 14.2, we
consider a setup with L terminals (encoders) Xl , l = 1, . . . , L and a decision-making
terminal Y. (X1n , . . . , XLn , Y n ) are generated according to some unknown PMF PX1 ···XL Y .
Terminals {Xl }l=1
L can send compressed messages related to their own data with limited
rates to the decision-maker, then the decision-maker performs statistical inference on the
basis of the messages received from terminals {Xl }l=1 L and its local information related to
Y. In particular, we focus on the problem in which that the decision-maker tries to decide
whether PX1 ···XL Y is the same as a given distribution QX1 ···XL Y , i.e., PX1 ···XL Y = QX1 ···XL Y ,
or they are λ-far away, i.e., ||PX1 ···XL Y − QX1 ···XL Y ||1 ≥ λ (λ > 0), where || · ||1 denotes the 1
446 Wenwen Zhao and Lifeng Lai
norm of its argument. This identity-testing problem can be interpreted as the following
two hypothesis-testing problems.
• Problem 1:
H0 : ||PX1 ···XL Y − QX1 ···XL Y ||1 ≥ λ versus H1 : PX1 ···XL Y = QX1 ···XL Y . (14.67)
• Problem 2:
H0 : PX1 ···XL Y = QX1 ···XL Y versus H1 : ||PX1 ···XL Y − QX1 ···XL Y ||1 ≥ λ. (14.68)
In both problems, our goal is to characterize the type-2 error exponent under the
constraints on the communication rates and type-1 error probability.
This distributed identity-testing problem with composite hypotheses can be viewed
as a generalization of the problems considered in Section 14.2. Between the two possi-
ble problems defined in (14.67) and (14.68), Problem 2 is relatively simple and it can
be solved using similar schemes to those proposed in Section 14.2. In particular, the
encoding schemes and the definition of the acceptance regions at the decision-maker in
Section 14.2 depend only on the form of the PMF under H0 . Since the form of the PMF
under H0 in Problem 2 is known, we can apply the existing coding/decoding schemes
such as that in Section 14.2 and take the type-2 error probability as the supremum of the
type-2 error probabilities under each PX1 ···XL Y that satisfies ||PX1 ···XL Y − QX1 ···XL Y ||1 ≥ λ.
As well, it can be shown that these schemes are optimal for Problem 2. However, in
Problem 1, as H0 is composite, we need to design universal encoding/decoding schemes
so that our schemes can provide a guaranteed performance regardless of the true PMF
under H0 . In this section, we will focus on the more challenging Problem 1 with L = 1,
and, to simplify our presentation, we use terminal X and terminal Y to denote the two
terminals.
14.4.1 Model
To simplify the presentation, we assume that there are only terminal X and terminal Y.
Our goal is to determine whether the true joint distribution is the same as the given
distribution QXY or far away from it. We interpret this problem as a hypothesis-testing
problem with a composite null hypothesis and a simple alternative hypothesis:
where Π = {PXY ∈ PXY : ||PXY − QXY ||1 ≥ λ} and λ is some fixed positive number. The
model is shown in Fig. 14.8. In a typical identity-testing problem, one determines which
Terminal
f (X n )
X
Decision-
g (Y n ) Maker
Y
Figure 14.8 Model.
Distributed Statistical Inference with Compressed Data 447
hypothesis is true under the assumption that (X n , Y n ) are fully available at the decision-
maker. We assume terminal X observes only X n and terminal Y observes only Y n .
Terminals X and Y are allowed to send encoded messages to the decision-maker, which
then decides which hypothesis is true using the encoded messages directly. We denote
the system as S XY in what follows.
More formally, the system consists of two encoders, f1 , f2 , and one decision function,
ψ. Terminal X has the encoder f1 , terminal Y has the encoder f2 , and the decision-
maker has the decision function ψ. After observing the data sequences xn ∈ Xn and
yn ∈ Yn , encoders f1 and f2 transform sequences xn and yn into messages f1 (xn ) and
f2 (yn ), respectively, which take values from the message sets Mn and Nn ,
f1 : Xn → M1 = {1, 2, . . . , M1 }, (14.70)
f2 : Y → M2 = {1, 2, . . . , M2 },
n
(14.71)
Using the messages f1 (X n ) and f2 (Y n ), the decision-maker will use the decision
function ψ to determine which hypothesis is true:
For any given decision function ψ, one can define the acceptance region as
For any given fi , i = 1, 2, and ψ, the type-1 error probability αn and the type-2 error
probability βn are defined as follows:
Given the type-1 error probability and type-2 error probability, we define the type-2
error exponents under different types of constraints on the type-1 error probability in a
similar way to that in Section 14.2. To distinguish this case from the basic model, we
use θid and σid to denote the type-2 error exponent under a constant-type constraint and
the type-2 error exponent under an exponential-type constraint, respectively.
where
ϕPXY = UV : I(U; X) ≤ R1 , I(V; Y) ≤ R2 , U ↔ X ↔ Y ↔ V . (14.78)
To achieve this lower bound, we design the following encoding/decoding scheme. For
a given rate constraint R1 , terminal X1 first generates a quantization codebook containing
2nR1 quantization sequences un1 . After observing x1n , terminal X1 picks one sequence un1
from the quantization codebook to describe x1n and sends this sequence to the decision-
maker. Then terminal Y uses a similar scheme and sends its quantization codebook vn .
After receiving the descriptions from both terminals, the decision-maker will employ
a universal decoding method to declare either H0 or H1 true. More details about the
universal decoding method can be found in [26].
To show a matching upper bound and lower bond, we assume the uniform positivity
constraint
ρinf inf min QXY (x, y) > 0. (14.81)
QXY ∈Ξ (x,y)∈X×Y
with
XY : P
HrPXY = {P X = P
X , P
Y = P
Y for some P
XY ∈ ϕrPXY },
where
XY : D(P
ϕrPXY = {P XY ||PXY ) ≤ r}.
To achieve this bound, we use the same encoding scheme as in Section 14.2, but
employ a universal decoding scheme at the decision-maker so that the type-1 error con-
straint is satisfied regardless of what the true value of PXY is. One can certainly design
an individual acceptance region that satisfies the type-1 error constraint for each possi-
ble value of PXY ∈ Π using the approach in the simple hypothesis case, and then take
the union of these individual regions as the final acceptance region. This will clearly sat-
isfy the type-1 error constraint regardless of the true value of PXY . This approach might
work if the number of possible PXY s is finite or grows polynomially in n. However, in
our case, there are uncountably infinitely many possible PXY s in Π. This approach will
lead to a very loose performance bound. Hence, we need to design a new approach that
will lead to a performance bound matching with the converse bound. More details can
be found in [26].
where
ϕPXY = (ω, ) : R1 ≥ I(X; U), R2 ≥ I(Y; V),
R1 − R1 ≤ I(U; V),
R2 − R2 ≤ I(U; V),
R1 − R1 + R2 − R2 ≤ I(U; V)
PU|X = ω(u|x; PX ), PV|Y = (v|y; PY ),
U ↔X↔Y ↔V (14.85)
and
UX = PUX , P UV = PUV .
VY = PVY , P
ξPXY = PUV XY : P (14.86)
Here ϕPXY denotes the set of (ω, ) when the distribution of (X, Y) is PXY . The notation
ξPXY has a similar interpretation.
To achieve this bound, we employ a universal encoding/decoding scheme with a
binning method. More details can be found in [26].
where
φPXY (R1 , r) = ω ∈ Φ : X)
max I(U; ≤ R1 ,
: D(X||X)
X ≤r
PU|X = ω(X)
⎧ ⎫
⎪
⎪
⎪ UXY ||PUXY ) ≤ r
D(P ⎪
⎪
⎪
⎪
⎨ ⎪
⎬
ΞPXY (ω) = ⎪ U|X = ω(X)
PUXY : PU|X = P ⎪ ,
⎪
⎪
⎪ ⎪
⎪
⎪
⎩ U ↔X↔Y ⎭
Distributed Statistical Inference with Compressed Data 451
ΞPXY (ω) = PUXY : P
UX = UY = P
PUX , P UY ,
UXY ∈
for some P ΞtXY (ω) ,
14.5.1 Conclusion
This chapter has explored distributed inference problems from an information-theoretic
perspective. First, we discussed distributed inference problems with non-interactive
encoders. Second, we considered distributed testing problems with interactive encoders.
We investigated the case of cascaded communication among multiple terminals and
then discussed the fully interactive communication between two terminals. Finally, we
studied the distributed identity-testing problem, in which the decision-maker decides
whether the distribution indirectly revealed from the compressed data from multiple
distributed terminals is the same as or λ-far from a given distribution.
Terminal
X1
Terminal
X2 Y
Acknowledgments
The work of W. Zhao and L. Lai was supported by the National Science Foundation
under grants CNS-1660128, ECCS-1711468, and CCF-1717943.
References
[29] S. Salehkalaibar, M. Wigger, and R. Timo, “On hypothesis testing against conditional inde-
pendence with multiple decision centers,” IEEE Trans. Communications, vol. 66, no. 6,
pp. 2409–2420, 2018.
[30] J. D. Lee, Y. Sun, Q. Liu, and J. E. Taylor, “Communication-efficient sparse regression: A
one-shot approach,” arXiv:1503.04337, 2015.
[31] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Agüera y Arcas,
“Communication-efficient learning of deep networks from decentralized data,”
arXiv:1602.05629, 2016.
[32] J. Konečný, “Stochastic, distributed and federated optimization for machine learning,”
arXiv:1707.01155, 2017.
[33] V. M. A. Martin, K. David, and B. Merlinsuganthi, “Distributed data clustering: A compar-
ative analysis,” Int. J. Sci. Res. Computer Science, Engineering and Information Technol.,
vol. 3, no. 3, article CSEITI83376, 2018.
15 Network Functional Compression
Soheil Feizi and Muriel Médard
Summary
In this chapter,1 we study the problem of compressing for function computation across
a network from an information-theoretic point of view. We refer to this problem as
network functional compression. In network functional compression, computation of a
function (or some functions) of sources located at certain nodes in a network is desired
at receiver(s). The rate region of this problem has been considered in the literature
under certain restrictive assumptions, particularly in terms of the network topology,
the functions, and the characteristics of the sources. In this chapter, we present results
that significantly relax these assumptions. For a one-stage tree network, we character-
ize a rate region by introducing a necessary and sufficient condition for any achievable
coloring-based coding scheme called the coloring connectivity condition (CCC). We
also propose a modularized coding scheme based on graph colorings to perform arbi-
trarily closely to derived rate lower bounds. For a general tree network, we provide a rate
lower bound based on graph entropies and show that this bound is tight in the case of hav-
ing independent sources. In particular, we show that, in a general tree network case with
independent sources, to achieve the rate lower bound, intermediate nodes should per-
form computations. However, for a family of functions and random variables, which we
call chain-rule proper sets, it is sufficient to have no computations at intermediate nodes
in order for the system to perform arbitrarily closely to the rate lower bound. Moreover,
we consider practical issues of coloring-based coding schemes and propose an efficient
algorithm to compute a minimum-entropy coloring of a characteristic graph under some
conditions on source distributions and/or the desired function. Finally, extensions of
these results for cases with feedback and lossy function computations are discussed.
15.1 Introduction
In modern applications, data are often stored in clouds in a distributed way (i.e., differ-
ent portions of data are located in different nodes in the cloud/network). Therefore, to
compute certain functions of the data, nodes in the network need to communicate with
each other. Depending on the desired computation, however, nodes can first compress
455
456 Soheil Feizi and Muriel Médard
the data before transmitting them in the network, providing a gain in communication
costs. We refer to this problem as network functional compression.
In this chapter, we consider different aspects of this problem from an information-
theoretic point of view. In the network functional compression problem, we would like to
compress source random variables for the purpose of computing a deterministic function
(or some deterministic functions) at the receiver(s), when these sources and receivers
are nodes in a network. Traditional data-compression schemes are special cases of func-
tional compression, where the desired function is the identity function. However, if the
receiver is interested in computing a function (or some functions) of sources, further
compression is possible.
Several approaches have been applied to investigate different aspects of this prob-
lem. One class of works considered the functional computation problem for specific
functions. For example, Kowshik and Kumar [2] investigated computation of symmet-
ric Boolean functions in tree networks, and Shenvi and Dey [3] and Ramamoorthy [4]
studied the sum network with three sources and three terminals. Other authors investi-
gated the asymptotic analysis of the transmission rate in noisy broadcast networks [5],
and also in random geometric graph models (e.g., [6, 7]). Also, Ma et al. [8] investi-
gated information-theoretic bounds for multiround function computation in collocated
networks. Network flow techniques (also known as multi-commodity methods) have
been used to study multiple unicast problems [9, 10]. Shah et al. [11] used this frame-
work, with some modifications, for function computation considering communication
constraints.
A major body of work on in-network computation investigates information-theoretic
rate bounds, when a function of sources is desired to be computed at the receiver. These
works can be categorized into the study of lossless functional compression and that
of functional compression with distortion. By lossless computation, we mean asymp-
totically lossless computation of a function: the error probability goes to zero as the
blocklength goes to infinity. However, there are several works investigating zero-error
computation of functions (e.g., [12, 13]).
Shannon was the first to consider the function computation problem in [14] for a spe-
cial case when f (X1 , X2 ) = (X1 , X2 ) (the identity function) and for the network topology
depicted in Fig. 15.1(a) (the side-information problem). For a general function, Orlitsky
and Roche provided a single-letter characterization of the rate region in [15]. In [16],
Doshi et al. proposed an optimal coding scheme for this problem.
For the network topology depicted in Fig. 15.1(b), and for the case in which the
desired function at the receiver is the identity function (i.e., f (X1 , X2 ) = (X1 , X2 )), Slepian
X1 f (X1, X2) X1
f (X1, X2)
X2 X2
(a) (b)
Figure 15.1 (a) Functional compression with side information. (b) A distributed functional
compression problem with two transmitters and a receiver.
Network Functional Compression 457
and Wolf [17] provided a characterization of the rate region and an optimal achievable
coding scheme. Some other practical but sub-optimal coding schemes were proposed
by Pradhan and Ramchandran [18]. Also, a rate-splitting technique for this problem
was developed in [19, 20]. Special cases when f (X1 , X2 ) = X1 and f (X1 , X2 ) = (X1 + X2 )
mod 2 have been investigated by Ahlswede and Körner [21] and by Körner and Marton
[22], respectively. Under some special conditions on source distributions, Doshi et al.
[16] investigated this problem for a general function and proposed some achievable
coding schemes.
There have been several prior works that studied lossy functional compression where
the function at the receiver is desired to be computed within a distortion level. Wyner and
Ziv [23] considered the side-information problem for computing the identity function at
the receiver within some distortion. Yamamoto [24] solved this problem for a general
function f (X1 , X2 ). Doshi et al. [16] gave another characterization of the rate-distortion
function given by Yamamoto. Feng et al. [25] considered the side-information problem
for a general function at the receiver in the case in which the encoder and decoder have
some noisy information. For the distributed function computation problem and for a
general function, the rate-distortion region remains unknown, but some bounds have
been given by Berger and Yeung [26], Barros and Servetto [27], and Wagner et al. [28],
who considered a specific quadratic distortion function.
In this chapter, we present results that significantly relax previously considered restric-
tive assumptions, particularly in terms of the network topology, the functions, and the
characteristics of the sources. For a one-stage tree network, we introduce a necessary and
sufficient condition for any achievable coloring-based coding scheme called the coloring
connectivity condition (CCC), thus relaxing the previous sufficient zigzag condition of
Doshi et al. [16]. By using the CCC, we characterize a rate region for distributed func-
tional compression and propose a modularized coding scheme based on graph colorings
in order for the system to perform arbitrarily closely to rate lower bounds. These results
are presented in Section 15.3.1.
In Section 15.3.2, we consider a general tree network and provide a rate lower bound
based on graph entropies. We show that this bound is tight in the case with independent
sources. In particular, we show that, to achieve the rate lower bound, intermediate nodes
should perform computations. However, for a family of functions and random variables,
which we call chain-rule proper sets, it is sufficient to have intermediate nodes act like
relays (i.e., no computations are performed at intermediate nodes) in order for the system
to perform arbitrarily closely to the rate lower bound.
In Section 15.3.3, we discuss practical issues of coloring-based coding schemes and
propose an efficient algorithm to compute a minimum-entropy coloring of a character-
istic graph under some conditions on source distributions and/or the desired function.
Finally, extensions of proposed results for cases with feedback and lossy function
computations are discussed in Section 15.4. In particular, we show that, in functional
compression, unlike in the Slepian–Wolf case, by having feedback, one may outperform
the rate bounds of the case without feedback. These results extend those of Bakshi
et al. We also present a practical coding scheme for the distributed lossy functional
compression problem with a non-trivial performance guarantee.
458 Soheil Feizi and Muriel Médard
In this section, we set up the functional compression problem and review some
prior work.
A rate tuple of the network is the set of rates of its edges (i.e., {Ri } for valid is). We say
a rate tuple is achievable iff there exists a coding scheme operating at these rates so that
Pne → 0 as n → ∞. The achievable rate region is the set closure of the set of all achievable
rates.
Network Functional Compression 459
Example 15.1 To illustrate the idea of confusability and the characteristic graph,
consider two random variables X1 and X2 such that X1 = {0, 1, 2, 3} and X2 = {0, 1},
where they are uniformly and independently distributed on their own supports. Sup-
pose f (X1 , X2 ) = (X1 + X2 ) mod 2 is to be perfectly reconstructed at the receiver. Then,
the characteristic graph of X1 with respect to X2 , p(x1 , x2 ) = 18 , and f is as shown in
Fig. 15.2(a). Similarly, G X2 is depicted in Fig. 15.2(b).
definition 15.3 Given a graph G X1 = (VX1 , E X1 ) and a distribution on its vertices VX1 ,
the graph entropy is
x 11 = 0 x 21 = 1
R B
x 12 = 0 x 22 = 1
Y G
B R
x =3
4 x 31 = 2
1
(a) (b)
Figure 15.2 Characteristic graphs (a) G X1 and (b) G X2 for the setup of Example 15.1. (Different
letters written over graph vertices indicate different colors.)
Example 15.2 Consider the scenario described in Example 15.1. For the charac-
teristic graph of X1 shown in Fig. 15.2(a), the set of maximal independent sets is
W1 = {{0, 2}, {1, 3}}. To minimize I(X1 ; W1 ) = H(X1 ) − H(X1 |W1 ) = log(4) − H(X1 |W1 ),
one should maximize H(X1 |W1 ). Because of the symmetry of the problem, to maxi-
mize H(X1 |W1 ), p(w1 ) must be uniform over two possible maximal independent sets
of G X1. Since each maximal independent set w1 ∈ W1 has two X1 values, H(X1 |w1 ) =
log(2) bit, and since p(w1 ) is uniform, H(X1 |W1 ) = log(2) bit. Therefore, HG X1 (X1 ) =
log(4) − log(2) = 1 bit. One can see that, if we want to encode X1 ignoring the effect of
the function f , we need H(X1 ) = log(4) = 2 bits. We will show that, for this example,
functional compression saves us 1 bit in every 2 bits compared with the traditional data
compression.
Witsenhausen [30] showed that the chromatic number of the strong graph-product
characterizes the minimum rate at which a single source can be encoded so that the
identity function of that source can be computed with zero distortion. Orlitsky and
Roche [15] defined an extension of Körner’s graph entropy, the conditional graph
entropy.
definition 15.4 The conditional graph entropy is
HG X1 (X1 |X2 ) = min I(W1 ; X1 |X2 ). (15.3)
X1 ∈W1 ∈Γ(G X1 )
W1 −X1 −X2
The notation W1 −X1 −X2 indicates a Markov chain. If X1 and X2 are independent,
HG X1 (X1 |X2 ) = HG X1 (X1 ). To illustrate this concept, let us consider an example borrowed
from [15].
connected to each other (i.e., this set is a clique of G X1 ). Since the intersection of a
clique and a maximal independent set is a singleton, X2 and the maximal independent
set W1 containing X1 determine X1 . So,
HG X1 (X1 |X2 ) = min I(W1 ; X1 |X2 )
X1 ∈W1 ∈Γ(G X1 )
W1 −X1 −X2
Example 15.4 Consider again the random variable X1 described in Example 15.1,
whose characteristic graph G X1 and its valid coloring are shown in Fig. 15.2a. One
can see that, in this coloring, two connected vertices are assigned to different colors.
Specifically, cG X1 (X1 ) = {r, b}. Therefore, p(cG X1 (x1i ) = r) = p(x1i = 0) + p(x1i = 2) and
p(cG X1 (x1i ) = b) = p(x1i = 1) + p(x1i = 3).
definition 15.6 The nth power of a graph G X1 is a graph GnX = (VXn1 , E Xn 1 ) such that
1
VXn1 = Xn1 and (x11 , x21 ) ∈ E Xn 1 when there exists at least one i such that (x1i
1 , x2 ) ∈ E . We
1i X1
n
denote a valid coloring of GX by cGnX (X1 ).
1 1
One may ignore atypical sequences in a sufficiently large power graph of a conflict
graph and then color that graph. This coloring is called an -coloring of a graph and is
defined as follows.
definition 15.8
χ
HG X (X1 ) = min H(cG X1 (X1 )).
1 cG X is an -coloring of G X1
1
definition 15.9
χ
HG X (X1 |X2 ) = min H(cG X1 (X1 )|X2 ).
1 cG X is an -coloring of G X1
1
Regardless of , the above optimizations are minima, rather than infima, because there
are finitely many subgraphs of any fixed graph G X1 , and therefore there are only finitely
many -colorings, regardless of .
In general, these optimizations are NP-hard [33]. But, depending on the desired func-
tion f , there are some interesting cases for which optimal solutions can be computed
efficiently. We discuss these cases in Section 15.3.3.
Körner showed in [31] that, in the limit of large n, there is a relation between the
chromatic entropy and the graph entropy.
theorem 15.1
1 χ
lim H n (X1 ) = HG X1 (X1 ). (15.5)
n→∞ n G X1
This theorem implies that the receiver can asymptotically compute a deterministic
function of a discrete memoryless source. The source first colors a sufficiently large
power of the characteristic graph of the random variable with respect to the func-
tion, and then encodes achieved colors using any encoding scheme that achieves the
entropy bound of the coloring random variable. In the previous approach, to achieve
the encoding rate close to graph entropy of X1 , one should find the optimal distribu-
tion over the set of maximal independent sets of G X1. However, this theorem allows
us to find the optimal coloring of GnX , instead of the optimal distribution on maxi-
1
mal independent sets. One can see that this approach modularizes the encoding scheme
into two parts, a graph-coloring module, followed by a Slepian–Wolf compression
module.
The conditional version of the above theorem is proven in [16].
theorem 15.2
1 χ
lim H n (X1 |X2 ) = HG X1 (X1 |X2 ). (15.6)
n→∞ n G X1
This theorem implies a practical encoding scheme for the problem of functional com-
pression with side information where the receiver wishes to compute f (X1 , X2 ), when
X2 is available at the receiver as the side information. Orlitsky and Roche showed in
[15] that HG X1 (X1 |X2 ) is the minimum achievable rate for this problem. Their proof uses
Network Functional Compression 463
random coding arguments and shows the existence of an optimal coding scheme. This
theorem presents a modularized encoding scheme where one first finds the minimum-
entropy coloring of GnX for large enough n, and then uses a compression scheme on the
1
coloring random variable (such as the Slepian–Wolf Scheme in [17]) to achieve a rate
arbitrarily close to H(cGnX (X1 )|X2 ). This encoding scheme guarantees computation of
1
the function at the receiver with a vanishing probability of error.
All these results considered only functional compression with side information at the
receiver (Fig. 15.1(a)). In general, the rate region of the distributed functional com-
pression problem (Fig. 15.1(b)) has not been determined. However, Doshi et al. [16]
characterized a rate region of this network when source random variables satisfy a
condition called the zigzag condition, defined below.
We refer to the -joint-typical set of sequences of random variables X1 , . . . , Xk as T n .
k is implied in this notation for simplicity. T n can be considered as a strong or weak
typical set [29].
definition 15.10 A discrete memoryless source {(X1i , X2i )}i∈N with a distribution
p(x1 , x2 ) satisfies the zigzag condition if for any and some n, (x11 , x12 ), (x21 , x22 ) ∈ T n ,
there exists some (x31 , x32 ) ∈ T n such that (x31 , xi2 ), (xi1 , x32 ) ∈ T n for each i ∈ {1, 2}, and
2
(x13 j , x23 j ) = (x1i j , x23−i
j ) for some i ∈ {1, 2} for each j.
In fact, the zigzag condition forces many source sequences to be typical. Doshi et al.
[16] show that, if the source random variables satisfy the zigzag condition, an achievable
rate region for this network is the set of all rates that can be achieved through graph
colorings. The zigzag condition is a restrictive condition which does not depend on the
desired function at the receiver. This condition is not necessary, but it is sufficient. In the
next section, we relax this condition by introducing a necessary and sufficient condition
for any achievable coloring-based coding scheme and characterize a rate region for the
distributed functional compression problem.
In this section, we present the main results for network functional compression.
Example 15.5 Suppose we have two random variables X1 and X2 with characteris-
tic graphs G X1 and G X2 . Let us assume cG X1 and cG X2 are two valid colorings of G X1
and G X2 , respectively. Assume cG X1 (x11 ) = cG X1 (x12 ) and cG X2 (x21 ) = cG X2 (x22 ). Suppose
j
j1c represents this joint coloring class. In other words, j1c = {(x1i , x2 )}, for all 1 ≤ i, j ≤ 2
j
when p(x1i , x2 ) > 0. Figure 15.3 considers two different cases. The first case is when
p(x1 , x2 ) = 0, and other points have a non-zero probability. It is illustrated in Fig. 15.3(a).
1 2
One can see that there exists a path between any two points in this joint coloring class.
Therefore, this joint coloring class satisfies the CCC. If other joint coloring classes of
cG X1 and cG X2 satisfy the CCC, we say cG X1 and cG X2 satisfy the CCC. Now, consider the
second case depicted in Fig. 15.3(b). In this case, we have p(x11 , x22 ) = 0, p(x12 , x21 ) = 0,
and other points have a non-zero probability. One can see that there is no path between
(x11 , x21 ) and (x12 , x22 ) in j1c . So, though these two points belong to a same joint coloring
class, their corresponding function values can be different from each other. Thus, j1c does
not satisfy the CCC for this example. Therefore, cG X1 and cG X2 do not satisfy the CCC.
Network Functional Compression 465
1
x11 x12 x1 x12
1
x12 0 0 x2 0
x22 0 x22 1
(a) (b)
Figure 15.3 Two examples of a joint coloring class: (a) satisfying the CCC and (b) not satisfying
the CCC. Dark squares indicate points with zero probability. Function values are depicted in the
picture.
There are several examples of source distributions and functions that satisfy the
CCC.
lemma 15.1 Consider two random variables X1 and X2 with characteristic graphs G X1
and G X2 and any valid colorings cG X1 (X1 ) and cG X2 (X2 ), respectively, where cG X2 (X2 ) is
a trivial coloring, assigning different colors to different vertices (to simplify the notation,
we use cG X2 (X2 ) = X2 to refer to this coloring). These colorings satisfy the CCC. Also,
cGnX (X1 ) and cGnX (x2 ) = X2 satisfy the CCC, for any n.
1 2
lemma 15.4 If two random variables X1 and X2 with characteristic graphs G X1 and G X2
satisfy the zigzag condition, any valid colorings cG X1 and cG X2 of G X1 and G X2 satisfy
the CCC, but not vice versa.
Proof of this lemma is presented in Section 15.5.4.
We use the CCC to characterize a rate region of functional compression for a one-stage
tree network as follows.
definition 15.15 For random variables X1 , . . . , Xk with characteristic graphs G X1 , . . . ,
G Xk , the joint graph entropy is defined as follows:
1
HG X1 ,...,G Xk (X1 , . . . , Xk ) lim min H(cGnX (X1 ), . . . , cGnX (Xk )), (15.7)
n→∞ cGn ,...,cGn n 1 k
X1 Xk
in which cGnX (X1 ), . . . , cGnX (Xk ) are -colorings of GnX , . . . , GnX satisfying the CCC.
1 k 1 k
We refer to this joint graph entropy as H[G Xi ]i∈S , where S = {1, 2, . . . , k}. Note that this
limit exists because we have a monotonically decreasing sequence bounded from below.
Similarly, we can define the conditional graph entropy.
definition 15.16 For random variables X1 , . . . , Xk with characteristic graphs G X1 , . . . ,
G Xk , the conditional graph entropy can be defined as follows:
where the minimization is over cGnX (X1 ), . . . , cGnX (Xk ), which are -colorings of GnX ,
1 k 1
. . . , GnX satisfying the CCC.
k
lemma 15.5 For k = 2, Definitions 15.4 and 15.14 are the same.
Proof of this lemma is presented in Section 15.5.5.
Note that, by this definition, the graph entropy does not satisfy the chain rule.
Suppose S(k) denotes the power set of the set {1, 2, . . . , k} excluding the empty subset.
Then, for any S ∈ S(k),
XS {Xi : i ∈ S }.
Let S c denote the complement of S in S(k). For S = {1, 2, . . . , k}, denote S c as the
empty set. To simplify the notation, we refer to a subset of sources by XS . For instance,
S(2) = {{1}, {2}, {1, 2}}, and for S = {1, 2}, we write H[G Xi ]i∈S (XS ) instead of HG X1 ,G X2
(X1 , X2 ).
theorem 15.3 A rate region of a one-stage tree network is characterized by the
following conditions:
∀S ∈ S(k) =⇒ Ri ≥ H[G Xi ]i∈S (XS |XS c ). (15.9)
i∈S
corollary 15.1 A rate region of the network shown in Fig. 15.1(b) is determined by
the following three conditions:
Algorithm 15.1
The following algorithm proposes a modularized coding scheme which performs
arbitrarily closely to the rate bounds of Theorem 15.1.
• Source nodes compute -colorings of sufficiently large power of their characteristic
graphs satisfying the CCC, followed by Slepian–Wolf compression.
• The receiver first uses a Slepian–Wolf decoder to decode transmitted coloring
variables. Then, it uses a look-up table to compute the function values.
The achievablity proof of this algorithm directly follows from the proof of
Theorem 15.3.
Source Nodes
Intermediate
Nodes
X1 R1
R5
X2 R2
X3 R3
R6
X4 R4
The problem of function computations for a general tree network has been considered
in [13, 34]. Kowshik and Kumar [13] derive a necessary and sufficient condition for the
encoders on each edge of the tree for a zero-error computation of the desired function.
Appuswamy et al. [34] show that, for a tree network with independent sources, a min-
cut rate is a tight upper bound. Here, we consider an asymptotically lossless functional
compression problem. For a general tree network with correlated sources, we derive rate
bounds using graph entropies. We show that these rates are achievable for the case of
independent sources and propose a modularized coding scheme based on graph color-
ings that performs arbitrarily closely to rate bounds. We also show that, for a family of
functions and random variables, which we call chain-rule proper sets, it is sufficient to
have no computations at intermediate nodes in order for the system to perform arbitrarily
closely to the rate lower bound.
In the tree network depicted in Fig. 15.4, nodes {1, . . . , 4} represent source nodes,
nodes {5, 6} are intermediate nodes, and node 7 is the receiver. The receiver wishes to
compute a deterministic function of source random variables. Intermediate nodes have
no demand of their own, but they are allowed to perform computation. Computing the
desired function f at the receiver is the only demand of the network. For this network, we
compute a rate lower bound and show that this bound is tight in the case of independent
sources. We also propose a modularized coding scheme to perform arbitrarily closely to
derived rate lower bounds in this case.
Sources transmit variables M1 , . . . , M4 through links e1 , . . . , e4 , respectively. Interme-
diate nodes transmit variables M5 and M6 over e5 and e6 , respectively, where M5 =
g5 (M1 , M2 ) and M6 = g6 (M3 , M4 ).
Let S(4) and S(5, 6) be the power sets of the set {1, . . . , 4} and the set {5, 6} except the
empty set, respectively.
theorem 15.4 A rate lower bound for the tree network of Fig. 15.4 can be characterized
as follows:
∀S ∈ S(4) =⇒ Ri ≥ H[G Xi ]i∈S (XS |XS c ),
i∈S
∀S ∈ S(5, 6) =⇒ Ri ≥ H[G Xi ]i∈S (XS |XS c ). (15.11)
i∈S
Proof of this theorem is presented in Section 15.5.7. Note that the result of
Theorem 15.4 can be extended to a general tree network topology.
In the following, we show that, for independent source variables, the rate bounds
of Theorem 15.4 are tight, and we propose a coding scheme that performs arbitrarily
closely to these bounds.
satisfying the CCC. The following coding scheme performs arbitrarily closely to rate
bounds of Theorem 15.4.
Source nodes first compute colorings of high-probability subgraphs of their character-
istic graphs satisfying CCC, and then perform source coding on these coloring random
variables. Intermediate nodes first compute their parents’ coloring random variables,
and then, by using a look-up table, find corresponding source values of their received
colorings. Then, they compute -colorings of their own characteristic graphs. The corre-
sponding source values of their received colorings form an independent set in the graph.
If all are assigned to a single color in the minimum-entropy coloring, intermediate nodes
send this coloring random variable followed by a source coding. However, if vertices of
this independent set are assigned to different colors, intermediate nodes send the color-
ing with the lowest entropy followed by source coding (Slepian–Wolf). The receiver first
performs a minimum-entropy decoding [29] on its received information and achieves
coloring random variables. Then, it uses a look-up table to compute its desired function
by using the achieved colorings.
In the following, we summarize this proposed algorithm.
Algorithm 15.2
The following algorithm proposes a modularized coding scheme which performs
arbitrarily closely to the rate bounds of Theorem 15.4 when the sources are independent.
• Source nodes compute -colorings of sufficiently large power of their characteristic
graphs satisfying the CCC, followed by Slepian–Wolf compression.
• Intermediate nodes compute -colorings of sufficiently large power of their charac-
teristic graphs by using their parents’ colorings.
• The receiver first uses a Slepian–Wolf decoder to decode transmitted coloring
variables. Then, it uses a look-up table to compute the function values.
The achievablity proof of this algorithm is presented in Section 15.5.8. Also, in Sec-
tion 15.3.3, we show that minimum-entropy colorings of independent random variables
can be computed efficiently.
f (x1, x2)
Figure 15.5 Having non-zero joint probability condition is necessary for Theorem 15.6. A dark
square represents a zero-probability point.
Quantization Functions
In this section, we consider some special functions which lead to practical minimum-
entropy coloring computation.
An interesting function in this context is a quantization function. A natural quan-
tization function is a function which separates the X1 −X2 plane into some rectangles
so that each rectangle corresponds to a different value of that function. The sides of
these rectangles are parallel to the plane axes. Figure 15.6(a) depicts such a quantization
function.
472 Soheil Feizi and Muriel Médard
1 2 3
X2 X2 X2
X2 1 X2
1 3 X1 1 1 3
2
X1 2 4 4
4
2 3
X1 2
5 5 5
Function Region
2 2
X1 X1 X 1 × X2
(a) (b)
Figure 15.6 (a) A quantization function. Function values are depicted in the figure on each
rectangle. (b) By extending the sides of rectangles, the plane is covered by some function regions.
Given a quantization function, one can extend different sides of each rectangle in
the X1 −X2 plane. This may make some new rectangles. We call each of them a func-
tion region. Each function region can be determined by two subsets of X1 and X2 . For
example, in Fig. 15.6(b), one of the function regions is distinguished by the shaded area.
definition 15.18 Consider two function regions X11 ×X12 and X21 ×X22 . If, for any x11 ∈ X11
and x12 ∈ X21 , there exists x21 such that p(x11 , x21 )p(x12 , x21 ) > 0 and f (x11 , x21 ) f (x12 , x21 ), we
say these two function regions are pair-wise X1 -proper.
theorem 15.7 Consider a quantization function f such that its function regions are
pair-wise X1 -proper. Then, G X1 (and GnX , for any n) is formed of some non-overlapping
1
fully connected maximally independent sets, and its minimum-entropy coloring can be
achieved by assigning different colors to different maximally independent sets.
Proof of this theorem is presented in Section 15.5.12.
Note that, without X1 -proper condition of Theorem 15.7, assigning different colors to
different partitions still leads to an achievable coloring scheme. However, it is not nec-
essarily a minimum-entropy coloring. In other words, without this condition, maximally
independent sets may overlap.
corollary 15.2 If a function f is strictly monotonic with respect to X1 , and p(x1 , x2 )
0, for all x1 ∈ X1 and x2 ∈ X2 , then G X1 (and GnX , for any n) is a complete graph.
1
Under the conditions of Corollary 15.2, functional compression does not give us
any gain, because, in a complete graph, one should assign different colors to different
vertices. Traditional compression where f is the identity function is a special case of
Corollary 15.2.
Section 15.3.3 presents conditions on either source probability distributions and/or
the desired function such that characteristic graphs of random variables are composed of
Network Functional Compression 473
Algorithm 15.3
Suppose G X1 = (V, E) is a graph composed of fully connected non-overlapping maxi-
mally independent sets and Ḡ X1 = (V, Ē) represents its complement, where E and Ē are
partitions of complete graph edges. Say C is the set of used colors formed as follows.
• Choose a node v ∈ V.
• Color node v and its neighbors in the graph Ḡ X1 by a color cv such that cv C.
• Add cv to C. Repeat until all nodes are colored.
This algorithm finds minimum colorings of G X1 in polynomial time with respect to the
number of vertices of G X1.
In this section, we discuss other aspects of the functional compression problem such
as the effect of having feedback and lossy computations. First, by presenting an exam-
ple, we show that, unlike in the Slepian–Wolf case, by having feedback in functional
compression, one can outperform the rate bounds of the case without feedback. Then,
we investigate the problem of distributed functional compression with distortion, where
computation of a function within a distortion level is desired at the receiver. Here, we
propose a simple sub-optimal coding scheme with a non-trivial performance guarantee.
Finally, we explain some future research directions.
Example 15.6 Consider a distributed functional compression problem with two sources
and a receiver as depicted in Fig. 15.7(a). Suppose each source has one byte (8 bits) to
transmit to the receiver. Bits are sorted from the (most significant bit MSB) to the least
significant bit (LSB). Bits can be 0 or 1 with the same probability. The desired function
at the receiver is f (X1 , X2 ) = max(X1 , X2 ).
474 Soheil Feizi and Muriel Médard
X1 X1
X2 X2
(a) (b)
Figure 15.7 A distributed functional compression network (a) without feedback and (b) with
feedback.
In the case without feedback, the characteristic graphs of sources are trivially com-
plete graphs. Therefore, each source should transmit all bits to the receiver (i.e., the
un-scaled rates are R1 = 8 and R2 = 8).
Now, suppose the receiver can broadcast some feedback bits to sources. In the fol-
lowing, we propose a communication scheme that has a reduced sum transmission rate
compared with the case without feedback:
First, each source transmits its MSB. The receiver compares two received bits. If they
are the same, the receiver broadcasts 0 to sources; otherwise it broadcasts 1. If sources
receive 1 from feedback, they stop transmitting. Otherwise, they transmit their next sig-
nificant bits. For this communication scheme, the un-scaled sum rate of the forward links
can be calculated as follows:
1 1 1
Rf1 + Rf2 = (2 × 1) + (2 × 2) + · · · + n (2 × n)
2 4
2
(a) (b)
(c) n+2
= 2 2− n , (15.12)
2
where n is the blocklength (in this example, n = 8), and Rf1 and Rf2 are the transmission
rates of sources X1 and X2 , respectively. Sources stop transmitting after the first bit
transmission if these bits are not equal. The probability of this event is 1/2 (term (a)
in equation (15.12)). Similarly, forward transmissions stop in the second round with
probability 1/4 (term (b) in equation (15.12)) and so on. Equality (c) follows from a
closed-form solution for the series i (i/2i ). For n = 8, this rate is around 3.92, which
is less than the sum rate in the case without feedback. With similar calculations, the
feedback rate is around 1.96. Hence, the total forward and feedback transmission rate is
around 5.88, which is less than that in the case without feedback.
Here, we present a practical coding scheme for this problem with a non-trivial perfor-
mance guarantee. All discussions can be extended to more general networks similar to
results of Section 15.3.
Consider two sources as described in Section 15.2.1. Here, we assume that the receiver
wants to compute a deterministic function f : X1 × X2 → Z or f : Xn1 × Xn2 → Zn , its
vector extension up to distortion D with respect to a given distortion function d : Z ×
Z → [0, ∞). A vector extension of the distortion function is defined as follows:
1
n
d(z1 , z2 ) = d(z1i , z2i ), (15.13)
n i=1
Pne = Pr[{(x1 , x2 ) : d( f (x1 , x2 ), r(enX1 (x1 ), enX2 (x2 ))) > D}].
We say a rate pair (R1 , R2 ) is achievable up to distortion D if there exist enX1 , enX2 ,
and r such that Pne → 0 when n → ∞.
Yamamoto gives a characterization of a rate-distortion function for the side-
information functional compression problem (i.e., X2 is available at the receiver) in
[24]. The rate-distortion function proposed in [24] is a generalization of the Wyner–
Ziv side-information rate-distortion function [23]. Another multi-letter characterization
of the rate-distortion function for the side-information problem given by Yamamoto was
discussed in [16]. The multi-letter characterization of [16] can be extended naturally to
a distributed functional compression case by using results of Section 15.3.
Here, we present a practical coding scheme with a non-trivial performance gaurantee
for a given distributed lossy functional compression setup.
Define the D-characteristic graph of X1 with respect to X2 , p(x1 , x2 ), and f (X1 , X2 ) as
having vertices V = X1 and the pair (x11 , x12 ) is an edge if there exists some x21 ∈ X2 such
that p(x11 , x21 )p(x12 , x21 ) > 0 and d( f (x11 , x21 ), f (x12 , x21 )) > D as in [16]. Denote this graph as
G X1 (D). Similarly, we define G X2 (D).
theorem 15.8 For the network depicted in Fig. 15.1(b) with independent sources, if the
distortion function is a metric, then the following rate pair (R1 , R2 ) is achievable for the
distributed lossy functional compression problem with distortion D:
R1 ≥ HG X1 (D/2) (X1 ),
R2 ≥ HG X2 (D/2) (X2 ),
R1 + R2 ≥ HG X1 (D/2),G X2 (D/2) (X1 , X2 ). (15.14)
15.5 Proofs
1
HG X1 (X1 |X2 ) = lim min H(cGnX (X1 )|cGnX (X2 ))
n→∞ cGn ,cGn n 1 2
X1 X2
1
= lim min H(cGnX (X1 )|X2 ).
n→∞ cGn n 1
X 1
Then, Lemma 15.1 implies that cGnX (X1 ) and cGnX (x2 ) = X2 satisfy the CCC. A direct
1 2
application of Theorem 15.2 completes the proof.
f : cGnX (X1 ) × · · · × cGnX (Xk ) → Zn (15.15)
1 k
such that
f (cGnX (x1 ), . . . , cGnX (xk )) = f (x1 , . . . , xk ), for all (x1 , . . . , xk ) ∈ T n .
1 k
Proof Suppose the joint coloring family for these colorings is JC = { jic : i}. We proceed
by constructing f . Assume (x11 , . . . , x1k ) ∈ jic and cGnX (x11 ) = σ1 , . . . , cGnX (x1k ) = σk . Define
1 1
f (σ1 , . . . σk ) = f (x11 , . . . , x1k ).
To show that this function is well defined on elements in its support, we should show
that, for any two points (x11 , . . . , x1k ) and (x21 , . . . , x2k ) in T n , if cGnX (x11 ) = cGnX (x21 ), . . . ,
1 1
cGnX (x1k ) = cGnX (x2k ), then f (x11 , . . . , x1k ) = f (x21 , . . . , x2k ).
k k
Since cGnX (x11 ) = cGnX (x21 ), . . . , cGnX (x1k ) = cGnX (x2k ), these two points belong to a joint
1 1 k k
coloring class such as jic . Since cGnX , . . . , cGnX satisfy the CCC, we have, by using
1 k
Lemma 15.3, f (x11 , . . . , x1k ) = f (x21 , . . . , x2k ). Therefore, our function
f is well defined and
has the desired property.
Lemma 15.7 implies that, given -colorings of characteristic graphs of random vari-
ables satisfying the CCC, at the receiver, we can successfully compute the desired
function f with a vanishing probability of error as n goes to infinity. Thus, if the decoder
at the receiver is given colors, it can look up f from its table of
f . It remains to be ascer-
tained at which rates encoders can transmit these colors to the receiver faithfully (with a
probability of error less than ).
Network Functional Compression 479
lemma 15.8 (Slepian–Wolf theorem) A rate region of a one-stage tree network with the
desired identity function at the receiver is characterized by the following conditions:
∀S ∈ S(k) =⇒ Ri ≥ H(XS |XS c ). (15.16)
i∈S
We now use the Slepian–Wolf (SW) encoding/decoding scheme on the achieved col-
oring random variables. Suppose the probability of error in each decoder of SW is less
than /k. Then, the total error in the decoding of colorings at the receiver is less than .
Therefore, the total error in the coding scheme of first coloring GnX , . . . , GnX and then
1 k
encoding those colors by using the SW encoding/decoding scheme is upper-bounded by
the sum of errors in each stage. By using Lemmas 15.7 and 15.8, we find that the total
error is less than , and goes to zero as n goes to infinity. By applying Lemma 15.8 on
the achieved coloring random variables, we have
1
∀S ∈ S(k) =⇒ Ri ≥ H(cGnX |cGnX ), (15.17)
i∈S
n S Sc
where cGnX and cGnX are -colorings of characteristic graphs satisfying the CCC. Thus,
S Sc
using Definition 15.6 completes the achievability part.
(2) Converse. Here, we show that any distributed functional source coding scheme
with a small probability of error induces -colorings on characteristic graphs of random
variables satisfying the CCC. Suppose > 0. Define Fn for all (n, ) as follows:
Fn = {
f : Pr[
f (X1 , . . . , Xk ) f (X1 , . . . , Xk )] < }. (15.18)
In other words, Fn is the set of all functions that differ from f with probability .
Suppose f is an achievable code with vanishing error probability, where
f (x1 , . . . , xk ) = rn (enX1 ,n (x1 ), . . . , enXk ,n (xk )), (15.19)
where n is the blocklength. Then there exists n0 such that for all n > n0 , Pr(
f f ) < .
In other words,
f ∈ Fn . We call these codes -error functional codes.
lemma 15.9 Consider some function f : X1 × · · · × Xk → Z. Any distributed functional
code which reconstructs this function with zero-error probability induces colorings on
G X1 , . . . ,G Xk satisfying the CCC with respect to this function.
Proof Say we have a zero-error distributed functional code represented by encoders
enX1 , . . . , enXk and a decoder r. For any two points (x11 , . . . , xk1 ) and (x12 , . . . , xk2 ) with
positive probabilities, if their encoded values are the same (i.e., enX1 (x11 ) = enX1 (x12 ), . . . ,
enXk (xk1 ) = enXk (xk2 )), their function values will be the same as well since it is an error-
free scheme:
We show that enX1 , . . . , enXk are in fact some valid colorings of G X1 , . . . , G Xk satisfying
the CCC. We demonstrate this argument for X1 . The argument for other random variables
is analogous. First, we show that enX1 induces a valid coloring on G X1 , and then we show
that this coloring satisfies the CCC.
Let us proceed by contradiction. If enX1 did not induce a coloring on G X1 , there must
be some edge in G X1 connecting two vertices with the same color. Let us call these ver-
tices x11 and x12 . Since these vertices are connected in G X1 , there must exist an (x21 , . . . , xk1 )
such that p(x11 , x21 , . . . , xk1 )p(x12 , x21 , . . . , xk1 ) > 0, enX1 (x11 ) = enX1 (x12 ), and f (x11 , x21 , . . . , xk1 )
f (x12 , x21 , . . . , xk1 ). Taking x21 = x22 , . . . , xk1 = xk2 as in equation (15.20) leads to a contradic-
tion. Therefore, the contradiction assumption is wrong and enX1 induces a valid coloring
on G X1.
Now, we show that these induced colorings satisfy the CCC. If this were not true,
it would mean that there must exist two points (x11 , . . . , xk1 ) and (x12 , . . . , xk2 ) in a joint
coloring class jic so that there is no path between them in jic . So, Lemma 15.2 says
that the function f can acquire different values at these two points. In other words, it is
possible to have f (x11 , . . . , xk1 ) f (x12 , . . . , xk2 ), where cG X1 (x11 ) = cG X1 (x12 ), . . . , cG Xk (xk1 ) =
cG Xk (xk2 ), which is in contradiction with equation (15.20). Thus, these colorings satisfy
the CCC.
In the last step, we must show that any achievable functional code represented by Fn
induces -colorings on characteristic graphs satisfying the CCC.
lemma 15.10 Consider random variables X1 , . . . , Xk . All -error functional codes of
these random variables induce -colorings on characteristic graphs satisfying the CCC.
Proof Suppose f (x1 , . . . , xk ) = r(enX1 (x1 ), . . . , enXk (xk )) ∈ Fn is such a code. If the
function it is desired to compute is f , then, according to Lemma 15.9, a zero-error recon-
struction of f induces colorings on characteristic graphs satisfying the CCC with respect
f . Let the set of all points (x1 , . . . , xk ) such that
to f (x1 , . . . , xk ) f (x1 , . . . , xk ) be denoted
by C. Since Pr( f f ) < , Pr[C] < . Therefore, functions enX1 , . . . , enXk restricted to C
are -colorings of characteristic graphs satisfying the CCC. with respect to f .
According to Lemmas 15.9 and 15.10, any distributed functional source code with
vanishing error probability induces -colorings on characteristic graphs of source
variables satisfying the CCC with respect to the desired function f.
Then, according to the Slepian–Wolf theorem, Theorem 15.8, we have
1
∀S ∈ S(k) =⇒ Ri ≥ H(cGnX |cGnX ), (15.21)
i∈S
n S Sc
where cGnX and cGnX are -colorings of characteristic graphs satisfying the CCC with
S Sc
respect to f. Using Definition 15.6 completes the converse part.
Network Functional Compression 481
R5 = R1 + R2 = HG X1 ,G X2 (X1 , X2 |X3 , X4 ).
which is the same condition as that of Theorem 15.4. Repeating this argument for R6
and R5 + R6 establishes the proof.
Fully
Connected
Parts
x21 x22
GX1 GX1
Figure 15.8 An example of G X1 ,X2 satisfying conditions of Lemma 15.6, when X2 has two
members.
Consider Fig. 15.8, which illustrates the conditions of this lemma. Under these con-
ditions, since all x2 in X2 have different function values, the graph G X1 ,X2 can be
decomposed into subgraphs which have the same topology as G X1 (i.e., isomorphism
to G X1 ), corresponding to each x2 in X2 . These subgraphs are fully connected to each
other under the conditions of this lemma. Thus, any coloring of this graph can be rep-
resented as two colorings of G X1 (within each subgraph) and G X2 (across subgraphs).
Therefore, the minimum-entropy coloring of G X1 ,X2 is equal to the minimum-entropy
coloring of (G X1 ,G X2 ), i.e., HG X1 ,G X2 (X1 , X2 ) = HG X1 ,X2 (X1 , X2 ).
w1 w2 w1 w2
x11
x11
x13
x13
x12 x12
(a) (b)
Figure 15.9 Having non-zero joint probability distribution, (a) maximally independent sets cannot
overlap with each other (this figure depicts the contradiction); and (b) maximally independent
sets should be fully connected to each other. In this figure, a solid line represents a connection,
and a dashed line means that no connection exists.
Network Functional Compression 483
Since there is no edge between x11 and x12 , for any x21 ∈ X2 , p(x11 , x21 )p(x12 , x21 ) > 0 and
f (x11 , x21 ) = f (x12 , x21 ). A similar argument can be expressed for x12 and x13 . In other words,
for any x21 ∈ X2 , p(x12 , x21 )p(x13 , x21 ) > 0 and f (x12 , x21 ) = f (x13 , x21 ). Thus, for all x21 ∈ X2 ,
p(x11 , x21 )p(x13 , x21 ) > 0 and f (x11 , x21 ) = f (x13 , x21 ). However, since x11 and x13 are connected
to each other, there should exist an x21 ∈ X2 such that f (x11 , x21 ) f (x13 , x21 ), which is
not possible. So, the contradiction assumption is not correct and these two maximally
independent sets do not overlap with each other.
We showed that maximally independent sets cannot have overlaps with each other.
Now, we want to show that they are also fully connected to each other. Again, let us
proceed by contradiction. Consider Fig. 15.9(b). Suppose w1 and w2 are two different
non-overlapping maximally independent sets. Suppose there exists an element in w2 (call
it x13 ) which is connected to one of the elements in w1 (call it x11 ) and is not connected to
another element of w1 (call it x12 ). By using a similar argument to the one in the previous
paragraph, we may show that it is not possible. Thus, x13 should be connected to x11 .
Therefore, if, for all (x1 , x2 ) ∈ X1 × X2 , p(x1 , x2 ) > 0, then the maximally independent
sets of G X1 are some separate fully connected sets. In other words, the complement of
G X1 is formed by some non-overlapping cliques. Finding the minimum-entropy coloring
of this graph is trivial and can be achieved by assigning different colors to these non-
overlapping fully connected maximally independent sets.
This argument also holds for any power of G X1. Suppose x11 , x21 , and x31 are some
typical sequences in Xn1 . If x11 is not connected to x21 and x31 , it is not possible to have x21
and x31 connected. Therefore, one can apply a similar argument to prove the theorem for
GnX , for some n. This completes the proof.
1
d( f (x11 , x21 ), f (x12 , x22 )) ≤ d( f (x11 , x21 ), f (x12 , x21 )) + d( f (x12 , x21 ), f (x12 , x22 ))
≤ D/2 + D/2 = D. (15.22)
References
[1] S. Feizi and M. Médard, “On network functional compression,” IEEE Trans. Information
Theory, vol. 60, no. 9, pp. 5387–5401, 2014.
[2] H. Kowshik and P. R. Kumar, “Optimal computation of symmetric Boolean functions in
tree networks,” in Proc. 2010 IEEE International Symposium on Information Theory, ISIT
2010, 2010, pp. 1873–1877.
[3] S. Shenvi and B. K. Dey, “A necessary and sufficient condition for solvability of a 3s/3t
sum-network,” in Proc. 2010 IEEE International Symposium on Information Theory, ISIT
2010, 2010, pp. 1858–1862.
[4] A. Ramamoorthy, “Communicating the sum of sources over a network,” in Proc. 2008
IEEE International Symposium on Information Theory, ISIT 2008, 2008, pp. 1646–1650.
[5] R. Gallager, “Finding parity in a simple broadcast network,” IEEE Trans. Information
Theory, vol. 34, no. 2, pp. 176–180, 1988.
Network Functional Compression 485
[6] A. Giridhar and P. Kumar, “Computing and communicating functions over sensor net-
works,” IEEE J. Selected Areas in Communications, vol. 23, no. 4, pp. 755–764, 2005.
[7] S. Kamath and D. Manjunath, “On distributed function computation in structure-free ran-
dom networks,” in Proc. 2008 IEEE International Symposium on Information Theory, ISIT
2008, 2008, pp. 647–651.
[8] N. Ma, P. Ishwar, and P. Gupta, “Information-theoretic bounds for multiround function
computation in collocated networks,” in Proc. 2009 IEEE International Symposium on
Information Theory, ISIT 2009, 2009, pp. 2306–2310.
[9] R. Ahuja, T. Magnanti, and J. Orlin, Network flows: Theory, algorithms, and applications.
Prentice Hall, 1993.
[10] F. Shahrokhi and D. Matula, “The maximum concurrent flow problem,” J. ACM, vol. 37,
no. 2, pp. 318–334, 1990.
[11] V. Shah, B. Dey, and D. Manjunath, “Network flows for functions,” in Proc. 2011 IEEE
International Symposium on Information Theory, 2011, pp. 234–238.
[12] M. Bakshi and M. Effros, “On zero-error source coding with feedback,” in Proc. 2010
IEEE International Symposium on Information Theory, ISIT 2010, 2010.
[13] H. Kowshik and P. Kumar, “Zero-error function computation in sensor networks,” in Proc.
48th IEEE Conference on Decision and Control, 2009 held jointly with the 2009 28th
Chinese Control Conference, CDC/CCC 2009, 2009, pp. 3787–3792.
[14] C. E. Shannon, “The zero error capacity of a noisy channel,” IRE Trans. Information
Theory, vol. 2, no. 3, pp. 8–19, 1956.
[15] A. Orlitsky and J. R. Roche, “Coding for computing,” IEEE Trans. Information Theory,
vol. 47, no. 3, pp. 903–917, 2001.
[16] V. Doshi, D. Shah, M. Médard, and M. Effros, “Functional compression through graph
coloring,” IEEE Trans. Information Theory, vol. 56, no. 8, pp. 3901–3917, 2010.
[17] D. Slepian and J. K. Wolf, “Noiseless coding of correlated information sources,” IEEE
Trans. Information Theory, vol. 19, no. 4, pp. 471–480, 1973.
[18] S. S. Pradhan and K. Ramchandran, “Distributed source coding using syndromes (DIS-
CUS): Design and construction,” IEEE Trans. Information Theory, vol. 49, no. 3, pp.
626–643, 2003.
[19] B. Rimoldi and R. Urbanke, “Asynchronous Slepian–Wolf coding via source-splitting,” in
Proc. 1997 IEEE International Symposium on Information Theory, 1997, p. 271.
[20] T. P. Coleman, A. H. Lee, M. Médard, and M. Effros, “Low-complexity approaches
to Slepian–Wolf near-lossless distributed data compression,” IEEE Trans. Information
Theory, vol. 52, no. 8, pp. 3546–3561, 2006.
[21] R. F. Ahlswede and J. Körner, “Source coding with side information and a converse for
degraded broadcast channels,” IEEE Trans. Information Theory, vol. 21, no. 6, pp. 629–
637, 1975.
[22] J. Körner and K. Marton, “How to encode the modulo-two sum of binary sources,” IEEE
Trans. Information Theory, vol. 25, no. 2, pp. 219–221, 1979.
[23] A. Wyner and J. Ziv, “The rate-distortion function for source coding with side information
at the decoder,” IEEE Trans. Information Theory, vol. 22, no. 1, pp. 1–10, 1976.
[24] H. Yamamoto, “Wyner–Ziv theory for a general function of the correlated sources,” IEEE
Trans. Information Theory, vol. 28, no. 5, pp. 803–807, 1982.
[25] H. Feng, M. Effros, and S. Savari, “Functional source coding for networks with receiver
side information,” in Proc. Allerton Conference on Communication, Control, and Comput-
ing, 2004, pp. 1419–1427.
486 Soheil Feizi and Muriel Médard
[26] T. Berger and R. W. Yeung, “Multiterminal source encoding with one distortion criterion,”
IEEE Trans. Information Theory, vol. 35, no. 2, pp. 228–236, 1989.
[27] J. Barros and S. Servetto, “On the rate-distortion region for separate encoding of correlated
sources,” in IEEE Trans. Information Theory (ISIT), 2003, p. 171.
[28] A. B. Wagner, S. Tavildar, and P. Viswanath, “Rate region of the quadratic Gaussian
two-terminal source-coding problem,” in Proc. 2006 IEEE International Symposium on
Information Theory, 2006.
[29] I. Csiszár and J. Körner, in Information theory: Coding theorems for discrete memoryless
systems. New York, 1981.
[30] H. S. Witsenhausen, “The zero-error side information problem and chromatic numbers,”
IEEE Trans. Information Theory, vol. 22, no. 5, pp. 592–593, 1976.
[31] J. Körner, “Coding of an information source having ambiguous alphabet and the entropy
of graphs,” in Proc. 6th Prague Conference on Information Theory, 1973, pp. 411–425.
[32] N. Alon and A. Orlitsky, “Source coding and graph entropies,” IEEE Trans. Information
Theory, vol. 42, no. 5, pp. 1329–1339, 1996.
[33] J. Cardinal, S. Fiorini, and G. Joret, “Tight results on minimum entropy set cover,”
Algorithmica, vol. 51, no. 1, pp. 49–60, 2008.
[34] R. Appuswamy, M. Franceschetti, N. Karamchandani, and K. Zeger, “Network coding for
computing: Cut-set bounds,” IEEE Trans. Information Theory, vol. 57, no. 2, pp. 1015–
1030, 2011.
[35] R. Ahlswede, N. Cai, S.-Y. R. Li, and R. W. Yeung, “Network information flow,” IEEE
Trans. Information Theory, vol. 46, pp. 1204–1216, 2000.
[36] R. Koetter and M. Médard, “An algebraic approach to network coding,” IEEE/ACM Trans.
Networking, vol. 11, no. 5, pp. 782–795, 2003.
[37] T. Ho, M. Médard, R. Koetter, D. R. Karger, M. Effros, J. Shi, and B. Leong, “A random
linear network coding approach to multicast,” IEEE Trans. Information Theory, vol. 52,
no. 10, pp. 4413–4430, 2006.
16 An Introductory Guide to Fano’s
Inequality with Applications in
Statistical Estimation
Jonathan Scarlett and Volkan Cevher
Summary
16.1 Introduction
The tremendous progress in large-scale statistical inference and learning in recent years
has been spurred by both practical and theoretical advances, with strong interactions
between the two: algorithms that come with a priori performance guarantees are clearly
desirable, if not crucial, in practical applications, and practical issues are indispensable
in guiding the theoretical studies.
A key role, complementary to that of performance bounds for specific algorithms, is
played by algorithm-independent impossibility results, stating conditions under which
one cannot hope to achieve a certain goal. Such results provide definitive benchmarks
for practical methods, serve as certificates for near-optimality, and help guide practical
developments toward directions where the greatest improvements are possible.
Since its introduction in 1948, the field of information theory has continually provided
such benefits for the problems of storing and transmitting data, and has accordingly
shaped the design of practical communication systems. In addition, recent years have
seen mounting evidence that the tools and methodology of information theory reach
far beyond communication problems, and can provide similar benefits within the entire
data-processing pipeline.
487
488 Jonathan Scarlett and Volkan Cevher
Table 16.1. Examples of applications for which impossibility results have been derived using
Fano’s inequality
Sparse and low-rank problems Other estimation problems
Problem References Problem References
Group testing [2, 3] Regression [12, 13]
Compressive sensing [4, 5] Density estimation [13, 14]
Sparse Fourier transform [6, 7] Kernel methods [15, 16]
Principal component analysis [8, 9] Distributed estimation [17, 18]
Matrix completion [10, 11] Local privacy [19]
Sequential decision problems Other learning problems
Problem References Problem References
Convex optimization [20, 21] Graph learning [26, 27]
Active learning [22] Ranking [28, 29]
Multi-armed bandits [23] Classification [30, 31]
Bayesian optimization [24] Clustering [32]
Communication complexity [25] Phylogeny [33]
While many information-theoretic tools have been proposed for establishing impossi-
bility results, the oldest one remains arguably the most versatile and widespread: Fano’s
inequality [1]. This fundamental inequality is not only ubiquitous in studies of com-
munication, but has also been applied extensively in statistical inference and learning
problems; several examples are given in Table 16.1.
In applying Fano’s inequality to such problems, one typically encounters a number of
distinct challenges different from those found in communication problems. The goal of
this chapter is to introduce the reader to some of the key tools and techniques, explain
their interactions and connections, and provide several representative examples.
Algorithm
Output θ̂
Index V Select Estimate V̂
Infer
Parameter Y X
Index
Parameter θV
Samples
Figure 16.1 Reduction of minimax estimation to multiple hypothesis testing. The gray boxes are
fixed as part of the problem statement, whereas the white boxes are constructed to our liking for
the purpose of proving a converse bound. The dashed line marked with X is optional, depending
on whether inputs are present.
√
In other words, if the algorithm yields θ − θv 22 < ( 2/2) , then V can be identified as
the index corresponding to the closest vector to θ. Thus, sufficiently accurate estimation
implies success in identifying V.
Discussion. Selecting the hard subset {θ1 , . . . , θ M } of parameters is often considered
something of an art. While the proofs of existing converse bounds may seem easy in
hindsight when the hard subset is known, coming up with a suitable choice for a new
problem usually requires some creativity and/or exploration. Despite this, there exist
general approaches that have proved to been effective in a wide range of problems, which
we exemplify in Sections 16.4 and 16.6.
In general, selecting the hard subset requires balancing conflicting goals: increasing
M so that the hypothesis test is more difficult, keeping the elements “close” so that
they are difficult to distinguish, and keeping the elements “sufficiently distant” so that
one can recover V from θ. Typically, one of the following three approaches is adopted:
(i) explicitly construct a set whose elements are known or believed to be difficult to
distinguish; (ii) prove the existence of such a set using probabilistic arguments; or (iii)
consider packing as many elements as possible into the entire space. We will provide
examples of all three kinds.
In the Bayesian setting, θ is already random, so we cannot use the above-mentioned
method of lower-bounding the worst-case performance by the average. Nevertheless, if
Θ is discrete, we can still use the trivial reduction V = θ to form a multiple-hypothesis-
testing problem with a possibly non-uniform prior. In the continuous Bayesian setting,
one typically requires more advanced methods that are not covered in this chapter; we
provide further discussion in Section 16.7.2.
I(V;
V) + log 2
P[
V V] ≥ 1 − . (16.4)
log M
The intuition is as follows. The term log M represents the prior uncertainty (i.e., entropy)
of V, and the mutual information I(V; V) represents how much information V reveals
about V. In order to have a small probability of error, we require that the information
revealed is close to the prior uncertainty.
Beyond the standard form of Fano’s inequality (16.4), it is useful to consider other
variants, including approximate recovery and conditional versions. These are the topic
of Section 16.2, and we discuss other alternatives in Section 16.7.2.
492 Jonathan Scarlett and Volkan Cevher
In this section, we state various forms of Fano’s inequality that will form the basis for
the results in the remainder of the chapter.
V →Y→ V, where Y is the collection of samples; we will exploit this fact in Section
16.3, but, for now, one can think of
V being randomly generated by any means given V.
The two fundamental quantities appearing in Fano’s inequality are the conditional
entropy H(V| V), representing the uncertainty of V given its estimate, and the error
probability:
Pe = P[
V V]. (16.5)
Since H(V|V) = H(V) − I(V;
V), the conditional entropy is closely related to the mutual
information, representing how much information V reveals about V.
theorem 16.1 (Fano’s inequality) For any discrete random variables V and
V on a
common finite alphabet V, we have
H(V|V) ≤ H2 (Pe ) + Pe log |V| − 1 , (16.6)
where H2 (α) = α log(1/α) + (1 − α) log(1/(1 − α)) is the binary entropy function. In
particular, if V is uniform on V, we have
I(V;
V) ≥ (1 − Pe ) log |V| − log 2, (16.7)
or equivalently,
I(V;
V) + log 2
Pe ≥ 1 − . (16.8)
log |V|
Since the proof of Theorem 16.1 is widely accessible in standard references such as
[34], we provide only an intuitive explanation of (16.6). To resolve the uncertainty in V
given
V, we can first ask whether the two are equal, which bears uncertainty H2 (Pe ). If
they differ, which only occurs a fraction Pe of the time, the remaining uncertainty is at
most log |V| − 1 .
remark 16.1 For uniform V, we obtain (16.7) by upper-bounding |V| − 1 ≤ |V| and
H2 (Pe ) ≤ log 2 in (16.6), and subtracting H(V) = log |V| on both sides. While these addi-
tional bounds have a minimal impact for moderate to large values of |V|, a notable
case where one should use (16.6) is the binary setting, i.e., |V| = 2. In this case, (16.7)
is meaningless due to the right-hand side being negative, whereas (16.6) yields the
following for uniform V:
I(V;
V) ≥ log 2 − H2 (Pe ). (16.9)
It follows that the error probability is lower-bounded as
Pe ≥ H2−1 log 2 − I(V; V) , (16.10)
where H2−1 (·) ∈ 0, 12 is the inverse of H2 (·) ∈ [0, log 2] on the domain 0, 12 .
for some real-valued function d(v, v) and threshold t ∈ R. In contrast to the exact recovery
setting, there are interesting cases where V and V are defined on different alphabets, so
we denote these by V and V, respectively.
One can interpret (16.11) as requiring V to be within a “distance” t of V. How-
ever, d need not be a true distance function, and it need not even be symmetric or take
non-negative values. This definition of the error probability in fact entails no loss of gen-
erality, since one can set t = 0 and d(V,
V) = 1{(V,
V) ∈ E} for an arbitrary set E containing
the pairs that are considered errors.
In the following, we make use of the quantities
Nmax (t) = max Nv (t), Nmin (t) = min Nv (t), (16.12)
v∈V
v∈V
where
Nv (t) = 1{d(v,
v) ≤ t} (16.13)
v∈V
I(V;
V) + log 2
Pe (t) ≥ 1 − . (16.16)
log(|V|/Nmax (t))
The proof is similar to that of Theorem 16.1, and can be found in [35].
An Introductory Guide to Fano’s Inequality with Applications 495
theorem 16.3 (Conditional Fano inequality) For any discrete random variables V and
V on a common alphabet V, any discrete random variable A on an alphabet A, and any
subset A ⊆ A, the error probability Pe = P[
V V] satisfies
H(V|
V, A = a) − log 2
Pe ≥ P[A = a] , (16.17)
a∈A
log |Va | − 1
where Va = {v ∈ V : P[V = v | A = a] > 0}. For possibly continuous A, the same holds
true with a∈A P[A = a]( · · · ) replaced by E[1{A ∈ A }( · · · )].
Proof We write Pe ≥ a∈A P[A = a]P[ V V | A = a], and lower-bound the conditional
error probability using Fano’s inequality (Theorem 16.1) under the joint distribution of
(V,
V) conditioned on A = a.
remark 16.2 Our main use of Theorem 16.3 will be to average over the input X
(Fig. 16.1) in the case in which it is random and independent of V. In such cases, by
setting A = X in (16.17) and letting A contain all possible outcomes, we simply recover
Theorem 16.1 with conditioning on X in the conditional entropy and mutual informa-
tion terms. The approximate recovery version, Theorem 16.2, extends in the same way.
In Section 16.4, we will discuss more advanced applications of Theorem 16.3, including
(i) genie arguments, in which some information about V is revealed to the decoder, and
(ii) typicality arguments, where we condition on V falling in some high-probability set.
We saw in Section 16.2 that the mutual information I(V; V) naturally arises from Fano’s
inequality when V is uniform. More generally, we have H(V| V) = H(V) − I(V; V), so
we can characterize the conditional entropy by characterizing both the entropy and the
mutual information. In this section, we provide some of the main useful tools for upper-
bounding the mutual information. For brevity, we omit the proofs of standard results
commonly found in information-theory textbooks, or simple variations thereof.
496 Jonathan Scarlett and Volkan Cevher
Throughout the section, the random variables V and V are assumed to be discrete,
whereas the other random variables involved, including the inputs X = (X1 , . . . , Xn ) and
samples Y = (Y1 , . . . , Yn ), may be continuous. Hence, notation such as PY (y) may repre-
sent either a probability mass function (PMF) or a probability density function (PDF).
16.3.2 Tensorization
One of the most useful properties of mutual information is tensorization. Under
suitable conditional independence assumptions, mutual information terms containing
length-n sequences (e.g., Y = (Y1 , . . . , Yn )) can be upper-bounded by a sum of n mutual
information terms, the ith of which contains the corresponding entry of each associated
vector (e.g., Yi ). Thus, we can reduce a complicated mutual information term containing
sequences to a sum of simpler terms containing individual elements. The following
lemma provides some of the most common scenarios in which such tensorization can
be performed.
lemma 16.2 (Tensorization of mutual information) (i) If the entries of Y = (Y1 , . . . , Yn )
are conditionally independent given V, then
n
I(V; Y) ≤ I(V; Yi ). (16.18)
i=1
An Introductory Guide to Fano’s Inequality with Applications 497
(ii) If the entries of Y are conditionally independent given (V, X), and Yi depends on
(V, X) only through (V, Xi ), then
n
I(V; Y|X) ≤ I(V; Yi |Xi ). (16.19)
i=1
(iii) If, in addition to the assumptions in part (ii), Yi depends on (V, Xi ) only through
Ui = ψi (V, Xi ) for some deterministic function ψi , then
n
I(V; Y|X) ≤ I(Ui ; Yi ). (16.20)
i=1
The proof is based on the sub-additivity of entropy, along with the conditional indepen-
dence assumptions given. We will use the first part of the lemma when X is absent or
deterministic, and the second and third parts for random non-adaptive X. When X can
be chosen adaptively on the basis of the past samples (Section 16.1.1), the following
variant is used.
lemma 16.3 (Tensorization of mutual information for adaptive settings) (i) If Xi is a
function of (X1i−1 , Y1i−1 ), and Yi is conditionally independent of (X1i−1 , Y1i−1 ) given (V, Xi ),
then
n
I(V; X, Y) ≤ I(V; Yi |Xi ). (16.21)
i=1
(ii) If, in addition to the assumptions in part (i), Yi depends on (V, Xi ) only through
Ui = ψi (V, Xi ) for some deterministic function ψi , then
n
I(V; X, Y) ≤ I(Ui ; Yi ). (16.22)
i=1
The proof is based on the chain rule for mutual information, i.e., I(V; X, Y) =
n
i=1 I(Xi , Yi ; V | X1 , Y1 ), as well as suitable simplifications via the conditional
i−1 i−1
independence assumptions.
remark 16.3 The mutual information bounds in Lemma 16.3 are analogous to those
used in the problem of communication with feedback (see Section 7.12 of [34]). A key
difference is that, in the latter setting, the channel input Xi is a function of (V, X1i−1 , Y1i−1 ),
with V representing the message. In statistical estimation problems, the quantity V
being estimated is typically unknown to the decision-maker, so the input Xi is only a
function of (X1i−1 , Y1i−1 ).
remark 16.4 Lemma 16.3 should be applied with care, since, even if V is uniform
on some set a priori, it may not be uniform conditioned on Xi . This is because, in the
adaptive setting, Xi depends on Y1i−1 , which in turn depends on V.
498 Jonathan Scarlett and Volkan Cevher
and, in addition,
I(V; Y) ≤ PV (v)PV (v )D PY|V (· | v) PY (· | v ) (16.26)
v,v
The upper bounds in (16.24)–(16.27) are closely related, and often essentially
equivalent in the sense that they lead to very similar converse bounds. In the authors’
experience, it is usually slightly simpler to choose a suitable auxiliary distribution
QY and apply (16.25), rather than bounding the pair-wise divergences as in (16.27).
Examples will be given in Sections 16.4 and 16.6.
remark 16.5 We have used the generic notation Y in Lemma 16.4, but in applications
this may represent either the entire vector Y, or a single one of its entries Yi . Hence, the
lemma may be used to bound I(V; Y) directly, or one may first apply tensorization and
then use the lemma to bound each I(V; Yi ).
remark 16.6 Lemma 16.4 can also be used to bound conditional mutual information
terms such as I(V; Y|X). Conditioned on any X = x, we can upper-bound I(V; Y|X = x)
using Lemma 16.4, with an auxiliary distribution QY|X=x that may depend on x. For
instance, doing this for (16.25) and then averaging over X, we obtain for any QY|X that
An Introductory Guide to Fano’s Inequality with Applications 499
I(V; Y|X) ≤ max D PY|X,V (· | ·, v) QY|X PX (16.28)
v
The bound (16.25) in Lemma 16.4 is useful when there exists a single auxiliary
distribution QY that is “close” to each PY|V (·|v) in KL divergence, i.e., D(PY|V (· | v) QY )
is small. It is natural to extend this idea by introducing multiple auxiliary distributions,
and requiring only that any one of them is close to a given PY|V (·|v). This can be viewed
as “covering” the conditional distributions {PY|V (·|v)}v∈V with “KL-divergence balls,”
and we will return to this viewpoint in Section 16.5.3.
lemma 16.5 (Mutual information bound via covering) Under the setup of Lemma 16.4,
suppose there exist N distributions Q1 (y), . . . , QN (y) such that, for all v and some > 0,
it holds that
min D PY|V (· | v) Q j ≤ . (16.30)
j=1,...,N
Then we have
I(V; Y) ≤ log N + . (16.31)
The proof is based on applying (16.24) with QY (y) = (1/N) Nj=1 Q j (y), and then
lower-bounding this summation over j by the value j∗ (v) achieving the minimum in
(16.30). We observe that setting N = 1 in Lemma 16.5 simply yields (16.25).
where Zi ∼ Bernoulli() for some ∈ 0, 12 , ⊕ denotes modulo-2 addition, and ∨ is
the “OR” operation. In the channel coding terminology, this corresponds to passing
the noiseless test outcome j∈S Xi j through a binary symmetric channel. We assume
that the noise variables Zi are independent of each other and of X, and we define the
vector of test outcomes Y = (Y1 , . . . , Yn ).
• Given X and Y, a decoder forms an estimate S of S . We initially consider the exact
recovery criterion, in which the error probability is given by
Pe = P[
S S ], (16.33)
where we have also upper-bounded I(S ; S |X) ≤ I(S ; Y|X) using the data-processing
inequality (from the second part of Lemma 16.1), which in turn uses the fact that
S →Y→ S conditioned on X.
Let Ui = j∈S Xi j denote the hypothetical noiseless outcome. Since the noise vari-
ables {Zi }ni=1 are independent and Yi depends on (S , X) only through Ui (see (16.32)),
we can apply tensorization (from the third part of Lemma 16.2) to obtain
n
I(S ; Y|X) ≤ I(Ui ; Yi ) (16.36)
i=1
≤ n log 2 − H2 () , (16.37)
where d(S , L) = |S \L|, and t = αk. Notice that a higher value of L means that more
non-defective items may be included in the list, whereas a higher value of α means that
more defective items may be absent.
theorem 16.5 (Group testing with approximate recovery) Under the preceding noisy
group-testing setup with list size L ≥ k, in order to achieve Pe (αk) ≤ δ for some α ∈ (0, 1)
(not depending on p), it is necessary that
(1 − α)k log(p/L)
n≥ (1 − δ − o(1)) (16.39)
log 2 − H2 ()
as p → ∞, k → ∞, and L → ∞ simultaneously with L = o(p).
Proof We apply the approximate recovery version of Fano’s inequality (Theorem 16.2)
with d(S , L) = |S \L| and t = αk as above. For any L with cardinality L, the number of
αk p−L L
S with d(S , L) ≤ αk is given by Nmax (t) = j=0 j k− j , which follows by counting
the number of ways to place k − j defective items in L, and the remaining j defective
items in the other p − L entries. Hence, using Theorem 16.2 with conditioning on X (see
Section 16.2.3), and applying the data-processing inequality (from the second part of
Lemma 16.1), we obtain
p
k
I(S ; Y|X) ≥ (1 − δ) log αk p−L L
− log 2. (16.40)
j=0 j k− j
Adaptive Testing
Next, we discuss the adaptive-testing setting, in which a given input vector Xi ∈ {0, 1} p ,
corresponding to a single row of X, is allowed to depend on the previous inputs and
outcomes, i.e., X1i−1 = (X1 , . . . , Xi−1 ) and Y1i−1 = (Y1 , . . . , Yi−1 ). In fact, it turns out that
Theorems 16.4 and 16.5 still apply in this setting. Establishing this simply requires
making the following modifications to the above analysis.
• Apply the data-processing inequality in the form of the third part of Lemma 16.1,
yielding (16.35) and (16.40) with I(S ; X, Y) in place of I(S ; Y|X).
• Apply tensorization via Lemma 16.3 to deduce (16.36) and (16.37) with I(S ; X, Y)
in place of I(S ; Y|X).
An Introductory Guide to Fano’s Inequality with Applications 503
In the regimes where Theorems 16.4 and/or 16.5 are known to have matching upper
bounds with non-adaptive designs, we can clearly deduce that adaptivity provides no
asymptotic gain. However, as with approximate recovery, adaptivity can significantly
broaden the conditions under which matching achievability bounds are known, at least
in the noiseless setting [40].
Figure 16.2 Two examples of graphs that are forests (i.e., acyclic graphs); the graph on the right is
also a tree (i.e., a connected acyclic graph).
1
PG (y1 ) = exp λ y1i y1 j , (16.42)
Z (i, j)∈E
where a cycle is defined to be a path of distinct edges leading back to the start node,
e.g., (1, 4), (4, 2), (2, 1). A special case of a forest is a tree, which is an acyclic graph for
which a path exists between any two nodes. One can view any forest as being a dis-
joint union of trees, each defined on some subset of V. See Fig. 16.2 for an illustration.
• Let Y ∈ {−1, 1}n×p be the matrix whose ith row contains the p entries of the ith
sample. Given Y, a decoder forms an estimate G of G, or equivalently, an estimate E
of E. We initially focus on the exact-recovery criterion, in which the minimax error
probability is given by
G],
Mn (Gforest , λ) = inf sup PG [G (16.44)
G∈Gforest
G
where PG denotes the probability when the true graph is G, and the infimum is over
all estimators.
To the best of our knowledge, Fano’s inequality has not been applied previously in this
exact setup; we do so using the general tools for Ising models given in [26, 27, 42, 43].
An Introductory Guide to Fano’s Inequality with Applications 505
Exact Recovery
Under the exact-recovery criterion, we have the following.
theorem 16.6 (Exact recovery of forest graphical models) Under the preceding Ising
graphical model selection setup with a given edge parameter λ > 0, in order to achieve
Mn (Gforest , λ) ≤ δ, it is necessary that
log p 2 log p
n ≥ max , (1 − δ − o(1)) (16.45)
log 2 λ tanh λ
as p → ∞.
Proof Recall from Section 16.1.1 that we can lower-bound the worst-case error
probability over Gforest by the average error probability over any subset of Gforest . This
gives us an important degree of freedom in the reduction to multiple hypothesis testing,
and corresponds to selecting a hard subset θ1 , . . . , θ M as described in Section 16.1.1. We
refer to a given subset G ⊆ Gforest as a graph ensemble, and provide two choices that
lead to the two terms in (16.45).
For any choice of G ⊆ Gforest , Fano’s inequality (Theorem 16.1) gives
(1 − δ) log |G| − log 2
n≥ , (16.46)
I(G; Y1 )
for G uniform on G, where we used I(G; G) ≤ I(G; Y) ≤ nI(G; Y1 ) by the data-processing
inequality and tensorization (from the first parts of Lemmas 16.1 and 16.2).
Restricted Ensemble 1. Let G1 be the set of all trees. It is well known from graph
theory that the number of trees on p nodes is |G1 | = p p−2 [44]. Moreover, since Y1 is
a length-p binary sequence, we have I(G; Y1 ) ≤ H(Y1 ) ≤ p log 2. Hence, (16.46) yields
n ≥ ((1 − δ)(p − 2) log p − log 2)/(p log 2), implying the first bound in (16.45).
Restricted
Ensemble 2. Let G2 be the set of graphs containing a single edge, so that
|G2 | = 2p . We will upper-bound the mutual information using (16.25) in Lemma 16.4,
choosing the auxiliary distribution QY to be PG , with G being the empty graph. Thus,
we need to bound D(PG PG ) for each G ∈ G2 .
We first give an upper bound on D(PG PG ) for any two graphs (G,G). We start with
the trivial bound
D(PG PG ) ≤ D(PG PG ) + D(PG PG ). (16.47)
Recall the definition D(PQ) = EP log(P(Y)/Q(Y)) , and consider the substitution of
PG and PG according to (16.42), with different normalizing constants ZG and ZG . We
see that when we sum the two terms in (16.47), the normalizing constants inside the
logarithms cancel out, and we are left with
D(PG PG ) ≤ λ EG [Y1i Y1 j ] − EG [Y1i Y1 j ]
(i, j)∈E\E
+ λ EG [Y1i Y1 j ] − EG [Y1i Y1 j ] (16.48)
(i, j)∈E\E
In the case that G has a single edge (i.e., G ∈ G2 ) and G is the empty graph, we can
easily compute EG [Y1i Y1 j ] = 0, and (16.48) simplifies to
D(PG PG ) ≤ λEG [Y1i Y1 j ], (16.49)
where (i, j) is the unique edge in G. Since Y1i and Y1 j only take values in {−1, 1}, we
have EG [Y1i Y1 j ] = (+1)P[Y1i = Y1 j ] + (−1)P[Y1i Y1 j ] = 2P[Y1i = Y1 j ] − 1, and letting
E have a single edge in (16.42) yields PG [(Y1i , Y1 j ) = (yi , y j )] = eλyi y j /(2eλ + 2e−λ ), and
hence PG [Y1i = Y1 j ] = eλ /(eλ + e−λ ). Combining this with EG [Y1i Y1 j ] = 2P[Y1i = Y1 j ] − 1
yields EG [Y1i Y1 j ] = 2eλ /(eλ + e−λ ) − 1 = tanh λ. Hence, using (16.49) along with
(16.25) in Lemma 16.4, we obtain I(G; Y1 ) ≤ λ tanh λ. Substitution into (16.46) (with
log |G| = (2 log p)(1 + o(1))) yields the second bound in (16.45).
Theorem 16.6 is known to be tight up to constant factors whenever λ = O(1) [44, 45].
When λ is constant, the lower bound becomes n = Ω(log p), whereas for asymptotically
vanishing λ it simplifies to n = Ω (1/λ2 ) log p .
Approximate Recovery
We consider the approximate recovery of G = (V, E) with respect to the edit distance
= |E\E|
d(G, G) + |E\E|,
which is the number of edge additions and removals needed
to transform G into G or vice versa. Since any forest can have at most p − 1 edges, it is
natural to consider the case in which an edit distance of up to αp is permitted, for some
α > 0. Hence, the minimax risk is given by
> αp].
Mn (Gforest , λ, α) = inf sup PG [d(G, G) (16.50)
G∈Gforest
G
While the decoder may output a graph G not lying in G1 , we can assume with-
is always selected such that d(G,G
out loss of generality that G ∗ ) ≤ αp for some
G∗ ∈ G1 ; otherwise, an error would be guaranteed. As a result, for any G, and
any G ∈ G1 such that d(G, G) ≤ αp, we have from the triangle inequality that
+ d(G,G
d(G,G∗ ) ≤ d(G, G) ∗ ) ≤ 2αp, which implies that
Nmax (αp) ≤ 1{d(G,G∗ ) ≤ 2αp}. (16.53)
G∈G1
Now observe that, since all graphs in G1 have exactly p − 1 edges, transforming G to G∗
requires removing j edges and adding j different edges, for some j ≤ αp. Hence, we have
αp
p−1 p − p+1
Nmax (αp) ≤ 2
. (16.54)
j=0
j j
Adaptive Sampling
We now return to the exact-recovery setting, and consider a modification in which we
have an added degree of freedom in the form of adaptive sampling.
• The algorithm proceeds in rounds; in round i, the algorithm queries a subset of the
p nodes indexed by Xi ∈ {0, 1} p , and the corresponding sample Yi is generated as
follows.
508 Jonathan Scarlett and Volkan Cevher
where the infimum is over all adaptive algorithms that observe at most nnode nodes
in total.
theorem 16.8 (Adaptive sampling for forest graphical models) Under the preceding
Ising graphical model selection problem with adaptive sampling and a given parameter
λ > 0, in order to achieve Mnnode (Gforest , λ) ≤ δ, it is necessary that
p log p 2p log p
nnode ≥ max , (1 − δ − o(1)) (16.56)
log 2 λ tanh λ
as p → ∞.
Proof We prove the result using Ensemble 1 and Ensemble 2a above. We let N denote
the number of rounds; while this quantity is allowed to vary, we can assume without loss
of generality that N = nnode by adding or removing rounds where no nodes are queried.
For any subset G ⊆ Gforest , applying Fano’s inequality (Theorem 16.1) and tensorization
(from the first part of Theorem 16.3) yields
N
I(G; Yi |Xi ) ≥ (1 − δ) log |G| − log 2, (16.57)
i=1
where G is uniform on G.
Restricted Ensemble 1. We again let G1 be the set of all trees, for which we know
that |G| = p p−2 . Since the n(Xi ) entries of Yi differing from ∗ are binary, and those
equaling ∗ are deterministic given Xi , we have I(G; Yi |Xi = xi ) ≤ n(xi ) log 2. Averaging
N N
over Xi and summing over i yields i=1 I(G; Yi |Xi ) ≤ i=1 E[n(Xi )] log 2 ≤ nnode log 2,
and substitution into (16.57) yields the first bound in (16.56).
Restricted Ensemble 2. We again use the above-defined ensemble G2a of graphs with
p/2 isolated edges, for which we know that |G2a | ≥ p log p (1 + o(1)). In this case,
when we observe n(Xi ) nodes, the subgraph corresponding to these observed nodes
has at most n(Xi )/2 edges, all of which are isolated. Hence, using Lemma 16.4, the
An Introductory Guide to Fano’s Inequality with Applications 509
above-established fact that the KL divergence from a single-edge graph to the empty
graph is at most λ tanh λ, and the additivity of KL divergence for product distributions,
we deduce that I(G; Yi |Xi = xi ) ≤ (n(xi )/2)λ tanh λ. Averaging over Xi and summing over
N
i yields i=1 I(G; Yi |Xi ) ≤ 12 nnode λ tanh λ, and substitution into (16.57) yields the second
bound in (16.56).
The threshold in Theorem 16.8 matches that of Theorem 16.6, and, in fact, a similar
analysis under approximate recovery also recovers the threshold in Theorem 16.7. This
suggests that adaptivity is of limited help in the minimax sense for the Ising model
and forest graph class. There are, however, other instances of graphical model selection
where adaptivity provably helps [43, 46].
Thus far, we have focused on using Fano’s inequality to provide converse bounds
for the estimation of discrete quantities. In many, if not most, statistical applications,
one is instead interested in estimating continuous quantities; examples include linear
regression, covariance estimation, density estimation, and so on. It turns out that the
discrete form of Fano’s inequality is still broadly applicable in such settings. The idea,
as outlined in Section 16.1, is to choose a finite subset that still captures the inherent
difficulty in the problem. In this section, we present several tools used for this purpose.
where ρ(θ, θ ) is a metric, and Φ(·) is an increasing function from R+ to R+ . For instance,
the squared-2 loss (θ, θ ) = θ − θ 22 clearly takes this form.
We focus on the minimax setting, defining the minimax risk as follows:
Mn (Θ, ) = inf sup Eθ (θ,
θ) , (16.61)
θ θ∈Θ
Then, we have
I(V; Y) + log 2
Mn (Θ, ) ≥ Φ 1− , (16.63)
2 log M
where V is uniform on {1, . . . , M}, and the mutual information is with respect to
V → θV → Y. Moreover, in the special case M = 2, we have
−1
Mn (Θ, ) ≥ Φ H log 2 − I(V; Y) , (16.64)
2 2
where H2−1 (·) ∈ [0, 0.5] is the inverse binary entropy function.
Proof As illustrated in Fig. 16.1, the idea is to reduce the estimation problem to
a multiple-hypothesis-testing problem. As an initial step, we note from Markov’s
inequality that, for any 0 > 0,
θ) ≥ sup Φ(0 )Pθ [(θ,
sup Eθ (θ, θ) ≥ Φ(0 )] (16.65)
θ∈Θ θ∈Θ
= Φ(0 ) sup Pθ [ρ(θ,
θ) ≥ 0 ], (16.66)
θ∈Θ
where (16.66) uses (16.60) and the assumption that Φ(·) is increasing.
Suppose that a random index V is drawn uniformly from {1, . . . , M}, the samples
Y are drawn from the distribution Pnθ corresponding to θ = θV , and the estimator is
applied to produce θ. Let V correspond to the closest θ j according to the metric ρ, i.e.,
V = arg minv=1,...,M ρ(θv , θ). Using the triangle inequality and the assumption (16.62), if
ρ(θv ,
θ) < /2 then we must have V = v; hence,
Pv ρ(θv ,
θ) ≥ ≥ Pv [
V v], (16.67)
2
where Pv is a shorthand for Pθv .
With the above tools in place, we proceed as follows:
sup Pθ ρ(θ, θ) ≥ ≥ max Pv ρ(θv ,
θ) ≥ (16.68)
θ∈Θ 2 v=1,...,M 2
≥ max Pv [V v] (16.69)
v=1,...,M
512 Jonathan Scarlett and Volkan Cevher
1
≥ Pv [
V v] (16.70)
M v=1,...,M
I(V; Y) + log 2
≥ 1− , (16.71)
log M
where (16.68) follows upon maximizing over a smaller set, (16.69) follows from (16.67),
(16.70) lower-bounds the maximum by the average, and (16.71) follows from Fano’s
inequality (Theorem 16.1) and the fact that I(V; V) ≤ I(V; Y) by the data-processing
inequality (Lemma 16.1).
The proof of (16.63) is concluded by substituting (16.71) into (16.66) with 0 = /2,
and taking the infimum over all estimators
θ. For M = 2, we obtain (16.64) in the same
way upon replacing (16.71) by the version of Fano’s inequality for M = 2 given in
Remark 16.1.
We return to this result in Section 16.5.3, where we introduce and compare some
of the most widely used approaches to choosing the set {θ1 , . . . , θ M } and bounding the
mutual information.
I(V; Y) + log 2
Mn (Θ, ) ≥ Φ 1− , (16.73)
2 log(M/Nmax (t))
where V is uniform on {1, . . . , M}, the mutual information is with respect to V → θV → Y,
and Nmax (t) = maxv ∈V v∈V 1{d(v, v ) ≤ t}.
The proof is analogous to that of Theorem 16.9, and can be found in [35].
Moreover, the same bound holds true when minv D(Pnθv Qn ) is replaced by any one of
(1/M) v D(Pnθv Q), (1/M 2 ) v,v D(Pnθv Pnθ ), or maxv,v D(Pnθv Pnθ ).
v v
Figure 16.3 Examples of -packing (left) and -covering (right) sets in the case in which ρ0 is the
Euclidean distance in R2 . Since ρ0 is a metric, a set of points is an -packing if and only if their
corresponding /2-balls do not intersect.
Observe that assumption (16.62) of Theorem 16.9 precisely states that {θ1 , . . . , θ M } is
an -packing set, though the result is often applied with M far smaller than the -packing
number. The logarithm of the covering number is often referred to as the metric entropy.
The notions of packing and covering are illustrated in Fig. 16.3. We do not explore
the properties of packing and covering numbers in detail in this chapter; the interested
reader is referred to [48, 49] for a more detailed treatment. We briefly state the following
useful property, showing that the two definitions are closely related in the case in which
ρ0 is a metric.
lemma 16.7 (Packing versus covering numbers) If ρ0 is a metric, then Mρ∗0 (Θ, 2) ≤
Nρ∗0 (Θ, ) ≤ Mρ∗0 (Θ, ).
We now show how to use Theorem 16.9 to construct a lower bound on the minimax
risk in terms of certain packing and covering numbers. For the packing number, we
will directly consider the metric ρ used in Theorem 16.9. On the other hand, for the
covering number, we consider the density Pnθv (y) associated with each θ ∈ Θ, and use
the associated KL divergence measure:
∗
NKL,n (Θ, ) = Nρ∗n (Θ, ), ρnKL (θ, θ ) = D(Pnθ Pnθ ). (16.75)
KL
Proof Since Theorem 16.9 holds for any packing set, it holds for the maximal packing
set. Moreover, using Lemma 16.5, we have I(V; Y) ≤ log NKL,n∗ (Θ, c,n ) + c,n in (16.63),
since covering the entire space Θ is certainly enough to cover the elements in the
packing set. By combining these, we obtain the first part of the corollary. The second
part follows directly from the first part on choosing c,n = n c and noting that the KL
divergence is additive for product distributions.
Corollary 16.2 has been used as the starting point to derive minimax lower bounds for
a wide range of problems [13]; see Section 16.6 for an example. It has been observed
that the global approach is mainly useful for infinite-dimensional problems such as
density estimation and non-parametric regression, with the local approach typically
being superior for finite-dimensional problems such as vector or matrix estimation.
Mn (F ) = inf sup E f [ f (X)], (16.78)
f ∈F
X
where the infimum is over all optimization algorithms that iteratively query the
function n times and return a final point
x as above, and E f denotes the expectation
when the underlying function is f .
In the following, we let X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Yn ) denote the queried
locations and samples across the n rounds.
theorem 16.11 (Minimax bound for noisy optimization) Fix > 0, and let
{ f1 , . . . , f M } ⊆ F be a finite subset of F such that for each x ∈ X, we have fv (x) ≤ for
at most one value of v ∈ {1, . . . , M}. Then we have
I(V; X, Y) + log 2
Mn (F ) ≥ · 1 − , (16.79)
log M
516 Jonathan Scarlett and Volkan Cevher
where V is uniform on {1, . . . , M}, and the mutual information is with respect to
V → fV → (X, Y). Moreover, in the special case M = 2, we have
Mn (F ) ≥ · H2−1 log 2 − I(V; X, Y) , (16.80)
where H2−1 (·) ∈ [0, 0.5] is the inverse binary entropy function.
Proof By Markov’s inequality, we have
≥ sup · P f [ f (X)
sup E f [ f (X)] ≥ ]. (16.81)
f ∈F f ∈F
Suppose that a random index V is drawn uniformly from {1, . . . , M}, and the triplet
is generated by running the optimization algorithm on fV . Given X
(X, Y, X) = x,
let
V index the function among { f1 , . . . , f M } with the lowest corresponding value:
V = arg minv=1,...,M fv (
x).
By the assumption that any x satisfies fv (x) ≤ for at most one of the M functions,
x) ≤ implies
we find that the condition fv ( V = v. Hence, we have
Pv fv (X) > ≥ P fv [
V v]. (16.82)
The remainder of the proof follows (16.68)–(16.71) in the proof of Theorem 16.9. We
lower-bound the minimax risk sup f ∈F P f f (X) ≥ by the average over V, and apply
Fano’s inequality (Theorem 16.1 and Remark 16.1) and the data-processing inequality
(from the third part of Lemma 16.3).
remark 16.7 Theorem 16.10 is based on reducing the optimization problem to a
multiple-hypothesis-testing problem with exact recovery. One can derive an analogous
result reducing to approximate recovery, but we are unaware of any works making use
of such a result for optimization.
In this section, we present three applications of the tools introduced in Section 16.5:
sparse linear regression, density estimation, and convex optimization. Similarly to the
discrete case, our examples are chosen to permit a relatively simple analysis, while still
effectively exemplifying the key concepts and tools.
• Given knowledge of X and Y, an estimate θ is formed, and the loss is given by the
squared 2 -error, (θ,
θ) = θ −
θ22 , corresponding to (16.60) with ρ(θ,
θ) = θ −
θ2 and
Φ(·) = (·)2 . Overloading the general notation Mn (Θ, ), we write the minimax risk as
Mn (k, X) = inf sup Eθ [θ −
θ22 ], (16.83)
θ θ∈R p : θ0 ≤k
Minimax Bound
The lower bound on the minimax risk is formally stated as follows. To simplify the
analysis slightly, we state the result in an asymptotic form for the sparse regime
k = o(p); with only minor changes, one can attain a non-asymptotic variant attaining the
same scaling laws for more general choices of k [35].
theorem 16.12 (Sparse linear regression) Under the preceding sparse linear regression
problem with k = o(p) and a fixed regression matrix X, we have
σ2 kp log(p/k)
Mn (k, X) ≥ (1 + o(1)) (16.84)
32X2F
as p → ∞. In particular, under the constraint X2F ≤ npΓ for some Γ > 0, achieving
Mn (k, X) ≤ δ requires n ≥ (σ2 k log(p/k)/(32δΓ)(1 + o(1)).
Proof We present a simple proof based on a reduction to approximate recovery (The-
orem 16.10). In Section 16.6.1, we discuss an alternative proof based on a reduction to
exact recovery (Theorem 16.9).
We define the set
V = v ∈ {−1, 0, 1} p : v0 = k , (16.85)
and with each v ∈ V we associate a vector θv = v for some > 0. Letting d(v, v )
denote the Hamming distance, we have the following properties.
√
• For v, v ∈ V, if d(v, v ) > t, thenθv − θv 2 > t.
• The cardinality of V is |V| = 2k kp , yielding log |V| ≥ log kp ≥ k log(p/k).
• The quantity Nmax (t) in Theorem 16.10 is the maximum possible number of
v ∈ V such that d(v, v ) ≤ t for a fixed v. Setting t = k/2, a simple counting
argument gives Nmax (t) ≤ k/2 2 j pj ≤ k/2 + 1 · 2k/2 · k/2
p
, which simplifies to
j=0
log Nmax (t) ≤ (k/2) log(p/k) (1 + o(1)) due to the assumption k = o(p).
√
From these observations, applying Theorem 16.10 with t = k/2 and = k/2 yields
k · ( )2 I(V; Y) + log 2
Mn (k, X) ≥ 1− . (16.86)
8 (k/2) log(p/k) (1 + o(1))
Note that we do not condition on X in the mutual information, since we have assumed
that X is deterministic.
To bound the mutual information, we first apply tensorization (from the first part
of Lemma 16.2) to obtain I(V; Y) ≤ ni=1 I(V; Yi ), and then bound each I(V; Yi ) using
equation (16.24) in Lemma 16.4. We let QY be the N(0, σ2 ) density function, and we let
518 Jonathan Scarlett and Volkan Cevher
Pv,i denote the density function of N(XiT θv , σ2 ), where XiT is the transpose of the ith row
of X. Since the KL divergence between the N(μ0 , σ2 ) and N(μ1 , σ2 ) density functions
is (μ1 − μ0 )2 /(2σ2 ), we have D(Pv,i QY ) = |XiT θv |2 /(2σ2 ). As a result, Lemma 16.4
yields I(V; Yi ) ≤ (1/|V|) v D(Pv,i QY ) = (1/(2σ2 ))E |XiT θV |2 for uniform V. Summing
over i and recalling that θv = v, we deduce that
( )2
I(V; Y) ≤ E[XV22 ]. (16.87)
2σ2
From the choice of V in (16.85), we can easily compute Cov[V] = (k/p)I p , which
implies that E[XV22 ] = (k/p)X2F . Substitution into (16.87) yields I(V; Y) ≤ ( )2 /
(2σ2 ) · (k/p)X2F , and we conclude from (16.86) that
of the condition on the empirical covariance matrix requires a careful application of the
non-elementary matrix Bernstein inequality.
Overall, while the two approaches yield the same result up to constant factors in this
example, the approach based on approximate recovery is entirely elementary and avoids
the preceding difficulties.
is given by
Mn (η, Γ) = inf sup E f f − f 22 , (16.90)
f f ∈Fη,Γ
Minimax Bound
The minimax lower bound is given as follows.
theorem 16.13 (Density estimation) Consider the preceding density estimation setup
with some η ∈ (0, 1) and Γ > 0 not depending on n. There exists a constant c > 0
(depending on η and Γ) such that, in order to achieve Mn (η, Γ) ≤ δ, it is necessary that
3/2
1
n ≥ c· (16.91)
δ
when δ is sufficiently small. In other words, Mn (η, Γ) = Ω n−2/3 .
Proof We specialize the general analysis of [13] to the class Fη,Γ . Recalling
the packing and covering numbers from Definitions 16.1 and 16.2, we adopt the
shorthand notation M2∗ ( p ) = Mρ∗ (Fη,Γ , p ) with ρ( f, f ) = f − f 2 , and similarly
520 Jonathan Scarlett and Volkan Cevher
We now apply the following bounds on the packing number of Fη,Γ , which we state
from [13] without proof:
for some constants c, c > 0 and sufficiently small > 0. It follows that
⎛ ⎞
p 2⎜ −1/2 + n + log 2 ⎟
Mn (η, Γ) ≥
⎜ ⎜⎜⎜1 − c · (η c ) c ⎟⎟⎟
⎟⎠. (16.98)
⎝
2 c · −1
p
The remainder of the proof amounts to choosing p and c to balance the terms
appearing in this expression.
First, choosing c to equate the terms c · (η c )−1/2 and n c leads to c = c /n 2/3 with
2/3
c = cη−1/2 , yielding (c · (η c )−1/2 + n c + log 2)/(c · −1
p ) = (2n c /n + log 2)/(c · −1
p ).
−1
Next, choosing p to make this fraction equal to 2 yields p = (2/c) 2(c ) n +
1 2/3 1/3
log 2 , which means that p ≥ c · n−1/3 for suitable c > 0 and sufficiently large n.
Finally, since we made the fraction equal to 12 , (16.98) yields Mn (η, Γ) ≥ 2p /8 ≥
(c )2 n−2/3 /8. Setting Mn (η, Γ) = δ and solving for n yields the desired result.
The scaling given in Theorem 16.13 cannot be improved; a matching upper bound is
given in [13], and can be achieved even when η = 0.
An Introductory Guide to Fano’s Inequality with Applications 521
where Z and Z are independent N(0, σ2 ) random variables, for some σ2 > 0. This is
commonly referred to as the noisy first-order oracle.
Minimax Bound
The following theorem lower-bounds the number of queries required to achieve
δ-optimality. The proof is taken from [20] with only minor modifications.
theorem 16.14 (Stochastic optimization of strongly convex functions) Under the
preceding convex optimization setting with noisy first-order oracle information, in order
to achieve Mn (Fscv ) ≤ δ, it is necessary that
σ2 log 2
n≥ (16.101)
40δ
when δ is sufficiently small.
Proof We construct a set of two functions satisfying the assumptions of √ Theorem
16.11. Specifically, we fix (, ) such that 0 < < < 18 , define x1∗ = 12 − 2 and
√
x2∗ = 12 + 2 , and set
1
fv (x) = (x − xv∗ )2 , v = 1, 2. (16.102)
2
0.15
0.1
Function value
0.05
0
0 0.2 0.4 0.6 0.8 1
Input x
Figure 16.4 Construction of two functions in Fscv that are difficult to distinguish, and such that
any point x ∈ [0, 1] can be -optimal for only one of the two functions.
-optimal for the other function. This is the condition needed in order for us to apply
Theorem 16.11, yielding from (16.80) that
Mn (Fscv ) ≥ · H2−1 (log 2 − I(V; X, Y)). (16.103)
To bound the mutual information, we first apply tensorization (from the first part
of Lemma 16.3) to obtain I(V; X, Y) ≤ ni=1 I(V; Yi |Xi ). We proceed by bounding
I(V; Yi |Xi ) for any given i. Fix x ∈ [0, 1], let PYx and PYx be the density functions of
the noisy samples of f1 (x) and f1 (x), and let QYx and QYx be defined similarly for
f0 (x) = 12 x − 12 2 . We have
D(PYx × PYx QYx × QYx ) = D(PYx QYx ) + D(PYx QYx ) (16.104)
( f1 (x) − f0 (x))2 ( f1 (x) − f0 (x))2
= + , (16.105)
2σ2 2σ2
where (16.104) holds since the KL divergence is additive for product distributions, and
(16.105) uses the fact that the divergence between the N(μ0 , σ2 ) and N(μ1 , σ2 ) density
functions is (μ1 − μ0 )2 /(2σ2 ). √
Recalling that f1 (x) = 12 x − 12 + 2 2 and f0 (x) = 12 x − 12 2 , we have
⎛ " ⎞2
1 √ ⎜⎜⎜ ⎟⎟⎟⎟
2
1 ⎜
( f1 (x) − f0 (x)) = 2 + 2 x −
2
2 ≤ ⎜⎝ + ⎟ ≤ 2 , (16.106)
4 2 2⎠
where the first inequality uses the fact that x ∈ [0, 1], and the second inequality follows
√ √ √ √ √ 2
since < 18 and hence = · ≤ /8 (note that 1/ 8 + 1/ 2 ≤ 2). Moreover,
taking the derivatives of f0 and f1 gives ( f1 (x) − f0 (x))2 = 2 , and substitution into
(16.105) yields D(PYx × PYx QYx × QYx ) ≤ 2 /σ2 .
The preceding analysis applies in a near-identical manner when f2 is used in
place of f1 , and yields the same KL divergence bound when (PYx , PYx ) is defined
An Introductory Guide to Fano’s Inequality with Applications 523
with respect to f2 . As a result, for any x ∈ [0, 1], we obtain from (16.25) in Lemma
16.4 that I(V; Yi |Xi = x) ≤ 2 /σ2 . Averaging over X, we obtain I(V; Yi |Xi ) ≤ 2 /σ2 ,
and substitution into the above-established bound I(V; X, Y) ≤ ni=1 I(V; Yi |Xi ) yields
I(V; X, Y) ≤ 2n /σ2 . Hence, (16.103) yields
2n
Mn (Fscv ) ≥ · H2−1 log 2 − . (16.107)
σ2
Now observe that if n ≤ σ2 (log 2)/4 then the argument to H2−1 (·) is at least (log 2)/2. It
1
is easy to verify that H2−1 (log 2)/2 > 10 , from which it follows that Mn (Fscv ) > /10.
Setting = 10δ and noting that can be chosen arbitrarily close to , we conclude that
the required number of samples σ2 (log 2)/4 recovers (16.101).
Theorem 16.14 provides tight scaling laws, since stochastic gradient descent is
known to achieve δ-optimality for strongly convex functions using O(σ2 /δ) queries.
Analogous results for the multidimensional setting can be found in [20].
16.7 Discussion
work with, or can be used to derive tighter results. Generalizations of Fano’s inequality
have been proposed specifically for this purpose, as we discuss in the following
section.
where D2 (pq) = p log (p/q) + (1 − p) log ((1 − p)/(1 − q)) is the binary KL divergence
function. Observe that, if V is uniform and E is the event that V V, then we have
PV
V [E] = Pe and (P V × P
V )[E] = 1 − 1/|V|, and Fano’s inequality (Theorem 16.1)
follows on substituting the definition of D2 (··) into (16.108) and rearranging. This
proof lends itself to interesting generalizations, including the following.
Continuum version. Consider a continuous random variable V taking values on
V ⊆ R p for some p ≥ 1, and an error probability of the form Pe (t) = P d(V, V) > t for
some real-valued function d on R p × R p . This is the same formula as (16.11), which we
previously introduced for the discrete setting. Defining the “ball” Bd ( v, t) = {v ∈ R p :
d(v,v) ≤ t} centered at
v, (16.108) leads to the following for V uniform on V:
I(V;
V) + log 2
Pe (t) ≥ 1 − , (16.109)
log Vol(V)/supv∈R p Vol(V ∩ Bd (
v, t))
where Vol(·) denotes the volume of a set. This result provides a continuous counterpart
to the final part of Theorem 16.2, in which the cardinality ratio is replaced by a volume
ratio. We refer the reader to [35] for example applications, and to [62] for the simple
proof outlined above.
Beyond KL divergence. The key step (16.108) extends immediately to other
measures that satisfy the data-processing inequality. A useful class of such measures is
the class of f -divergences: D f (PQ) = EQ f P(Y)/Q(Y) for some convex function f
satisfying f (1) = 0. Special cases include KL divergence ( f (z) = z log z), total variation
√
( f (z) = 12 |z − 1|), squared Hellinger distance ( f (z) = ( z − 1)2 ), and χ2 -divergence
( f (z) = (z − 1)2 ). It was shown in [60] that alternative choices beyond the KL divergence
can provide improved bounds in some cases. Generalizations of Fano’s inequality
beyond f -divergences can be found in [61].
Non-uniform priors. The first form of Fano’s inequality in Theorem 16.1 does not
require V to be uniform. However, in highly non-uniform cases where H(V) log|V|,
the term Pe log(|V| − 1) may be too large for the bound to be useful. In such cases, it
is often useful to use different Fano-like bounds that are based on the alternative proof
above. In particular, the step (16.108) makes no use of uniformity, and continues to
hold even in the non-uniform case. In [57], this bound was further weakened to provide
simpler lower bounds for non-uniform settings with discrete alphabets. Fano-type
An Introductory Guide to Fano’s Inequality with Applications 525
lower bounds in continuous Bayesian settings with non-uniform priors arose more
recently, and are typically more technically challenging; the interested reader is referred
to [18, 63].
Acknowledgments
J. Scarlett was supported by an NUS startup grant. V. Cevher was supported by the
European Research Council (ERC) under the European Union’s Horizon 2020 research
and innovation programme (grant agreement 725594 – time-data).
References
[1] R. M. Fano, “Class notes for MIT course 6.574: Transmission of information,” 1952.
[2] M. B. Malyutov, “The separating property of random matrices,” Math. Notes Academy
Sci. USSR, vol. 23, no. 1, pp. 84–91, 1978.
[3] G. Atia and V. Saligrama, “Boolean compressed sensing and noisy group testing,” IEEE
Trans. Information Theory, vol. 58, no. 3, pp. 1880–1901, 2012.
[4] M. J. Wainwright, “Information-theoretic limits on sparsity recovery in the high-
dimensional and noisy setting,” IEEE Trans. Information Theory, vol. 55, no. 12, pp.
5728–5741, 2009.
[5] E. J. Candès and M. A. Davenport, “How well can we estimate a sparse vector?” Appl.
Comput. Harmonic Analysis, vol. 34, no. 2, pp. 317–323, 2013.
[6] H. Hassanieh, P. Indyk, D. Katabi, and E. Price, “Nearly optimal sparse Fourier transform,”
in Proc. 44th Annual ACM Symposium on Theory of Computation, 2012, pp. 563–578.
[7] V. Cevher, M. Kapralov, J. Scarlett, and A. Zandieh, “An adaptive sublinear-time
block sparse Fourier transform,” in Proc. 49th Annual ACM Symposium on Theory of
Computation, 2017, pp. 702–715.
[8] A. A. Amini and M. J. Wainwright, “High-dimensional analysis of semidefinite relaxations
for sparse principal components,” Annals Statist., vol. 37, no. 5B, pp. 2877–2921, 2009.
[9] V. Q. Vu and J. Lei, “Minimax rates of estimation for sparse PCA in high dimensions,”
in Proc. 15th International Conference on Artificial Intelligence and Statistics, 2012, pp.
1278–1286.
[10] S. Negahban and M. J. Wainwright, “Restricted strong convexity and weighted matrix
completion: Optimal bounds with noise,” J. Machine Learning Res., vol. 13, no. 5, pp.
1665–1697, 2012.
[11] M. A. Davenport, Y. Plan, E. Van Den Berg, and M. Wootters, “1-bit matrix completion,”
Information and Inference, vol. 3, no. 3, pp. 189–223, 2014.
[12] I. Ibragimov and R. Khasminskii, “Estimation of infinite-dimensional parameter in
Gaussian white noise,” Soviet Math. Doklady, vol. 236, no. 5, pp. 1053–1055, 1977.
[13] Y. Yang and A. Barron, “Information-theoretic determination of minimax rates of
convergence,” Annals Statist., vol. 27, no. 5, pp. 1564–1599, 1999.
[14] L. Birgé, “Approximation dans les espaces métriques et théorie de l’estimation,”
Probability Theory and Related Fields, vol. 65, no. 2, pp. 181–237, 1983.
526 Jonathan Scarlett and Volkan Cevher
[15] G. Raskutti, M. J. Wainwright, and B. Yu, “Minimax-optimal rates for sparse additive
models over kernel classes via convex programming,” J. Machine Learning Res., vol. 13,
no. 2, pp. 389–427, 2012.
[16] Y. Yang, M. Pilanci, and M. J. Wainwright, “Randomized sketches for kernels: Fast and
optimal nonparametric regression,” Annals Statist., vol. 45, no. 3, pp. 991–1023, 2017.
[17] Y. Zhang, J. Duchi, M. I. Jordan, and M. J. Wainwright, “Information-theoretic lower
bounds for distributed statistical estimation with communication constraints,” in Advances
in Neural Information Processing Systems, 2013, pp. 2328–2336.
[18] A. Xu and M. Raginsky, “Information-theoretic lower bounds on Bayes risk in decentral-
ized estimation,” IEEE Trans. Information Theory, vol. 63, no. 3, pp. 1580–1600, 2017.
[19] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy and statistical minimax
rates,” in Proc. 54th Annual IEEE Symposium on Foundations of Computer Science, 2013,
pp. 429–438.
[20] M. Raginsky and A. Rakhlin, “Information-based complexity, feedback and dynamics in
convex programming,” IEEE Trans. Information Theory, vol. 57, no. 10, pp. 7036–7056,
2011.
[21] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright, “Information-theoretic
lower bounds on the oracle complexity of stochastic convex optimization,” IEEE Trans.
Information Theory, vol. 58, no. 5, pp. 3235–3249, 2012.
[22] M. Raginsky and A. Rakhlin, “Lower bounds for passive and active learning,” in Advances
in Neural Information Processing Systems, 2011, pp. 1026–1034.
[23] A. Agarwal, S. Agarwal, S. Assadi, and S. Khanna, “Learning with limited rounds of
adaptivity: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons,”
in Proc. Conference on Learning Theory, 2017, pp. 39–75.
[24] J. Scarlett, “Tight regret bounds for Bayesian optimization in one dimension,” in Proc.
International Conference on Machine Learning, 2018, pp. 4507–4515.
[25] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar, “Information theory methods
in communication complexity,” in Proc. 17th IEEE Annual Conference on Computational
Complexity, 2002, pp. 93–102.
[26] N. Santhanam and M. Wainwright, “Information-theoretic limits of selecting binary
graphical models in high dimensions,” IEEE Trans. Information Theory, vol. 58, no. 7, pp.
4117–4134, 2012.
[27] K. Shanmugam, R. Tandon, A. Dimakis, and P. Ravikumar, “On the information theoretic
limits of learning Ising models,” in Advances in Neural Information Processing Systems,
2014, pp. 2303–2311.
[28] N. B. Shah and M. J. Wainwright, “Simple, robust and optimal ranking from pairwise
comparisons,” J. Machine Learning Res., vol. 18, no. 199, pp. 1–38, 2018.
[29] A. Pananjady, C. Mao, V. Muthukumar, M. J. Wainwright, and T. A. Courtade,
“Worst-case vs average-case design for estimation from fixed pairwise comparisons,”
http://arxiv.org/abs/1707.06217.
[30] Y. Yang, “Minimax nonparametric classification. i. rates of convergence,” IEEE Trans.
Information Theory, vol. 45, no. 7, pp. 2271–2284, 1999.
[31] M. Nokleby, M. Rodrigues, and R. Calderbank, “Discrimination on the Grassmann
manifold: Fundamental limits of subspace classifiers,” IEEE Trans. Information Theory,
vol. 61, no. 4, pp. 2133–2147, 2015.
[32] A. Mazumdar and B. Saha, “Query complexity of clustering with side information,” in
Advances in Neural Information Processing Systems, 2017, pp. 4682–4693.
An Introductory Guide to Fano’s Inequality with Applications 527
[33] E. Mossel, “Phase transitions in phylogeny,” Trans. Amer. Math. Soc., vol. 356, no. 6, pp.
2379–2404, 2004.
[34] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2006.
[35] J. C. Duchi and M. J. Wainwright, “Distance-based and continuum Fano inequalities with
applications to statistical estimation,” http://arxiv.org/abs/1311.2669.
[36] I. Sason and S. Verdú, “ f -divergence inequalities,” IEEE Trans. Information Theory,
vol. 62, no. 11, pp. 5973–6006, 2016.
[37] R. Dorfman, “The detection of defective members of large populations,” Annals Math.
Statist., vol. 14, no. 4, pp. 436–440, 1943.
[38] J. Scarlett and V. Cevher, “Phase transitions in group testing,” in Proc. ACM-SIAM
Symposium on Discrete Algorithms, 2016, pp. 40–53.
[39] J. Scarlett and V. Cevher, “How little does non-exact recovery help in group testing?” in
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2017,
pp. 6090–6094.
[40] L. Baldassini, O. Johnson, and M. Aldridge, “The capacity of adaptive group testing,” in
Proc. IEEE Int. Symp. Inform. Theory, 2013, pp. 2676–2680.
[41] J. Scarlett and V. Cevher, “Converse bounds for noisy group testing with arbitrary
measurement matrices,” in Proc. IEEE International Symposium on Information Theory,
2016, pp. 2868–2872.
[42] J. Scarlett and V. Cevher, “On the difficulty of selecting Ising models with approximate
recovery,” IEEE Trans. Signal Information Processing over Networks, vol. 2, no. 4, pp.
625–638, 2016.
[43] J. Scarlett and V. Cevher, “Lower bounds on active learning for graphical model selection,”
in Proc. 20th International Conference on Artificial Intelligence and Statistics, 2017.
[44] V. Y. F. Tan, A. Anandkumar, and A. S. Willsky, “Learning high-dimensional Markov
forest distributions: Analysis of error rates,” J. Machine Learning Res., vol. 12, no. 5, pp.
1617–1653, 2011.
[45] A. Anandkumar, V. Y. F. Tan, F. Huang, and A. S. Willsky, “High-dimensional structure
estimation in Ising models: Local separation criterion,” Annals Statist., vol. 40, no. 3, pp.
1346–1375, 2012.
[46] G. Dasarathy, A. Singh, M.-F. Balcan, and J. H. Park, “Active learning algorithms
for graphical model selection,” in Proc. 19th International Conference on Artificial
Intelligence and Statistics, 2016, pp. 1356–1364.
[47] B. Yu, “Assouad, Fano, and Le Cam,” in Festschrift for Lucien Le Cam. Springer, 1997,
pp. 423–435.
[48] J. Duchi, “Lecture notes for statistics 311/electrical engineering 377 (MIT),”
http://stanford.edu/class/stats311/.
[49] Y. Wu, “Lecture notes for ECE598YW: Information-theoretic methods for
high-dimensional statistics,” www.stat.yale.edu/~yw562/ln.html.
[50] Y. Polyanskiy, V. Poor, and S. Verdú, “Channel coding rate in the finite blocklength
regime,” IEEE Trans. Information Theory, vol. 56, no. 5, pp. 2307–2359, 2010.
[51] O. Johnson, “Strong converses for group testing from finite blocklength results,” IEEE
Trans. Information Theory, vol. 63, no. 9, pp. 5923–5933, 2017.
[52] R. Venkataramanan and O. Johnson, “A strong converse bound for multiple hypothesis
testing, with applications to high-dimensional estimation,” Electron. J. Statistics, vol. 12,
no. 1, pp. 1126–1149, 2018.
[53] P.-L. Loh, “On lower bounds for statistical learning theory,” Entropy, vol. 19, no. 11,
p. 617, 2017.
528 Jonathan Scarlett and Volkan Cevher
[54] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances
Appl. Math., vol. 6, no. 1, pp. 4–22, 1985.
[55] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling in a rigged casino:
The adversarial multi-armed bandit problem,” in Proc. 26th Annual IEEE Conference on
Foundations of Computer Science, 1995, pp. 322–331.
[56] E. Arias-Castro, E. J. Candès, and M. A. Davenport, “On the fundamental limits of
adaptive sensing,” IEEE Trans. Information Theory, vol. 59, no. 1, pp. 472–481, 2013.
[57] T. S. Han and S. Verdú, “Generalizing the Fano inequality,” IEEE Trans. Information
Theory, vol. 40, no. 4, pp. 1247–1251, 1994.
[58] L. Birgé, “A new lower bound for multiple hypothesis testing,” IEEE Trans. Information
Theory, vol. 51, no. 4, pp. 1611–1615, 2005.
[59] A. A. Gushchin, “On Fano’s lemma and similar inequalities for the minimax risk,”
Probability Theory and Math. Statistics, vol. 2003, no. 67, pp. 26–37, 2004.
[60] A. Guntuboyina, “Lower bounds for the minimax risk using f -divergences, and
applications,” IEEE Trans. Information Theory, vol. 57, no. 4, pp. 2386–2399, 2011.
[61] Y. Polyanskiy and S. Verdú, “Arimoto channel coding converse and Rényi divergence,”
in Proc. 48th Annual Allerton Conference on Communication, Control, and Compution,
2010, pp. 1327–1333.
[62] G. Braun and S. Pokutta, “An information diffusion Fano inequality,”
http://arxiv.org/abs/1504.05492.
[63] X. Chen, A. Guntuboyina, and Y. Zhang, “On Bayes risk lower bounds,” J. Machine
Learning Res., vol. 17, no. 219, pp. 1–58, 2016.
Index
α dimension, 77, 80, 86, 87, 94 binary symmetric community model, 385
χ2 -divergence, 389, 499, 524 binary symmetric community recovery problem,
f -divergence, 524 386–393
bitrate, 45
data compression, 72 unlimited, 57
unrestricted, 53
adaptive composition, 326 Blahut–Arimoto algorithm, 283, 346
ADX, 46 boosting, 327
ADX Bradley–Terry model, 231
achievability, 57 Bregman divergence, 277
converse, 56 Bregman information, 277
Akaike information criterion, 366 Bridge criterion, 371
algorithmic stability, 304, 308
aliasing, 65 capacity, 267, 325, 346
almost exact recovery, 386, 394–401 channel, 346
necessary conditions, 398–401 cavity method, 393, 414
sufficient conditions, 396–398 channel, 3, 325
analog-to-digital binary symmetric, 8
compression, 46 capacity, 10, 11
conversion, 46 discrete memoryless, 10
approximate message passing (AMP), 74, 200, 202, Gaussian, 200, 203, 210
206–207 channel coding, 339
approximation error, 338, 365 achievability, 9
arithmetic coding, 266 blocklength, 10
artificial intelligence, 264 capacity-achieving code, 15
Assouad’s method, 513 codebook, 10, 13
asymptotic efficiency, 363 codewords, 3, 10
asymptotic equipartition property, 6 converse, 11
decoder, 3
barrier-based second-order methods, 110 encoder, 3
basis pursuit, 74 error-correction codes, 10
Bayes decision rule, 333 linear codes, 12
Bayes risk, 337 generator matrix, 13
Bayesian information criterion, 366 parity-check matrix, 13
Bayesian network, 281 maximum-likelihood decoding, 10
belief propagation, 200 polar codes, 14
bias, 365 sphere packing, 10
bilinear problem, 223 syndrome decoding, 12
binary symmetric community detection problem, characteristic (conflict) graph, 459
386–393 characteristic graph, 459
529
530 Index
divergence filter
Bregman divergence, 274, 277 pre-sampling, 63
f -divergence, 274, 278 Wiener, 54, 64
I-divergence, 277 first-moment method, 387–388
Kullback–Leibler divergence, 11, 277 Fisher information, 209
frame, 136, 138, 148
empirical loss, 303, 361 tight, 144, 145
empirical risk, 303, 336 frequency H( f ), 51
empirical risk minimization, 139, 153, 304, 337 frequency response, 51
noisy, 324 functional compression
encoder, 51, 335, 339 distributed functional compression with
randomized, 335 distortion, 474
capacity, 351 feedback in functional compression, 473
entropy, 5, 6 functional compression with distortion,
conditional entropy, 7 456
entropy rate, 96 lossless functional compression, 456
erasure T -information, 306 network functional compression, 456
erasure mutual information, 306 fundamental limit, 201, 265, 330
error
absolute generalization error, 313 Gauss–Markov
approximation error, 338, 365 process, 62
Bayes error, 333 Gaussian
estimation error, 338 process, 50
expected generalization error, 304 vector, 48
generalization error, 303 Gaussian complexity, 111
prediction error, 364, 366, 368, 377, 379 localized Gaussian complexity, 105, 111
testing error, 331 Gaussian mixture clustering, 415
training error, 331 generalization error, 303
error probability, 493 generalization gap, 337
approximate recovery, 494 bounds, 348–349
estimation, 105, 201, 265 information bounds, 348
Bayesian, 489 generalized information criterion, 368
continuous, 510 generalized likelihood ratio, 268
minimax, 489, 510 generalized likelihood ratio test, 268
estimation bias, 365 generalized linear model, 203
estimation error, 338 Gibbs algorithm, 320
estimation phase, 378 good codes, 210
estimation variance, 365 graph
exact recovery, 386, 394–401 forest, 504
necessary conditions, 398–401 tree, 504
sufficient conditions, 396–398 weighted undirected graph, 108
expectation propagation, 200 graph coloring, 461
exponential family of distributions, 232 graph entropy, 459
graph sparsification via sub-sampling, 108
false-positive rate, 201 graphical model, 281
Fano’s inequality, 7, 11, 116, 134, 137, 141, 491 Gaussian, 281
applications, 488, 499, 516 graphical model selection, 503
approximate recovery, 494 adaptive sampling, 507
conditional version, 495 approximate recovery, 506
continuum version, 524 Bayesian, 509
generalizations, 524 graphons, 413
genie argument, 503 group testing, 500
limitations, 523 adaptive, 502
standard version, 493 approximate recovery, 501
Fano’s lemma, 352 non-adaptive, 501
fast iterative shrinkage-thresholding Guerra interpolation method, 384,
algorithm, 74 393, 414
Index 533